0% found this document useful (0 votes)
137 views77 pages

Limited DEpendent Variable

The document provides background information for six lectures on limited dependent variables models. It summarizes the topics that will be covered in each lecture, including binary choice models, discrete ordered and multinomial choice models, and censored and truncated outcomes. It also lists required readings, problem sets that will be assigned, and clarifies what aspects of the material covered in the notes and past exams will not be examinable.

Uploaded by

Talha Sidd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views77 pages

Limited DEpendent Variable

The document provides background information for six lectures on limited dependent variables models. It summarizes the topics that will be covered in each lecture, including binary choice models, discrete ordered and multinomial choice models, and censored and truncated outcomes. It also lists required readings, problem sets that will be assigned, and clarifies what aspects of the material covered in the notes and past exams will not be examinable.

Uploaded by

Talha Sidd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Limited Dependent Variables

Lecture notes for Econometrics


(First year, MPhil in Economics)
Simon Quinn*
Hilary 2013

These notes provide some background for six lectures that I will be giving this year for the M.Phil
Econometrics core course. I will use slides for the lectures themselves. I will make the slides
available online after our last lecture. It is likely that there will be some things in these notes that
we do not have time to cover in class, and we may cover some things in class that are not covered in these notes. Though we will focus in class on the most important issues, please consider
all of the lectures and all of these notes to be potentially relevant for the exam (except where noted).
For each lecture, I have starred (?) references to Cameron and Trivedi (2005) and to Wooldridge
(2002 and 2010). You are required to read at least one of these, but you do not need to read more.
I have also provided other references; you are not required or expected to read these.

* Department

of Economics, Centre for the Study of African Economies and St Antonys College, University of
Oxford: simon.quinn@economics.ox.ac.uk. I must particularly thank Victoria Prowse for her assistance
in preparing these lectures. I must also thank Cameron Chisholm, Cosmina Dorobantu, Sebastian Knigs, Julien
Labonne and Jeremy Magruder for very useful comments. All errors remain my own.

Lectures
We will spend six lectures on limited dependent variables; the following table summarises.

H ILARY

Week 7

Lecture 1 Binary Choice I

Monday, 11.30am 1.00pm

Week 7

Lecture 2 Binary Choice II

Tuesday, 11.30am 1.00pm

Week 7

Lecture 3 Discrete Ordered Choice

Wednesday, 9.30am 11.00am

Week 8

Lecture 4 Discrete Multinomial Choice

Monday, 11.30am 1.00pm

Week 8

Lecture 5 Censored and Truncated Outcomes Tuesday, 11.30am 1.00pm

Week 8

Lecture 6 Selection

Wednesday, 9.30am 11.00am

Problem sets
There are two problem sets for limited dependent variables. Both were written by Victoria Prowse,
and have evolved on this course over several years. The problem sets do not attempt to cover
every aspect of our lectures, but they will provide a very useful opportunity for you to consider key
concepts and techniques.

So, whats changed?


This is the second year that I have lectured this topic. In revising for the exam, you may wish
to look at past papers and hence you should note the following small differences between the
content for this year and the content prior to 2012:
We will not be discussing the specific details of any numerical optimisation techniques.
(Therefore, for example, you may disregard Question 7(ii)(c) on the June 2011 Exam.)
We will discuss the Generalised Ordered Probit (and Generalised Ordered Logit), which
were not covered prior to 2012.

simon.quinn@economics.ox.ac.uk

Are any of these notes not examinable?


There are a few small aspects of these notes that are not examinable for any question on limited
dependent variables:
I provide some example Stata code; you do not need to know any Stata code for any exam
questions about limited dependent variables.
I briefly discuss nonparametric and semiparametric estimation in a few places; you do not
need to know anything about these estimation techniques for any exam questions about limited dependent variables.
You do not need to be able to repeat the derivation in equations 4.13 to 4.24.
Nonethelesss, I include this extra material because I think it may be useful to help your general
understanding of the other themes that we will discuss.

simon.quinn@economics.ox.ac.uk

Lecture 1: Binary Choice I

Required readings (for Lectures 1 and 2):


? C AMERON , A.C. AND T RIVEDI , P.K. (2005): Microeconometrics: Methods and Applications. Cambridge University Press, pages 463 478 (i.e. sections 14.1 to 14.4, inclusive)
or
? W OOLDRIDGE , J. (2002): Econometric Analysis of Cross Section and Panel Data. The
MIT Press, pages 453 461 (i.e. sections 15.1 to 15.4, inclusive)
or
? W OOLDRIDGE , J. (2010): Econometric Analysis of Cross Section and Panel Data (2nd
ed.). The MIT Press, pages 561 569 (i.e. sections 15.1 to 15.4, inclusive).
Other references:
T RAIN , K. (2009): Discrete Choice Methods with Simulation. Cambridge University Press.
G OULD , W., P ITBLADO , J.
with Stata. Stata Press.

1.1

AND

S RIBNEY, W. (2006): Maximum Likelihood Estimation

An illustrative empirical question

Our first two lectures consider models for binary dependent variables; that is, models for contexts
in which our outcome of interest takes just two values. We will focus on a simple illustrative question: how has primary school attendance changed over time in Tanzania? There are many reasons
that this question may be important for empirical researchers for example, it may be of historical interest in understanding Tanzanias long-run economic development, or it may be important
for considering present-day earnings differences across Tanzanian age cohorts.
We shall consider this question using data from Tanzanias 2005/2006 Integrated Labour Force
Survey (ILFS). For simplicity, we will consider a single explanatory variable: the year in which
a respondent was born. We index individuals by i, and denote the ith individuals year of birth as
xi . We record educational attainment by a dummy variable, yi , defined such that:

0 if the ith individual did not complete primary education;
yi =
(1.1)
1 if the ith individual did complete primary education.
(Note immediately that as with all binary outcome models this denomination is arbitrary:
for example, we could just as easily reverse the assignment of 0 and 1 without changing anything
of the structure of the problem.)
Figure 1.1 illustrates the data: it shows the education dummy variable on the y axis (with data
points jittered, for illustrative clarity), and the age variable on the x axis. Note that we will limit
consideration to individuals born between 1920 and 1980 (inclusive).
4

simon.quinn@economics.ox.ac.uk

1.2

A simple model of binary choice

Figure 1.1: Primary school attainment in Tanzania across age cohorts

1.2

A simple model of binary choice

We began with a somewhat imprecise question: how has primary school attendance changed over
time in Tanzania? More formally, we will be interested in estimating the following object of
interest:
Pr(yi = 1 | xi ).

(1.2)

That is, we will build and estimate a model of the probability of attaining a primary school education, conditional upon year of birth.
As with most econometric outcome variables, investment in primary school education is a matter
of choice. We therefore begin by specifying a (very simple) microeconomic model of investment
in education. Denote the ith households utility of attending primary school as UiS (xi ) and the
utility of not attending school (i.e. staying home) as UiH (xi ). For simplicity, we will assume that
each utility function is additive in the year in which a child was born:1

UiS (xi ) = 0S + 1S xi + Si

(1.3)

UiH (xi ) = 0H + 1H xi + H
i .

(1.4)

This would be the case, for example, if we think that the utility cost of primary education has changed linearly
over time or, indeed, the utility benefit from a primary education.

simon.quinn@economics.ox.ac.uk

1.3

The probit model

This is a very simple example of an additive random utility model (ARUM).


Define 0 0S 0H , 1 1S 1H and i Si H
i . Then, trivially, we model a household
as having invested in primary education if:
0 + 1 xi + i 0.

(1.5)

We can therefore define a latent variable, yi :


yi (xi ; 0 , 1 ) 0 + 1 xi + i .

(1.6)

We can express this latent variable as determining our outcome variable for the ith individual:

0 if yi < 0
(1.7)
yi =
1 if yi 0.
So far, so good but were still not in a position to estimate the object of interest. To do this, we
need to make a distributional assumption.

1.3

The probit model

Assumption 1.1 (DISTRIBUTION OF i ) i is i.i.d. with a standard normal distribution, independent of xi :


i | xi N (0, 1).

(1.8)

You will be familiar with the normal distribution, and with the concepts of the probability density
and the cumulative density. Recall that the probability density function (x) is:
 2
1
x
(x) = exp
,
(1.9)
2
2
and that the cumulative density function ((x)) has no closed-form expression:
Z x
(x) =
(z) dz.

(1.10)

Figure 1.2 illustrates. With our distributional assumption, we can now write the conditional probability of primary education:2
Pr(yi = 1 | xi ; 0 , 1 ) = Pr(0 + 1 xi + i 0 | xi )
= Pr(i 0 + 1 xi | xi )
= Pr(i 0 + 1 xi | xi )
= (0 + 1 xi )
Pr(yi = 0 | xi ; 0 , 1 ) = 1 (0 + 1 xi ).
2

(1.11)
(1.12)
(1.13)
(1.14)
(1.15)

Note that equation 1.13 follows from equation 1.12 because, under the assumption of normality, the distribution of
is symmetric.

simon.quinn@economics.ox.ac.uk

1.3

The probit model

Figure 1.2: The standard normal: probability density (()) and cumulative density (())
(a)

()

(b)

.
....
........
......
...... ..
....
......
..
....
......
..
.....
......
.....
................................
......
......
....
......
......
.
.
.
.
.
.
.
.
.
.....
......
..
...
.....
....
.....
......
....
.....
...... .........
...
.....
.....
.........
....
...
.....
..............
..................
.
.
.
.
.
.
.
.
.
..... .........
... ..................
..... ..
....
.....
.....
...
....
.....
....
...
.....
....
.
.
.
.
.
.
.....
.
...
.
.
.
.....
.
.
.
.
...
.....
.
.
.
.
.
.
.....
.
...
.
.
.
.
.....
.
.
.
...
.....
.
.
.
.
.
.
......
.
....
.
.
.
.
......
.
.
.
...
.
.......
.
.
.
.
.
.
.
........
.
.....
.
.
.
.
.
.
...........
.
.
.
.
.
......
.
.
.......
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....................................................................................................................................................................................................................................................................................................................................................................................................................
..
..

. ... ... .
.. ... .. ..... ..... ..
.
.. .. ..
. ... ... .. . ... ... .
... ..... ..... ..... .. ... ..... ..... ...
... .. .. .. . . .. .. .
.... ....... ..... ..... ..... ... .. ..... ..... ..
.
.. ... ... .. .. .. . . .. .. .
.... ..... ..... ..... ... ... ... . .. ... ... ..
. . . .... .... .... .... .... ... ... ... .. . ... ... .
a

Equation 1.14 (and, equivalently, equation 1.15) defines the probit model. There is certainly no
need to motivate the probit model with an additive random utility approach, as we have done;
indeed, the vast majority of empirical papers (and econometrics textbooks) start merely with some
equivalent to equation 1.14. But the additive random utility approach is useful (i) for thinking about
how microeconomic models may motivate econometric analysis, (ii) for explaining the latent
variable interpretation of the probit model, and (iii) to lay the conceptual groundwork for other
limited depdenent variable models that we will discuss later (in particular, models of multinomial
choice). Train (2009, page 14) explains the utility-based approach in discrete choice models as
follows:
Discrete choice models are usually derived under an assumptiion of utility-maximising
behaviour by the decision maker. . . It is important to note, however, that models derived from utility maximisation can also be used to represent decision making that
does not entail utility maximisation. The derivation assumes that the model is consistent with utility maximisation; it does not preclude the model from being consistent
with other forms of behaviour. The models can also be seen as simply describing
the relation of explanatory variables to the outcome of a choice, without reference to
exactly how the choice is made.

simon.quinn@economics.ox.ac.uk

1.4

1.4
1.4.1

Estimation by maximum likelihood

Estimation by maximum likelihood


The log-likelihood

Equation 1.14 defines the probit model. But this still requires a method of estimation. The method
used for the probit model is maximum likelihood.3 For the ith individual, the likelihood can be
written as:
Li (0 , 1 ; yi | xi ) = Pr(yi = 1 | xi ; 0 , 1 )yi Pr(yi = 0 | xi ; 0 , 1 )1yi
= (0 + 1 xi )yi [1 (0 + 1 xi )]1yi .

(1.16)
(1.17)

The log-likelihood, therefore, is:


`i (0 , 1 ; yi |xi ) = yi ln (0 + 1 xi ) + (1 yi ) ln [1 (0 + 1 xi )] .

(1.18)

Denoting the stacked values of yi and xi as y and x respectively, we can write the likelihood for a
sample of N individuals as:
`(0 , 1 ; y | x) =

N
X

{yi ln (0 + 1 xi ) + (1 yi ) ln [1 (0 + 1 xi )]} .

(1.19)

i=1

You will be aware that several numerical algorithms may be used to find the values (0 , 1 ) that
jointly maximise this log-likelihood for example, the Newton-Raphson method, the BerndtHall-Hall-Hausman algorithm, the Davidson-Fletcher-Powell algorithm, the Broyden-FletcherGoldfarb-Shanno algorithm, etc. Happily, Stata (and other statistical packages) has these algorithms built in, so we can use these algorithms without having to code them ourselves.
1.4.2

Properties of the maximum likelihood estimator

Before we go on with our probit example, we should briefly revise several important properties
of maximum likelihood estimators. Suppose that we have an outcome vector, y, and a matrix of
explanatory variables, X; further, suppose that we are interested in fitting a parameter vector .
You will recall that we can generally specify the log-likelihood as:
`(; y | X) = ln f (y | X; );

(1.20)

that is, the log-likelihood is the log of the conditional probability density (or probability mass) of
y, given X, for some parameter value . This can formally be described as the conditional loglikelihood function, but we usually just term it the log-likelihood.4 Further, you will recall that, if
3

However, this is certainly not the only way we could estimate the probit model. For example, equation 1.14 implies
that E(yi | xi ) = (0 + 1 xi ); the model could therefore also be estimated by Nonlinear Least Squares (i.e. a
method-of-moments estimator).
Cameron and Trivedi (page 139) note that ignoring the marginal likelihood of X is not a problem if f (y | X) and
f (X) depend on mutually exclusive sets of parameters; that is, if there is no endogeneity problem.

simon.quinn@economics.ox.ac.uk

1.4

Estimation by maximum likelihood

we assume observations are independent across individuals, we can decompose the log-likelihood
as:
`(; y | X) =

N
X

`i (; yi | xi ) =

i=1

N
X

ln f (yi | xi ; ).

(1.21)

i=1

The maximum likelihood estimate M L therefore solves:



`(; y | X)
= 0,

=M L

(1.22)

where the lefthand side of this expression is called the score vector.
You will recall further that all maximum likelihood estimators share at least four important properties:
(i) Consistency: In general terms, an estimator is consistent if, as the number of observations
becomes very large, the probability of the estimator missing the true parameter value goes
to zero. Suppose that we are trying to estimate some true scalar parameter , and that we
are using a maximum likelihood estimator M L , with N observations in our sample. Then
consistency means that, for any > 0,
lim Pr(|M L | > ) = 0.

(1.23)

We can describe this by saying M L converges in probability to the true value , and we
can write
plim M L = .

(1.24)

(ii) Asymptotic normality: Assuming some regularity conditions, the asymptotic distribution
of a maximum likelihood estimator is normal:5




d
N N 0, I()1 ,
(1.25)

 2
`i (; yi | xi )
where I() = E
.
(1.26)
0
We generally estimate I() using:

N
2
X
1
`i ()

I(M L ) =

N i=1 0
5

(1.27)

=M L

See, for example, page 392 of Wooldridge (2002) for those conditions. In this context, the conditions are: (i) that
is interior to the set of possible values for , and (ii) that the log-likelihood is twice continuously differentiable on
the interior of that set. We will not need to worry about these conditions in these lectures.

simon.quinn@economics.ox.ac.uk

1.4

Estimation by maximum likelihood

An alternative approach is to use the outer product of the gradient (sometimes known as
the BHHH estimate, or just the OPG):
 

N 
X
`
()
1
`
()

i
i

.
(1.28)
IOP G (M L ) =

0

N

i=1

=M L

1 ], the con )
As Gould et al explain, the OPG has the same asymptotic properties as [I(
ventional variance estimator (page 11).6 One advantage of the OPG is that it does not
require calculation of the second derivatives of the log-likelihood; if we are maximising
our log-likelihood with an algorithm that does not itself require those second derivatives (for
example, the BHHH method), the OPG method may therefore be computationally more convenient (see, for example, Gould et al, pages 1112).
Suppose, then, that we want to test a hypothesis H0 : = 0 . The estimated covariance
M L )1 (or its OPG equivalent) can be used to perform a Wald test. Alternatively,
matrix I(
we can perform a Likelihood Ratio test:

i
h 
(1.29)
2 ` M L ` (0 ) 2 (k),
where k is the number of restricted parameters in 0 .
(iii) Efficiency: Equations 1.25 and 1.26 show that, asymptotically, the variance of maximum
likelihood estimators is the Cramr-Rao lower bound (i.e. the inverse of the Fisher information matrix). That is, maximum likelihood estimators are efficient: the asymptotic
variance of the maximum likelihood estimator is at least as small as the variance of any
other consistent estimator of the parameter.
(iv) Invariance: If  = f() is a one-to-one, continuous and continuously differentiable funcM L = f M L .
tion,
1.4.3

Goodness of fit in the probit model

For simplicity, lets return to our earlier example of a probit model with a single explanatory variable. You will be familiar with the R2 statistic from linear regression models; this statistic reports
the proportion of variation in the outcome variable that is explained by variation in the regressors. Unfortunately, this statistic does not generalise naturally to the maximum likelihood context.
Instead, the standard goodness-of-fit statistic for maximum likelihood estimates is McFaddens
Pseudo-R2 :
Rp2 1
6

`()
,
`0

(1.30)

I use square brackets in the quote to show where I have inserted our notation in place of Gould et als. I use this
notation at several points in these notes.

10

simon.quinn@economics.ox.ac.uk

1.4

Estimation by maximum likelihood

is the value of the maximised log-likelihood, and `0 is the log-likelihood for a model
where `()
without explanatory variables (so, in the context of our probit model, `0 is the log-likelihood for a
probit estimation using Pr(yi = 1 | x) = (0 )). You should confirm that we will always obtain
Rp2 (0, 1).
Additionally, in a binary outcome model, we may wish to report the percent correctly predicted.
Wooldridge (2002, page 465) explains:
For each i, we compute the predicted probability that yi = 1, given the explanatory
variables, xi . If [(0 +1 xi ) > 0.5], we predict yi to be unity; if [(0 +1 xi ) 0.5],
yi is predicted to be zero. The percentage of times the predicted yi matches the actual
yi is the percent correctly predicted. In many cases it is easy to predict one of the
outcomes and much harder to predict another outcome, in which case the percent
correctly predicted can be misleading as a goodness-of-fit statistic. More informative
is to compute the percent correctly predicted for each outcome, y = 0 and y = 1.
1.4.4

Back to Tanzania. . .

Table 1.1 reports the results of the probit estimation for Tanzania (see column (1)). We estimate
0 = 90.395 and 1 = 0.046; both estimates are highly significant. Columns (2) and (3) show
respectively the estimated mean marginal effect and the estimated marginal effect at the mean (that
is, the estimated marginal effect for xi = 1962.627). (We will discuss the concept of marginal
effects shortly.) Figure 1.3 shows the predicted probability of primary school attainment: (0 +
1 xi ). (Appendix 1 provides the basic Stata commands for producing these estimates.)
Table 1.1: Probit estimates for primary school attainment in Tanzania
Estimates

Year born

(1)
0.046
(0.001)

Const.

Marginal Effects
Mean effect
Effect at mean
(2)
(3)
0.015
0.018
(0.0002)

(0.0004)

-90.395
(2.075)

Obs.
Log-likelihood
Pseudo-R2
Successes correctly predicted (%)
Failures correctly predicted (%)
Mean of "year born"

10000
-5684.679
0.165
84.7
57.2
1962.627

Confidence: *** 99%, ** 95%, * 90%.

11

simon.quinn@economics.ox.ac.uk

1.5

Normalisations in the probit model

Figure 1.3: Probit estimates for primary school attainment in Tanzania

1.5

Normalisations in the probit model

We assumed earlier that i N (0, 1). But suppose instead that we had assumed more generally
that i N (, 2 ). In that case, we would write the conditional probability as:
Pr(yi = 1 | xi ; 0 , 1 ) = Pr(0 + 1 x + i 0 | xi )
= Pr(i 0 + 1 xi | xi )



0 1
i
= Pr

+
xi xi



0 1
+
xi
=



0 1
Pr(yi = 0 | xi ; 0 , 1 ) = 1
+
xi .

(1.31)
(1.32)
(1.33)
(1.34)
(1.35)

This clearly presents a problem: the best that we can now do is to identify the objects (0 ) 1
and 1 1 . That is, the earlier assumptions that = 0 and = 1 are identifying assumptions:
they are normalisations without which we cannot identify either 0 or 1 .

12

simon.quinn@economics.ox.ac.uk

1.6

Interpreting the results: Marginal effects

This should not come as a surprise: remember that we can always take a monotone increasing
transformation of a utility function without changing any of the observed choices. This has an
important implication for the way that we interpret the magnitude of parameter estimates from
discrete choice models; as Train explains (2009, page 24, emphasis in original):
The [estimated coefficients in a probit model] reflect, therefore, the effect of the observed variables relative to the standard deviation of the unobserved factors.

1.6

Interpreting the results: Marginal effects

The parameter estimates from a probit model are often difficult to interpret in any intuitive sense;
a policymaker, for example, is hardly likely to be impressed if told that the estimated effect of
age on primary school completion in Tanzania is 1 = 0.048! Instead, the interpretation of probit
estimates tends to focus upon (i) the predicted probabilities of success and, consequently, (ii) the
estimated marginal effects.
In a binary outcome model, a given marginal effect is the ceteris paribus effect of changing one
individual characteristic upon an individuals probability of success. In the context of the Tanzanian education data, the marginal effects measure the predicted difference in the probability of
primary school attainment between individuals born one year apart.
Having estimated the parameters 0 and 1 , the estimated marginal effects follow straightforwardly. For an individual i born in year xi , the predicted probability of completing primary education is:


(1.36)
Pr(yi = 1 | xi ; 0 , 1 ) = 0 + 1 xi .
Had the ith individual been born a year later, (s)he would have a predicted probability of completing primary education of:


0 + 1 (xi + 1) .
(1.37)
For the ith individual, the estimated marginal effect of the variable x is therefore:




Md (xi ; 0 , 1 ) = 0 + 1 (xi + 1) 0 + 1 xi .

(1.38)

Md (xi ; 0 , 1 ) provides the marginal effect for the discrete variable xi . If xi were continuous or
treated as being continuous for simplicity (an approximation that often works well) we would
instead use:
Pr(yi = 1 | xi ; 0 , 1 )
Mc (xi ; 0 , 1 ) =
xi


= 1 0 + 1 xi .
13

(1.39)
(1.40)

simon.quinn@economics.ox.ac.uk

1.6

Interpreting the results: Marginal effects

Figure 1.4: Probit estimates and marginal effects for primary school attainment in Tanzania

Figure 1.4 shows the predicted probabilities of success for the Tanzanian data as in Figure 1.3
but superimposes the estimated marginal effects (where, for simplicity, I have treated year of
birth as a continuous variable). You will see the the estimated marginal effect is greatest for individuals born in 1959 and that this is the year for which the predicted probability of primary
attainment is (approximately) 0.5. This is clearly no coincidence: the function (x) is maximised
at x = 0, and (0) = 0.5. This restriction i.e. that the marginal effect is largest for individuals
with a predicted probability of 0.5 can be relaxed by, for example, the scobit estimator. This
estimator, however, is beyond the scope of our course.
Figure 1.4 shows one way of reporting the marginal effects i.e. by calculating the marginal
effects separately for all individuals in the sample, and then graphing. But there are several alternative approaches: for example, statistical packages generally report either the average of the
marginal effects across the sample, or the marginal effect at the mean of the regressors.7 Alternatively, you may wish to take some weighted average of the sample marginal effects, if the sample
is unrepresentative of the population of interest. Table 1.1 earlier reported both the mean marginal
effect and the marginal effect at the mean.
7

Cameron and Trivedi (page 467) prefer the former; they say, it is best to use. . . the sample average of the marginal
effects. Some programs instead evaluate at the sample average of the regressors. . . .

14

simon.quinn@economics.ox.ac.uk

1.7

Appendix to Lecture 1: Stata code

In short, the marginal effects are particularly important for binary outcome variables, and it is
generally a very good idea to report marginal effects alongside estimates of the parameters (or,
indeed, instead of them). Standard statistical packages can compute estimated marginal effects
straightforwardly and, similarly, can use the delta method to calculate corresponding standard
errors.

1.7

Appendix to Lecture 1: Stata code

Note: You do not need to know any Stata code for any exam question about limited dependent
variables.
First, lets clear Statas memory and load our dataset.
clear
use WorkingSample
We can then tabulate primaryplus, the dummy variable that records whether or not a respondent has primary education (or higher):
tab primaryplus
Similarly, lets summarise the variable yborn, which records the year in which each respondent
was born:
summarize yborn
Time to estimate. We begin with our probit model, explaining primaryplus by yborn (and a
constant):
probit primaryplus yborn
Finally, lets calculate marginal effects first as the mean across the sample, and then as the
marginal effect at the mean of yborn:
margins, dydx(yborn)
margins, dydx(yborn) atmean

15

simon.quinn@economics.ox.ac.uk

Lecture 2: Binary Choice II

Required readings (for Lectures 1 and 2):


? C AMERON , A.C. AND T RIVEDI , P.K. (2005): Microeconometrics: Methods and Applications. Cambridge University Press, pages 463 478 (i.e. sections 14.1 to 14.4, inclusive)
or
? W OOLDRIDGE , J. (2002): Econometric Analysis of Cross Section and Panel Data. The
MIT Press, pages 453 461 (i.e. sections 15.1 to 15.4, inclusive)
or
? W OOLDRIDGE , J. (2010): Econometric Analysis of Cross Section and Panel Data (2nd
ed.). The MIT Press, pages 561 569 (i.e. sections 15.1 to 15.4, inclusive).
Other references:
H ARRISON , G. (2011): Randomisation and Its Discontents, Journal of African Economies,
20(4), 626652.
A NGRIST, J.D. AND P ISCHKE , J.S. (2008): Mostly Harmless Econometrics: An Empiricists Companion. Princeton University Press.

2.1

The logit model

Lecture 1 considered the probit model: a model of binary choice in which the latent error variable
is assumed to have a standard normal distribution. You will recall that, in the context of a single
explanatory variable (xi ), this model can be summarised succinctly by our earlier equation 1.14:
Pr(yi = 1 | xi ; 0 , 1 ) = (0 + 1 xi ).

(1.14)

An alternative approach is to assume that the latent unobservable has a logistic distribution:
Assumption 2.1 (DISTRIBUTION OF i ) is i.i.d. with a logistic distribution, independent of x:
Pr( Z | x) = (Z)
exp(Z)
.
=
1 + exp(Z)

(2.1)
(2.2)

Figure 2.1 shows the cdf of the logistic distribution, compared to the normal.

16

simon.quinn@economics.ox.ac.uk

2.1

The logit model

Figure 2.1: Cumulative density functions: normal and logistic distributions

Symmetric to our derivation of the probit specification, we can write:


Pr(yi = 1 | xi ; 0 , 1 ) = Pr(i 0 + 1 xi | xi )
= (0 + 1 xi ).

(2.3)
(2.4)

Equation 2.4 is directly analogous to equation 1.14; it defines the logit model. The logit model is
an alternative to the probit model for estimating the conditional probability of a binary outcome.
For any given dataset, the predicted probabilities from a logit model are generally almost identical
to those from a probit model, as we will see later in this lecture.
All of the reasoning from Lecture 1 extends by analogy to the case of the logit model: we can
follow the same principles to (i) form the log-likelihood, (ii) maximise the log-likelihood, (iii)
measure the goodness-of-fit, (iv) normalise our parameter estimates and (v) interpret the marginal
effects. We will not rehearse these principles in this lecture, but you should understand the way
that they extend from the probit case to the logit case.

17

simon.quinn@economics.ox.ac.uk

2.2

2.2

The log-odds ratio in the logit model

The log-odds ratio in the logit model

The probit model and the logit model are almost identical in their implications. However, when
we use the logit model, we sometimes speak about the odds ratio, because this ratio has a natural
relationship to the estimated parameters from a logit specification.8
Generally, the odds ratio is defined as:
odds ratio =

probability of success
.
probability of failure

(2.5)

In the context of our Tanzanian problem, and using the logit specification, we can write:
Pr(yi = 1 | xi ; 0 , 1 )
Pr(yi = 0 | xi ; 0 , 1 )
Pr(yi = 1 | xi ; 0 , 1 )
=
1 Pr(yi = 1 | xi ; 0 , 1 )

 
1
exp(0 + 1 xi )
1
=

1 + exp(0 + 1 xi )
1 + exp(0 + 1 xi )
= exp (0 + 1 xi ) .

odds ratioi =

(2.6)
(2.7)
(2.8)
(2.9)

In the logit model, we can therefore interpret the index 0 + 1 xi as providing the log odds ratio,
so that the parameter 1 shows the effect of xi on this log ratio:
0 + 1 xi = ln (odds ratioi )
ln (odds ratioi )
.
1 =
xi

(2.10)
(2.11)

This implies that, for a small change in xi , the value 1001 xi is approximately the percentage
change in the odds ratio.

2.3

Probit or logit?

Cameron and Trivedi have an extensive discussion of the theoretical and empirical considerations
in choosing between the probit or logit model: see pages 471473. In these notes, I would like
simply to emphasise their comment about empirical considerations:
Empirically, either logit and probit can be used. There is often little difference between
the predicted probabilities from probit and logit models. The difference is greatest in
the tails where probabilities are close to 0 or 1. The difference is much less if interest
lies only in marginal effects averaged over the sample rather than for each individual.
Figure 2.2 illustrates this point by comparing estimates from the Tanzanian data.
8

Of course, this doesnt mean that we cant talk about the odds ratio when discussing other models just that the
ratio has a more intuitive relationship to the parameters of interest in the logit model.

18

simon.quinn@economics.ox.ac.uk

2.4

The Linear Probability Model

Figure 2.2: Probit estimates and logit estimates for primary school attainment in Tanzania

2.4

The Linear Probability Model

To this point, we have considered two models: probit and logit. We have specified these models
in terms of a conditional probability of success, but we could equally specify them in terms of the
conditional expectation of the outcome variable:
E(yi | xi ; 0 , 1 ) = 1 Pr(yi = 1 | xi ; 0 , 1 ) + 0 Pr(yi = 0 | xi ; 0 , 1 )
= Pr(yi = 1 | xi ; 0 , 1 ).

(2.12)
(2.13)

Thus, for the probit model, we used:


E(yi | xi ; 0 , 1 ) = (0 + 1 xi );

(2.14)

E(yi | xi ; 0 , 1 ) = (0 + 1 xi ).

(2.15)

for the logit model, we used:

A simpler approach is to assume that the conditional probability of success and, therefore, the
conditional expectation of the outcome is linear in the explanatory variable(s):
E(yi | xi ; 0 , 1 ) = 0 + 1 xi
yi = 0 + 1 xi + i ,
19

(2.16)
(2.17)

simon.quinn@economics.ox.ac.uk

2.4

The Linear Probability Model

where E(i | xi ) = 0.
This is known as the Linear Probability Model, or LPM for short. As equation 2.17 implies,
the parameters of interest for the LPM can be obtained very simply: just use OLS. We will not
rehearse here the principles involved in OLS estimation.
2.4.1

Predicted probabilities and marginal effects in the Linear Probability Model

Predicted probabilities in the LPM are trivial:




c yi = 1 | xi ; 0 , 1 = 0 + 1 xi .
Pr

(2.18)

Note that nothing constrains this predicted probability to lie in the unit interval. We will return to
this point shortly.
Marginal effects in the LPM are similarly trivial: whether xi is discrete or continuous, its estimated
marginal effect is 1 . Note that this marginal effect is identical across all values of x.
2.4.2

Heteroskedasticity in the Linear Probability Model

The Linear Probability Model generally produces heteroskedastic errors. We can illustrate this
straightforwardly using our simple example; for a given xi , we have:

1 0 1 xi with conditional probability
0 + 1 xi
(2.19)
i =
0 1 xi with conditional probability 1 0 1 xi .
Figure 2.3 illustrates. We know that Var(i | xi ) = E (2i | xi ) [E (i | xi )]2 , and as we noted
earlier E(i | xi ) = 0. We therefore have:

Var(i | xi ) = E 2i | xi
(2.20)
= Pr(yi = 1 | xi ) (1 0 1 xi )2 + Pr(yi = 0 | xi ) (0 1 xi )2
2

= (0 + 1 xi ) (1 0 1 xi ) + (1 0 1 xi ) (0 1 xi )
= (0 + 1 xi ) (1 0 1 xi ) .

(2.21)
(2.22)
(2.23)

Therefore, Var(i | xi ) depends upon xi so long as 1 6= 0.9 The simplest way of dealing with this
problem is to use Whites heteroskedasticity-robust standard errors (which can be implemented
straightforwardly in Stata using the robust option). Alternatively, we could use Weighted
Least Squares; this produces more efficient estimates, but requires predicted probabilities to lie
between 0 and 1 (which, as we discussed, is not guaranteed in the LPM).
9

Note, then, that if we are testing a null hypothesis H0 : 1 = 0 in this model, we do not need to worry about
heteroskedasticity.

20

simon.quinn@economics.ox.ac.uk

2.5

LPM or MLE?

Figure 2.3: Heteroskedasticty in the Linear Probability Model

y
1

..
....
...
.
..
...
...
....
.
.
...
.......
...
.......
.......
...
....
.......
.
.
.
.
.
...
.
.
.......
...
.......
...
....
.......
.
.......
...
.......
.
.
.
.
...
.
.
.
.......
...
....
.......
.
...
.......
.......
...
.......
.
.
..
.
.
...
.
.
.
..
.......
...
.......
...
.......
.......
...
....
.......
.
.
.
.
.
...
.
.
.......
...
.......
...
....
.......
.
.......
...
.......
.
.
.
.
...
.
.
...
.......... ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ...................... ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........ ........
...
...... .
.......
...
.
.......
.
.
.
.
..
.
.
.
.
..
.......
...
.......
...
.......
.
.......
...
.....
.......
.
.
.
.
.
...
.
.
.......
...
.
.......
.
.
.
.
...
.
.....
.
....
...
.......
.......
...
.
.......
.
.
.
.
...
.
.....
.
....
...
.......
.......
...
.
.......
.
.
.
.
...
.
.....
.
.
.......
...
.......
...
.
.......
... .............
.....
..............
...
.
...
.....
...
...
.
..................................................................................................................................................................................................................................................................................................................................................................................................
....

1 0 1 xi

0
2.5
2.5.1

0 + 1 xi

xi

LPM or MLE?
Relative advantages and disadvantages

We noted earlier that the difference between probit and logit is very small both in terms of
the estimates that they provide and in terms of their underlying structure. The Linear Probability
Model, however, is clearly quite different for example, as Cameron and Trivedi note (page 466),
the Linear Probability Model, unlike probit and logit, does not use a cdf. So which approach
should be preferred probit/logit on the one hand, or the Linear Probability Model on the other?
This can be quite a controversial issue in applied research! On the one hand, many researchers
prefer the probit or logit models, on the basis that they constrain predicted probabilities to the
unit interval, and that they therefore imply sensible marginal effects across the entire range of
explanatory variables. Cameron and Trivedi, for example, say (page 471):
Although OLS estimation with heteroskedastic standard errors can be a useful exploratory data analysis tool, it is best to use the logit or probit MLE for final data
analysis.
In a 2011 article about Randomised Controlled Trials (RCT) in the Journal of African Economies,
Harrison said this about the use of OLS for limited dependent variable models (footnote omitted):10
10

Harrison also discussed the issue in his presentation at the 2011 CSAE Annual Conference, available at http:
//www.csae.ox.ac.uk/conferences/2011-EdiA/video.html.

21

simon.quinn@economics.ox.ac.uk

2.5

LPM or MLE?

One side-effect of the popularity of RCT is the increasing use of Ordinary Least
Squares estimators when dependent variables are binary, count or otherwise truncated
in some manner. One is tempted to call this the OLS Gone Wild reality show, akin to
the Girls Gone Wild reality TV show, but it is much more sober and demeaning stuff.
I have long given up asking researchers in seminars why they do not just report the
marginal effects for the right econometric specification. Instead I ask if we should just
sack those faculty in the room who seem to waste our time teaching things like logit,
count models or hurdle models.
In their book Mostly Harmless Econometrics, Angrist and Pischke (2009, page 94) take a different
approach:
Should the fact that a dependent variable is limited affect empirical practice? Many
econometrics textbooks argue that, while OLS is fine for continuous dependent variables, when the outcome of interest is a limited dependent variable (LDV), linear regression models are inappropriate and nonlinear models such as probit and Tobit are
preferred. In contrast, our view of regression as inheriting its legitimacy from the
[Conditional Expectation Function] makes LDVness less central.
That is, the Linear Probability Model can still be used to estimate the average marginal effect.
Cameron and Trivedi acknowledge (page 471) that:
The OLS estimator [that is, the Linear Probability Model] is nonetheless useful as an
exploratory tool. In practice it provides a reasonable direct estimate of the sampleaverage marginal effect on the probability that y = 1 as x changes, even though it
provides a poor model for individual probabilities. In practice it provides a good guide
to which variables are statistically significant.
Further, the Linear Probability Model is sometimes preferred for computational reasons; maximum
likelihood models can prove much more difficult to estimate where, for example, there is a very
large number of observations or a large number of explanatory variables.
2.5.2

Estimates from Tanzania

Table 2.1 reports estimates from the probit, logit and LPM models for the Tanzanian education
example; Figure 2.4 shows the predicted probabilities. Together, the table and figure illustrate
several important features of the three models. First, all three models predict very similar mean
marginal effects. Second, the mean marginal effect for the Linear Probability Model is identical
to the parameter estimate.11 Third, the probit and logit models predict conditional probabilities in
the unit interval; in contrast, the LPM implies nonsensical predicted probabilities for people born
before about 1925.
11

For this reason, we would never report the estimate and the marginal effect separately for the LPM; I have done so
here simply to emphasise their equivalence.

22

simon.quinn@economics.ox.ac.uk

2.5

LPM or MLE?

Table 2.1: Probit, logit and LPM results from Tanzania

Year born

Probit
Estimate
Marginal
(1)
(2)
0.046
0.015
(0.001)

Const.

Obs.
Log-likelihood
Pseudo-R2
R2
Correctly predicted:
Successes (%)
Failures (%)

Logit
Estimate
Marginal
(3)
(4)
0.077
0.015

(0.0002)

(0.002)

(0.0002)

LPM
Estimate Marginal
(5)
(6)
0.016
0.016
(0.0003)

-90.395

-149.997

-30.510

(2.075)

(3.651)

(0.603)

10000
-5684.679
0.165

10000
-5680.831
0.165

10000

(0.0003)

0.210

84.7
57.2

84.7
57.2

86.3
55.2

Confidence: *** 99%, ** 95%, * 90%.


Marginal refers to the mean marginal effect. The Linear Probability Model was run using Whites heteroskedasticityrobust standard errors.

Figure 2.4: LPM, probit and logit estimates for primary school attainment in Tanzania

23

simon.quinn@economics.ox.ac.uk

2.6

2.6

The single-index assumption

The single-index assumption

Albert Einstein is sometimes quoted misquoted, perhaps as having said that everything
should be made as simple as possible, but no simpler. In this spirit, we have considered the probit, logit and LPM models solely in the context of a single (scalar) explanatory variable, xi . All of
the basic principles of these estimators can be understood in this way, so we have not yet considered the multivariate context.
In most empirical applications, however, we have more than one explanatory variable. It is straightforward to take all of our previous reasoning on xi and generalise it to a vector xi , by replacing
0 + 1 xi with the linear index xi (where, generally, xi is understood as including an element
1, to allow an intercept). This is how we deal with multiple explanatory variables in the context
of the probit, logit and LPM models; thus, in the multivariate case, we specify either:
Pr(yi = 1 | xi ) = ( xi ) for probit,
or Pr(yi = 1 | xi ) = ( xi ) for logit,
or Pr(yi = 1 | xi ) = xi
for LPM.

(2.24)
(2.25)
(2.26)

This is a very general structure; it permits quite flexible estimation of binary outcome models
with a large number of explanatory variables. However, note that the explanatory variables enter
linearly through a single index, xi . It is easy to think of functions that violate this assumption.
For example, the left surface of Figure 2.5 shows the function y = (3(x1 + x2 1)); this
satisfies the single index assumption because we can write y = F ( x). But the right surface in
Figure 2.5 shows the function y = 0.5 ((10x1 3) + (10x2 3)); this cannot be expressed as
y = F ( x), so it violates the single-index assumption.

2.7

A general framework

If we are willing to impose the single-index assumption, we can write the probit, logit and LPM
models as special cases of a very general structure:
Pr(yi = 1 | xi ) = F ( xi ) ,

(2.27)

where F (z) = (z) for probit, F (z) = (z) for logit and F (z) = z for the LPM. F can be
referred to as a link function. If F (z) is a cdf as it is, for example, for probit and logit
then we can rewrite all of our earlier maximum likelihood results more generally in terms of
F (z). This is the way, for example, that Cameron and Trivedi introduce probit and logit (see pages
465 to 469 of their text); Wooldridge takes the same approach (se pages 457 458 of his 2002 text).

24

simon.quinn@economics.ox.ac.uk

2.8

Appendix to Lecture 2: Stata code

Figure 2.5: The single index restriction: an example (left) and a violation (right)

The surface on the left shows the function y = (3(x1 + x2 1)); the graph on the right shows
the function y = 0.5 ((10x1 3) + (10x2 3)). Thus, the surface on the left may be
expressed as y = f ( x), but the surface on the right requires a bivariate function y = g(x1 , x2 ).

More generally, equation 2.27 permits semiparametric estimation, in which a researcher can jointly
estimate the parameter vector and the link function F . In our simple application with one explanatory variable, this kind of approach just implies fitting a univariate nonparametric function,
y = f (xi ) + i , where E(i | xi ) = 0. Figure 2.6 illustrates, where the function F is fitted with a
kernel.
You do not need to understand any nonparametric or semiparametric methods for these lectures.
However, the underlying point is worth remembering: the probit, logit and LPM models all impose
particular assumptions on the data, and we can sometimes relax these assumptions using more
flexible estimation techniques. (Some of these techniques are covered in the second-year M.Phil
course Advanced Econometrics 1 Microeconometrics.)

2.8

Appendix to Lecture 2: Stata code

Note: You do not need to know any Stata code for any exam question about limited dependent
variables.
Lets again clear Statas memory and load our dataset.
clear
use WorkingSample
25

simon.quinn@economics.ox.ac.uk

2.8

Appendix to Lecture 2: Stata code

Figure 2.6: Probit and kernel estimates for primary school attainment in Tanzania

We can run a logit estimation with the logit command:


logit primaryplus yborn
We can use the same margins command as for probit
margins, dydx(yborn)
margins, dydx(yborn) atmean
Finally, we can run the Linear Probability Model using the command regress, for an OLS
regression; we add the option robust to calcuate Whites heteroskedasticity-robust standard
errors:
regress primaryplus yborn, robust

26

simon.quinn@economics.ox.ac.uk

Lecture 3: Discrete ordered choice

Required reading:
? C AMERON , A.C. AND T RIVEDI , P.K. (2005): Microeconometrics: Methods and Applications. Cambridge University Press, pages 519 520 (i.e. section 15.9.1)
or
? W OOLDRIDGE , J. (2002): Econometric Analysis of Cross Section and Panel Data. The
MIT Press, pages 504 507 (i.e. section 15.10)
or
? W OOLDRIDGE , J. (2010): Econometric Analysis of Cross Section and Panel Data (2nd
ed.). The MIT Press, pages 655 659 (i.e. sections 16.3.1 and 16.3.2).
Other references:
C UNHA , F., H ECKMAN , J. AND NAVARRO , S. (2007): The Identification and Economic
Content of Ordered Choice Models with Stochastic Thresholds, International Economic
Review, 48(4), 12731309.

3.1

The concept of ordered choice

In Lectures 1 and 2, we considered the problem of binary outcome variables; we did so by considering Tanzanians decision whether or not to complete primary school education. In this lecture,
we extend our earlier reasoning to consider the problem of discrete ordered choice. To do so, we
will continue to work with the Tanzanian ILFS dataset; we will now consider Tanzanians decision
between three choices: (i) not completing primary education, (ii) completing primary education
but not secondary education, and (iii) completing secondary education.
We will denote our outcome variable as follows:

0 if the ith individual did not complete primary education;


1 if the ith individual completed primary education but not secondary education;
yi =

2 if the ith individual completed secondary education.


(3.1)
Figure 3.1 shows how attainment of primary and secondary education has changed over time in
Tanzania; it plots our new variable yi against respondents year of birth (xi ). Note again one of the
key characteristics of many limited dependent variable models: the outcome variable is categorical, so the numerical values taken by yi have no cardinal meaning. There is no sense, for example,
in which completing secondary education (yi = 2) is twice as good, or twice as useful, or twice
as anything as completing primary education (yi = 1).
We will model this education decision as an ordered choice. It is clear that, in some intuitive
sense, the categories no education primary education secondary education are ordered; for
27

simon.quinn@economics.ox.ac.uk

3.2

A simple optimal stopping model

Figure 3.1: Primary and secondary school attainment in Tanzania across age cohorts

example, secondary education requires more time than primary education, which itself (obviously)
requires more time than no education. This kind of intuitive reasoning often justifies the description
of a choice as a discrete ordered choice. Ideally, though, we should be able to go further: we
should be able to describe the outcome as a monotone step function of some continuous latent
variable. To illustrate what this might mean, we will consider a simple microeconomic model of
Tanzanians investment in education.

3.2

A simple optimal stopping model

Assume that a student obtains some utility from attending school (or, equivalently, pays some utility cost), and that this utility changes with (i) the students year of birth (xi ), and (ii) the students
unobserved taste for education (i ):
usit (xi ) = 0 + 1 xi + i .

(3.2)

Additionally, assume that the student may work and receive in-period utility determined by the
students level of education (si ):
uit (si ) = si .

(3.3)

We assume that i is known to the student, but unobservable to a researcher.

28

simon.quinn@economics.ox.ac.uk

3.2

A simple optimal stopping model

For simplicity, lets assume that the student faces only three choices: (i) do not attend school
(s = 0), (ii) finish primary school (s = 7), and (iii) finish secondary school (s = 12). We will
assume that students have a lifetime of known finite duration T > 12 years and, for simplicity, we
will make the extreme assumption that students assign equal utility weight to each period.12 Given
these assumptions, we can write three value functions, one corresponding to each choice:
V0 (0, xi , i ) = 0
V0 (7, xi , i ) = 7 (0 + 1 xi + i ) + (T 7) 7
V0 (12, xi , i ) = 12 (0 + 1 xi + i ) + (T 12) 12.

(3.4)
(3.5)
(3.6)

Therefore, the student prefers s = 7 to s = 0 if and only if:13


0 + 1 xi + i (7 T ) .

(3.7)

Similarly, the student prefers s = 12 to s = 7 if and only if:


12 (0 + 1 xi + i ) + (T 12) 12 7 (0 + 1 xi + i ) + (T 7) 7
0 + 1 xi + i (19 T ) .

(3.8)
(3.9)

We can therefore define two cutpoints,


1 = (7 T ) 0
2 = (19 T ) 0 ,

(3.10)
(3.11)

and express the ith students decision (yi ) as an ordered choice in the latent variable 1 xi + i :

0 if 1 xi + i < 1 ;
1 if 1 xi + i [1 , 2 ) ;
y(xi , i ; 1 , 1 , 2 ) =
(3.12)

2 if 1 xi + i 2 .
Our simple optimal stopping model therefore implies that y(x, ; 0 , 1 ) is a monotone step function in 1 xi + i . Cunha, Heckman and Navarro (2007) discuss several classes of models
including a dynamic schooling choice model that imply this kind of monotone step function
solution. Figure 3.2 illustrates.
Notice that, in this model, the observable covariate x affects the latent index rather than the cutpoints; for this reason, we can describe the Ordered Probit as a model of index shift. There are
two reasons that this result matters for empirical analysis:
(i) We may wish to exploit the ordered nature of the outcome variable for more efficient estimation;
(ii) We may wish to test the microeconomic model by testing the implication that x affects
educational choice through index shift.
12
13

That is, we will use a subjective discount factor of 1.


We will assume that the indifferent student chooses the higher level of education.

29

simon.quinn@economics.ox.ac.uk

3.3

The Ordered Probit

Figure 3.2: Optimal schooling as a monotone step function in 1 xi + i

0
3.3

y(xi , i ; 1 , 1 , 2 )

..............................................................................................................................................
..
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
.....................................................................................................................................................
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
.
.
.
.
.
.
................................................................................................................................................................... ........ ........ ........ ........ ........ ........ ........ ........ ........ ..... ........ ........ ........ ........ ........ ........ ........ ........ ........ .............
...
...

1 xi + i

The Ordered Probit

The implications of our simple optimal stopping model are important. However, we need more
before we can take these implications to data: once again, we need a distributional assumption
about . We will make the same assumption that we made in Lecture 1.
Assumption 3.1 (DISTRIBUTION OF i ) i is i.i.d. with a standard normal distribution, independent of xi :
i | xi N (0, 1).

(3.13)

Armed with this assumption, it is straightforward to write the log-likelihood for the ith individual:

if yi = 0;
ln (1 1 xi )
ln [ (2 1 xi ) (1 1 xi )] if yi = 1;
`i (1 , 1 , 2 ; yi | xi ) =
(3.14)

ln [1 (2 1 xi )]
if yi = 2.

3.4

Marginal effects in the Ordered Probit model

Marginal effects in the Ordered Probit model are directly analogous to marginal effects in the
probit model. For simplicity, we will consider only the case in which xi is continuous. Consider
first the marginal effects for the extreme categories, yi = 2 and yi = 0. Following the reasoning in

30

simon.quinn@economics.ox.ac.uk

3.5

The Ordered Probit in Tanzania

subsection 1.6, we have:




Pr(yi = 0 | xi ; 1 ,
1)
M0 (xi ; 1 ,
1) =
= 1
1 1 xi , and
xi


Pr(yi = 2 | xi ; 1 ,
2)
M2 (xi ; 1 ,
2) =
= 1
2 1 xi .
xi

(3.15)
(3.16)

For the intermediate category, we can find the marginal effect simply by considering the effect of
xi at both cutoffs:
M1 (xi ; 1 ,
1,
2) =

h 


i
Pr(yi = 1 | xi ; 1 ,
1,
2)
= 1
1 1 xi
2 1 xi .
xi
(3.17)

These principles generalise naturally to the case where xi is discrete, and to the case in which there
are more than three categories.

3.5

The Ordered Probit in Tanzania

Table 3.1 shows the estimates from the Ordered Probit model for our Tanzanian data: we estimate
1 = 0.039,
1 = 76.672 and
2 = 78.517. Columns 2 and 3 respectively show the mean marginal
effects for the outcomes y = 1 and y = 2 (that is, I omit the mean marginal effect for outcome
y = 0; you should be able to calculate this, however). Figure 3.3 shows the consequent predicted
probabilities.
Table 3.1: Estimates from Tanzania: Ordered Probit
Estimates

Year born

Cutoff 1 (
1 )

Mean Marginal Effects


y=2
(3)
0.005

(1)
0.039

y=1
(2)
0.008

(0.001)

(0.002)

(0.002)

76.672
(1.886)

Cutoff 2 (
2 )

78.517
(1.890)

Obs.
Log-likelihood
Pseudo-R2

10000
-7972.275
0.105

Confidence: *** 99%, ** 95%, * 90%.

31

simon.quinn@economics.ox.ac.uk

3.6

The Generalised Ordered Probit

Figure 3.3: Ordered Probit estimates for primary and secondary school attainment in Tanzania across age cohorts

3.6

The Generalised Ordered Probit

We just noted that the Ordered Probit model is a model of index shift: the observable variable xi
affects the latent index 1 xi + i . This was justified by our simple optimal stopping model, in
which year of birth (xi ) directly affected each students utility from attending school. This model
i.e. both the optimal stopping model and the Ordered Probit therefore implied that we could
summarise the effect of age on both primary and secondary education with a single parameter: 1 .
This implies, for example, that if we estimate that, over time, students are more likely to complete
primary school (i.e. Pr(yi = 0 | xi ) is decreasing), we must also estimate that students are more
likely to complete secondary school (i.e. that Pr(yi = 2 | xi ) is increasing). This is implied in
equations 3.15 and 3.16; the marginal effects on the largest and smallest outcomes must have opposite signs.
However, we might be concerned that this structure is too restrictive. After all, there might be many
good reasons that, over time, students have become less likely to complete secondary education
and less likely to complete no education at all (i.e. with more students stopping after primary
school). This might be the case if, for example, the cost of primary education has fallen over time
but the cost of secondary education has increased. In that case, we may still believe that educational
choice is a monotone step function in the students unobserved taste for education (i ), but we may
32

simon.quinn@economics.ox.ac.uk

3.6

The Generalised Ordered Probit

want to allow the explanatory variable to affect each cutpoint differently. That is, we may want to
use a model of cutpoint shift, rather than of index shift:

0 if i < 1 1 xi ;
1 if i [1 1 xi , 2 2 xi ) ;
y(xi , i ; 1 , 1 , 2 ) =
(3.18)

2 if i 2 2 xi .
This model is identical to our earlier model in the special case 1 = 2 = 1 . But, by allowing 1
and 2 to vary separately, we can allow for a more flexible model while still exploiting the ordered
structure of the decision. (Note, of course, that we havent gone back to modify our simple optimal stopping model to reflect this change; however, we could certainly do so for example, by
allowing xi to affect the cost of each schooling level differently.) Figure 3.4 illustrates this more
general model.

Figure 3.4: Optimal schooling as a monotone step function in i

y(xi , i ; 1 , 2 , 1 , 2 )

..
..............................................................................................................................................
..
..
..
..
.
.
...
..
..
..
.
.
...
..
..
..
.
.
...
...
...
...
..
...
...
...
..
...
...
...
..
...
...
...
..
...
..
...
..................................................................................................................................................
.
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
..
...
...
..
..
...
...
..
..
...
...
..
..
........................................................................................................................................................................ ........ ........ ........ ........ ........ ........ ........ ........ ........ ..... ........ ........ ........ ........ ........ ........ ........ ........ ........ .............
...
...

1 1 x i

2 2 xi

If we maintain the assumption that i has a standard normal distribution, we can describe this new
model as a Generalised Ordered Probit. We will not write the log-likelihood for this model,
but it is straightforward and directly analogous to the log-likelihood for the Ordered Probit. Table
3.2 shows the estimation results, with estimated mean marginal effect. Compared to the Ordered
Probit, the Generalised Ordered Probit implies a slightly higher mean marginal effect upon the
probability of primary education (yi = 1), but a slightly lower effect upon the probability of
secondary (yi = 2).

33

simon.quinn@economics.ox.ac.uk

3.6

The Generalised Ordered Probit

Table 3.2: Estimates from Tanzania: Generalised Ordered Probit


Estimates

Mean Marginal Effects


y=1
y=2
0.013
0.001

Year born

(0.0002)

Cutoff 1:
Year born

(0.0002)

0.045
(0.001)

Const.

88.725
(2.017)

Cutoff 2:
Year born

0.01
(0.001)

Const.

21.794
(2.882)

Obs.
Log-likelihood
Pseudo-R2

10000
-7779.772
0.126

Confidence: *** 99%, ** 95%, * 90%.

Figure 3.5 shows the consequent predicted probabilities. Figure 3.6 shows the estimated cutoff
functions for the Generalised Ordered Probit (that is,
1 1 xi and
2 2 xi ) along with

the cutoffs for the Ordered Probit (


1 1 xi and
2 1 xi ). The diagram shows the fundamental
difference between the Ordered Probit and Generalised Ordered Probit: the Ordered Probit restricts
the cutoff functions to be parallel. Of course, this may or may not be a valid (or useful) restriction,
depending on our particular empirical context. We can test the restriction straightforwardly: you
should verify that, using a Likelihood Ratio test, we can compare the results in Tables 3.2 and
3.1 and obtain LR = 2 (7972.275 7779.772) = 385.006, which implies a tiny p-value when
compared to a 2 (1) distribution.

34

simon.quinn@economics.ox.ac.uk

3.6

The Generalised Ordered Probit

Figure 3.5: Generalised Ordered Probit estimates for primary and secondary school attainment in Tanzania across age cohorts

Figure 3.6: Estimated cutoff functions: Ordered Probit and Generalised Ordered Probit

35

simon.quinn@economics.ox.ac.uk

3.7

3.7

The Ordered Logit and Generalised Ordered Logit

The Ordered Logit and Generalised Ordered Logit

Recall that, in the binary outcome case, the probit model is motivated by the assumption that the
latent error term has a normal distribution, and the logit model is motivated by the assumption that
the error has a logistic distribution. In this lecture, we have considered the Ordered Probit and
the Generalised Ordered Probit. Both specifications have relied upon the assumption that has a
normal distribution. However, as in the binary outcome case, we could assume instead that has
a logistic distribution. By analogy to the binary outcome case, we would then call our estimators
the Ordered Logit and the Generalised Ordered Logit.

3.8

The Linear Probability Model and discrete ordered choice

In Lecture 2, we considered the Linear Probability Model as an alternative to the probit or logit
model. We can also use a Linear Probability Model as an alternative to the Generalised Ordered
Probit (or Generalised Ordered Logit). It would be tempting to write such an alternative like this:
yi = 0 + 1 xi + i ,

(3.19)

where yi again refers to our three-outcome measure of educational achievement and xi is again year
of birth. That is, we could simply run an OLS regression of yi on xi . However, it is very difficult
if not impossible to justify this approach. The reason for the difficulty is simple: as we noted
earlier, yi is a categorial outcome, where the values 0, 1 and 2 have no cardinal meaning. It is
therefore not meaningful to talk about a one unit increase in the outcome variable (for example,
we cannot interpret our estimate of 1 in terms of a marginal effect on a conditional probability).
Unfortunately, it is not uncommon to see researchers using specifications like equation 3.19 for
studying discrete outcomes.
Instead, we ought to estimate in a way that respects the categorical nature of the dependent variable.
If we want to use a linear probability structure, we can do this by using multiple LPM estimates.
In our Tanzanian example, we can do this by defining two new binary outcomes:

1 if yi = 1.
pi =
(3.20)
0 if yi 6= 1;

1 if yi = 2.
si =
(3.21)
0 if yi 6= 2;
We can then run two separate Linear Probability Models to estimate the effect of xi on the probability of primary compltion and the probability of secondary completion:
pi = 0 + 1 xi + i
si = 0 + 1 xi + i .

(3.22)
(3.23)

Figure 3.7 shows the resulting estimates. We can compare this graph directly to Figure 3.5. Of
course, the estimates illustrated in Figure 3.7 are not necessarily good estimates: all of the objections to the Linear Probability Model that we discussed in Lecture 2 still apply. Arguably, these
36

simon.quinn@economics.ox.ac.uk

3.8

The Linear Probability Model and discrete ordered choice

objections apply with even more force where the dependent variable has multiple outcomes: we
may think that it is even more important, in this case, to use an estimator that can be rationalised
in terms of an underlying economic structure. But these estimates can at least be defended as
providing reasonable estimates of the marginal effect of xi on the probability of choosing yi = 1
and the probability of choosing yi = 2. Unfortunately, this is not something that we can say about
equation 3.19.14
Figure 3.7: Linear Probability Model estimates for primary and secondary school attainment
in Tanzania across age cohorts

14

I have introduced this multiple LPM approach as an alternative to the Generalised Ordered Probit. We could also
use it as an alternative to models of discrete multinomial choice (i.e. unordered choice), which we will discuss in
Lecture 4. However, as we will see in that lecture, models of unordered choice have traditionally placed particular
emphasis upon having choice-theoretic foundations which, as we saw in Lecture 2, the Linear Probability Model
does not provide.

37

simon.quinn@economics.ox.ac.uk

3.9

3.9

Appendix to Lecture 3: Stata code

Appendix to Lecture 3: Stata code

Note: You do not need to know any Stata code for any exam question about limited dependent
variables.
We can start by clearing the memory and loading the data as we did previously. Then, we can
tabulate our categorical education variable:
tab educ_cat
We can run an Ordered Probit with the oprobit command:
oprobit educ_cat yborn
We can then calculate mean marginal effects for the outcomes y = 1 and y = 2:
margins, dydx(yborn) predict(outcome(1))
margins, dydx(yborn) predict(outcome(2))
We can then fit the Generalised Ordered Probit using the goprobit command:
goprobit educ_cat yborn
(Note that goprobit actually estimates, in our terminology, 1 and 2 , rather than 1 and 2 .
Note also that this command may not be installed on your computer; the command is not currently
built in to Stata, so you may have to download it separately.)
The same margins commands will then calculate mean marginal effects.
To use multiple Linear Probability Models instead, we can generate new dummy variables, then
run OLS regressions:
gen
gen
reg
reg

p
s
p
s

= (educ_cat == 1)
= (educ_cat == 2)
yborn, robust
yborn, robust

38

simon.quinn@economics.ox.ac.uk

Lecture 4: Discrete multinomial choice

Required reading:
? C AMERON , A.C. AND T RIVEDI , P.K. (2005): Microeconometrics: Methods and Applications. Cambridge University Press, pages 490 506 (i.e. sections 15.1 to 15.5.3, inclusive)
or
? W OOLDRIDGE , J. (2002): Econometric Analysis of Cross Section and Panel Data. The
MIT Press, pages 497 502 (i.e. section 15.9.1 and part of section 15.9.2)
or
? W OOLDRIDGE , J. (2010): Econometric Analysis of Cross Section and Panel Data (2nd
ed.). The MIT Press, pages 643 649 (i.e. sections 16.1, 16.2.1 and part of 16.2.2).
Other references:
L EWIS , W.A. (1954): Economic Development with Unlimited Supplies of Labour, The
Manchester School, 22(2), 139191.
M C FADDEN , D. (1974): The Measurement of Urban Travel Demand, Journal of Public
Economics, 3(4), 303328.
M C FADDEN , D. (2000): Economic Choices, Nobel Prize Lecture, 8 December 2000.

4.1

Occupational choice in Tanzania


Travel demand forecasting has long been the province of transportation engineers, who
have built up over the years considerable empirical wisdom and a repertory of largely
ad hoc models which have proved successful in various applications. . . [but] there still
does not exist a solid foundation in behavioral theory for demand forecasting practices.
Because travel behavior is complex and multifaceted, and involves non-marginal
choices, the task of bringing economic consumer theory to bear is a challenging one.
Particularly difficult is the integration of a satisfactory behavioural theory with practical statistical procedures for calibration and forecasting.
McFadden (1974, emphasis added)
The main sources from which workers come as economic development proceeds are
subsistence agriculture, casual labour, petty trade, domestic service, wives and daughters in the household, and the increase of population. In most but not all of these
sectors, if the country is overpopulated relatively to its natural resources, the marginal
productivity of labour is negligible, zero, or even negative.
Lewis (1954)
39

simon.quinn@economics.ox.ac.uk

4.1

Occupational choice in Tanzania

In this lecture, we consider occupational choice in Tanzania. Both geographically and conceptually, the Tanzanian labour market is a long way from the San Francisco Bay Area Rapid Transit
system (the BART). Nonetheless, we will analyse occupational choice using some of the econometric methods that Daniel McFadden famously developed to predict demand for the new BART.
Like McFadden, our concern shall be to estimate the conditional probability of various discrete
and unordered choices, and to do so with as McFadden termed it a solid foundation in
behavioral theory.
Lets begin in Tanzania. Figure 4.1 describes occupational choice among employed Tanzanians
of different ages. The figure and our subsequent analysis uses a ternary outcome variable,
covering three mutually exlusive categories:

1 if the ith respondent works in agriculture;


2 if the ith respondent is self-employed (outside of agriculture);
(4.1)
yi =

3 if the ith respondent is wage employed (outside of agriculture).

Figure 4.1: Occupational categories and age in Tanzania

We will not worry too much today about why occupational choice in Tanzania might matter; as
in earlier lectures, we will use the Tanzanian data as an illustrative vehicle for our econometric
techniques. However, it is not difficult to see why this kind of occupational choice might be
important for understanding Tanzanias development; the quote from Lewiss famous 1954 paper,
40

simon.quinn@economics.ox.ac.uk

4.2

An Additive Random Utility Model

for example, highlights sectoral shifts as an important mechanism for long-run development, and
the data in Figure 4.1 might provide insights into the flexibility with which workers can achieve
such shifts. (For example, if older workers are more likely to choose employment in agriculture
than younger workers, this may suggest some switching costs between sectors.)

4.2

An Additive Random Utility Model

As in earlier lectures, we will motivate our econometric methods by a simple underlying microeconomic model. Typically, this kind of choice-theoretic foundation is more common in the analysis
of discrete unordered choice than in the models that we have studied earlier. For example, the latent variable interpretion is a useful approach for thinking about the probit and logit models, but is
not generally a starting point for analysis; similarly, an optimal stopping model is just one possible
foundation for models of discrete ordered choice. But in the analysis of discrete unordered choice,
an additive random utility model is a common starting point. As Cameron and Trivedi (page 506)
explain:
The econometrics literature has placed great emphasis in restricting attention to multinomial models that are consistent with maximisation of a random utility function. This
is similar to restricting analysis to demand functions that are consistent with consumer
choice theory.
Suppose, therefore, that we again have data on N individuals, indexed i {1, . . . , N }. Assume
that each individual makes a choice yi = j, where there are a finite number J options available.
Critically, suppose that we observe information at the level of each individual, including his/her
choices (that is, we observe xi and yi ). That is, we do not observe information at the level of each
available option; we will consider this alternative kind of data structure later in this lecture. You
may query how reasonable it may be to model occupational outcomes purely as a matter of choice
after all, could an agricultural employee simply choose to take a wage job? but we will leave
this concern aside for this lecture.
As in Lecture 1, we will assume that the ith individuals utility from the jth choice is determined
by an additive random utility model:
(j)

(j)

Uij (xi ) = 0 + 1 xi + ij .

(4.2)

Thus, for example, for choices j {1, 2, 3}, the individual obtains the following utilities:
(1)

(1)

(4.3)

(2)

(2)

(4.4)

(3)

(3)

(4.5)

Ui1 (xi ) = 0 + 1 xi + i1
Ui2 (xi ) = 0 + 1 xi + i2
Ui3 (xi ) = 0 + 1 xi + i3 .

Together, these three utilities determine the choice of an optimising agent. Figure 4.2 illustrates
preferences between the three options in two-dimensional space; in each box, the bold outcome
represents the agents choice.
41

simon.quinn@economics.ox.ac.uk

4.2

An Additive Random Utility Model

Figure 4.2: Multinomial choice among three options


U3 (x) U1 (x)
...
...........
............
..... ..
...
..
.
...
.
.
.
..
...
....
...
....
...
..
.
.
...
.
...
...
...
....
....
...
.
.
...
.
.....
...
...
...
.....
...
...
.
..
...
.....
...
...
...
.....
...
....
...
....
...
...
...
.....
...
...
...
.....
...
...
....
....
...
.
...
....
... ...
.. ...
..................................................................................................................................................................................................................................................................................................................................................
.... ...
... ....
.....
...
...
.
..
...
.....
...
...
...
.
.
.
.
.
...
...
..
.
.
.
.
...
..
.
...
.
.
.
.....
..
...
..
.
...
.
.
...
...
.
...
.
.
.
...
....
.
.
.
.....
.....
...
...
.
.
.
.....
....
.
...
.
.
...
.....
...
.
..
...
.....
...
...
.
.
.
.....
.
.
...
..
.
.
...
.
.
..
..
............
.......
.
...
.........

U3 (x) = U2 (x)

32
21
31

32
21
31

32
21
31

32
21
31

U2 (x) U1 (x)

32
21
31

32
21
31

We can, therefore, express the conditional probability of the ith individual choosing, say, option 1:


Pr(yi = 1 | xi ) = Pr U (1) (xi ) > U (2) (xi ) and U (1) (xi ) > U (3) (xi ) | xi
(4.6)
h
(1)
(1)
(2)
(2)
= Pr 0 + 1 xi + i1 > 0 + 1 xi + i2 and
i
(1)
(1)
(3)
(3)
0 + 1 xi + i1 > 0 + 1 xi + i3 | xi .
(4.7)
More generally, if the ith individual were to choose yi = j out of J choices, we could write:



(j)
(k)
Pr(yi = j | xi ) = Pr U (xi ) > max U (xi ) | xi
(4.8)
k6=j



j
j
k
k
(4.9)
= Pr 0 + 1 xi + ij > max 0 + 1 xi + ik | xi .
k6=j

In order to estimate using equation 4.9, we again need to make a distributional assumption.

42

simon.quinn@economics.ox.ac.uk

4.3

4.3

The Multinomial Logit model

The Multinomial Logit model

Assumption 4.1 (DISTRIBUTION OF ij ) ij is i.i.d. with a Type I Extreme Value distribution,


independent of xi :
Pr(ij < z | xi ) = Pr(ij < z) = exp ( exp (z)) .

(4.10)

Equation 4.10, of course, defines the cumulative density function F (z); this implies a probability
density function of:
f (z) =

d
exp ( exp (z)) = exp (z) exp ( exp (z))
dz
= exp (z) F (z).

(4.11)
(4.12)

Figure 4.3 shows the cumulative density function for the Type I Extreme Value function, compared
to the cdf of the normal.

Figure 4.3: Cumulative density functions: Normal and Type I Extreme Value distributions

Figure 4.4 shows how this distributional assumption might imply the different outcomes yi = 1,
yi = 2 and yi = 3; the figure shows the same two-dimensional space as Figure 4.2, but with
simulated values for Ui1 , Ui2 and Ui3 . (For simplicity, I have generated the graph by setting all of
(j)
(j)
the parameters 0 and 1 to zero; that is, the graph shows variation generated solely by the Type
I Extreme Value distribution on ij .)
43

simon.quinn@economics.ox.ac.uk

4.3

The Multinomial Logit model

Figure 4.4: Multinomial choice among three options: Simulated data

With this distributional assumption, we can now find an expression for the conditional probability
that the ith individual chooses outcome j from J choices.15 Note that you do not need to memorise
this derivation for the exam; however, I think the derivation is useful for understanding the underlying structure required by multinomial choice models.
First, suppose that the error term for the chosen option, ij , were known. Then we could write:




(j)
(k)

Pr(yi = j | xi , ij ) = Pr U (xi ) > max U (xi ) xi , ij
(4.13)
k6=j


(4.14)
= Pr 0k + 1k xi + ik < 0j + 1j xi + ij | xi , ij k 6= j


j
j
k
k
= Pr ik < ij + 0 + 1 xi 0 1 xi | xi , ij k 6= j
(4.15)
Y



=
exp exp ij + 0j + 1j xi 0k 1k xi
.
(4.16)
k6=j
15

This derivation is taken from Train (2009, pages 74-75).

44

simon.quinn@economics.ox.ac.uk

4.3

The Multinomial Logit model

Of course, ij is not known; we therefore need to integrate across its possible values:
Z
f (ij ) Pr(yi = j | xi , ij ) dij
Pr(yi = j | xi ) =

Z
=
exp (ij ) exp [ exp (ij )]

Y



dij

exp exp ij + 0j + 1j xi 0k 1k xi
k6=j

exp (ij )

(4.17)

(4.18)




exp exp ij + 0j + 1j xi 0k 1k xi
dij

(4.19)
Z

exp (ij ) exp

)
X



exp ij + 0j + 1j xi 0k 1k xi

dij

(4.20)
Z

exp (ij )
(

exp exp(ij )



exp 0j + 1j xi 0k 1k xi

dij .

(4.21)

We can now integrate by substitution. Define t = exp(ij ), so that dt = exp(ij ) dij . Note
that limij t = and limij t = 0. Then we can rewrite our integral as:
(
)
Z
X


exp t
exp 0j + 1j xi 0k 1k xi
Pr(yi = j | xi ) =
dt
(4.22)
0


 #
exp t k exp 0j + 1j xi 0k 1k xi


=
P
k exp 0j + 1j xi 0k 1k xi
0
1


=P
j
j
k
k
k exp 0 + 1 xi 0 1 xi


exp 0j + 1j xi
 k
.
=P
k
k exp 0 + 1 xi
"

45

(4.23)
(4.24)
(4.25)

simon.quinn@economics.ox.ac.uk

4.3

The Multinomial Logit model

Lets return to our example with three outcomes, y {1, 2, 3}. The last derivation implies that we
can write the conditional probabilities of the outcomes as:


(1)
(1)
exp 0 + 1 xi





 ; (4.26)
Pr(yi = 1 | xi ) =
(1)
(1)
(2)
(2)
(3)
(3)
exp 0 + 1 xi + exp 0 + 1 xi + exp 0 + 1 xi


(2)
(2)
exp 0 + 1 xi





 ; (4.27)
Pr(yi = 2 | xi ) =
(1)
(1)
(2)
(2)
(3)
(3)
exp 0 + 1 xi + exp 0 + 1 xi + exp 0 + 1 xi


(3)
(3)
exp 0 + 1 xi





 . (4.28)
Pr(yi = 3 | xi ) =
(1)
(1)
(2)
(2)
(3)
(3)
exp 0 + 1 xi + exp 0 + 1 xi + exp 0 + 1 xi
It would be tempting to take these three conditional probabilities and write the log-likelihood; for
our three outcomes, we could therefore try to maximise the log-likelihood across six unknown
(1)
(2)
(3)
(1)
(2)
(3)
parameters (that is, the parameters 0 , 0 , 0 , 1 , 1 and 1 ). However, this would be
a mistake, because we would be unable to find a unique set of values that would maximise the
function. (That is, the model would be underidentified.) The reason, of course, is that we can
only ever express utility in relative terms: we have seen this principle already in deriving both
the probit and the Ordered Probit models. We therefore need to choose a base category, and
estimate relative to the utility from that category. We shall choose 1 as the base category, and
(2)
(2)
(1)
(2)
(2)
(1)
(3)
(3)
define 0 0 0 and 1 1 1 (and symmetrically for 0 and 1 ). Note the
emphasis here that the choice of base category is arbitrary: our predicted probabilities would be
identical if we were to choose a different base category. Then we can multiply numerator and
(1)
(1)
denominator of each conditional probability by exp(0 1 xi ), to obtain:
Pr(yi = 1 | xi ) =

1


;

(3)
(3)
+ exp 0 + 1 xi


(2)
(2)
exp 0 + 1 xi



;
Pr(yi = 2 | xi ) =
(2)
(2)
(3)
(3)
1 + exp 0 + 1 xi + exp 0 + 1 xi


(3)
(3)
exp 0 + 1 xi



.
Pr(yi = 3 | xi ) =
(2)
(2)
(3)
(3)
1 + exp 0 + 1 xi + exp 0 + 1 xi
1 + exp

(2)
0

46

(2)
1 xi

(4.29)

(4.30)

(4.31)

simon.quinn@economics.ox.ac.uk

4.4

Estimates from Tanzania

These conditional probabilities can then be used to define the log-likelihood; it is now straightforward that, for the ith individual, the log-likelihood is:16


(1)
(1)
(2)
(2)
`i 0 , 1 , 0 , 1 ; yi | xi
= 1 (yi = 1) ln [Pr(yi = 1 | xi )] + 1 (yi = 2) ln [Pr(yi = 2 | xi )]
+ 1 (yi = 3) ln [Pr(yi = 3 | xi )] .

(4.32)

This log-likelihood function defines the Multinomial Logit model. The Multinomial Logit is the
simplest model for unordered choice. (Note that, if J = 2, the Multinomial Logit collapses to the
logit model that we considered in Lecture 2.) Marginal effects in the Multinomial Logit model are
directly analogous to marginal effects in the earlier models.

4.4

Estimates from Tanzania

Table 4.1 shows the estimates from the Tanzanian data. Note that, given the foundation of our
additive random utility model, we can express the estimates in terms of relative utility from self(1)
(1)
(2)
employment and wage employment; we estimate 0 = 36.383, 1 = 0.019, 0 = 48.562
(2)
and 1 = 0.025. All four estimates are highly significant. Figure 4.5 shows the consequent
predicted probabilities of all three work categories (conditional upon having employment); this
shows that older employed Tanzanians are significantly more likely to be working in agriculture
than are younger employed Tanzanians, and that the converse applies for wage employment and
self-employment.

16

I denote the indicator function by 1(). Note that, for simplicity, I have not explicitly written each conditional
probability as depending upon the parameters of interest, though they clearly do.

47

simon.quinn@economics.ox.ac.uk

4.4

Estimates from Tanzania

Table 4.1: Estimates from Tanzania: Multinomial Logit


Estimates
(1)
Year born

Mean Marginal Effects


y=1
y=2
(2)
(3)
0.001
0.002
(0.0007)

Relative utility of self-employment:


Year born

(0.0006)

0.019
(0.003)

Const.

-36.383
(6.308)

Relative utility of wage employment:


Year born

0.025
(0.004)

Const.

-48.562
(7.289)

Obs.
Log-likelihood
Pseudo-R2

4136
-4218.448
0.006

Confidence: *** 99%, ** 95%, * 90%.

Figure 4.5: Occupational categories and age in Tanzania: Multinomial Logit estimates

48

simon.quinn@economics.ox.ac.uk

4.5

4.5

From Multinomial Logit to Conditional Logit

From Multinomial Logit to Conditional Logit

We assumed earlier that the ith individuals utility from the jth choice depends linearly upon (i)
the observable characteristics of the individual, xi and (ii) unobservable characteristics of the
individuals taste for that choice, ij :
(j)

(j)

Uij (xi ) = 0 + 1 xi + ij .

(4.2)

Critically, this structure does not allow for observable characteristics of different options. However, we can allow straightforwardly for that possibility, by allowing the variable x to be indexed
by both individual and potential choice: xij . That is, we now assume that the researcher observes
information on characteristics of options that the ith individual did not choose for example, a
researcher might know what the ith individual would have paid to take the train, even though she
actually chose the bus.17 We can therefore write Uij as a linear function of characteristics and,
for the sake of generality, we now use vector notation to allow for multiple characteristics:18
Uij (xij ) = xij + ij .

(4.33)

Note that xij is indexed at the level of the option-individual combination; the vector includes characteristics that vary at the level of the individual and the choice for example, respondents may
face different relative costs of using different types of transport. Wooldridge cautions (2002, page
500) that xij cannot contain elements that vary only across i and not j; in particular, xij does not
contain unity. This means, therefore, that xij cannot include personal characteristics (for example, age, sex, income, etc); we will see an intuitive reason for this shortly (in equation 4.39).
In the Multinomial Logit, the outcome y was indexed by individuals, and took the value of the
particular choice made; for example, we might write yi = j. But in the Conditional Logit model,
we need to express the outcome at the level of the individual-choice combination. Therefore, we
redefine our outcome as a binary variable:

1 if the ith individual chooses option j, and;
yij =
(4.34)
0 if the ith individual chooses some other option k 6= j.
All of our reasoning from the Multinomial Logit extends to the Conditional Logit. The conditional
probability of individual i choosing outcome j is:
exp( xij )
.
Pr(yij = 1 | xi1 , . . . , xiJ ) = PJ
k=1 exp( xik )
17

18

(4.35)

Cameron and Trivedi provide an example of this kind of data structure in their Table 15.1 on page 492; in that
application, a researcher observes the price that different respondents faced for each of four types of fishing, even
though each respondent chose only one type.
Of course, we could also have used vector notation for all of our earlier reasoning in the Multinomial Logit model;
(1)
(2)
this would have implied estimating two vectors 1 and 1 . But the scalar case captured all of the important
aspects of Multinomial Logit in a simpler context, and matched directly our illustrative empirical application.

49

simon.quinn@economics.ox.ac.uk

4.6

Independence assumptions

The log-likelihood follows straightforwardly:


`i (; yi1 , . . . , yiJ | xi1 , . . . , xiJ ) =

J
X

yij ln Pr(yij = 1 | xi1 , . . . , xiJ ).

(4.36)

j=1

4.6

Independence assumptions

The Multinomial Logit and Conditional Logit are very tractable models. As we have discussed,
they provide an analytical expression for the log-likelihood; this function can therefore be evaluated and maximised easily. But this analytical tractability comes at a cost: the Multinomial Logit
and Conditional Logit both require that the unobservable terms, ij , have a Type I Extreme Value
distribution, and that these terms are distributed independently of each other. This has serious
implications for a structure of individual choice. Consider, for example, the Multinomial Logit.
Using equation 4.25, we can write the ratio of the conditional probability that yi = A and that
yi = B:
h
i
(A)
(A)
exp 0 + 1 xi
Pr (yi = A | xi )
i
h
(4.37)
=
(B)
(B)
Pr (yi = B | xi )
exp 0 + 1 xi

i
h

(A)
(B)
(A)
(B)
xi .
(4.38)
= exp 0 0 + 1 1
That is, the ratio of probabilities for any two alternatives cannot depend upon how much the respondents like any of the other alternatives on offer. Similarly, consider the Conditional Logit.
Using equation 4.35, the ratio of conditional probabilities for two choices is:
Pr(yiA = 1 | xi1 , . . . , xiJ )
= exp [ (xiA xiB )] .
Pr(yiB = 1 | xi1 , . . . , xiJ )

(4.39)

Thus, in the Conditional Logit model, the ratio of probabilities for two alternaties cannot depend
upon the characteristics of any other alternative (or, as noted, on any characteristics that do not
vary across j).
Cameron and Trivedi (page 503, emphasis in original) describe why these results are so concerning:
A limitation of the [Conditional Logit] and [Multinomial Logit] models is that discrimination among the [J] alternatives reduces to a series of pairwise comparisons
that are unaffected by the characteristics of alternatives other than the pair under consideration. . .
As an extreme example, the conditional probability of commute by car given commute
by car or red bus is assumed in an MNL or CL model to be independent of whether
commuting by blue bus is an option. However, in practice we would expect introduction of a blue bus, which is the same as a red bus in every aspect except colour, to have
50

simon.quinn@economics.ox.ac.uk

4.7

Appendix to Lecture 4: Stata code

little impact on car use and to halve use of the red bus, leading to an increase in the
conditional probability of car use given commute by car or red bus.
This weakness of MNL is known in the literature as the red bus blue bus problem,
or more formally as the assumption of independence of irrelevant alternatives.
This is clearly a serious limitation of the conditional logit and multinomial logit. Indeed, in his
Nobel Prize Lecture in 2000, Daniel McFadden even went so far as to say (page 339):
The MNL model has proven to have wide empirical applicability, but as a theoretical
model of choice behaviour its IIA property is unsatisfactorily restrictive.
A variety of alternative models have been developed in order to relax these independence assumptions while still maintaining a clear basis in individual utility maximisation. For example, the
Alternative-Specific Multinomial Probit assumes that the values of ij are drawn from a multivariate normal distribution with a flexible covariance matrix; this would allow, for example, that the
unobservable utility from taking a blue bus is very close to the unobservable utility from taking
a red bus. However, this model like most other alternative models does not admit a closed
form expression for the log-likelihood. The log-likelihood is therefore evaluated using simulationbased methods (e.g. Simulated Maximum Likelihood). These models are beyond the scope of
our lectures though they build directly upon the principles that we have discussed.

4.7

Appendix to Lecture 4: Stata code

Note: You do not need to know any Stata code for any exam question about limited dependent
variables.
First, clear the memory and load the data, as in previous lectures. We can then tabulate the variable
WorkCat:
tab work_cat
We can estimate a multinomial logit with a base category of 1 (agricultural employment) by:
mlogit work_cat yborn, baseoutcome(1)
Mean marginal effects for wage employment and self-employment categories 2 and 3 can be
calculated by:
margins, dydx(yborn) predict(outcome(2))
margins, dydx(yborn) predict(outcome(3))

51

simon.quinn@economics.ox.ac.uk

Lecture 5: Censored and trucated outcomes

Required reading:
? C AMERON , A.C. AND T RIVEDI , P.K. (2005): Microeconometrics: Methods and Applications. Cambridge University Press, pages 529 544 (i.e. sections 16.1 to 16.3.7, inclusive)
or
? W OOLDRIDGE , J. (2002): Econometric Analysis of Cross Section and Panel Data. The
MIT Press, pages 517 527 (i.e. sections 16.1 to 16.4, inclusive)
or
? W OOLDRIDGE , J. (2010): Econometric Analysis of Cross Section and Panel Data (2nd
ed.). The MIT Press, pages 667 677 (i.e. sections 17.1 to 17.3, inclusive).

5.1

The problem of incompletely observed data

Data is often observed incompletely. For example, we may often need to worry about problems
of measurement error, missing observations, non-random attrition, and so on. In this lecture, we
will discuss the specific problem of censoring on an outcome variable (and the related problem of
truncation). There are two reasons that this problem is worth discussing. First, censoring often
arises in real data, and can cause serious problems for empirical analysis if it is not addressed.
Second, censoring provides a useful introduction to the issues and methods involved in correcting
for problems of sample selection (the topic of Lecture 6). Cameron and Trivedi define truncated
and censored data as follows (page 529, emphasis in original):
Leading cases of incompletely observed data are truncation and censoring. For truncated data some observations on both the dependent variable and the regressors are
lost. For example, income may be the dependent variable and only low-income people
are included in the sample. For censored data information on the dependent variable
is lost, but not data on the regressors. For example, people of all income levels may
be included in the sample, but for confidentiality reasons the income of high-income
people may be top-coded and reported only as exceeding, say, $100,000 per year.
Truncation entails greater information loss than does censoring.
In this lecture, we will focus upon the problem of censored data.19 Once again, we will use the
Tanzanian data for illustrative purposes. However, we will use an artificially modified version
of the data. Specifically, we will imagine that, for purposes of survey anonymity, the Tanzanian
National Bureau of Statistics censored the measure of income at a gross weekly income of 250,000
Tanzanian shillings (i.e. almost exactly 100). Figure 5.1 shows the actual recorded Tanzaian
income distribution (in the top panel), and then the imaginary income distribution with artificial
top-coding (in the bottom panel).
19

Truncated data requires very similar methods, and a similar log-likelihood function.

52

simon.quinn@economics.ox.ac.uk

5.2

The Tobit model

Figure 5.1: Income in Tanzania with artificial top-coding

The question for this lecture is simple: how can we estimate the effect of education on earnings
when earnings are censored? For simplicity, we will consider four educational achievements: (i)
no education (i.e. not having completed primary school), (ii) primary education, (iii) secondary education and (iv) tertiary education. Figure 5.2 shows the distribution of (true) gross weekly income
across those four categories, with the top-coding point marked. The figure illustrates immediately
why censorship produces empirical problems: censorship has different effects across the different
educational categories. Thus, for example, only about 12% of income-earning Tanzanians with
no education would be affected by our artificial top-coding; however, the top-coding would affect
almost half of income-Tanzanians with a tertiary education. Figure 5.3 shows the effect on conditional means: top-coding reduces mean measured income by more for those education categories
with higher earnings; if we do not correct for top-coding, we will therefore underestimate the effect
of education.

5.2

The Tobit model

Lets consider the problem more formally. We will define Dip = 1 if the ith individual has completed primary education, and define Dis and Dit respectively for secondary and tertiary. We can
denote log earnings by the variable y , and therefore specify:
yi = + p Dip + s Dis + t Dit + i .
53

(5.1)

simon.quinn@economics.ox.ac.uk

5.2

The Tobit model

Figure 5.2: Income distributions in Tanzania by education category

In the actual Tanzanian data, y is observed, so we can run an OLS regression on equation 5.1 to
estimate the average earnings of individuals with different educational achievements. (Note that
we are not saying anything here about causation: because we are saying nothing about ability
bias, we are simply seeking here to describe conditional average earnings.)
In our artificial data, however, things are not so simple; recall that, in our artificial data, we are
imagining that earnings is top-coded at a weekly income of 250,000 Tanzanian shillings. In our
artificial data, we therefore observe not y but y, where we have:

yi if yi < z
yi =
(5.2)
z if yi z,
where z denotes the censorship point (in this case, defined such that z = ln(250000)). Alternatively, equation 5.2 can be rewritten as:
yi = min (yi , z) .

(5.3)

As in earlier lectures, we need a distributional assumption on the error term.


Assumption 5.1 (DISTRIBUTION OF i ) i is i.i.d. with a normal distribution with mean 0 and
variance 2 , independent of Dip , Dis and Dit :

(5.4)
i | Dip , Dis , Dit N 0, 2 .
54

simon.quinn@economics.ox.ac.uk

5.2

The Tobit model

Figure 5.3: Mean log income by education category, with and without top-coding

For simplicity, we can define i such that i | Dip , Dis , Dit N (0, 1), and write:
yi = + p Dip + s Dis + t Dit + i .

(5.5)

From this equation, we can derive the log-likelihood. The idea is simple: we can recover consistent
estimates of our parameters of interest if we can write a log-likelihood that jointly models (i) the
income-generating process, and (ii) the process of top-coding. First, we can write the conditional
probability of censorship:20


(5.6)
Pr yi = z | Dip , Dis , Dit = Pr yi z | Dip , Dis , Dit



p
s
t
z ( + p Di + s Di + t Di ) p s t
= Pr i
(5.7)
Di , Di , Di



z ( + p Dip + s Dis + t Dit )
=1
(5.8)



+ p Dip + s Dis + t Dit z
=
.
(5.9)

For individuals whose income is not censored, the likelihood is the probability that income is not
censored, multiplied by the conditional density of income (conditional also upon income not being
20

Note that, because i has a standard normal distribution, the conditional probability of censorship follows a probit
model.

55

simon.quinn@economics.ox.ac.uk

5.3

Back to Tanzania. . .



Pr yi < z | Dip , Dis , Dit f yi | yi < z, Dip , Dis , Dit

f (yi | Dip , Dis , Dit )
= Pr yi < z | Dip , Dis , Dit
Pr (yi < z | Dip , Dis , Dit )

= f yi | Dip , Dis , Dit


yi ( + p Dip + s Dis + t Dit )
1
.
=

(5.10)

censored):

(5.11)
(5.12)
(5.13)

(Note that, for any random variable X distributed such that X N (, 2 ), the probability density
is f (X; , 2 ) = 1 [(X )/].)
We can therefore write the log-likelihood for the ith individual as:

`i , p , s , t , ; yi | Dip , Dis , Dit



+ p Dip + s Dis + t Dit z

ln




=
p
s
t
y

(
+

D
+

D
+

D
)

i
p
s
t
i
i
i
1
ln

if yi = z;
(5.14)
if yi < z.

We can write the log-likelihood for the sample as:




N 
X
+ p Dip + s Dis + t Dit z
`() =
1 (yi = z) ln

i=1



yi ( + p Dip + s Dis + t Dit )
1
+1 (yi < z) ln
.

5.3

(5.15)

Back to Tanzania. . .

Table 5.1 shows estimates using the earnings data. Column (1) shows the OLS estimates using
the actual earnings data; i.e. the column shows consistent estimates of the relationship between
education and earnings. Columns (2) and (3) use the top-coded data: column (2) estimates with
the top-coded observations included, while column (3) drops those observations. Neither approach
works well in recovering the column (1) estimates. Column (4) shows the Tobit estimates: the
estimator performs very well in recovering the column (1) estimates.21 Figure 5.4 illustrates.

21

Admittedly, the Tobit estimator overestimates the average earnings for the tertiary-educated; however, this difference
is not significant.

56

simon.quinn@economics.ox.ac.uk

5.3

Back to Tanzania. . .

Table 5.1: Tobit estimates for the artificial Tanzanian top-coding

Const.
Primary
Secondary
Tertiary

Obs.
Log-likelihood

z
Real data
(1)
10.786

OLS
}|
Censored data
(2)
10.705

{
Truncated data
(3)
10.470

Tobit
z
}|
{
Censored data
(4)
10.776

(0.041)

(0.035)

(0.034)

(0.041)

0.556

0.501

0.445

0.564

(0.049)

(0.042)

(0.042)

(0.05)

1.076

0.981

0.982

1.111

(0.071)

(0.06)

(0.062)

(0.072)

1.401

1.185

0.936

1.577

(0.21)

(0.177)

(0.224)

(0.227)

3660

3660

2990

3660
-5613.710

Confidence: *** 99%, ** 95%, * 90%.


Real data refers to the true Tanzanian earnings data. Censored data refers to the data including top-coded observations (at 250,000 Tanzanian shillings). Truncated data refers to the data omitting top-coded observations.

Figure 5.4: Tobit estimates for Tanzania

57

simon.quinn@economics.ox.ac.uk

5.4

5.4

The Inverse Mills Ratio

The Inverse Mills Ratio

Hopefully, it is clear from the earlier discussion that censorship (in the form of top-coding) has
created serious problems for the OLS estimator: censorship caused OLS to underestimate substantially the effect of education on conditional earnings. So whats going on? We can consider
the problem by writing the expectation of yi , conditional on covariates and upon earnings lying
below the top-coding point. We noted earlier that even dropping the censored observations that
is, transforming the problem to one of truncation, rather than of censorship did not produce
consistent estimates. In this section, we will use the truncated sample in order to introduce the
concept of the Inverse Mills Ratio.
Figure 5.5: Mean log income by education category, dropping top-coded observations

Figure 5.5 shows mean log income across education category, comparing the mean of the true data
to the mean of truncated data. (That is, Figure 5.3 compared the true data to censored data; Figure
5.5 shows the equivalent figure for truncated data. In doing so, the figure summarises the coefficients in column (3) of Table 5.1.)
Our objective is to derive an expression for the expected heights of the light gray bars in Figure
5.5. To do this, we need to consider a very nice feature of the normal distribution. As you know,
the probability density function of the standard normal is:
 2
1
x
(x) = exp
.
(5.16)
2
2
58

simon.quinn@economics.ox.ac.uk

5.4

The Inverse Mills Ratio

It follows that the slope of this pdf is:


 2
d(x)
1
x
= x exp
dx
2
2
= x (x).

(5.17)
(5.18)

This is very useful; we can use this result to derive the expectation of a truncated normal distribution. Suppose that X N (0, 1). Then we can write:
Z z
x (x) dx

E(X | X < z) = Z z
(5.19)
(x) dx

Z z
d(x)
dx

dx
(5.20)
=
(z)
[(x)]z
=
(5.21)
(z)
(z)
=
.
(5.22)
(z)
Symmetrically, we have:
E(X | X > z) =

(z)
.
1 (z)

(5.23)

Therefore, we can write the expectation of yi , conditional on covariates and upon no truncation
as:22
E(yi | yi < z, Dip , Dis , Dit ) = E(yi | yi < z, Dp , Ds , Dt )
= + p Dip + s Dis + t Dit



z ( + p Dip + s Dis + t Dit )

+ E i i <

p
s
t
= + p Di + s Di + t Di
([z ( + p Dip + s Dis + t Dit )] 1 )

,
([z ( + p Dip + s Dis + t Dit )] 1 )

(5.24)

(5.25)

(5.26)

where the final equality follows from our earlier results about the truncated standard normal.
22

Note that equation 5.25 follows from equation 5.24 because we are conditioning on Dip , Dis and Dit ; that is, those
variables are known, and enter linearly, so we can take them outside of the expectation operator.

59

simon.quinn@economics.ox.ac.uk

5.4

The Inverse Mills Ratio

The final term the Inverse Mills Ratio term is clearly central to understanding why topcoding causes us to underestimate the conditional means of education. To see this clearly, we can
write the conditional mean for each level of education separately:

([z ] 1 )

if no education;

([z ] 1 )

([z p ] 1 )

if primary education;

p

([z p ] 1 )
E yi | yi < z, Dp , Ds , Dt =
([z s ] 1 )

if secondary education;

([z s ] 1 )

([z t ] 1 )

if tertiary education.
+ t
([z t ] 1 )
(5.27)
Figure 5.6 shows the Inverse Mills Ratio function: the function is strictly decreasing, strictly convex, limx (x)/(x) = 0 and limx (x)/ (x (x)) = 1. Thus, if the effect of education
is larger (e.g. p is higher), then the argument to the Invese Mills Ratio will be smaller, the Ratio
will be larger, and the difference between true mean earnings and mean censored earnings will be
greater. This, of course, is precisely what we observed in Figure 5.5.
Figure 5.6: The Inverse Mills Ratio
y
.
......
......
.. .......
.....
.....
..
....
..............
...
.. .....
...
..... ........
...
.. .....
.
.
...
..... ......
.. .....
...
..... .......
...
.. ......
...
..... .......
...
.. ......
...
..... ........
...
.. .....
.
.
...
..... ......
.. .....
...
..... ........
...
.. .....
...
..... ........
...
.....
..
.
...
..... ........
...
..
.....
.
.
...
.....
.....
.....
..
...
..... ..........
...
..
.....
...
.....
.....
.
...
.....
..
......
...
......
.....
...
......
..
.
...
.
......
.....
......
..
...
......
..
.....
......
..
...... ....
...... ..
.....
.......
..
.........
.....
... ..........
..
.......
...
........
.....
.........
...
..
..........
..... ....
...........
.. ..
...............
.........
.
..
.
.
.
.
.
.
.
.
...................................................................................................................................................................................................................................................................................................................................................................................................................................................................
..
..
.....
..
.....
..
.....
..
.....
..
.....
..

y = x

y=

(x)
(x)

The Inverse Mills Ratio is therefore useful for understanding the consequences of truncation. (And,
indeed, we could also write a slightly more complicated expression for the conditional expectations
under censorship.) But the Ratio is more important than that; we will see in our next lecture that it
is also fundamental for correcting problems of sample selection.

60

simon.quinn@economics.ox.ac.uk

5.5

5.5

A warning

A warning: Censoring is a data problem, not an economic outcome

This has been a lecture about modelling incompletely observed data: we have considered an example in which an observable economic variable (income) has been censored, through top-coding.
This is quite different to a context in which an outcome variable appears to be censored or truncated because of agents economic decisions. For example, consider the top panel of Figure 5.7;
it shows the number of hours worked in the previous week, for a sample from the Tanzanian ILFS
(for respondents main activity). In some respects, this looks a bit like the outcome that we might
expect from a Tobit model with censorship from below at zero hours worked.

Figure 5.7: Using a Tobit structure for hours worked

The Tobit model is certainly one structure that is sometimes placed upon data like this. Researchers
sometimes argue, for example, that respondents each have a preferred number of hours worked
if that preferred number is greater than zero, respondents work that number of hours; otherwise,
they work zero hours. However, this interpretation is problematic, for at least two reasons. First, as
a matter of principle, we may not find it reasonable to model individuals labour supply decision
in this way. For example, we may think that individuals require some effort to find a job, or some
transportation cost of travelling to work so it may not make sense, say, to model anybody as
working for 10 minutes a week. Second, as matter of empirical observation, many censored or
truncated variables actually have a distribution quite different to that implied by a Tobit structure.
For example, consider the bottom panel of Figure 5.7: this shows the distribution of a simulated
61

simon.quinn@economics.ox.ac.uk

5.6

Appendix to Lecture 5: Stata code

hours of work variable, where the parameters used are obtained by a Tobit estimation on the true
data (in the top panel). Loosely, we can interpret the bottom panel as saying, This is what the
distribution of hours should look like, if the underlying process truly is a Tobit model. It is clear
that the distributions look quite different: for example, the proportion of respondents who do not
work is substantially higher in the real data than in the simulated data. In short, the Tobit model
does not seem to describe this data well either as a matter of empirical observation or in its
implications about underlying behaviour.
Angrist and Pischke make a similar point (2009, page 102):
Do Tobit-type latent variables models ever make sense? Yes, if the data you are working with are truly censored. True censoring means the latent variable has an empirical
counterpart that is the outcome of primary interest. A leading example from labor economics is CPS earnings data, which topcodes (censors) very high values of earnings
to protect respondent confidentiality.
Instead of using the Tobit model, it may be more reasonable to consider individuals as making two
labour supply decisions i.e. (i) whether to work at all and, if so, (ii) how many hours to work.
The Tobit model links those two decisions in a very rigid and very specific way but it may
be preferable to use a model that treats those two outcomes separately. Such an approach might
fit the data more closely, while also allowing more flexibility in its implications about economic
behaviour. This is the kind of approach that we will consider in our final lecture.

5.6

Appendix to Lecture 5: Stata code

Note: You do not need to know any Stata code for any exam question about limited dependent
variables.

gen logincome_cens = logincome


replace logincome_cens = log(250000) ///
if logincome
>= log(250000) & logincome !=.
reg logincome primary secondary tertiary
reg logincome_cens primary secondary tertiary
tobit logincome_cens primary secondary tertiary, ul(12.429)
(Note, in interpreting this last line, that exp(12.429) 250000; that is, the last line is telling Stata
to use 12.429 as the upper limit for censorship of log wages.)

62

simon.quinn@economics.ox.ac.uk

Lecture 6: Selection

Required reading:
? C AMERON , A.C. AND T RIVEDI , P.K. (2005): Microeconometrics: Methods and Applications. Cambridge University Press, pages 546 553 (i.e. section 16.5)
or
? W OOLDRIDGE , J. (2002): Econometric Analysis of Cross Section and Panel Data. The
MIT Press, pages 551 566 (i.e. sections 17.1 to 17.4.1, inclusive)
or
? W OOLDRIDGE , J. (2010): Econometric Analysis of Cross Section and Panel Data (2nd
ed.). The MIT Press, pages 790 808 (i.e. sections 19.3 to 19.6.1, inclusive).
Other references:
D EATON , A. (1997): The Analysis of Household Surveys: A Microeconometric Approach to
Development Policy. The World Bank.
N EWEY, W., J.P OWELL AND J.WALKER (1990): Semiparametric Estimation of Selection
Models: Some Empirical Results, American Economic Review, 80(2), 324328.

6.1

The problem of endogenous selection

What is the effect of education upon worker productivity in Tanzania? This is an important question, for several reasons: for example, it may be important in assessing the value of public investment in education, or it may be useful for learning about how firms reward their employees. It is
also a canonical illustration for many issues in microeconometrics; in this lecture, we will consider
this question in order to think about the problem of endogenous selection.
For illustrative purposes, we will begin by considering a simulated dataset. In that simulated data,
students choose years of education x {1, 20}, and the ith student then has a potential log income
yi , where yi is determined by a standard Mincer earnings equation:
yi = 0 + 1 xi + i .

(6.1)

That is, we can interpret yi as an income offer the income that the ith individual would earn if
that individual were to hold employment. Critically, we will assume that we cannot fully observe
yi : we will assume that the ith individual earns yi only if that individual finds a job. Otherwise,
we will assume that the value for log income is missing (in the sense that log(0) is undefined). As
researchers, we therefore observe not potential log income, yi , but actual log income, yi :

yi if the ith individual has a job;
yi =
(6.2)
if the ith individual does not have a job.
63

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

We will assume that E(i | xi ) = 0; that is, we will assume that, if we could observe yi , we could
estimate 0 and 1 consistently simply by using OLS. (In reality, of course, we may be concerned
that this assumption does not hold; we might be concerned both about endogenous sample seletion
and about endogenous choice of education. But we will ignore this possibility in this lecture so
that we can focus on the problem of endogenous selection.)
Before we go any further, we must consider a simple question: why does it matter whether an
individual has a job or not? The simplest approach to estimating 0 and 1 would simply be to
regress yi on xi for those individuals who have a job. This would be a valid approach that is, it
would produce consistent estimates of 0 and 1 if the individuals with a job were a representative sample of all individuals in the population. For example, it would be a valid approach if each
individuals employment status were determined by flipping a coin.
The problem, of course, is that employment is not determined by flipping a coin and there
are good reasons to believe that the mechanism that determines whether someone has a job is not
independent of the mechanism that determines how much that individual can earn. For example,
it may be that individuals with a higher earning capacity invest more resources in job search;
similarly, it may be that employers are more likely to offer a job to someone who is likely to be
more productive. This is what researchers sometimes term a selection problem: the individuals
for whom we have data (in this case, individuals with a job) may be an unrepresentative subsample
of the population, because the process by which the sample is chosen depends upon the mechanism
being studied (in this case, the relationship between education and earnings). In this lecture, we
consider a method for recovering consistent estimates of 0 and 1 while allowing for this selection
problem.

6.2
6.2.1

The Heckman selection model


A triangular simultaneous equations model

The central idea behind a selection model in this context is simple: in order to estimate consistently
the relationship between education and income (that is, to obtain consistent estimates of 0 and 1 ),
we need to model both the relationship between education and earnings and the mechanism by
which individuals find employment. Equation 6.1 includes a single error term, i ; for simplicity,
we will term this work ability. But we will now introduce a second error term, denoted i , which
we shall denote as work desire. We will also introduce a new variable, zi {0, . . . , 4}, denoting
the number of children in the ith individuals household. This variable requires an important
assumption.
Assumption 6.1 The number of children in an individuals household (zi ) is independent of that
individuals work ability (i ) and work desire (i ).
Like all assumptions, this is debateable. For example, the assumption rules out the possibility that
individuals with higher work ability choose to have fewer children in order to pursue their careers. I leave it to you to consider how reasonable you find this assumption; it is an assumption
64

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

that we shall maintain throughout this lecture.


We will assume that the ith individuals employment status depends upon (i) his or her education,
xi , (ii) the number of children in his or her household, zi , and (iii) his or her work desire, i . We
will denote ji as a dummy variable for whether the ith individual has a job, and will assume:

0 if 0 + 1 xi + 2 zi + i < 0;
ji =
(6.3)
1 if 0 + 1 xi + 2 zi + i 0.
(Note, in passing, the similarity between equation 6.3 and our earlier equations 1.6 and 1.7.)
Together, equations 6.1, 6.2 and 6.3 describe what can be termed a triangular simultaneous equations system; more generally, we can write the system as:
yi = y(xi , ji , i )
ji = j(xi , zi , i ).

(6.4)
(6.5)

Critically, note that because of Assumption 6.1 zi directly affects ji but does not directly affect yi . That is, zi is excluded from the function y; this assumption is therefore sometimes termed
the exclusion restriction. In this context, the validity of the exclusion restriction depends critically upon our assumption that zi and i are independent; if, for example, individuals with higher
work ability choose to have fewer children, the exclusion restriction will be violated.
Together, equations 6.4 and 6.5 emphasise the importance of the relationship between i and i :
as we noted earlier, it is this relationship that creates the selection problem that our estimator will
need to address. In the Heckman selection model, we proceed by assuming that i and i are
drawn from the following bivariate normal distribution, independent of xi and zi :
   2

 

0

x ,z N
,
.
(6.6)
0

1
i i
In our simulated data, we will set = 0.75 and = 1. Figure 6.1 shows the simulated distribution
of (i , i ) for a sample of size N = 10000. Additionally, we have zi {0, . . . , 4}, simulated with
equal probability on each point of support (i.e. a probability of 0.2).
We will derive formal results shortly. But, even before we do this, Figure 6.1 provides an intuitive
illustration of the selection problem. First, note that individuals with a higher value of i are
more likely to have a job and that those individuals are also likely to have a higher value of
i . Moreover, we can see intuitively why the value of i among individuals with a job is likely
to correlate with individuals education. If 1 > 0, more educated individuals are more likely to
have a job. Therefore, those less educated individuals who have a job will, on average, have a
higher work desire than their more educated workmates; in turn, this implies that they will also
have a higher average work ability. We should therefore expect an OLS regression of yi on xi to
underestimate the effects of schooling. To formalise this, we need to revise some characteristics of
the bivariate normal.
65

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

Figure 6.1: Work desire and work ability (simulated data)

6.2.2

Marginal and conditional distributions under the bivariate normal

Suppose that two random variables, X and Y , have a bivariate normal distribution:




 
X Y
Y
Y
Y2
.
N
,
2
X Y
X
X
X

(6.7)

It follows that the marginal distribution of X is:



2
X N X , X
.
The distribution of Y , conditional on X = x, is:


Y
2
2
(x X ), (1 ) Y .
Y | (X = x) N Y +
X

(6.8)

(6.9)

(If you are interested in an intuitive illustration of the bivariate normal distribution, I recommend
http://demonstrations.wolfram.com/TheBivariateNormalDistribution/.)
Equation 6.6 is clearly a special case of 6.7, where the distributional assumption on (i , i ) normalises the means of each variable to zero and the variance of i to 1. Two important results
follow. First, consider the marginal distribution of i :
i N (0, 1).
66

(6.10)

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

The ith individuals employment status is therefore determined by a linear index of xi and zi (see
equation 6.3), plus an error term that has a standard normal distribution: that is, employment status
is determined by a probit model. This implies that we can estimate 0 , 1 and 2 consistently by
running a probit estimation of ji on xi and zi .
Second, consider the conditional distribution of i given i :

i | i N i , (1 2 ) 2 .

(6.11)

Critically, this implies that the conditional expectation of i is linear in i (as Figure 6.1 illustrated):
E(i | i ) = i .

(6.12)

Consider, then, the conditional expectation of i for someone who has a job:
E(i | xi , zi , ji = 1; 0 , 1 , 2 ) = E(i | xi , zi , ji = 1)
= E(i | 0 + 1 xi + zi + i 0)
= E(i | i 0 + 1 xi + 2 zi )
(0 + 1 xi + 2 zi )
=
.
(0 + 1 xi + 2 zi )

(6.13)
(6.14)
(6.15)
(6.16)

This derivation follows our earlier results on the Inverse Mills Ratio (see, for example, equation
5.22). From this result, we can write the conditional expectation of log income, given that an
individual has a job:
E(yi | xi , zi , ji = 1; 0 , 1 , 0 , 1 , 2 ) = 0 + 1 xi +

(0 + 1 xi + 2 zi )
.
(0 + 1 xi + 2 zi )

(6.17)

This implies that, for the subsample having a job, we can write:
yi = 0 + 1 xi +

(0 + 1 xi + 2 zi )
+ i ,
(0 + 1 xi + 2 zi )

(6.18)

where E(i | xi , zi ) = 0.
This should remind you of our earlier derivation of the Tobit estimator; for this reason, the Heckman selection correction model is sometimes known as a Generalised Tobit estimator (or a Type
2 Tobit).
6.2.3

Estimation using a two-step method

The simplest way of thinking about the Heckman model is through a two-step estimation method,
implied by equation 6.18:

67

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

(i) Run a probit of ji on xi and zi (i.e. for the entire sample). Use the estimates 0 , 1 and 2 to
construct an estimated Inverse Mills Ratio,
(
0 + 1 xi + 2 zi )
.
(
0 + 1 xi + 2 zi )
(ii) Run an OLS regression of yi on xi and the estimated Inverse Mills Ratio, for all individuals
who have a job.
As Wooldridge (2002, page 564) explains, The procedure is sometimes called Heckit after Heckman (1976) and the tradition of putting it on the end of procedures related to probit (such as
Tobit). Assuming that all of the model assumptions are correct, this method will yield consistent
estimates of 0 and 1 . The Inverse Mills Ratio term has acted in the OLS regression as a control
for the differences in work ability created by the selection process: for this reason, we can term the
Inverse Mills Ratio function in this context as a control function. Control function methods are
useful for many different types of applied problems.
The Heckit procedure produces consistent estimates of 0 and 1 , but does not generally produce
consistent estimates of the standard errors, if 6= 0. There are two reasons for this: (i) 6= 0
implies heteroskedasticty, and (ii) 0 , 1 and 2 are estimates of the true 0 , 1 and 2 , so the
estimated Inverse Mills Ratio is a generated regressor. Fortunately, Stata adjusts the standard
errors to correct these problem (using the heckman ..., twostep command). Note that if
we are only interested in testing H0 : = 0, these problems do not arise (i.e. we can trust the the
standard errors from the OLS regression).
6.2.4

Estimation using maximum likelihood

We can estimate the Heckman selection model efficiently using maximum likelihood. Like the
two-step approach, the maximum likelihood estimator must explain two features of the data:
(i) Whether the ith individual has a job, and
(ii) If the individual has a job, how much (s)he earns.
To write the log-likelihood, lets start by turning our earlier logic around. For a given value of i ,
the marginal distribution of i is N (0, 2 ); the conditional distribution of i is:

i | i N 1 i , (1 2 ) .
(6.19)
We can use this expression to write the conditional probability that i 0 + 1 xi + 2 zi (i.e. that
the ith individual has a job), for a given value of i . First, note that equation 6.19 implies:

i + 1 i
p
(6.20)
i N (0, 1) .

1 2
68

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

The probability of a respondent having a job, conditional upon having a work ability i , is therefore:
Pr (0 + 1 xi + 2 zi + i 0 | i )
= Pr (i 0 + 1 xi + 2 zi | i )
!
i + 1 i
0 + 1 xi + 2 zi + 1 i
p
p
= Pr

i

1 2
1 2
!
0 + 1 xi + 2 zi + 1 i
p
=
.
1 2

(6.21)
(6.22)
(6.23)

This provides the probability mass of having a job, conditional on an observed value of i (which
we are terming i ). Given that the marginal distribution of i is normal with variance 2 , the
marginal probability density of i is:


yi 0 1 xi
1

.
(6.24)

Therefore, the joint likelihood of observing the ith respondent in a job and of observing that respondents particular wage is simply the marginal probability density times the conditional probability
mass:
!


yi 0 1 xi
0 + 1 xi + 2 zi + 1 i
1
p

.
(6.25)

1 2
Thus, for the ith individual, we can write the log-likelihood as:

ln [1 (0 + 1 xi + 2 zi )]

if the ith individual has no job, and

!



`i () =
1
yi 0 1 xi

x
+

z
+

(y

x
)

0
1 i
2 i
i
0
1 i
1

p
+ ln
ln

if the ith individual has a job at wage yi .


(6.26)
This, it must be said, is quite an intimidating expression and you will have the opportunity
to revise its derivation as part of the second problem set. The underlying idea, however, should
be clear: we can write the log-likelihood for a selection model by thinking about the marginal
distribution of the observed wage and the probability of employment conditional upon that wage
having been offered.

69

simon.quinn@economics.ox.ac.uk

6.2

6.2.5

The Heckman selection model

Maximum Likelihood or Heckmans Two-Step?

Both the Heckit estimator and the maximum likelihood estimator produce consistent estimates under the bivariate normality assumption of equation 6.6. If we are willing to accept that assumption,
we should use the maximum likelihood estimator, because of its efficiency (see, for example, our
earlier discussion on page 10 of these notes).
However, note that the Heckit estimator did not require the full distributional assumption of equation 6.7; instead, it required only (i) the standard normal marginal distribution assumption of equation 6.10 and (ii) the linear conditional expectation assumption of equation 6.12. As in many
econometric contexts, the choice between the Heckit estimator and the maximum likelihood estimator is therefore a choice between an estimator that is more efficient and one that is more robust.
Wooldridge (2002, page 566) summarises the trade-off as follows:23
[Maximum likelihood estimation] will be more efficient than the two-step procedure
under joint normality of [i and i ], and it will produce standard errors and likelihood
ratio statistics that can be used directly. . . The drawbacks are that it is less robust than
the two-step procedure and that it is sometimes difficult to get the problem to converge.
Cameron and Trivedi make the same point: see pages 550551.
6.2.6

Illustrating using simulated data

These principles can be illustrated using our simulated data. First, Figure 6.2 shows the predicted
probabilities of having a job from the first-stage probit model. I simulated the first stage using
0 = 0.5, 1 = 0.1 and 2 = 0.25; the five fitted lines therefore correspond to different values
of zi , with the highest line being zi = 0 and the lowest being zi = 4. I simulated the second stage
using = 0.75, = 1, 0 = 10 and 1 = 0.1.
Figure 6.3 shows the distribution of simulated i across different levels of education. The dotted
line shows that E(i | xi ) = 0 (that is, it confirms that OLS would produce consistent estimates of
0 and 1 if we were able to observe earnings for the entire sample).
Figure 6.4 shows the distribution of simulated i , i.e. the distribution of i for individuals with a
job (for simplicity, I now show only individuals having no children in the household: zi = 0). The
conditional expectation is noticeably larger for individuals with less education; as discussed, these
individuals have overcome more in order to find a job, and therefore have a higher average work
ability. The smoothed line is given by times the Inverse Mills Ratio (i.e. as in equation 6.16).

23

Wooldridge actually prefers the term partial maximum likelihood here, because the wage does not enter the likelihood unless an individual has a job.

70

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

Figure 6.2: Conditional probability of earning income (simulated data)

Figure 6.3: Conditional expectation of (simulated data)

71

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

Figure 6.4: Conditional expectation of (simulated data, z = 0)

Table 6.1 shows three sets of estimates. Note that the primary object of interest is 1 , the effect of
education upon log earnings. Column 3 reports the OLS estimate: 1 = 0.059. OLS substantially
underestimates the effect of education, for the reasons that we have discussed. Columns 1 and 2
report the Heckman estimates; they respectively show the estimates from maximum likelihood and
from the two-step method. The estimates are nearly identical, and both estimate 1 very close to
its true value (1 = 0.1).

72

simon.quinn@economics.ox.ac.uk

6.2

The Heckman selection model

Table 6.1: Heckman estimates for a simulated selection model


Heckman
Maximum Likelihood
(1)
First stage:
0 (true 0 = 0.5)

Second stage:
0 (true 0 = 10.0)
1 (true 1 = 0.1)

Other parameters:
(true = 0.75)

Two-step
(2)

-.505

-.509

(0.031)

(0.032)

1 (true 1 = 0.1)
2 (true 2 = 0.25)

OLS

0.1

0.1

(0.002)

(0.002)

-.247

-.245

(0.009)

(0.01)

(3)

10.041

9.965

10.987

(0.055)

(0.094)

(0.029)

0.098

0.101

0.059

(0.003)

(0.004)

(0.002)

0.727

0. 768

(0.024)

(true = 1)

0.983

1.005

(0.017)

(true = 0.75)

Obs.
Log-likelihood

0.715

0.772

(0.034)

(0.067)

10000
-11822.11

10000

10000

Confidence: *** 99%, ** 95%, * 90%.

73

simon.quinn@economics.ox.ac.uk

6.3

6.3

An unconvincing alternative (or How can I find log(0)?)

An unconvincing alternative (or How can I find log(0)?)

Ive got a bit of a problem with my thesis...


Go on...
Well, I have log earnings on the lefthand side of my estimation.
Okay...
And lots of people in my data dont have any earnings.
Sure...whats the problem?
Well...how do I find the log of zero?
In one form or another, almost every applied researcher has had this conversation at some point
in his or her life yet it is a conversation that is rarely addressed directly by econometrics textbooks. Of course, the problem really has nothing to do with finding the log of zero; as we have
discussed in this lecture, the problem is that those individuals who do not work are unlikely to be
drawn randomly from our population of interest. This is why we needed to discuss work ability
and work desire, and the consequent Heckman selection model.
But there is another approach that deserves a mention, if only because it is reasonably popular: just
ignore the selection problem and instead code the outcome variable as y = log(wage + 1). This
function is almost identical to y = log(wage) for individuals who have a positive wage, but now
allows us to include y = 0 in our estimations for everyone without a job. For example, we could
then estimate the effect of education using a regression of log(wagei + 1) = 0 + 1 xi + i .
This is a superficially appealing approach, because we appear to have resolved the entire selection
problem without needing any of the irritations of Heckmans approach (and, indeed, without needing an exclusion restriction). There is just one problem: it doesnt work. Figure 6.5 shows why,
using the same simulated example that we have just been examining: because the log(wagei + 1)
transformation combines individuals who have a job with those who do not, the regression estimates (shown by the dotted line) conflate two separate effects: (i) the effect of education on
whether an individual has a job, and (ii) the effect of education on the individuals potential earnings.24 Selection problems should be confronted directly, rather than sidestepped with arbitrary
transformations.25

24

25

The log(y + 1) transformation produces an estimate 1 = 0.430, whereas the true value used for simulation was
1 = 0.1.
The same could be said of several variations on the theme including, for example, using y = log(wage
i + ) for
p
some arbitrarily chosen > 1, and the inverse hyperbolic sine transformation, y = log(wagei + wage2i + 1).

74

simon.quinn@economics.ox.ac.uk

6.4

Selection in the Tanzanian data

Figure 6.5: Scatter plot of log(wage + 1) against years of education (with jitter)

6.4

Selection in the Tanzanian data

Table 6.2 shows results from a Heckman two-step estimation on the Tanzanian data. For the
exclued variable, zi , we use the number of babies in the household (where a baby is defined
as a child aged 0 or 1). The estimations are run separately for men (N = 4953) and women
(N = 5047). Compared to the OLS estimations, the Heckman correction slightly reduces the estimated return to education for men, and slightly increases the estimate for women. This suggests
that there may be a positive correlation between work desire and work ability for women, but a
negative correlation for men (note the sign of
for the two Heckman estimations). However, note
that neither selection effect is significant; i.e. we cannot rule out, for either group, that selection
into employment is as if random, so that the OLS estimates are consistent.26
Of course, these conclusions depend upon our model being correct and this, of course, includes
the exclusion restriction. I leave you to consider how reasonable these assumptions are.

26

Notice, however, that the standard errors on the Heckman estimates are much larger than the standard errors on
OLS.

75

simon.quinn@economics.ox.ac.uk

6.5

Flexible extensions to the Heckman model

Table 6.2: Income and education in Tanzania: Heckman estimates

z
OLS
(1)

Men
}|

First stage:
Const.
Years of education
Babies in the household
Second stage:
Const.
Years of education

z
OLS
(3)

Women
}|
{
Heckman
(4)

-.799

-.967

(0.04)

(0.034)

0.127

0.097

(0.006)

(0.005)

-.162

-.252

(0.036)

(0.042)

10.639

10.835

10.570

9.769

(0.064)

(0.692)

(0.063)

(0.649)

0.112

0.098

0.089

0.126

(0.009)

(0.043)

(0.009)

(0.031)

Other:

Obs.

{
Heckman
(2)

2240

-.178

0.524

(0.518)

(0.423)

4953

1420

5047

Confidence: *** 99%, ** 95%, * 90%.

6.5

Flexible extensions to the Heckman model

We have considered two versions of the Heckman selection model. One version the maximum
likelihood approach depends upon a strong assumption about the joint distribution of i and i .
The other version the two-step approach is less efficient, but relies only upon the assumption
that the conditional expectation of i is linear in i . But this assumption can be relaxed further.
We can generalise equation 6.12 by writing that conditional expectation as a flexible function of
i :
E (i | i ) = f (i ).

(6.27)

This implies that, instead of equation 6.18, we can write our second stage as a flexible function of
the Inverse Mills Ratio:


(
0 + 1 xi + 2 zi )
+ i .
(6.28)
yi = 0 + 1 xi + g
(
0 + 1 xi + 2 zi )
76

simon.quinn@economics.ox.ac.uk

6.6

Appendix to Lecture 6: Stata code

Deaton (1997, page 105) notes that the function g can be estimated parametrically for example,
by using a polynomial specification, such as:
2
(
0 + 1 xi + 2 zi )
+ 2
yi = 0 + 1 xi + 1
(
0 + 1 xi + 2 zi )

3

4
(
0 + 1 xi + 2 zi )
(
0 + 1 xi + 2 zi )
+ 3
+ 4
+ i .
(
0 + 1 xi + 2 zi )
(
0 + 1 xi + 2 zi )


(
0 + 1 xi + 2 zi )
(
0 + 1 xi + 2 zi )

(6.29)

This specification nests the two-step estimator of equation 6.18 for the special case that 1 =
and 2 = 3 = 4 = 0. Alternatively, we could estimate equation 6.28 semiparametrically, as a
partial linear model as in, for example, Newey, Powell and Walker (1990).

6.6

Appendix to Lecture 6: Stata code

Note: You do not need to know any Stata code for any exam question about limited dependent
variables.
Having cleared the memory and loaded our data, we can implement the Heckman model using the
command heckman; we choose the two-step estimator by adding the option twostep. We
can therefore run the following:
reg logincome educ if sex == "male"
heckman logincome educ HHminors if sex == "male", ///
twostep select(educ HHbabies)
reg logincome educ if sex == "female"
heckman logincome educ if sex == "female", ///
twostep select(educ HHbabies)

77

simon.quinn@economics.ox.ac.uk

You might also like