econometrics Il

Econometrics - II By: Assegid A.
Chapter One
Regression Analysis with Qualitative Information: Binary (or Dummy Variables)
1.1 Describing Qualitative Information
Qualitative factors often come in the form of binary information: a person is female or male; a
person does or does not own a personal computer; a firm offers a certain kind of employee
pension plan or it does not; a state administers capital punishment or it does not. In all of these
examples, the relevant information can be captured by defining a binary variable or a zero-one
variable. In econometrics, binary variables are most commonly called dummy variables,
although this name is not especially descriptive.
In defining a dummy variable, we must decide which event is assigned the value one and which
is assigned the value zero. For example, in a study of individual wage determination, we might
define female to be a binary variable taking on the value one for females and the value zero for
males. The name in this case indicates the event with the value one. The same information is
captured by defining male to be one if the person is male and zero if the person is female. Either
of these is better than using gender because this name does not make it clear when the dummy
variable is one: does gender 1 correspond to male or female? What we call our variables is
unimportant for getting regression results, but it always helps to choose names that clarify
equations and expositions.
Suppose in the wage example that we have chosen the name female to indicate gender. Further,
we define a binary variable married to equal one if a person is married and zero if otherwise.
Table 1.1 describes partial listing of data containing dummy variables. We see that Person 1 is
female and not married, Person 2 is female and married, Person 3 is male and not married, and so
on.
Why do we use the values zero and one to describe qualitative information? In a sense, these
values are arbitrary: any two different values would do. The real benefit of capturing qualitative
information using zero-one variables is that it leads to regression models where the parameters
have very natural interpretations, as we will see
1|Page
DoE CoBE DU
Table 1.1 Partial List of data containing dummy variables
Person Wage Educ exper Female married
1 3.1 11 2 1 0
2 3.24 12 22 1 1
3 3.00 11 2 0 0
4 6.00 8 44 0 1
5 5.30 12 7 0 1
…………….. ………… ………….. ………………. ………………. ……………..
..
525 .
11.56 .16
16 5. 0. 1.
526 3.50 14 5 1 0
1.2 Dummy as Independent Variables

A single dummy independent variable
How do we incorporate binary information into regression models? In the simplest case, with
only a single dummy explanatory variable, we just add it as an independent variable in the
equation. For example, consider the following simple model of hourly wage determination:
…………………………………..…….(1.1)
We use as the parameter on female in order to highlight the interpretation of the parameters
multiplying dummy variables; we can use whatever notation which is most convenient.
In model (1.1), only two observed factors affect wage: gender and education. Since female=1
when the person is female and female=0 when the person is male, the parameter has the
following interpretation: is the difference in hourly wage between females and males, given
the same amount of education (and the same error term u). Thus, the coefficient determines
whether there is discrimination against women: if , then, for the same level of other
factors, women earn less than men on average.
In terms of expectations, if we assume the zero conditional mean assumption

, then
2|Page
DoE CoBE DU
Since female=1 corresponds to females and female=0 corresponds to males, we can write this
more simply as
………………………..(1.2)
The key here is that the level of education is the same in both expectations; the difference, , is
due to gender only.
The situation can be depicted graphically as an intercept shift between males and females.
Example of d0 > 0
y y = (b0 + d0) + b1x
female = 1
slope = b1
d0 { y = b0 + b1x
female= 0
} b0
x
4
The figure, when showns, that man earn a fixed amount less per hour than women. The
difference does not depend on the amount of education, and this explains why the wage-
education profiles for women and men are parallel. The intercept for males is , and the
intercept for females is . Since there are just two groups, we only need two different
intercepts. This means that, in addition to , we need to use only one dummy variable; we have
chosen to include the dummy variable for females. Using two dummy variables would introduce
perfect collinearity because , which means that male is a perfect linear
function of female. Including dummy variables for both genders is the simplest example of the
so-called dummy variable trap, which arises when too many dummy variables describe a given
number of groups.
3|Page
DoE CoBE DU
Nothing much changes when more explanatory variables are involved. Taking males as the base
group, a model that controls for experience and tenure in addition to education is:
…………………….(1.3)
If educ, exper, and tenure are all relevant productivity characteristics, the null hypothesis of no
difference between men and women is H0: . The alternative that there is discrimination
against women is H1:
How can we actually test for wage discrimination? The answer is simple: just estimate the model
by OLS, exactly as before, and use the usual t statistic. Nothing changes about the mechanics of
OLS or the statistical theory when some of the independent variables are defined as dummy
variables. The only difference with what we have done up until now is in the interpretation of the
coefficient on the dummy variable.
EXAMPLE 1.1 Hourly Wage Equation
̂ …………………(1.4)
The parenthesis is seeing
The negative intercept—the intercept for men, in this case—is not very meaningful. The
coefficient on female is interesting, because it measures the average difference in hourly wage
between a woman and a man, given the same levels of educ, exper, and tenure. If we take a
woman and a man with the same levels of education, experience, and tenure, the woman earns,
on average, $1.81 less per hour than the man.
Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable

is log(y)
A common specification in applied work has the dependent variable appearing in logarithmic
form, with one or more dummy variables appearing as independent variables. How do we
interpret the dummy variable coefficients in this case? Not surprisingly, the coefficients have a
percentage interpretation.
EXAMPLE 1.2: Housing Price Regression
4|Page
DoE CoBE DU
̂ ……(1.5)
All the variables are continuous-explanatory except colonial, which is a binary variable equal to
one if the house is of the colonial style. What does the coefficient on colonial mean? For given
levels of lotsize, sqrft, and bdrms, the difference in log( price) between a house of colonial style
and that of another style is .054. This means that a colonial style house is predicted to sell for
about 5.4% more, holding other factors fixed.
This example shows that, when log(y) is the dependent variable in a model, the coefficient on a
dummy variable, when multiplied by 100, is interpreted as the percentage difference in y,
holding all other factors fixed.
1.3 Dummy as Dependent Variable
In all of the models up until now, the dependent variable y has had quantitative meaning (for
example, y is a dollar amount, a test score, a percent, or the logs of these). What happens if we
want to use multiple regressions to explain a qualitative event? In the simplest case, and one that
often arises in practice, the event we would like to explain is a binary outcome. In other words,
our dependent variable, y, takes on only two values: zero and one.
1.3.1The Linear Probability Model (LPM)

The regression model places no restrictions on the values that the independent (exogenous)
variables take on, except that they not be exact linear combinations of each other. They may be
continuous, interval level (net worth of a company), they may be only positive or zero (percent
of vote a party received), they may be integers (number of children in a family), or they may be
dichotomous (so-called dummy variables, e.g., scored as a one if male, zero if female).
The dependent variable, however, is assumed to be continuous. Since there are no restrictions on
the , the , or the , then Yi must be free to take on any value from negative infinity to
positive infinity. In practice, Yi will take on only a small set of values in the data. For example,
even if Yi is family income for the year, only a relatively small range of values will be observed.
But, in this case, since any will be similarly restricted, the assumption of continuous,
interval measurement of Yi will not be a bad approximation. But, if Yi can take on only two
values (say zero and one), the violation of this assumption is so egregious as to merit special
attention.
5|Page
DoE CoBE DU
The term linear probability model is used to denote a regression model in which the dependent
variable y is a dichotomous variable taking the value 1 or zero. What does it mean to write down
a multiple regression model, such as:-
………………………………………………….(1.6)
Since y, takes the value 1 or zero, the errors in equation (1.6) can take only two values,
and the respective probabilities of these events are , and The
probability of observing a 0 or1 in any one case is treated as depending on one or
more explanatory variables. For the "linear probability model", this relationship is a particularly
simple one, and allows the model to be fitted by regression. The linear-regression model applied
to a dummy response variable is called the linear probability model.
Since y can take on only two values, cannot be interpreted as the change in y given a one-unit
increase in , holding all other factors fixed: y either changes from zero to one or from one to
zero. Nevertheless, the still have useful interpretations. If we assume that the zero conditional
mean assumption holds, that is, , in this case the estimators are unbiased,
then we have, as always,
Where x is shorthand for all of the explanatory variables. The key point is that when y is a binary
variable taking on the values zero and one, it is always true that the
probability of “success” that is, the probability that y=1, is the same as the expected value of y.
Thus, we have the important equation
……………………………………………(1.7)
Which says that the probability of success, say , is a linear function of the
Equation (1.7) is an example of a binary response model, and is also called the
response probability. Because probabilities must sum to one, is
also a linear function of the .
The multiple linear regression models with a binary dependent variable are called the linear
probability model (LPM) because the response probability is linear in the parameters . In the
LPM, measures the change in the probability of success when changes, holding other
factors fixed:
…………………..……………………………….…………….(1.8)
6|Page
DoE CoBE DU
With this in mind, the multiple regression models can allow us to estimate the effect of various
explanatory variables on qualitative events. If we write the estimated equation as:
̂ ̂ ̂ ̂
We must now remember that ̂ is the predicted probability of success. Therefore, ̂ is the
predicted probability of success when each is set to zero, which may or may not be interesting.
The slope coefficient ̂ measures the predicted change in the probability of success when
increases by one unit.
In order to correctly interpret a linear probability model, we must know what constitutes a
“success”. Thus, it is a good idea to give the dependent variable a name that describes the event
y=1. As an example, let (“in the labor force”) be binary variable indicating labor force
participation by a married woman:inlf=1 if the woman reports working for a wage outside the
home at some point during the year, and zero otherwise. We assume that labor force participation
depends on other sources of income, including husband‟s earnings , measured in
thousands of dollars), years of education ( ), past years of labor market experience ( ,
age, number of children less than six years old ( ), and number of kids between 6 and 18
years of age (kidsge6). The estimated linear probability model, where 428 of the 753 women in
the sample report being in the labor force at some point:
̂
……...1.9
The significance of coefficients can be tested by using the usual t statistics.

To interpret the estimates, we must remember that a change in the independent variable changes
the probability that . For example, the coefficient on means that, everything else
in (1.9) held fixed, another year of education increases the probability of labor force participation
by .038. If we take this equation literally, 10 more years of education increases the probability of
being in the labor force by , which is a pretty large increase in a probability.
The coefficient on implies that, if (which means an increase of
$10,000), the probability that a woman is in the labor force falls by .034. Experience has been
entered as a quadratic to allow the effect of past experience to have a diminishing effect on the
labor force participation probability. Holding other factors fixed, the estimated change in the
7|Page
DoE CoBE DU
probability is approximated as the point at

which past experience has no effect on the probability of labor force participation is
.039/.0012=32.5, which is a high level of experience.
This example illustrates how easy linear probability models are to estimate and interpret, but it
also highlights some shortcomings of the LPM. First, it is easy to see that, if we plug in certain
combinations of values for the independent variables into (1.9), we can get predictions either less
than zero or greater than one. Since these are predicted probabilities, and probabilities must be
between zero and one, this can be a little embarrassing. For example, what would it mean to
predict that a woman is in the labor force with a probability of -.10?
A related problem is that a probability cannot be linearly related to the independent variables for
all their possible values.
Predicted probabilities outside the unit interval are a little troubling when we want to make
predictions, but this is rarely central to an analysis. Usually, we want to know the ceteris paribus
effect of certain variables on the probability.
Due to the binary nature of y, the linear probability model does violate one of the Gauss-Markov
assumptions. When y is a binary variable, its variance, conditional on x, is
[ ] ……………………………………………………(1.10)
Where is shorthand for the probability of success: ̂ ̂ ̂ . This
means that, except in the case where the probability does not depend on any of the independent
variables, there must be heteroskedasticity in a linear probability model. This does not cause bias
in the OLS estimators of the . But homoskedasticity is crucial for justifying the usual t and F
statistics, even in large samples.
We can also include dummy independent variables in models with dummy dependent variables.
The coefficient measures the predicted difference in probability when the dummy variable goes
from zero to one.
The problem with linear probability model is the error term is not normally distributed.
Because can take on only the values of 0 and 1, theerror is dichotomous as well — not
normally distributed: if which occurs with probability , then
8|Page
DoE CoBE DU
Alternatively, if ,
Another problem is Non-constant error variance:If the assumption of linearity holds over the
range of the data, then Using the relations just noted
( ∑ ) ( ∑ )
The heteroskedasticity of the errors are ill for ordinary-least squares estimation of the linear
probability model. The variance is not constant and due to this the estimators are not best or not
efficient. Thus, the OLS estimators will be unbiased but not best (i.e., not have the smallest
possible sampling variance). As a further result, estimates of the sampling variances will not be
correct, and any hypothesis tests (e.g.,the t and F tests) or confidence intervals based on these
sampling variances will be invalid, even for very large samples. Thus, even in the best
circumstances, OLS regression estimates of a dichotomous dependent variable are, although
unbiased, not very desirable. There is, however, a solution to this problem that is a fairly simple
modification of OLS regression.
Goldberger (1964) proposed a two-step, weighted estimator to correct the problems of OLS
regression of the linear probability model. The first step is to do the usual OLS regression of Yi
on the . So doing yields the unbiased estimates . From these estimates, construct the set of
weights, one for each observation: These weights are just the reciprocals of the estimated
standard errors of ui. Now multiply both sides the model estimate the parameters.
Nonlinearity: Most seriously, the assumption that that is, the assumption of linearity is
only tenable over a limited range of X-values. If the range of the X‟s is sufficiently broad, then
the linear specification cannot confine to the unit interval [0, 1]. It makes no sense, of course,
to interpret a number outside of the unit interval as a probability. Latent Variable Approach
1.3.2 The Logit and Probit Models
To overcome the problems with the linear model, there exists a class of binary choice models (or
univariate dichotomous models), designed to model the „choice‟ between two discrete
alternatives. These models essentially describe the probability that directly, although they
are often derived from an underlying latent variable model. By assuming and
= { } { } { }
9|Page
DoE CoBE DU
{ } ……………………………………………………………..…(1.11)
for some function This equation says that the probability of having depends on the
vector containing individual characteristics. So, for example, the probability that a person
owns a house depends on his income, education level, age and marital status. Or, from a different
field, the probability that an insect survives a dose of poisonous insecticide depends upon the
quantity of the dose, and possibly some other characteristics. Clearly, the function in
(1.11) should take on values in the interval [0, 1] only. Usually, one restricts attention to
functions of the form . As also has to be between 0 and 1, it seems
natural to choose F to be some distribution function. Common choices are the standard normal
distribution function
∫ { } ……………………………..……………(1.12)
√
Leading to the so-called probit model, and the standard logistic distribution function, given by
……………………………………(1.13)
This results in the logit model. logit function derived from odds ratio:
[ ] ⁄
( )
By solving this for ,
A third choice corresponds to a uniform distribution over the interval [0, 1] with distribution
function
………………………………………..(1.14)
This results in the so-called linear probability model, which is similar to the regression model
in 1.1. But the probabilities are set to 0 or 1 if exceeds the lower or upperlimit, respectively.
In fact, the first two models (probit and logit) are more common in applied work. Both the
standard normal and a standard logistic random variable have an expectation of zero, while the
latter has a variance of instead of 1. These two distribution functions are very similar if one
corrects for this difference in scaling; the logistic distribution has slightly heavier tails.
Accordingly, the probit and logit model typically yield very similar results in empirical work.
10 | P a g e
DoE CoBE DU
Apart from their signs, the coefficients in these binary choice models are not easy to interpret
directly. One way to interpret the parameters (and to ease comparison across different models) is
to consider the partial derivative of the probability that equals one with respect to a continuous
explanatory variable, , say. For the three models above, we obtain:
Where (.) denotes the standard normal density function. Except for the last model, the effect of
a change in depends upon the values of . In all cases, however, the sign of the effect of a
change in corresponds to the sign of its coefficient .For a discrete explanatory variable, for
example a dummy, the effect of a change can be determined from computing the implied
probabilities for the two different outcomes, fixing the values of all other explanatory variables.
1.3.4 Estimation
Most commonly, the parameters in binary choice models (or limited dependent variable models
in general) are estimated by the method of maximum likelihood. In general, the likelihood
contribution of observation i with is given by { } as afunction of the unknown
parameter vectorβ, and similarly for . The likelihoodfunction for the entire sample is thus
given by: ∏ { } { } ,………….….. (1.15)
Where, we included β in the expressions for the probabilities to stress that the likelihood function
is a function ofβ. Substituting { }
∑ ∑ …………………………….(1.16)
Substituting the appropriate form for F gives an expression that can be maximized with respect
to β. As indicated above, the values of β and their interpretation depend upon the distribution
function that is chosen.
It is instructive to consider the first order conditions of the maximum likelihood problem.
Differentiating (1.16) with respect to β yields
( )
∑ [ ] …………………………………………..(1.17)
( )( ( ))
11 | P a g e
DoE CoBE DU
Where is the derivative of the distribution function (so f is the density function).The term
( )
in square brackets is often referred to as the generalized residual of the model. It equals ( )
for
( )
the positive observations ( ) and ( )
for the zero observations ( ).The first
order conditions thus say that each explanatory variables should be orthogonal to the generalized
residual (over the whole sample). This is comparable to the OLS first order conditions which
state that the least squares residuals are orthogonal to each variable in . In a probit model, the
value of is taken to be the z-value of a normal distribution. Higher values of mean that
the event is more likely to happen. Have to be careful about the interpretation of estimation
results here. A one unit change in leads to a change in the z-score of Y.
For the logit model we can simplify (1.17) to
( )
∑ [ ( )
] …………………………………………………….(1.18)
The solution of (1.18) is the maximum likelihood estimator ̂ . In linear regression, if the
coefficient on x is β, then a 1-unit increase in x increases Y by β. But what exactly does it mean
in probit that the coefficient on x is β? It means that a 1% increase in x will raise the z-score of
Pr(Y=1) by β.
From this estimate we can estimate the probability that foragiven as
̂
̂
̂
Consequently, the first order conditions for the logit model imply that
∑̂ ∑
Thus, if contains a constant term (and there is no reason why it should not), then the sum of the
estimated probabilities is equal to ∑ or the number of observationsin the sample for
which . In other words, the predicted frequency is equal to the actual frequency. Similarly,
if includes a dummy variable, say 1 for females, 0 for males, then the predicted frequency will
be equal to the actual frequency for each gender group. Although a similar result does not hold
exactly for the probit model, it does hold approximately by virtue of the similarity of the logit
and probit model.
12 | P a g e
DoE CoBE DU
A look at the second order conditions of the ML problem reveals that the matrix of second order
derivatives is negative definite (assuming that the xs are not collinear). Consequently, the log
likelihood function is globally concave and convergence of the iterative maximum likelihood
algorithm is guaranteed (and usually quite fast).
1.4 Goodness-of-fit
A goodness-of-fit measure is a summary statistic indicating the accuracy with which the model
approximates the observed data, like the measure in the linear regression model. In the case in
which the dependent variable is qualitative, accuracy can be judged either in terms of the fit
between the calculated probabilities and observed response frequencies or in terms of the
model‟s ability to forecast observed responses. Contrary to the linear regression model, there is
no single measure for the goodness of-fit in binary choice models and a variety of measures
exists.
Often, goodness-of-fit measures are implicitly or explicitly based on comparison with a model
that contains only a constant as explanatory variable. Let logL1 denote the maximum log
likelihood value of the model of interest and let logL0 denote the maximum value of the log
likelihood function when all parameters, except the intercept, are set to be zero. Clearly,
logL1≥logL0. The larger the difference between the two log likelihood values, the more the
extended model adds to the very restrictive model. (Indeed, a formal likelihood ratio test can be
based on the difference between the two values.) A first goodness-of-fit measure is defined as
……………………………………………….…(1.19)
Where, N denotes the number of observations. An alternative measure is suggested by

McFadden (1974),
…………………………………………..……..(1.20)
Likelihood Ratio (LR) Test: Used to test overall significance of the model. LR= -2(logL0-logL1)
Sometimes is referred to as the likelihood ratio index. Because the log likelihood is
the sum of log probabilities, it follows that , from which it is straightforward
to show that both measures take on values in the interval [0, 1] only. If allestimated slope
coefficients are equal to zero we have , such that both s are equal to zero. If the
model would be able to generate (estimated) probabilitiesthat correspond exactly to the observed
values (that is ̂ for all i), all probabilities in the log likelihood would be equal to one, such
that the log likelihood would be exactly equal to zero. Consequently, the upper limit for the two
13 | P a g e
DoE CoBE DU
measures above is obtained for . The upper bound of 1 can therefore, in theory, only
be attained byMcFadden‟s measure.
To compute it is not necessary to estimate a probit or logit model with an intercept term
only. If there is only a constant term in the model, the distribution function is irrelevant for the
implied probabilities and the model essentially says { } for some unknown p. The ML
estimator for p can easily be shown to be:
̂
Where ∑ , That is, the estimated probability is equal to the proportion of one sin the
sample. The maximum log likelihood value is therefore given by:
∑ ∑
………………………………(1.21)
where denotes the number of zeros in the sample. It can be directlycomputed from
the sample sizeN and the sample frequencies . The valueof logL1should be given by
your computer package.
Testing in Binary Response Index Models
Any of the three tests from general MLE analysis—the Wald, LR, or LM test—can be used to
test hypotheses in binary response contexts. Since the tests are all asymptotically equivalent
under local alternatives, the choice of statistic usually depends on computational simplicity
(since finite sample comparisons must be limited in scope).
14 | P a g e
DoE CoBE DU

econometrics Il

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

econometrics Il

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

econometrics Il

Uploaded by

Copyright:

Available Formats

Econometrics - II By: Assegid A.

Table 1.1 Partial List of data containing dummy variables

Person Wage Educ exper Female married

…………….. ………… ………….. ………………. ………………. ……………..

1.2 Dummy as Independent Variables

In terms of expectations, if we assume the zero conditional mean assumption

EXAMPLE 1.1 Hourly Wage Equation

The parenthesis is seeing

Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable

1.3.1The Linear Probability Model (LPM)

The significance of coefficients can be tested by using the usual t statistics.

probability is approximated as the point at

By solving this for ,

contribution of observation i with is given by { } as afunction of the unknown

Where, N denotes the number of observations. An alternative measure is suggested by

You might also like