Cuarta Clase
Cuarta Clase
Cuarta Clase
1. Introduction
2. Assumptions
3. Estimation and Testing
4. Comparison of Estimators
1
1. INTRODUCTION
∙ We already covered panel data models where the error term had no
particular structure. But we assumed either contemporaneous
exogeneity (pooled OLS) or strict exogeneity (feasible GLS).
∙ Now we explicitly add a time constant, unobserved effect to the
model. Often called unobserved heterogeneity.
2
∙ Start with the balanced panel case, and assume random sampling
across i (the cross section dimension), with fixed time periods T. So
x it , y it : t 1, . . . , T, c i where c i is the unobserved effect drawn
along with the observed data.
∙ The unbalanced case is trickier because we must know why we are
missing some time periods for some units. We consider this much later
under missing data/sample selection issues.
3
∙ For a random draw i from the population, the basic model is
y it x it c i u it , t 1, . . . , T,
v it c i u it
4
∙ Useful to write a population version of the model in conditional
expectation form:
Ey t |x t , c x t c, t 1, . . . , T.
Therefore,
∂Ey t |x t , c
j ,
∂x tj
5
∙ With a single cross section, there is nothing we can do unless we can
find good observable proxies for c or IVs for the endogenous elements
of x t . But with two or more periods we have more options.
∙ We can write the population model as
yt xt c ut
Eu t |x t , c 0
y1 x1 c u1
y2 x2 c u2
6
∙ Subtract t 1 from t 2 and define Δy y 2 − y 1 , Δx x 2 − x 1 , and
Δu u 2 − u 1 :
Δy Δx Δu,
7
∙ The orthogonality condition is
Ex 2 − x 1 ′ u 2 − u 1 0.
But
8
∙ OLS on the differences will only be consistent if we add
Ex ′s u t 0, s ≠ t.
9
∙ Would we really omit an intercept from the differenced equation?
Very unlikely. If we start with a model with different intercepts,
y1 1 x1 c u1
y2 2 x2 c u2
then
Δy Δx Δu,
10
2. ASSUMPTIONS
∙ As mentioned earlier, we assume a balanced panel and all asymptotic
analysis – implicit or explicit – is with fixed T and N → , where N is
the size of the cross section.
∙ The basic unobserved effects model is
y it x it c i u it , t 1, . . . , T,
11
∙ An extension of the basic model is
y it x it t c i u it , t 1, . . . , T,
12
∙ A general specification is
y it g t z i w it c i u it
13
Assumptions about the Unobserved Effect
∙ In modern applications, “random effect” essentially means
Covx it , c i 0, t 1, . . . , T,
14
Exogeneity Assumptions on the Explanatory Variables
y it x it c i u it
Eu it |x it , c i 0
or
Ey it |x it , c i x it c i .
15
∙ Strict Exogeneity Conditional on the Unobserved Effect:
Ey it |x i1 , . . . , x iT , c i Ey it |x it , c i x it c i ,
Ey it |x i1 , . . . , x iT x it Ec i |x i1 , . . . , x iT .
16
∙ But strict exogeneity conditional on c i rules out lagged dependent
variables and feedback. Written in terms of the idiosyncratic errors,
strict exogeneity is
Eu it |x i1 , . . . , x iT , c i 0,
17
∙ A more reasonable assumption that we will use later is
Ey it |x it , x i,t−1 , . . . , x i1 , c i Ey it |x it , c i x it c i ,
18
3. ESTIMATION AND TESTING
∙ There are four common methods: pooled OLS, random effects, fixed
effects, and first differencing.
3.1. Pooled OLS
19
∙ Contemporaneous exogeneity is weaker than strict exogeneity, but it
buys us little in practice because POLS also uses Ex ′it c i 0, which
cannot hold for lagged dependent variables and is unlikely for other
variables not strictly exogenous.
∙ Inference should be made robust to serial correlation and
heteroskedasticity.
20
∙ Let v̂ it y it − x it ̂ POLS be the POLS residuals. Then
N T −1 N T T
Avar̂ POLS ∑ ∑ x ′it x it ∑ ∑ ∑ v̂ it v̂ ir x ′it x ir
i1 t1 i1 t1 r1
N T −1
∑ ∑ x ′it x it ,
i1 t1
21
∙ Can also write this estimator as
N −1 N −1 N −1
Avar̂ POLS ∑ X ′i X i ∑ X ′i v̂ i v̂ ′i X i ∑ X ′i X i
i1 i1 i1
∙ In Stata:
reg y x1 x2 ... xK, cluster(id)
22
3.2. Random Effects Estimation
(a) Eu it |x i1 , x i2 , . . . , x iT , c i 0, t 1, . . . , T
(b) Ec i |x i1 , x i2 , . . . , x iT Ec i
23
∙ A GLS approach also leaves c i in the error term:
y it x it v it , t 1, 2, . . . , T
24
∙ Write the equation in system form (for all time periods) as
yi Xi vi Xi cijT ui
25
∙ RE imposes a special structure on (which could be wrong!). Under
RE.1(a), c i and u it are uncorrelated. Assume further that
Varu it 2u , t 1, . . . , T
Covu it , u is 0, t ≠ s
Then
26
∙ Further, for t ≠ s,
Covv it , v is Covc i u it , c i u is
Varc i Covc i , u is Covu it , c i Covu it , u is
2c
or
27
2c 2u 2c 2c
2c 2c 2u 2c
,
2c 2c 2c 2u
28
∙ We can also write as
1
1
2v
1
29
∙ We can use pooled OLS to get the residuals, v it , across all i and t.
Then a consistent estimator of 2v (not generally unbiased), as N gets
large for fixed T, is
N T
̂ 2v NT − K −1 ∑ ∑ v 2it SSR/NT − K,
i1 t1
the usual variance estimator from OLS regression. This is based on, for
T
each i, 2v T −1 ∑ t1 Ev 2it and then average across i, too. Then
replace population with sample average, and with pooled OLS
estimates, and subtract K as a degrees-of-freedom adjustment.
30
∙ For 2c , note that
T−1 T
2c TT − 1/2 −1 ∑ ∑ Ev it v is .
t1 st1
31
∙ An actual estimator replaces v it with the POLS residuals,
N T−1 T
̂ 2c NTT − 1/2 − K −1 ∑ ∑ ∑ v it v is ,
i1 t1 st1
plim ̂ 2c 2c
N→
with T fixed.
32
∙ Now we can use
̂ 2v ̂ 2c ̂ 2c 1 ̂ ̂
̂ ̂ 2c ̂ 2v ̂ 2c ̂ ̂ 1 ̂
or
̂ 2c ̂ 2c ̂ 2v ̂ ̂ 1
33
∙ It is possible for ̂ 2c to be negative, which means the basic unobserved
effects variance-covariance structure is faulty.
∙ Typically, ̂ 2c 0 unless the variables have been transformed in some
way – such as being first differenced – before applying GLS.
∙ The FGLS estimator that uses this particular structure of ̂ is the
random effects (RE) estimator.
34
∙ Fully robust inference is available for RE, and there are good reasons
for doing so.
(1) may not have the special (and restrictive, especially for large T)
RE structure, that is, Ev i v i need not have the RE form. Serial
correlation or changing variances in u it : t 1, . . . , T invalidate the
RE structure.
(2) The system homoskedasticity requirement,
Ev i v i |X i Ev i v i
35
∙ A fully robust estimator is
N −1 N −1
Avar̂ RE ∑ X ′i ̂ X i
−1
∑ X ′i ̂ v̂ i v̂ ′i ̂ X i
−1 −1
i1 i1
N −1
∑ ′ ̂ −1
Xi Xi ,
i1
36
∙ For first order asymptotics, no efficiency gain from iterating. Might
help with smaller N, though.
∙ What is the advantage of RE, which imposes specific assumptions on
, and the unrestricted FGLS we discussed earlier? Theoretically,
nothing. We do not get more efficiency with large N and small T by
imposing restrictions on .
∙ If system homoskedasticity holds but is not of the RE form, an
unrestricted FGLS analysis is more efficient than RE (again, fixed T,
N → ).
∙ As we will see later, RE does have some appeal because of its
implicit transformation.
37
∙ A nonrobust variance matrix estimator can be used if we add an
assumption:
ASSUMPTION RE.3:
(a) Eu i u ′i |x i , c i 2u I T
(b) Ec 2i |x i 2c
38
∙ Under RE.1, RE.2, and RE.3,
N −1
Avar̂ RE ∑ X ′i ̂ X i
−1
i1
is a valid estimator.
∙ Inference is straightforward. Typically use Wald or robust Wald
statistic for multiple restrictions.
∙ In Stata, fully robust inference uses the “cluster” option; for the
“usual” variance matrix estimator, drop this option:
xtreg y x1 x2 ... xK, re cluster(id)
39
∙ Occasionally, one might want to test
H 0 : 2c 0
H 1 : 2c 0
It’s rare that one cannot strongly reject this because of the strong
positive serial correlation in the POLS residuals in most applications.
The formal test, derived under joint normality for c i , u i , is called the
Breusch-Pagan test.
40
∙ A fully robust test does not add any additional assumptions, and
allows for heteroskedasticity. The key is that if v̂ it now denotes the
POLS residuals – which is what the B-P test uses – then
N T−1 T N T−1 T
N −1/2 ∑ ∑ ∑ v̂ it v̂ is N −1/2 ∑ ∑ ∑ v it v is o p 1
i1 t1 st1 i1 t1 st1
41
∙ Therefore, under
H 0 : Ev it v is 0, all t ≠ s,
it follows that
N T−1 T
N −1/2 ∑ i1 ∑ t1 ∑ st1 v̂ it v̂ is d
1/2
→ Normal0, 1
T−1 T 2
E ∑ t1 ∑ st1 v it v is
42
∙ Now estimate the denominator and cancel the sample sizes:
N T−1 T
∑ i1 ∑ t1 ∑ st1 v̂ it v̂ is d
1/2
→ Normal0, 1.
N T−1 T 2
∑ i1
∑ t1
∑ st1
v̂ it v̂ is
43
3.3 Fixed Effects Estimation
ȳ i x̄ i c i ū i ,
44
∙ The equation ȳ i x̄ i c i ū i is often called the between equation
because it relies on variation in the data between cross section
observations. The between estimator is the OLS estimator from the
cross section regression
ȳ i on x̄ i , i 1, . . . , N.
45
∙ Instead, subtract off the time-averaged equation from the original
equation to eliminate c i :
y it − ȳ i x it − x̄ i u it − ū i , t 1, . . . , T
or
ÿ it ẍ it ü it , t 1, . . . , T
46
∙ Key is that c i is gone from the time demeaned equation. So, we can
use pooled OLS:
ÿ it on ẍ it , t 1, . . . , T; i 1, . . . , N.
T T
because ∑ t1 x it − x̄ i ′ y it − ȳ i ∑ t1 x it − x̄ i ′ y it .
47
∙ What is the weakest orthogonality assumption for consistency? We
can just apply the results for POLS, but it is useful to see it directly.
∙ Write the estimator by substituting ÿ it ẍ it ü it :
N T −1 N T
̂ FE ∑ ∑ ẍ ′it ẍ it ∑ ∑ ẍ ′it ü it
i1 t1 i1 t1
N T −1 N T
∑ ∑ ẍ ′it ẍ it ∑ ∑ ẍ ′it u it
i1 t1 i1 t1
48
∙ By the WLLN as N → with fixed T, the key moment condition for
consistency is
T T
∑ Eẍ ′it u it ∑ Ex it − x̄ i ′ u it 0.
t1 t1
49
ASSUMPTION FE.1: Same as RE.1(a), that is,
Eu it |x i , c i 0, t 1, . . . , T.
50
∙ The rank condition rules out elements in x it that have no time
variation for any unit in the population. Such variables get swept away
by the within transformation.
∙ Under FE.1 and FE.2,
p
̂ → as N →
FE
∙ The FE estimator works well for large T, too, but showing that
requires putting restrictions on the time series process
x it , y it : t 1, 2, . . . .
51
∙ What parameters can we identify with FE? Suppose we start with
y it 1 2 d2 t . . . T dT t z i 1 d2 t z i 2 . . . dT t z i T
w it c i u it
52
∙ We can estimate 2 , . . . , T and 2 , . . . , T . So we can estimate whether
the effect of the time constant variables has changed over time. We
cannot estimate the effect in any period t because it is 1 for t 1 and
1 t for t 2, . . . , T.
53
∙ As another example, suppose w it is a scalar policy variable and z i are
time-constant characteristics, and the model is
y it 1 2 d2 t . . . T dT t z i 1 d2 t z i 2 . . . dT t z i T
w it w it z i − z c i u it
where z Ez i .
∙ We can estimate (the average partial effect) as well as , which
means we can see how the policy effects change with individual
characters (and test H 0 : 0). As a practical matter, we would
replace the population mean z with the sample average,
N
z̄ N −1 ∑ i1 z i .
54
∙ We can obtain a variance matrix estimator valid under Assumptions
FE.1 and FE.2.
∙ Define the FE residuals as
ü it ÿ it − ẍ it ̂ FE , t 1, . . . , T; i 1, . . . , N
∙ These are “estimates” of the ü it , not the u it . This has implications for
estimating the error variance, 2u .
55
∙ Without additional assumptions, use the “cluster-robust” matrix
N T −1 N T T
′
Avar̂ FE ∑∑ ẍ ′it ẍ it ∑ ∑ ∑ ü it ü ir ẍ it ẍ ir
i1 t1 i1 t1 r1
N T −1
∑ ∑ ẍ ′it ẍ it .
i1 t1
56
∙ In Stata, again use the “cluster” option:
xtreg y x1 x2 ... xK, fe cluster(id)
∙ Of course, a nonrobust form requires an extra assumption:
ASSUMPTION FE.3: Same as RE.3(a), that is,
Eu i u ′i |x i , c i 2u I T .
57
Under FE.3, we can simplify the middle matrix. First, use
T T
∑ ẍ ′it ü it ∑ x it u it X i u i .
̈ ′ ′ ̈ ′
Xiüi ̈
t1 t1
Therefore,
̈ ′i ü i ü ′i X
EX ̈ ′i u i u ′i X
̈ i EX ̈ i EEX ̈ ′i u i u ′i X
̈ i |X
̈ i
̈ ′i Eu i u ′i |X
EX ̈ i X
̈ i EX ̈ ′i 2u I T Ẍ i
̈ ′i X
2u EX ̈ i
̈ i 2u I T under FE.3.
because Eu i u ′i |X
58
∙ So Avar N ̂ FE − 2u EX
̈ ′i X
̈ i −1 under FE.1, FE.2, and
FE.3.
∙ Estimating 2u requires some care because we effectively observe ü it ,
not u it .
∙ Under the constant variance and no serial correlation assumptions on
u it ,
59
∙ So
T
∑ Eü 2it T − 1 2u .
t1
∙ One degree of freedom is lost for each unit i because of the time
T
demeaning: ∑ t1 ü it 0.
60
∙ Therefore,
N T
2u NT − 1 −1 ∑ ∑ Eü 2it
i1 t1
61
N T
2
̂ 2u NT − 1 − K −1 ∑ ∑ ü it SSR/NT − 1 − K
i1 t1
Avar̂ FE ̂ 2u ∑ XiXi
̈ ′̈
̂ ̈ ′ ̈ −1
u X X
2
i1
62
∙ If you do the time-demeaning and run pooled OLS, the usual statistics
do not reflect the lost degrees of freedom (N of them). The estimate of
2u will be SSR/NT − K, which is too small. Canned FE packages
properly compute the statistics.
∙ The FE estimator ̂ can also be obtained by running a long
FE
regression on the original data, and including dummy variables for each
cross section unit:
y it on d1 i , d2 i , . . . , dN i , x it , t 1, . . . , T; i 1, . . . , N,
often called the dummy variable regression. The statistics are properly
computed because the inclusion of the N dummy variables.
63
∙ Only danger: treating the c i as parameters to estimate, while sensible
with “large” T, can lead to trouble later with nonlinear models. Here,
we get a consistent estimator of for fixed T.
∙ Sometimes we want to estimate the c i using the T time periods. Do
not have to run the dummy variable regression:
ĉ i ȳ i − x̄ i ̂ FE , i 1, . . . , N.
64
N N N
̂ c N −1 ∑ ĉ i N −1 ∑ȳ i − x̄ i ̂ FE N −1 ∑c i ū i x̄ i − ̂ FE
i1 i1 i1
N N N
N −1 ∑ c i N −1 ∑ ū i N −1 ∑ x̄ i − ̂ FE
i1 i1 i1
N
∑ c i o p 1 O p 1o p 1 → c .
p
N −1
i1
65
∙ Can estimate other features of the distribution, too, although some
“obvious” estimators are inconsistent. For example, we might try to
estimate 2c using the sample variance of ĉ i : i 1, . . . , N:
N
̃ 2c N − 1 −1 ∑ĉ i − ̂ c 2 .
i1
66
∙ We can adjust for the “bias” using the estimate ̂ 2u :
N
̂ 2c ̃ 2c − ̂ 2u /T N − 1 −1 ∑ĉ i − ̂ c 2 − ̂ 2u /T
i1
67
∙ We can also obtain an estimate of 2c using
2c 2v − 2u
68
∙ Recent work by Orme and Yamagata (2006, Econometric Reviews)
has shown that the F statistic is approximately valid if we drop the
normality assumption on u it , but it is still unknown how to test
constancy of the c i with serial correlation or heteroskedasticity in u it .
69
Testing for Serial Correlation
∙ Because we can obtain fully robust inference, why should we test for
serial correlation in the u it ? The answer is that we might be able to
improve efficiency using a GLS-type method.
∙ We can test for serial correlation in u it , but it is tricky because we
effectively only have ü it .
∙ When u it is serially uncorrelated with constant variance, for t ≠ r
we have
70
Therefore,
− 2u /T
Corrü it , ü ir 2 − 1 .
u T − 1/T T−1
71
∙ A simple test is based on a pooled AR(1) regression. First obtain the
FE residuals, ü it . (In Stata, use the “areg” command.) Then run the
pooled OLS regression
ü it on ü i,t−1 , t 3, . . . , T; i 1, . . . , N
and let the coefficient on ü i,t−1 be ̂ . The tricky thing is that, under the
null, the ü it are serially correlated.
∙ We obtain a simple statistic using a fully robust standard error for ̂ ,
sê (available from the “cluster” option in POLS). The t statistic is
̂ T − 1 −1
.
̂
se
72
∙ Typically observe ̂ 0 if u it is positively serially correlated. A
positive, significant estimate of ̂ reveals some positive serial
correlation. If ̂ ≈ −T − 1 −1 , no serial correlation in u it might be
reasonable.
∙ If we find strong evidence of serial correlation in u it , we might
want to exploit it in estimation rather than just making FE inference
robust.
73
Fixed Effects GLS
∙ Write the T time periods for a random draw i as
yi Xi cijT ui
74
∙ Because j ′T ü i 0, we know the T T matrix Eü i ü ′i has rank less
than T. In fact, (unconditional) variance covariance matrix of ü i is
≡ Eü i ü ′i EQ T u i u ′i Q T Q T Q T ,
75
∙ There is a simple solution. After demeaning to obtain ÿ i and Ẍ i using
all T time periods and obtaining
N
′
N ∑ üiüi
̂ −1
i1
drop one of the time periods. It does not matter which one is dropped
(but the first or last are easiest).
∙ Apply FGLS to the T − 1 remaining equations using ̂.
∙ Remember, can still make a case for robust inference because system
heteroskedasticity is always a possibility.
76
Some Practical Hints in Applying Fixed Effects
∙ Possible confusion concerning the term “fixed effects.” Suppose i is a
firm. Then the phrase “firm fixed effect” corresponds to allowing c i in
the model to be correlated with the covariates. If c i is called a firm
“random effect” then it is being assumed to be uncorrelated with x it .
∙ Suppose that we cannot, or do not want to, use FE estimation. This
might occur because the key variable at the firm level is constant across
time for all firms – and so the FE transformation sweeps it away – or
there is little time variation within firm in the key variable, leading to
large standard errors.
77
∙ Instead, we might use a random effects analysis at the firm level but
include industry dummy variables to account for systematic differences
across industries. So, we include in x it a set of industry dummy
variables while also allowing a firm effect c i in a “random effects”
analysis.
∙ If there are many firms per industry, the industry “fixed effects” – the
coefficients on the industry dummies – can be precisely estimated. So
the industry “fixed effects” are really parameters to estimate whereas
the c i are not.
78
∙ Generally, including dummies for more aggregated levels and then
applying RE is common when the covariates of interest vary in the
cross section but not (much) over time.
∙ Keep in mind that an RE analysis at the firm level with industry
dummies need not be entirely convincing: the key elements of x it might
be correlated with unoberved firm features that are not adequately
captured by industry differences.
79
Application
For N 1, 149 U.S. air routes and the years 1997 through 2000, y it is
logfare it and the key explanatory variable is concen it , the
concentration ratio for route i. Other covariates are year dummies and
the time-constant variables logdist i and logdist i 2 . Note that what I
call c i Stata refers to as u_i.
80
. use airfare
. tab year
1997, 1998, |
1999, 2000 | Freq. Percent Cum.
-----------------------------------------------
1997 | 1,149 25.00 25.00
1998 | 1,149 25.00 50.00
1999 | 1,149 25.00 75.00
2000 | 1,149 25.00 100.00
-----------------------------------------------
Total | 4,596 100.00
81
. reg lfare concen ldist ldistsq y98 y99 y00
------------------------------------------------------------------------------
lfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
concen | .3601203 .0300691 11.98 0.000 .3011705 .4190702
ldist | -.9016004 .128273 -7.03 0.000 -1.153077 -.6501235
ldistsq | .1030196 .0097255 10.59 0.000 .0839529 .1220863
y98 | .0211244 .0140419 1.50 0.133 -.0064046 .0486533
y99 | .0378496 .0140413 2.70 0.007 .010322 .0653772
y00 | .09987 .0140432 7.11 0.000 .0723385 .1274015
_cons | 6.209258 .4206247 14.76 0.000 5.384631 7.033884
------------------------------------------------------------------------------
82
. reg lfare concen ldist ldistsq y98 y99 y00, cluster(id)
83
. xtreg lfare concen ldist ldistsq y98 y99 y00, re
------------------------------------------------------------------------------
lfare | Coef. Std. Err. z P|z| [95% Conf. Interval]
-----------------------------------------------------------------------------
concen | .2089935 .0265297 7.88 0.000 .1569962 .2609907
ldist | -.8520921 .2464836 -3.46 0.001 -1.335191 -.3689931
ldistsq | .0974604 .0186358 5.23 0.000 .0609348 .133986
y98 | .0224743 .0044544 5.05 0.000 .0137438 .0312047
y99 | .0366898 .0044528 8.24 0.000 .0279626 .0454171
y00 | .098212 .0044576 22.03 0.000 .0894752 .1069487
_cons | 6.222005 .8099666 7.68 0.000 4.6345 7.80951
-----------------------------------------------------------------------------
sigma_u | .31933841
sigma_e | .10651186
rho | .89988885 (fraction of variance due to u_i)
------------------------------------------------------------------------------
84
. * The coefficient on the time-varying variable concen drops quite a bit.
. * Notice that the RE and POLS coefficients on the time-constant
. * distance variables are pretty similar, something that often occurs.
85
. * What if we do not control for distance in RE?
. * The RE estimate is now much smaller than when ldist and ldistsq are
. * controlled for, and much smaller than the FE estimate. Thus, it can be
. * very harmful to omit time-constant variables in RE estimation.
86
. * Allow an unrestricted unconditional variance-covariance matrix, but
. * make robust to system heteroskedasticity:
. xtgee lfare concen ldist ldistsq y98 y99 y00, corr(uns) robust
87
. xtreg lfare concen ldist ldistsq y98 y99 y00, fe
F(4,3443) 134.61
corr(u_i, Xb) -0.2033 Prob F 0.0000
------------------------------------------------------------------------------
lfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
concen | .168859 .0294101 5.74 0.000 .1111959 .226522
ldist | (dropped)
ldistsq | (dropped)
y98 | .0228328 .0044515 5.13 0.000 .0141048 .0315607
y99 | .0363819 .0044495 8.18 0.000 .0276579 .0451058
y00 | .0977717 .0044555 21.94 0.000 .089036 .1065073
_cons | 4.953331 .0182869 270.87 0.000 4.917476 4.989185
-----------------------------------------------------------------------------
sigma_u | .43389176
sigma_e | .10651186
rho | .94316439 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i0: F(1148, 3443) 36.90 Prob F 0.0000
88
. xtreg lfare concen ldist ldistsq y98 y99 y00, fe cluster(id)
89
. * Let the effect of concen depend on route distance.
90
. xtreg lfare concen ldistconcen y98 y99 y00, fe cluster(id)
91
. * Effect at the average of ldist is similar to before. But at one standard
. * deviation of ldist above its mean, the effect of concen is zero:
------------------------------------------------------------------------------
lfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
(1) | .0003449 .0554442 0.01 0.995 -.1084383 .1091281
------------------------------------------------------------------------------
. di 209/1149
.1818973
. * So about 18.2* of the routes of ldist greater than one standard deviation
. * above the mean.
92
3.4. First-Differencing Estimation
y it x it c i u it , t 1, . . . , T.
Δy it Δx it Δu it , t 2, . . . , T.
93
∙ FD also requires a kind of strict exogeneity. The weakest assumption
is
EΔx ′it Δu it 0, t 2, . . . , T.
94
∙ A sufficient condition is
ASSUMPTION FD.1: Same as FE.1, Eu it |x i , c i 0, t 1, . . . , T.
ASSUMPTION FD.2: Let ΔX i be the T − 1 K matrix with rows
Δx it . Then,
rank EΔX ′i ΔX i K.
95
∙ After POLS on the first differences, let
ê it Δy it − Δx it ̂ FD , t 2, . . . , T; i 1, . . . , N
Avar̂ FD ∑ ΔX ′i ΔX i ∑ ΔX ′i ê i ê ′i ΔX i ∑ ΔX ′i ΔX i
i1 i1 i1
96
ASSUMPTION FD.3:
Ee i e ′i |ΔX i 2e I T
97
∙ For a given i, the time series “model” would be
y it c i x it u it
u it u i,t−1 e it ,
where c i is the intercept for unit i. This does not define a sensible time
series regression because u it is not “mean reverting.” One way to see
this is Varu it 2e t, and so the idiosyncratic error variance grows as a
linear function of t.
98
∙ Here we can allow random walk behavior in u it with a short T
because we have cross section variation driving the large-sample
analysis.
99
∙ Testing for serial correlation in e it Δu it is easy. If we start with
T ≥ 3, then use a t test or heteroskedasticity-robust version for ̂ , where
̂ is the coefficient on ê i,t−1 in the pooled dynamic OLS regression
ê it on ê i,t−1 , t 3, . . . , T; i 1, . . . , N.
̂ . 5
.
sê
100
∙ Can use the FD residuals to recover an estimate of if we think
u it : t 1, 2, . . . T follows a stationary AR(1) process. Then
Covu it , u i,t−h h 2u , h 0, 1 , . . . . Therefore
101
∙ Further,
Vare it 2u − 2Covu it , u i,t−1 2u
2 2u 1 −
∙ It follows that
− 2u 1 − 2 − 1
Corre it , e i,t−1 .
2 u 1 −
2 2
1 2
102
∙ Notice we get the right answer when 0: namely, 1 (so that
u it follows a random walk). So we can use
̂ 1 2̂
103
∙ Applying feasible GLS after differencing is especially easy because
the lost degree of freedom for each i is automatically incorporated by
losing the first time period.
∙ Resulting estimator is the FDGLS estimator. It uses an unrestricted
T − 1 T − 1 variance matrix in the FD equation
Δy i ΔX i Δu i
where Δu i is T − 1 1.
∙ Easy to use the xtgee command in Stata.
104
. sort id year
------------------------------------------------------------------------------
clfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
cconcen | .1759764 .0284387 6.19 0.000 .1202181 .2317348
y99 | -.0091019 .0052688 -1.73 0.084 -.0194322 .0012284
y00 | .0386441 .0052301 7.39 0.000 .0283897 .0488985
_cons | .0227692 .0036988 6.16 0.000 .0155171 .0300212
------------------------------------------------------------------------------
105
. * Fairly close to FE estimate of .169, but standard errors are probably
. * not correct. The R-squared gives us a measure of how well changes
. * in concentration explain changes in lfare.
106
. gen cy99 y99 - y99[_n-1] if year 1997
(1149 missing values generated)
. * All estimates are now similar to FE. This R-squared is less useful
. * than when a constant is included because it does not remove the average.
. * It is the "uncentered" R-squared.
107
. * Test for serial correlation using FD.
------------------------------------------------------------------------------
| Robust
eh | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
eh_1 | -.1275163 .0274343 -4.65 0.000 -.1813148 -.0737177
_cons | -3.30e-11 .0024386 -0.00 1.000 -.0047821 .0047821
------------------------------------------------------------------------------
108
. * Can use xtgee to obtain the FGLS estimator on the FD equation:
------------------------------------------------------------------------------
clfare | Coef. Std. Err. z P|z| [95% Conf. Interval]
-----------------------------------------------------------------------------
cconcen | .169649 .0285421 5.94 0.000 .1137076 .2255904
y99 | -.0092635 .0054855 -1.69 0.091 -.0200149 .001488
y00 | .0385667 .0054062 7.13 0.000 .0279707 .0491627
_cons | .0228257 .0036967 6.17 0.000 .0155802 .0300712
------------------------------------------------------------------------------
109
. xtgee clfare cconcen y99 y00, corr(uns) robust
. * The robust standard error for FGLS is about 50% larger than the nonrobust
. * one.
110
. reg eh eh_1, cluster(id)
. lincom eh_1 .5
( 1) eh_1 -.5
------------------------------------------------------------------------------
eh | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
(1) | .3724837 .0272003 13.69 0.000 .3191159 .4258515
------------------------------------------------------------------------------
. * And we can easily reject -.5, too, which is what would happen under FE.3.
. di 1 2*(-.128)
.744
111
. * Test for serial correlation using FE. Use "areg" to get the FE
. * residuals.
------------------------------------------------------------------------------
lfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
concen | .168859 .0294101 5.74 0.000 .1111959 .226522
y98 | .0228328 .0044515 5.13 0.000 .0141048 .0315607
y99 | .0363819 .0044495 8.18 0.000 .0276579 .0451058
y00 | .0977717 .0044555 21.94 0.000 .089036 .1065073
_cons | 4.953331 .0182869 270.87 0.000 4.917476 4.989185
-----------------------------------------------------------------------------
id | F(1148, 3443) 60.521 0.000 (1149 categories)
. sort id year
112
. reg udh udh_1, cluster(id)
( 1) udh_1 -.333
------------------------------------------------------------------------------
udh | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
(1) | .3044832 .0304886 9.99 0.000 .2446636 .3643028
------------------------------------------------------------------------------
113
3.5. Prediction
114
x i,T1 ̂ RE ̂ 2c /̂ 2c ̂ 2u /Tȳ i − x̄ i ̂ RE .
which does not shrink the influence of the second term. As ̂ 2c increases
relative to ̂ 2u , or for large T, the two predictions are similar.
∙ Seems unlikely that either of these can match dynamic models
estimated by pooled OLS. The RE and FE methods each give the same
weight to the most recent and earliest outcomes on y.
115
4. COMPARISON OF ESTIMATORS
FE versus FD.
∙ Estimates and inference are identical when T 2. Generally, can see
differences as T increases.
∙ Usually think a significant difference signals violation of
Covx is , u it 0, all s, t. FE has some robustness if Covx it , u it 0 but
Covx it , u is 0, some s ≠ t: The “bias” is of order 1/T. FD does not
average out the bias over T.
116
∙ To see this, maintain contemporaneous exogeneity:
Ex ′it u it 0.
117
∙ Under contemporaneous exogeneity,
Eẍ ′it u it −Ex̄ ′i u it
and so
T T
T −1 ∑ Eẍ ′it u it T −1 ∑ Ex̄ ′i u it −Ex̄ ′i ū i .
t1 t1
118
∙ Under stationarity and weak dependence, Ex̄ ′i ū i OT −1 because,
by the Cauchy-Schwartz inequality, for each j,
and sdx̄ ij , sdū i are OT −1/2 where each series is weakly dependent.
(If uncorrelated with constant variance, sdū i u / T .)
119
∙ Further, T −1 ∑ t1
T
Eẍ ′it ẍ it is bounded as a function of T. It follows
that
120
∙ If x it : t 1, 2, . . . is weakly dependent, so is Δx it , and so the first
average is generally bounded. (In fact, under stationarity this average
does not depend on T.
∙ As for the second average,
EΔx ′it Δu it −Ex ′it u i,t−1 Ex ′i,t−1 u it
even if Ex ′i,t−1 u it 0 (so the dynamics given the elements of x it are
correct).
121
∙ Can show the previous results hold even if x it is I1 as a time
series process (has a “unit root”), but it is crucial that u it is I0
(weakly dependent). If the regression is “spurious” in levels, it is better
to first difference!
∙ In simple cases, such as the AR(1) model with x it y i,t−1 , can find
what the OT −1 term is for FE. If write the model as
y it y i,t−1 1 − a i u it
122
∙ Simple test for feedback when the model does not contain lagged
dependent variables, that is, Covx i,t1 , u it ≠ 0. Estimate
y it x it w i,t1 c i u it , t 1, . . . , T − 1
123
. * We found that the FE and FD estimates of concen coefficient were
. * pretty close.
. sort id year
. gen concenp1 concen[_n1] if year 2000
. xtreg lfare concen concenp1 y98 y99 y00, fe cluster(id)
F(4,1148) 25.63
corr(u_i, Xb) -0.2949 Prob F 0.0000
124
∙ If do not reject strict exogeneity, can use serial correlation properties
of u it to choose between FE and FD. Generally a good idea to do FE
and FD and report robust standard errors.
∙ If we maintain system homoskedasticity (sufficient is
Varu i |x i , c i Varu i ), then unrestricted FDGLS and FEGLS (with a
time period dropped) are asymptotically equivalent.
125
FE versus RE.
∙ Time-constant variables drop out of FE estimation. On the
time-varying covariates, are FE and RE so different after all? Define
the parameter
1/2
1− 1 ,
1 T 2c / 2u
y it − ̂ ȳ i on x it − ̂ x̄ i , t 1, . . . , T; i 1, . . . , N.
126
∙ Call y it − ̂ ȳ i a “quasi-time-demeaned” variable: only a fraction of the
mean is removed.
̂ ≈ 0 ̂ RE ≈ ̂ POLS
̂ ≈ 1 ̂ RE ≈ ̂ FE
127
. * Can get the quasi-time-demeaning parameter, which Stata calls “theta.”
. xtreg lfare concen ldist ldistsq y98 y99 y00, re cluster(id) theta
128
Testing for Serial Correlation after RE
∙ Can show that under the RE variance matrix assumptions,
r it ≡ v it − v̄ i 1 − c i u it − ū i has constant (unconditional)
variance and is serially uncorrelated.
∙ Suggests a way to test u it for serial correlation. After RE
estimation, obtain r̂ it from the regression on the quasi-time-demeaned
data, and use a standard test for, say, AR(1) serial correlation. (Can
ignore estimation of parameters.)
129
Efficiency of RE
∙ Can show that RE is asymptotically more efficient than FE under
RE.1, RE.2, FE.2, and RE.3. Assume, for simplicity, x it has all
time-varying elements. (See text Section 10.7.2 for more general case.)
∙ Then
Avar̂ FE 2u EX
̈ ′i X
̈ i −1 /N
130
∙ Using ∑ t1
T
ẍ it 0 we have
T T
̆ ′i X
X ̆i ∑ x̆ ′it x̆ it ∑ẍ it 1 − x̄ i ′ ẍ it 1 − x̄ i
t1 t1
T
∑ẍ ′it ẍ it 1 − 2 x̄ ′i x̄ i
t1
̈ ′i X
X ̈i 1 − 2 Tx̄ ′i x̄ i
̆ ′i X
EX ̈ ′i X
̆ i − EX ̈ i 1 − 2 TEx̄ ′i x̄ i
131
Testing the Key RE Assumption
∙ Recall the key RE assumption is Covx it , c i 0. With lots of good
time-constant controls (“observed heterogeneity”) might be able to
make this condition roughly true.
∙ a. The traditional Hausman Test: Compare the coefficients on the
time-varying explanatory variables, and compute a chi-square statistic.
∙ Caution: Usual Hausman test maintains RE.3 – second moment
assumptions – yet has no systematic power for detecting violations
from this assumption.
∙ With time effects, must use generalized inverse. Easy to get the
degrees of freedom wrong.
132
∙ b. Variable addition test. Write the model as
y it g t z i w it c i u it .
133
∙ Let w it be 1 J. Use a correlated random effects (CRE) formulation
due to Mundlak (1978):
ci w
̄ i ai
Ea i |z i , w i 0.
134
∙ If we substitute c i w
̄ i a i into the original equation we get
y it g t z i w it w
̄ i a i u it .
135
∙ When we use the CRE formulation to obtain a test of
Ec i |z i , w i Ec i
136
∙ Guggenberger (2010, Journal of Econometrics) has recently pointed
out the pre-testing problem in using the Hausman test to decide
between RE and FE. The regression-based version of the test shows it
is related to the classic problem of pre-testing on a set of regressors –
̄ i in this case – in order to decide whether or not to include them.
w
∙ If ≠ 0 but the test has low power, we will omit w̄ i when we should
include it. That is, we will incorrectly opt for RE.
∙ As always, need to distinguish between a statistical and practical
rejection.
137
Airfare Example
. * First use the Hausman test that maintains all of the RE assumptions under
. * the null and directly compares the RE and FE estimates:
138
. hausman b_fe b_re
chi2(4) (b-B)’[(V_b-V_B)^(-1)](b-B)
10.00
Probchi2 0.0405
(V_b-V_B is not positive definite)
.
. di -.0401/.0127
-3.1574803
. * This is the nonrobust H t test based just on the concen variable. There is
. * only one restriction to test, not four. The p-value reported for the
. * chi-square statistic is incorrect. Notice that the rejection using the
. * correct df is much stronger than if we act as if there are four restrictions.
139
. * Using the same variance matrix estimator solves the problem of wrong df.
. * The next command uses the matrix of the relatively efficient estimator.
Note: the rank of the differenced variance matrix (1) does not equal the
number of coefficients being tested (4); be sure this is what you expect,
or there may be problems computing the test. Examine the output of your
estimators for anything unexpected and possibly consider scaling your
variables so that the coefficients are on a similar scale.
chi2(1) (b-B)’[(V_b-V_B)^(-1)](b-B)
9.89
Probchi2 0.0017
140
. * The regression-based test is better: it gets the df right AND is fully
. * robust to violations of the RE variance-covariance matrix:
. xtreg lfare concen concenbar ldist ldistsq y98 y99 y00, re cluster(id)
. * So the robust t statistic is 2.62 --- still a rejection, but not as strong.
141
. * Using the CRE formulation, we get the FE estimate on the time-varying
. * covariate concen. In this case, the coefficients on the time-constant
. * variables are close to the usual RE estimates, and even closer to the
. * POLS estimates.
142