Cuarta Clase

UNOBSERVED EFFECTS LINEAR
PANEL DATA MODELS, I
Econometric Analysis of Cross Section and Panel Data, 2e

MIT Press
Jeffrey M. Wooldridge
1. Introduction
2. Assumptions
3. Estimation and Testing
4. Comparison of Estimators
1
1. INTRODUCTION
∙ We already covered panel data models where the error term had no
particular structure. But we assumed either contemporaneous
exogeneity (pooled OLS) or strict exogeneity (feasible GLS).
∙ Now we explicitly add a time constant, unobserved effect to the
model. Often called unobserved heterogeneity.
2
∙ Start with the balanced panel case, and assume random sampling
across i (the cross section dimension), with fixed time periods T. So
x it , y it  : t  1, . . . , T, c i  where c i is the unobserved effect drawn
along with the observed data.
∙ The unbalanced case is trickier because we must know why we are
missing some time periods for some units. We consider this much later
under missing data/sample selection issues.
3
∙ For a random draw i from the population, the basic model is
y it  x it   c i  u it , t  1, . . . , T,
where u it : t  1, . . . , T are the idiosyncratic errors. The composite

error at time t is
v it  c i  u it
∙ Because of c i , the sequence v it : t  1, . . . , T is almost certainly

serially correlated, and definitely is if u it  is serially uncorrelated.
4
∙ Useful to write a population version of the model in conditional
expectation form:
Ey t |x t , c  x t   c, t  1, . . . , T.
Therefore,
∂Ey t |x t , c
j  ,
∂x tj
so that  j is the partial effect of x tj on Ey t |x t , c, so that we are

“holding c fixed.”
∙ Hope is that we can allow c to be correlated with x t .
5
∙ With a single cross section, there is nothing we can do unless we can
find good observable proxies for c or IVs for the endogenous elements
of x t . But with two or more periods we have more options.
∙ We can write the population model as
yt  xt  c  ut
Eu t |x t , c  0
Suppose we have T  2 time periods:
y1  x1  c  u1
y2  x2  c  u2
6
∙ Subtract t  1 from t  2 and define Δy  y 2 − y 1 , Δx  x 2 − x 1 , and
Δu  u 2 − u 1 :
Δy  Δx  Δu,
which is now a cross section in the changes or differences.

∙ Sufficient for OLS on a random sample to consistently estimate :
EΔx ′ Δu  0
rank EΔx ′ Δx  K.
∙ The rank condition is violated if x t has elements that do not change

over time. Assume each element of x t has some time variation (that is,
for at least some members in the population).
7
∙ The orthogonality condition is
Ex 2 − x 1  ′ u 2 − u 1   0.
But
Ex 2 − x 1  ′ u 2 − u 1   Ex ′2 u 2  − Ex ′1 u 2  − Ex ′2 u 1   Ex ′1 u 1 

 −Ex ′1 u 2   Ex ′2 u 1 
because Ex ′t u t   0 under the conditional mean specification.
8
∙ OLS on the differences will only be consistent if we add
Ex ′s u t   0, s ≠ t.
This is a kind of strict exogeneity assumption. However, we have

removed c from the composite error. Assuming x s is uncorrelated with
u t for all s and t is weaker than assuming x s is uncorrelated with the
composite error, c  u t , for all s and t.
9
∙ Would we really omit an intercept from the differenced equation?
Very unlikely. If we start with a model with different intercepts,
y1  1  x1  c  u1
y2  2  x2  c  u2
then
Δy    Δx  Δu,
where    2 −  1 is the change in the aggregate time effects

(intercepts). Now the rank condition also excludes variables that change
by the same amount for each unit (such as age).
10
2. ASSUMPTIONS
∙ As mentioned earlier, we assume a balanced panel and all asymptotic
analysis – implicit or explicit – is with fixed T and N → , where N is
the size of the cross section.
∙ The basic unobserved effects model is
y it  x it   c i  u it , t  1, . . . , T,
where x it is 1  K and so  is K  1. In addition to unobserved effect

and unobserved heterogeneity, c i is sometimes called a latent effect or
an individual effect, firm effect, school effect, and so on.
11
∙ An extension of the basic model is
y it  x it    t c i  u it , t  1, . . . , T,
where  t : t  1, . . . , T are unknown parameters (and we have to

assume something like  1  1). More on this later.
∙ As in the earlier treatment, the model is written with  not depending
on time. But x it can include time period dummies and interactions of
variables with time periods dummies, so the model is quite flexible.
12
∙ A general specification is
y it  g t   z i   w it   c i  u it
where g t is a vector of aggregate time effects (often time dummies), z i

is a set of time-constant observed variables, and w it changes across i
and t (for at least some units i and time periods t). w it can include
nteractions among time-constant and time varying variables.
∙ In microeconometric applications, best to avoid calling c i a “random
effect” or a “fixed effect.” We are treating c i always as a random
variable.
13
Assumptions about the Unobserved Effect
∙ In modern applications, “random effect” essentially means
Covx it , c i   0, t  1, . . . , T,
although we often will strengthen this.

∙ The term “fixed effect” means that no restrictions are placed on the
relationship between c i and x it .
∙ Recently, “correlated random effects” is used to denote situations
where we model the relationship between c i and x it , and it is
especially useful for nonlinear models (but also for linear models, as we
will see).
14
Exogeneity Assumptions on the Explanatory Variables
y it  x it   c i  u it
Contemporaneous Exogeneity Conditional on the Unobserved Effect:
Eu it |x it , c i   0
or
Ey it |x it , c i   x it   c i .
∙ Ideally, we could proceed with just this assumption.
15
∙ Strict Exogeneity Conditional on the Unobserved Effect:
Ey it |x i1 , . . . , x iT , c i   Ey it |x it , c i   x it   c i ,
so that only x it affects the expected value of y it once c i is controlled for.

∙ This is weaker than if we did not condition on c i . Assuming the
condition holds conditional on c i ,
Ey it |x i1 , . . . , x iT   x it   Ec i |x i1 , . . . , x iT .
So correlation between c i and x i1 , . . . , x iT  would invalidate the

assumption without conditioning on c i .
16
∙ But strict exogeneity conditional on c i rules out lagged dependent
variables and feedback. Written in terms of the idiosyncratic errors,
strict exogeneity is
Eu it |x i1 , . . . , x iT , c i   0,
and so x i,th must be uncorrelated with u it for all h  0.

∙ In addition to ruling out feedback, strict exogeneity assumes we have
any distributed lag dynamics correct, too. For example, if
x it  z it , z i,t−1 , then
Ey it |z i1 , . . . , z it , . . . , z iT , c i   Ey it |z it , z i,t−1 , c i .
17
∙ A more reasonable assumption that we will use later is
Ey it |x it , x i,t−1 , . . . , x i1 , c i   Ey it |x it , c i   x it   c i ,
which is sequential exogeneity conditional on the unobserved effect.

∙ Sequential exogeneity assumes correct distributed lag dynamics but is
silent on feedback.
18
3. ESTIMATION AND TESTING
∙ There are four common methods: pooled OLS, random effects, fixed
effects, and first differencing.
3.1. Pooled OLS
∙ We already covered this. Now, we just recognize that the equation is

y it  x it   v it
v it  c i  u it
∙ Consistency (fixed T, N → ) of the POLS estimator is ensured by

Ex ′it c i   0
Ex ′it u it   0, t  1, . . . , T.
19
∙ Contemporaneous exogeneity is weaker than strict exogeneity, but it
buys us little in practice because POLS also uses Ex ′it c i   0, which
cannot hold for lagged dependent variables and is unlikely for other
variables not strictly exogenous.
∙ Inference should be made robust to serial correlation and
heteroskedasticity.
20
∙ Let v̂ it  y it − x it ̂ POLS be the POLS residuals. Then
N T −1 N T T
Avar̂ POLS   ∑ ∑ x ′it x it ∑ ∑ ∑ v̂ it v̂ ir x ′it x ir
i1 t1 i1 t1 r1
N T −1
 ∑ ∑ x ′it x it ,
i1 t1
or sometimes with an adjustment, such as multiply by N/N − 1.
21
∙ Can also write this estimator as
N −1 N −1 N −1
Avar̂ POLS   ∑ X ′i X i ∑ X ′i v̂ i v̂ ′i X i ∑ X ′i X i
i1 i1 i1
∙ In Stata:
reg y x1 x2 ... xK, cluster(id)
22
3.2. Random Effects Estimation
∙ State assumptions in conditional mean terms so that second moment

derivations are are easier.
ASSUMPTION RE.1:
(a) Eu it |x i1 , x i2 , . . . , x iT , c i   0, t  1, . . . , T
(b) Ec i |x i1 , x i2 , . . . , x iT   Ec i 
∙ Assume x it includes (at least) unity, and probably time dummies in

addition. Then Ec i   0 is without loss of generality.
23
∙ A GLS approach also leaves c i in the error term:
y it  x it   v it , t  1, 2, . . . , T
and we know the properties of feasible GLS when
Ex ′is v it   0, all s, t  1, . . . , T.
∙ This weaker version of strict exogeneity is implied by Assumption

RE.1.
24
∙ Write the equation in system form (for all time periods) as
yi  Xi  vi  Xi  cijT  ui
where j ′T  1, 1, . . . , 1.

∙ Define
  Ev i v ′i   Varv i .
TT
ASSUMPTION RE.2:  is nonsingular and rank EX ′i  −1 X i .
25
∙ RE imposes a special structure on  (which could be wrong!). Under
RE.1(a), c i and u it are uncorrelated. Assume further that
Varu it    2u , t  1, . . . , T
Covu it , u is   0, t ≠ s
Then
Varv it   Varc i  u it   Varc i   Varu it 

 2v   2c   2u
26
∙ Further, for t ≠ s,
Covv it , v is   Covc i  u it , c i  u is 
 Varc i   Covc i , u is   Covu it , c i   Covu it , u is 
  2c
∙ This leads to the “random effects” or “exchangeable” structure for :

  Ec i j T u i c i j T u i  ′   Ec 2i j T j ′T  Eu i u ′i 
  2c j T j ′T   2u I T
or
27
 2c   2u   2c  2c
 2c  2c   2u  2c
 ,
  
 2c   2c  2c   2u
so the T  T matrix depends on only two parameters,  2c and  2u or,

more directly,  2v and  2c .
∙ Feasible GLS requires estimating , that is, the two parameters.
∙ Actually, it would be enough to know    2c / 2c   2u , the fraction
of the total variance accounted for by c i . Notice that   Corrv it , v is 
for all t ≠ s.
28
∙ We can also write  as
1   
 1 
  2v
  
   1
which shows we only need to estimate  to proceed with FGLS.

∙ Typically, we estimate  2v and  2c , but  is useful for summarizing the
importance of c i .
29
∙ We can use pooled OLS to get the residuals, v it , across all i and t.
Then a consistent estimator of  2v (not generally unbiased), as N gets
large for fixed T, is
N T
̂ 2v  NT − K −1 ∑ ∑ v 2it  SSR/NT − K,
i1 t1
the usual variance estimator from OLS regression. This is based on, for
T
each i,  2v  T −1 ∑ t1 Ev 2it  and then average across i, too. Then
replace population with sample average, and  with pooled OLS
estimates, and subtract K as a degrees-of-freedom adjustment.
30
∙ For  2c , note that
T−1 T
 2c  TT − 1/2 −1 ∑ ∑ Ev it v is .
t1 st1
So a consistent “estimator” would be

N T−1 T
̃ 2c  NTT − 1/2 −1 ∑ ∑ ∑ v it v is .
i1 t1 st1
31
∙ An actual estimator replaces v it with the POLS residuals,
N T−1 T
̂ 2c  NTT − 1/2 − K −1 ∑ ∑ ∑ v it v is ,
i1 t1 st1
and subtracts K from NTT − 1/2 as a df adjustment. By the usual

argument,
plim ̂ 2c   2c
N→
with T fixed.
32
∙ Now we can use
̂ 2v  ̂ 2c ̂ 2c 1  ̂ ̂
̂  ̂ 2c ̂ 2v ̂ 2c ̂  ̂ 1 ̂
 or 
     
̂ 2c  ̂ 2c ̂ 2v ̂  ̂ 1
where ̂  ̂ 2c /̂ 2v in FGLS.
33
∙ It is possible for ̂ 2c to be negative, which means the basic unobserved
effects variance-covariance structure is faulty.
∙ Typically, ̂ 2c  0 unless the variables have been transformed in some
way – such as being first differenced – before applying GLS.
∙ The FGLS estimator that uses this particular structure of ̂ is the
random effects (RE) estimator.
34
∙ Fully robust inference is available for RE, and there are good reasons
for doing so.
(1)  may not have the special (and restrictive, especially for large T)
RE structure, that is, Ev i v i  need not have the RE form. Serial
correlation or changing variances in u it : t  1, . . . , T invalidate the
RE structure.
(2) The system homoskedasticity requirement,
Ev i v i |X i   Ev i v i 
might not hold.
35
∙ A fully robust estimator is
N −1 N −1
Avar̂ RE   ∑ X ′i ̂ X i
−1
∑ X ′i ̂ v̂ i v̂ ′i ̂ X i
−1 −1
i1 i1
N −1
 ∑ ′ ̂ −1
Xi Xi ,
i1
where v̂ i  y i − X i ̂ RE is the vector of RE (FGLS) residuals.

∙ Sometimes, an iterative procedure is used. These new residuals can be
used to obtain a new estimate of , and so on.
36
∙ For first order asymptotics, no efficiency gain from iterating. Might
help with smaller N, though.
∙ What is the advantage of RE, which imposes specific assumptions on
, and the unrestricted FGLS we discussed earlier? Theoretically,
nothing. We do not get more efficiency with large N and small T by
imposing restrictions on .
∙ If system homoskedasticity holds but  is not of the RE form, an
unrestricted FGLS analysis is more efficient than RE (again, fixed T,
N → ).
∙ As we will see later, RE does have some appeal because of its
implicit transformation.
37
∙ A nonrobust variance matrix estimator can be used if we add an
assumption:
ASSUMPTION RE.3:
(a) Eu i u ′i |x i , c i    2u I T
(b) Ec 2i |x i    2c
∙ Under Assumptions RE.1 and RE.3,  has the RE structure and

system homoskedasticity holds. Part (a) is homoskedasticity and serial
uncorrelatedness of u it  conditional on x i , c i , and (b) is
homoskedasticity of c i .
38
∙ Under RE.1, RE.2, and RE.3,
N −1
Avar̂ RE   ∑ X ′i ̂ X i
−1
i1
is a valid estimator.
∙ Inference is straightforward. Typically use Wald or robust Wald
statistic for multiple restrictions.
∙ In Stata, fully robust inference uses the “cluster” option; for the
“usual” variance matrix estimator, drop this option:
xtreg y x1 x2 ... xK, re cluster(id)
39
∙ Occasionally, one might want to test
H 0 :  2c  0
H 1 :  2c  0
It’s rare that one cannot strongly reject this because of the strong
positive serial correlation in the POLS residuals in most applications.
The formal test, derived under joint normality for c i , u i , is called the
Breusch-Pagan test.
40
∙ A fully robust test does not add any additional assumptions, and
allows for heteroskedasticity. The key is that if v̂ it now denotes the
POLS residuals – which is what the B-P test uses – then
N T−1 T N T−1 T
N −1/2 ∑ ∑ ∑ v̂ it v̂ is  N −1/2 ∑ ∑ ∑ v it v is  o p 1
i1 t1 st1 i1 t1 st1
41
∙ Therefore, under
H 0 : Ev it v is   0, all t ≠ s,
it follows that
N T−1 T
N −1/2 ∑ i1 ∑ t1 ∑ st1 v̂ it v̂ is d
1/2
→ Normal0, 1
T−1 T 2
E ∑ t1 ∑ st1 v it v is
42
∙ Now estimate the denominator and cancel the sample sizes:
N T−1 T
∑ i1 ∑ t1 ∑ st1 v̂ it v̂ is d
1/2
→ Normal0, 1.
N T−1 T 2
∑ i1
∑ t1
∑ st1
v̂ it v̂ is
∙ Later, show how to test u it  for serial correlation allowing for c i ,

which is more interesting.
43
3.3 Fixed Effects Estimation
∙ Unlike POLS and RE, fixed effects estimation removes c i to form an

estimating equation.
∙ Average the original equation,
y it  x it   c i  u it , t  1, . . . , T,
across t to get a cross-sectional equation:
ȳ i  x̄ i   c i  ū i ,
where the overbar indicates time averages:

T T T
ȳ i  T −1 ∑ y it , x̄ i  T −1 ∑ x it , ū i  T −1 ∑ u it
t1 t1 t1
44
∙ The equation ȳ i  x̄ i   c i  ū i is often called the between equation
because it relies on variation in the data between cross section
observations. The between estimator is the OLS estimator from the
cross section regression
ȳ i on x̄ i , i  1, . . . , N.
[In practice, an intercept is included to account for nonzero Ec i .]

∙ The between estimator is inconsistent unless
Covx̄ i , c i   0, Covx̄ i , ū i   0.
45
∙ Instead, subtract off the time-averaged equation from the original
equation to eliminate c i :
y it − ȳ i  x it − x̄ i   u it − ū i , t  1, . . . , T
or
ÿ it  ẍ it   ü it , t  1, . . . , T
where ÿ it  y it − ȳ i and so on.

∙ We call this the time demeaned equation, and the transformation is
time demeaning, fixed effects, or within (time variation within each i is
used).
46
∙ Key is that c i is gone from the time demeaned equation. So, we can
use pooled OLS:
ÿ it on ẍ it , t  1, . . . , T; i  1, . . . , N.
This is the fixed effects (FE) estimator or the within estimator.

N T −1 N T
̂ FE  ∑ ∑ ẍ ′it ẍ it ∑ ∑ ẍ ′it ÿ it
i1 t1 i1 t1
N T −1 N T
 ∑ ∑ ẍ ′it ẍ it ∑ ∑ ẍ ′it y it
i1 t1 i1 t1
T T
because ∑ t1 x it − x̄ i  ′ y it − ȳ i   ∑ t1 x it − x̄ i  ′ y it .
47
∙ What is the weakest orthogonality assumption for consistency? We
can just apply the results for POLS, but it is useful to see it directly.
∙ Write the estimator by substituting ÿ it  ẍ it   ü it :
N T −1 N T
̂ FE    ∑ ∑ ẍ ′it ẍ it ∑ ∑ ẍ ′it ü it
i1 t1 i1 t1
N T −1 N T
  ∑ ∑ ẍ ′it ẍ it ∑ ∑ ẍ ′it u it
i1 t1 i1 t1
48
∙ By the WLLN as N →  with fixed T, the key moment condition for
consistency is
T T
∑ Eẍ ′it u it   ∑ Ex it − x̄ i  ′ u it   0.
t1 t1
∙ In addition to contemporaneous exogeneity, Ex ′it u it   0, we need a

kind of strict exogeneity:
T
Ex̄ ′i u it   T −1 ∑ Ex ′is u it   0, t  1, 2, . . . , T.
s1
49
ASSUMPTION FE.1: Same as RE.1(a), that is,
Eu it |x i , c i   0, t  1, . . . , T.
∙ This implies Ex ′is u it   0, all s, t  1, . . . , T, and so Eẍ ′it u it   0,

t  1, . . . , T.
∙ The rank condition is directly from POLS.2:
ASSUMPTION FE.2:
T
rank ∑ Eẍ ′it ẍ it   K.
t1
50
∙ The rank condition rules out elements in x it that have no time
variation for any unit in the population. Such variables get swept away
by the within transformation.
∙ Under FE.1 and FE.2,
p
̂ →  as N → 
FE
∙ The FE estimator works well for large T, too, but showing that
requires putting restrictions on the time series process
x it , y it  : t  1, 2, . . . .
51
∙ What parameters can we identify with FE? Suppose we start with
y it   1   2 d2 t . . .  T dT t  z i  1  d2 t z i  2 . . . dT t z i  T
 w it   c i  u it
∙ Using FE, we cannot estimate  1 or  1 , but all other parameters are

generally identified.
∙ FE allows c i to be arbitrarily correlated with z i , w it , and so we
cannot distinguish  1  z i  1 from c i .
52
∙ We can estimate  2 , . . . ,  T and  2 , . . . ,  T . So we can estimate whether
the effect of the time constant variables has changed over time. We
cannot estimate the effect in any period t because it is  1 for t  1 and
 1   t for t  2, . . . , T.
53
∙ As another example, suppose w it is a scalar policy variable and z i are
time-constant characteristics, and the model is
y it   1   2 d2 t . . .  T dT t  z i  1  d2 t z i  2 . . . dT t z i  T
 w it  w it z i −  z   c i  u it
where  z  Ez i .
∙ We can estimate  (the average partial effect) as well as , which
means we can see how the policy effects change with individual
characters (and test H 0 :   0). As a practical matter, we would
replace the population mean  z with the sample average,
N
z̄  N −1 ∑ i1 z i .
54
∙ We can obtain a variance matrix estimator valid under Assumptions
FE.1 and FE.2.
∙ Define the FE residuals as

ü it  ÿ it − ẍ it ̂ FE , t  1, . . . , T; i  1, . . . , N
∙ These are “estimates” of the ü it , not the u it . This has implications for
estimating the error variance,  2u .
55
∙ Without additional assumptions, use the “cluster-robust” matrix
N T −1 N T T
  ′
Avar̂ FE   ∑∑ ẍ ′it ẍ it ∑ ∑ ∑ ü it ü ir ẍ it ẍ ir
i1 t1 i1 t1 r1
N T −1
 ∑ ∑ ẍ ′it ẍ it .
i1 t1
56
∙ In Stata, again use the “cluster” option:
xtreg y x1 x2 ... xK, fe cluster(id)
∙ Of course, a nonrobust form requires an extra assumption:
ASSUMPTION FE.3: Same as RE.3(a), that is,
Eu i u ′i |x i , c i    2u I T .
∙ To find the asymptotic variance under this assumption, remember the

general form for a pooled OLS estimator – in this case, on the time
demeaned data – is the sandwich form
̈ ′i X
EX ̈ ′i ü i ü ′i X
̈ i  −1 EX ̈ ′i X
̈ i EX ̈ i  −1 .
57
Under FE.3, we can simplify the middle matrix. First, use
T T
 ∑ ẍ ′it ü it  ∑ x it u it  X i u i .
̈ ′ ′ ̈ ′
Xiüi ̈
t1 t1
Therefore,
̈ ′i ü i ü ′i X
EX ̈ ′i u i u ′i X
̈ i   EX ̈ i   EEX ̈ ′i u i u ′i X
̈ i |X
̈ i 
̈ ′i Eu i u ′i |X
 EX ̈ i X
̈ i   EX ̈ ′i  2u I T Ẍ i
̈ ′i X
  2u EX ̈ i
̈ i    2u I T under FE.3.
because Eu i u ′i |X
58
∙ So Avar N ̂ FE −    2u EX
̈ ′i X
̈ i  −1 under FE.1, FE.2, and
FE.3.
∙ Estimating  2u requires some care because we effectively observe ü it ,
not u it .
∙ Under the constant variance and no serial correlation assumptions on
u it ,
Varü it   Varu it − ū i    2u   2u /T − 2Covu it , ū i 

  2u   2u /T − 2 2u /T   2u 1 − 1/T
59
∙ So
T
∑ Eü 2it   T − 1 2u .
t1
∙ One degree of freedom is lost for each unit i because of the time
T
demeaning: ∑ t1 ü it  0.
60
∙ Therefore,
N T
 2u  NT − 1 −1 ∑ ∑ Eü 2it 
i1 t1
and now take away expectation, insert ̂ FE for , and use a df

adjustment to account for estimating the K-vector :
61
N T
2
̂ 2u  NT − 1 − K −1 ∑ ∑ ü it  SSR/NT − 1 − K
i1 t1
∙ ̂ 2u is actually unbiased under FE.1, FE.2, and FE.3. It is consistent as

N → .
∙ Under FE.1, FE.2, and FE.3,
N −1
Avar̂ FE   ̂ 2u ∑ XiXi
̈ ′̈
̂ ̈ ′ ̈ −1
  u X X
2
i1
and this is the “usual” asymptotic variance estimator.
62
∙ If you do the time-demeaning and run pooled OLS, the usual statistics
do not reflect the lost degrees of freedom (N of them). The estimate of
 2u will be SSR/NT − K, which is too small. Canned FE packages
properly compute the statistics.
∙ The FE estimator ̂ can also be obtained by running a long
FE
regression on the original data, and including dummy variables for each
cross section unit:
y it on d1 i , d2 i , . . . , dN i , x it , t  1, . . . , T; i  1, . . . , N,
often called the dummy variable regression. The statistics are properly
computed because the inclusion of the N dummy variables.
63
∙ Only danger: treating the c i as parameters to estimate, while sensible
with “large” T, can lead to trouble later with nonlinear models. Here,
we get a consistent estimator of  for fixed T.
∙ Sometimes we want to estimate the c i using the T time periods. Do
not have to run the dummy variable regression:
ĉ i  ȳ i − x̄ i ̂ FE , i  1, . . . , N.
∙ With small T, this is not a good “estimate” of c i , but it is unbiased.

We can estimate features of the distibution of c i well:
64
N N N
̂ c  N −1 ∑ ĉ i  N −1 ∑ȳ i − x̄ i ̂ FE   N −1 ∑c i  ū i  x̄ i  − ̂ FE 
i1 i1 i1
N N N
 N −1 ∑ c i  N −1 ∑ ū i  N −1 ∑ x̄ i  − ̂ FE 
i1 i1 i1
N
∑ c i  o p 1  O p 1o p 1 →  c .
p
 N −1
i1
∙ Stata, for example, reports ̂ c as the “intercept” or “constant” in FE

regressions.
∙ This consistency argument uses only FE.1 and FE.2.
65
∙ Can estimate other features of the distribution, too, although some
“obvious” estimators are inconsistent. For example, we might try to
estimate  2c using the sample variance of ĉ i : i  1, . . . , N:
N
̃ 2c  N − 1 −1 ∑ĉ i − ̂ c  2 .
i1
But under FE.1 to FE.3 it can be shown that
plim̃ 2c    2c  Varū i    2c   2u /T.
66
∙ We can adjust for the “bias” using the estimate ̂ 2u :
N
̂ 2c  ̃ 2c − ̂ 2u /T  N − 1 −1 ∑ĉ i − ̂ c  2 − ̂ 2u /T
i1
is consistent for  2c for any T as N → .

∙ If we treat the c i as parameters, can test the null that they are the
same. This is easy to see if we add to FE.1 to FE.3 the assumption
u it |x i , c i ~Normal0,  2u . Then the classical linear model assumptions
hold, and so H 0 : c 1  c 2 . . .  c N can be tested using an F statistic
with N − 1 and NT − 1 − K degrees of freedom.
67
∙ We can also obtain an estimate of  2c using
 2c   2v −  2u
We already have ̂ 2u  SSR/NT − 1 − K, which is consistent for  2u

under FE.1 to FE.3. Also, v it  y it − x it  and so a consistent estimator
of  2v is
N T
̂ 2v  NT − K −1 ∑ ∑y it − x it ̂ FE − ̂ v  2 ,
i1 t1
where ̂ v is the sample average of the y it − x it ̂ FE .
68
∙ Recent work by Orme and Yamagata (2006, Econometric Reviews)
has shown that the F statistic is approximately valid if we drop the
normality assumption on u it , but it is still unknown how to test
constancy of the c i with serial correlation or heteroskedasticity in u it .
69
Testing for Serial Correlation
∙ Because we can obtain fully robust inference, why should we test for
serial correlation in the u it ? The answer is that we might be able to
improve efficiency using a GLS-type method.
∙ We can test for serial correlation in u it , but it is tricky because we
effectively only have ü it .
∙ When u it  is serially uncorrelated with constant variance, for t ≠ r
we have
Eü it ü ir   Eu it − ū i u ir − ū i 

 −2 2u /T   2u /T  − 2u /T.
70
Therefore,
− 2u /T
Corrü it , ü ir   2 − 1 .
 u T − 1/T T−1
∙ If the original errors are serially uncorrelated, the time-demeaned

errors have a negative correlation, which is smaller as T increases.
∙ Cannot (and need not) test for serial correlation when T  2 because
ü i1  −ü i2 .
∙ But for T  2, can examine whether the fixed effects residuals are
consistent with correlation of roughly −T − 1 −1 .
71
∙ A simple test is based on a pooled AR(1) regression. First obtain the

FE residuals, ü it . (In Stata, use the “areg” command.) Then run the
pooled OLS regression
 
ü it on ü i,t−1 , t  3, . . . , T; i  1, . . . , N

and let the coefficient on ü i,t−1 be ̂ . The tricky thing is that, under the
null, the ü it are serially correlated.
∙ We obtain a simple statistic using a fully robust standard error for ̂ ,
se̂  (available from the “cluster” option in POLS). The t statistic is
̂  T − 1 −1 
.
̂
se
72
∙ Typically observe ̂  0 if u it  is positively serially correlated. A
positive, significant estimate of ̂ reveals some positive serial
correlation. If ̂ ≈ −T − 1 −1 , no serial correlation in u it  might be
reasonable.
∙ If we find strong evidence of serial correlation in u it , we might
want to exploit it in estimation rather than just making FE inference
robust.
73
Fixed Effects GLS
∙ Write the T time periods for a random draw i as
yi  Xi  cijT  ui
and let the variance matrix of u i to be a T  T unrestricted matrix .

∙ When we eliminate c i by demeaning we get
̈ i  üi
ÿi  X
where, for example, ü i  Q T u i and Q T  I T − j T j ′T j T  −1 j ′T is

symmetric, idempotent with rank T − 1.
74
∙ Because j ′T ü i  0, we know the T  T matrix Eü i ü ′i  has rank less
than T. In fact, (unconditional) variance covariance matrix of ü i is
 ≡ Eü i ü ′i   EQ T u i u ′i Q T   Q T Q T ,
which has rank T − 1.

∙ Applying FGLS to
̈ i  üi
ÿi  X
is tricky (generalized inverse required).
75
∙ There is a simple solution. After demeaning to obtain ÿ i and Ẍ i using
all T time periods and obtaining
N
 ′
  N ∑ üiüi
̂ −1
i1
drop one of the time periods. It does not matter which one is dropped
(but the first or last are easiest).
∙ Apply FGLS to the T − 1 remaining equations using ̂.
∙ Remember, can still make a case for robust inference because system
heteroskedasticity is always a possibility.
76
Some Practical Hints in Applying Fixed Effects
∙ Possible confusion concerning the term “fixed effects.” Suppose i is a
firm. Then the phrase “firm fixed effect” corresponds to allowing c i in
the model to be correlated with the covariates. If c i is called a firm
“random effect” then it is being assumed to be uncorrelated with x it .
∙ Suppose that we cannot, or do not want to, use FE estimation. This
might occur because the key variable at the firm level is constant across
time for all firms – and so the FE transformation sweeps it away – or
there is little time variation within firm in the key variable, leading to
large standard errors.
77
∙ Instead, we might use a random effects analysis at the firm level but
include industry dummy variables to account for systematic differences
across industries. So, we include in x it a set of industry dummy
variables while also allowing a firm effect c i in a “random effects”
analysis.
∙ If there are many firms per industry, the industry “fixed effects” – the
coefficients on the industry dummies – can be precisely estimated. So
the industry “fixed effects” are really parameters to estimate whereas
the c i are not.
78
∙ Generally, including dummies for more aggregated levels and then
applying RE is common when the covariates of interest vary in the
cross section but not (much) over time.
∙ Keep in mind that an RE analysis at the firm level with industry
dummies need not be entirely convincing: the key elements of x it might
be correlated with unoberved firm features that are not adequately
captured by industry differences.
79
Application
For N  1, 149 U.S. air routes and the years 1997 through 2000, y it is
logfare it  and the key explanatory variable is concen it , the
concentration ratio for route i. Other covariates are year dummies and
the time-constant variables logdist i  and logdist i  2 . Note that what I
call c i Stata refers to as u_i.
80
. use airfare
. tab year
1997, 1998, |
1999, 2000 | Freq. Percent Cum.
-----------------------------------------------
1997 | 1,149 25.00 25.00
1998 | 1,149 25.00 50.00
1999 | 1,149 25.00 75.00
2000 | 1,149 25.00 100.00
-----------------------------------------------
Total | 4,596 100.00
. sum fare concen dist
Variable | Obs Mean Std. Dev. Min Max

---------------------------------------------------------------------
fare | 4596 178.7968 74.88151 37 522
concen | 4596 .6101149 .196435 .1605 1
dist | 4596 989.745 611.8315 95 2724
81
. reg lfare concen ldist ldistsq y98 y99 y00
Source | SS df MS Number of obs  4596

------------------------------------------- F( 6, 4589)  523.18
Model | 355.453858 6 59.2423096 Prob  F  0.0000
Residual | 519.640516 4589 .113236112 R-squared  0.4062
------------------------------------------- Adj R-squared  0.4054
Total | 875.094374 4595 .190444913 Root MSE  .33651
------------------------------------------------------------------------------
lfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
concen | .3601203 .0300691 11.98 0.000 .3011705 .4190702
ldist | -.9016004 .128273 -7.03 0.000 -1.153077 -.6501235
ldistsq | .1030196 .0097255 10.59 0.000 .0839529 .1220863
y98 | .0211244 .0140419 1.50 0.133 -.0064046 .0486533
y99 | .0378496 .0140413 2.70 0.007 .010322 .0653772
y00 | .09987 .0140432 7.11 0.000 .0723385 .1274015
_cons | 6.209258 .4206247 14.76 0.000 5.384631 7.033884
------------------------------------------------------------------------------
82
. reg lfare concen ldist ldistsq y98 y99 y00, cluster(id)
(Std. Err. adjusted for 1149 clusters in id)

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .3601203 .058556 6.15 0.000 .2452315 .4750092
ldist | -.9016004 .2719464 -3.32 0.001 -1.435168 -.3680328
ldistsq | .1030196 .0201602 5.11 0.000 .0634647 .1425745
y98 | .0211244 .0041474 5.09 0.000 .0129871 .0292617
y99 | .0378496 .0051795 7.31 0.000 .0276872 .048012
y00 | .09987 .0056469 17.69 0.000 .0887906 .1109493
_cons | 6.209258 .9117551 6.81 0.000 4.420364 7.998151
------------------------------------------------------------------------------
83
. xtreg lfare concen ldist ldistsq y98 y99 y00, re
Random-effects GLS regression Number of obs  4596

Group variable: id Number of groups  1149
R-sq: within  0.1348 Obs per group: min  4

between  0.4176 avg  4.0
overall  0.4030 max  4
Random effects u_i ~Gaussian Wald chi2(6)  1360.42

corr(u_i, X)  0 (assumed) Prob  chi2  0.0000
------------------------------------------------------------------------------
lfare | Coef. Std. Err. z P|z| [95% Conf. Interval]
-----------------------------------------------------------------------------
concen | .2089935 .0265297 7.88 0.000 .1569962 .2609907
ldist | -.8520921 .2464836 -3.46 0.001 -1.335191 -.3689931
ldistsq | .0974604 .0186358 5.23 0.000 .0609348 .133986
y98 | .0224743 .0044544 5.05 0.000 .0137438 .0312047
y99 | .0366898 .0044528 8.24 0.000 .0279626 .0454171
y00 | .098212 .0044576 22.03 0.000 .0894752 .1069487
_cons | 6.222005 .8099666 7.68 0.000 4.6345 7.80951
-----------------------------------------------------------------------------
sigma_u | .31933841
sigma_e | .10651186
rho | .89988885 (fraction of variance due to u_i)
------------------------------------------------------------------------------
84
. * The coefficient on the time-varying variable concen drops quite a bit.
. * Notice that the RE and POLS coefficients on the time-constant
. * distance variables are pretty similar, something that often occurs.
. xtreg lfare concen ldist ldistsq y98 y99 y00, re cluster(id)

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .2089935 .0422459 4.95 0.000 .126193 .2917939
ldist | -.8520921 .2720902 -3.13 0.002 -1.385379 -.3188051
ldistsq | .0974604 .0201417 4.84 0.000 .0579833 .1369375
y98 | .0224743 .0041461 5.42 0.000 .014348 .0306005
y99 | .0366898 .0051318 7.15 0.000 .0266317 .046748
y00 | .098212 .0055241 17.78 0.000 .0873849 .109039
_cons | 6.222005 .9144067 6.80 0.000 4.429801 8.014209
-----------------------------------------------------------------------------
sigma_u | .31933841
sigma_e | .10651186
------------------------------------------------------------------------------
. * Robust standard error on concen is quite a bit larger.
85
. * What if we do not control for distance in RE?
. xtreg lfare concen y98 y99 y00, re cluster(id)


------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .0468181 .0427562 1.09 0.274 -.0369826 .1306188
y98 | .0239229 .0041907 5.71 0.000 .0157093 .0321364
y99 | .0354453 .0051678 6.86 0.000 .0253167 .045574
y00 | .0964328 .0055197 17.47 0.000 .0856144 .1072511
_cons | 5.028086 .0285248 176.27 0.000 4.972178 5.083993
-----------------------------------------------------------------------------
sigma_u | .40942871
sigma_e | .10651186
------------------------------------------------------------------------------
. * The RE estimate is now much smaller than when ldist and ldistsq are
. * controlled for, and much smaller than the FE estimate. Thus, it can be
. * very harmful to omit time-constant variables in RE estimation.
86
. * Allow an unrestricted unconditional variance-covariance matrix, but
. * make robust to system heteroskedasticity:
. xtgee lfare concen ldist ldistsq y98 y99 y00, corr(uns) robust
GEE population-averaged model Number of obs  4596

Group and time vars: id year Number of groups  1149
Link: identity Obs per group: min  4
Family: Gaussian avg  4.0
Correlation: unstructured max  4
Wald chi2(6)  1246.97
Scale parameter: .1135142 Prob  chi2  0.0000
(Std. Err. adjusted for clustering on id)

------------------------------------------------------------------------------
| Semi-robust
-----------------------------------------------------------------------------
concen | .2364893 .0406545 5.82 0.000 .1568079 .3161706
ldist | -.8806104 .26696 -3.30 0.001 -1.403842 -.3573785
ldistsq | .0992803 .0197484 5.03 0.000 .0605741 .1379866
y98 | .0222287 .0041432 5.37 0.000 .0141082 .0303492
y99 | .0369008 .0051386 7.18 0.000 .0268293 .0469724
y00 | .0985136 .0055411 17.78 0.000 .0876533 .109374
_cons | 6.313734 .8977898 7.03 0.000 4.554098 8.07337
------------------------------------------------------------------------------
87
. xtreg lfare concen ldist ldistsq y98 y99 y00, fe
Fixed-effects (within) regression Number of obs  4596


between  0.0576 avg  4.0
F(4,3443)  134.61
corr(u_i, Xb)  -0.2033 Prob  F  0.0000
------------------------------------------------------------------------------
-----------------------------------------------------------------------------
concen | .168859 .0294101 5.74 0.000 .1111959 .226522
ldist | (dropped)
ldistsq | (dropped)
y98 | .0228328 .0044515 5.13 0.000 .0141048 .0315607
y99 | .0363819 .0044495 8.18 0.000 .0276579 .0451058
y00 | .0977717 .0044555 21.94 0.000 .089036 .1065073
_cons | 4.953331 .0182869 270.87 0.000 4.917476 4.989185
-----------------------------------------------------------------------------
sigma_u | .43389176
sigma_e | .10651186
------------------------------------------------------------------------------
F test that all u_i0: F(1148, 3443)  36.90 Prob  F  0.0000
88
. xtreg lfare concen ldist ldistsq y98 y99 y00, fe cluster(id)

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .168859 .0494587 3.41 0.001 .0718194 .2658985
ldist | (dropped)
ldistsq | (dropped)
y98 | .0228328 .004163 5.48 0.000 .0146649 .0310007
y99 | .0363819 .0051275 7.10 0.000 .0263215 .0464422
y00 | .0977717 .0055054 17.76 0.000 .0869698 .1085735
_cons | 4.953331 .0296765 166.91 0.000 4.895104 5.011557
-----------------------------------------------------------------------------
sigma_u | .43389176
sigma_e | .10651186
------------------------------------------------------------------------------
89
. * Let the effect of concen depend on route distance.
. sum ldist if y00
Variable | Obs Mean Std. Dev. Min Max

---------------------------------------------------------------------
ldist | 1149 6.696482 .6595331 4.553877 7.909857
. gen ldistconcen  (ldist - 6.7)*concen
90
. xtreg lfare concen ldistconcen y98 y99 y00, fe cluster(id)


------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .1652538 .0482782 3.42 0.001 .0705304 .2599771
ldistconcen | -.2498619 .0828545 -3.02 0.003 -.4124251 -.0872987
y98 | .0230874 .0041459 5.57 0.000 .014953 .0312218
y99 | .0355923 .0051452 6.92 0.000 .0254972 .0456874
y00 | .0975745 .0054655 17.85 0.000 .0868511 .1082979
_cons | 4.93797 .0317998 155.28 0.000 4.875578 5.000362
-----------------------------------------------------------------------------
sigma_u | .50598296
sigma_e | .10605257
------------------------------------------------------------------------------
91
. * Effect at the average of ldist is similar to before. But at one standard
. * deviation of ldist above its mean, the effect of concen is zero:
. lincom concen  .66*ldistconcen
( 1) concen  .66 ldistconcen  0
------------------------------------------------------------------------------
-----------------------------------------------------------------------------
(1) | .0003449 .0554442 0.01 0.995 -.1084383 .1091281
------------------------------------------------------------------------------
. count if ldist  6.7  .66 & y00

209
. di 209/1149
.1818973
. * So about 18.2* of the routes of ldist greater than one standard deviation
. * above the mean.
92
3.4. First-Differencing Estimation
∙ Like FE, FD removes c i . But it does it by differencing adjacent

observations. FE and FD are the same when T  2, but differ
otherwise. Again, start with the original equation:
y it  x it   c i  u it , t  1, . . . , T.
For FD, we explicitly lose the first time period:
Δy it  Δx it   Δu it , t  2, . . . , T.
The FD estimator is pooled OLS on the first differences.

∙ In practice, might not difference period dummies, unless interested in
the year intercepts in the original levels.
93
∙ FD also requires a kind of strict exogeneity. The weakest assumption
is
EΔx ′it Δu it   0, t  2, . . . , T.
∙ Failure of strict exogeneity will cause different inconsistencies in FE

and FD when T  2.
∙ (For later: In unbalanced cases, FD requires that data exists in
adjacent time periods. FE does not.)
94
∙ A sufficient condition is
ASSUMPTION FD.1: Same as FE.1, Eu it |x i , c i   0, t  1, . . . , T.
ASSUMPTION FD.2: Let ΔX i be the T − 1  K matrix with rows
Δx it . Then,
rank EΔX ′i ΔX i   K.
∙ Should make inference robust to serial correlation and

heteroskedasticity in the differenced errors, e it ≡ u it − u i,t−1 . For
example, if u it  is uncorrelated, Corre it , e i,t1   −. 5.
95
∙ After POLS on the first differences, let
ê it  Δy it − Δx it ̂ FD , t  2, . . . , T; i  1, . . . , N
and let ê i  ê i2 , . . . , ê iT  ′ be the T − 1  1 residuals. Then

N −1 N −1 N −1
Avar̂ FD   ∑ ΔX ′i ΔX i ∑ ΔX ′i ê i ê ′i ΔX i ∑ ΔX ′i ΔX i
i1 i1 i1
is the fully robust variance matrix estimator.

∙ Use pooled OLS, on the first differences and then use a “cluster”
option.
96
ASSUMPTION FD.3:
Ee i e ′i |ΔX i    2e I T
where  2e  Ee 2it  for all t.

∙ Under Assumption FE.3, the usual POLS statistics in the FD
regression are asymptotically valid.
∙ If we believe FD.3, then u it  u i,t−1  e it is a random walk. In a pure
time series setting, this means the regression would be “spurious.”
97
∙ For a given i, the time series “model” would be
y it  c i  x it   u it
u it  u i,t−1  e it ,
where c i is the intercept for unit i. This does not define a sensible time
series regression because u it  is not “mean reverting.” One way to see
this is Varu it    2e t, and so the idiosyncratic error variance grows as a
linear function of t.
98
∙ Here we can allow random walk behavior in u it  with a short T
because we have cross section variation driving the large-sample
analysis.
99
∙ Testing for serial correlation in e it  Δu it  is easy. If we start with
T ≥ 3, then use a t test or heteroskedasticity-robust version for ̂ , where
̂ is the coefficient on ê i,t−1 in the pooled dynamic OLS regression
ê it on ê i,t−1 , t  3, . . . , T; i  1, . . . , N.
∙ We can also use this regression to test whether Corre it , e i,t−1   −. 5,

as implied by FE.3. But then the standard error of ̂ should be made
robust to serial correlation. The t statistic in this case is
̂ . 5
.
se̂ 
100
∙ Can use the FD residuals to recover an estimate of  if we think
u it : t  1, 2, . . . T follows a stationary AR(1) process. Then
Covu it , u i,t−h    h  2u , h  0, 1 , . . . . Therefore
Cove it , e i,t−1   Covu it − u i,t−1 , u i,t−1 − u i,t−2 

  2u −  2  2u −  2u   2u
 − 2u 1 − 2   2 
 − 2u 1 −  2
101
∙ Further,
Vare it    2u − 2Covu it , u i,t−1    2u
 2 2u 1 − 
∙ It follows that
− 2u 1 −  2  − 1
Corre it , e i,t−1    .
2 u 1 − 
2 2
Letting  ≡ Corre it , e i,t−1 , we can write
  1  2
102
∙ Notice we get the right answer when   0: namely,   1 (so that
u it  follows a random walk). So we can use
̂  1  2̂
as a consistent estimator of  for  ≤ 0.

∙ If ̂ L , ̂ U  is a 95% CI for , then we get a 95% CI for  by finding
̂ L  1  2̂ L and ̂ U  1  2̂ U .
103
∙ Applying feasible GLS after differencing is especially easy because
the lost degree of freedom for each i is automatically incorporated by
losing the first time period.
∙ Resulting estimator is the FDGLS estimator. It uses an unrestricted
T − 1  T − 1 variance matrix in the FD equation
Δy i  ΔX i   Δu i
where Δu i is T − 1  1.
∙ Easy to use the xtgee command in Stata.
104
. sort id year
. gen clfare  lfare - lfare[_n-1] if year  1997

(1149 missing values generated)
. gen cconcen  concen - concen[_n-1] if year  1997

. reg clfare cconcen y99 y00
Source | SS df MS Number of obs  3447

------------------------------------------- F( 3, 3443)  45.61
Model | 2.14076964 3 .71358988 Prob  F  0.0000
Residual | 53.8669392 3443 .01564535 R-squared  0.0382
------------------------------------------- Adj R-squared  0.0374
Total | 56.0077088 3446 .016252963 Root MSE  .12508
------------------------------------------------------------------------------
clfare | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
cconcen | .1759764 .0284387 6.19 0.000 .1202181 .2317348
y99 | -.0091019 .0052688 -1.73 0.084 -.0194322 .0012284
y00 | .0386441 .0052301 7.39 0.000 .0283897 .0488985
_cons | .0227692 .0036988 6.16 0.000 .0155171 .0300212
------------------------------------------------------------------------------
. predict eh, resid
105
. * Fairly close to FE estimate of .169, but standard errors are probably
. * not correct. The R-squared gives us a measure of how well changes
. * in concentration explain changes in lfare.
. reg clfare cconcen y99 y00, cluster(id)
Linear regression Number of obs  3447

F( 3, 1148)  34.36
Prob  F  0.0000
R-squared  0.0382
Root MSE  .12508

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
cconcen | .1759764 .0430367 4.09 0.000 .0915371 .2604158
y99 | -.0091019 .0058305 -1.56 0.119 -.0205416 .0023378
y00 | .0386441 .0055658 6.94 0.000 .0277239 .0495643
_cons | .0227692 .0041573 5.48 0.000 .0146124 .030926
------------------------------------------------------------------------------
. * We can estimate the intercepts in the original model, too, by

. * differencing the year dummies.
. gen cy98  y98 - y98[_n-1] if year  1997

106
. gen cy99  y99 - y99[_n-1] if year  1997
. gen cy00  y00 - y00[_n-1] if year  1997

. reg clfare cconcen cy98 cy99 cy00, nocons cluster(id)

F( 4, 1148)  118.18
Prob  F  0.0000
Root MSE  .12508

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
cconcen | .1759764 .0430367 4.09 0.000 .0915371 .2604158
cy98 | .0227692 .0041573 5.48 0.000 .0146124 .030926
cy99 | .0364365 .005153 7.07 0.000 .026326 .0465469
cy00 | .0978497 .0055468 17.64 0.000 .0869666 .1087328
------------------------------------------------------------------------------
. * All estimates are now similar to FE. This R-squared is less useful
. * than when a constant is included because it does not remove the average.
. * It is the "uncentered" R-squared.
107
. * Test for serial correlation using FD.
. predict eh, resid

. gen eh_1  eh[_n-1] if year  1998

. reg eh eh_1, robust

F( 1, 2296)  21.60
Prob  F  0.0000
Root MSE  .1169
------------------------------------------------------------------------------
| Robust
eh | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
eh_1 | -.1275163 .0274343 -4.65 0.000 -.1813148 -.0737177
_cons | -3.30e-11 .0024386 -0.00 1.000 -.0047821 .0047821
------------------------------------------------------------------------------
. * We can reject zero correlation in FD errors. (Robust to heteroskedasticity.)
108
. * Can use xtgee to obtain the FGLS estimator on the FD equation:
. xtgee clfare cconcen y99 y00, corr(uns)

Wald chi2(3)  119.43
------------------------------------------------------------------------------
clfare | Coef. Std. Err. z P|z| [95% Conf. Interval]
-----------------------------------------------------------------------------
cconcen | .169649 .0285421 5.94 0.000 .1137076 .2255904
y99 | -.0092635 .0054855 -1.69 0.091 -.0200149 .001488
y00 | .0385667 .0054062 7.13 0.000 .0279707 .0491627
_cons | .0228257 .0036967 6.17 0.000 .0155802 .0300712
------------------------------------------------------------------------------
109
. xtgee clfare cconcen y99 y00, corr(uns) robust

Wald chi2(3)  101.68
(Std. Err. adjusted for clustering on id)

------------------------------------------------------------------------------
| Semirobust
clfare | Coef. Std. Err. z P|z| [95% Conf. Interval]
-----------------------------------------------------------------------------
cconcen | .169649 .042983 3.95 0.000 .0854038 .2538942
y99 | -.0092635 .0058158 -1.59 0.111 -.0206622 .0021352
y00 | .0385667 .0055622 6.93 0.000 .0276651 .0494683
_cons | .0228257 .0041575 5.49 0.000 .0146771 .0309743
------------------------------------------------------------------------------
. * The robust standard error for FGLS is about 50% larger than the nonrobust
. * one.
110
. reg eh eh_1, cluster(id)

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
eh_1 | -.1275163 .0272003 -4.69 0.000 -.1808841 -.0741485
_cons | -3.30e-11 .0023264 -0.00 1.000 -.0045644 .0045644
------------------------------------------------------------------------------
. lincom eh_1  .5
( 1) eh_1  -.5
------------------------------------------------------------------------------
-----------------------------------------------------------------------------
(1) | .3724837 .0272003 13.69 0.000 .3191159 .4258515
------------------------------------------------------------------------------
. * And we can easily reject -.5, too, which is what would happen under FE.3.
. * If we believe u(i,t) follows an AR(1), then we can use

. * rho  1  2*Corr(eh,eh_1)
. di 1  2*(-.128)
.744
. * So the estimated rho is pretty high at .744.
111
. * Test for serial correlation using FE. Use "areg" to get the FE
. * residuals.
. areg lfare concen y98 y99 y00, absorb(id)
Linear regression, absorbing indicators Number of obs  4596

F( 4, 3443)  134.61
Prob  F  0.0000
Adj R-squared  0.9404
Root MSE  .10651
------------------------------------------------------------------------------
-----------------------------------------------------------------------------
concen | .168859 .0294101 5.74 0.000 .1111959 .226522
y98 | .0228328 .0044515 5.13 0.000 .0141048 .0315607
y99 | .0363819 .0044495 8.18 0.000 .0276579 .0451058
y00 | .0977717 .0044555 21.94 0.000 .089036 .1065073
_cons | 4.953331 .0182869 270.87 0.000 4.917476 4.989185
-----------------------------------------------------------------------------
id | F(1148, 3443)  60.521 0.000 (1149 categories)
. predict udh, resid
. sort id year
. gen udh_1  udh[_n-1] if year  1998

112
. reg udh udh_1, cluster(id)

F( 1, 1148)  0.87
Prob  F  0.3498
Root MSE  .08806

------------------------------------------------------------------------------
| Robust
udh | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
udh_1 | -.0285168 .0304886 -0.94 0.350 -.0883364 .0313028
_cons | 1.45e-11 .0019846 0.00 1.000 -.0038938 .0038938
------------------------------------------------------------------------------
. lincom udh_1  .333
( 1) udh_1  -.333
------------------------------------------------------------------------------
udh | Coef. Std. Err. t P|t| [95% Conf. Interval]
-----------------------------------------------------------------------------
(1) | .3044832 .0304886 9.99 0.000 .2446636 .3643028
------------------------------------------------------------------------------
.* -1/(T-1)  -.333 when T  4. Strongly reject FE.3; appears to be positive

. * serial correlation, as we already concluded using FD.
113
3.5. Prediction
∙ For prediction with unobserved effects models, we might include only

lags of explanatory variables in x it – so we do not have to forecast
future values of the covariates – and then try to forecast y i,T1 based on
data observed up through time T. Can show under the full RE
assumptions
Ey i,T1 |X i , x i,T1 , y i1 , . . . , y iT   x i,T1    2c / 2c   2u /Tv̄ i

v̄ i  ȳ i − x̄ i ,
so the prediction for RE is
114
x i,T1 ̂ RE  ̂ 2c /̂ 2c  ̂ 2u /Tȳ i − x̄ i ̂ RE .
∙ For fixed effects, the prediction would be

x i,T1 ̂ FE  ȳ i − x̄ i ̂ FE ,
which does not shrink the influence of the second term. As ̂ 2c increases
relative to ̂ 2u , or for large T, the two predictions are similar.
∙ Seems unlikely that either of these can match dynamic models
estimated by pooled OLS. The RE and FE methods each give the same
weight to the most recent and earliest outcomes on y.
115
4. COMPARISON OF ESTIMATORS
FE versus FD.
∙ Estimates and inference are identical when T  2. Generally, can see
differences as T increases.
∙ Usually think a significant difference signals violation of
Covx is , u it   0, all s, t. FE has some robustness if Covx it , u it   0 but
Covx it , u is   0, some s ≠ t: The “bias” is of order 1/T. FD does not
average out the bias over T.
116
∙ To see this, maintain contemporaneous exogeneity:
Ex ′it u it   0.
∙ Generally, under Assumption FE.2, we can write

T −1 T
plim ̂ FE     T −1 ∑ Eẍ ′it ẍ it  T −1 ∑ Eẍ ′it u it  .
N→ t1 t1
117
∙ Under contemporaneous exogeneity,
Eẍ ′it u it   −Ex̄ ′i u it 
and so
T T
T −1 ∑ Eẍ ′it u it   T −1 ∑ Ex̄ ′i u it   −Ex̄ ′i ū i .
t1 t1
118
∙ Under stationarity and weak dependence, Ex̄ ′i ū i   OT −1  because,
by the Cauchy-Schwartz inequality, for each j,
|Covx̄ ij , ū i |≤ sdx̄ ij sdū i 
and sdx̄ ij , sdū i  are OT −1/2  where each series is weakly dependent.
(If uncorrelated with constant variance, sdū i    u / T .)
119
∙ Further, T −1 ∑ t1
T
Eẍ ′it ẍ it  is bounded as a function of T. It follows
that
plim ̂ FE     O1  OT −1     OT −1 .

N→
∙ For the first difference estimator, the general probability limit is

T −1
plim ̂ FD     T − 1 −1 ∑ EΔx ′it Δx it 

N→ t2
T
 T − 1 −1 ∑ EΔx ′it Δu it 
t2
120
∙ If x it : t  1, 2, . . .  is weakly dependent, so is Δx it , and so the first
average is generally bounded. (In fact, under stationarity this average
does not depend on T.
∙ As for the second average,
EΔx ′it Δu it   −Ex ′it u i,t−1   Ex ′i,t−1 u it 
which is constant under stationarity (and generally nonzero). So
plim ̂ FD     O1

N→
even if Ex ′i,t−1 u it   0 (so the dynamics given the elements of x it are
correct).
121
∙ Can show the previous results hold even if x it  is I1 as a time
series process (has a “unit root”), but it is crucial that u it  is I0
(weakly dependent). If the regression is “spurious” in levels, it is better
to first difference!
∙ In simple cases, such as the AR(1) model with x it  y i,t−1 , can find
what the OT −1  term is for FE. If write the model as
y it  y i,t−1  1 − a i  u it
for −1   ≤ 1, then plim N→ (̂ FE     OT −1 . When   1, the

second term is −3/T  1.
122
∙ Simple test for feedback when the model does not contain lagged
dependent variables, that is, Covx i,t1 , u it  ≠ 0. Estimate
y it  x it   w i,t1   c i  u it , t  1, . . . , T − 1
by FE and test H 0 :   0 (fully robust, as usual).

∙ Only useful for T ≥ 3 because lose last time period.
123
. * We found that the FE and FD estimates of concen coefficient were
. * pretty close.
. sort id year
. gen concenp1  concen[_n1] if year  2000
. xtreg lfare concen concenp1 y98 y99 y00, fe cluster(id)


between  0.0535 avg  3.0
F(4,1148)  25.63
corr(u_i, Xb)  -0.2949 Prob  F  0.0000

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .2983988 .054797 5.45 0.000 .1908854 .4059122
concenp1 | -.0659259 .0467578 -1.41 0.159 -.1576663 .0258145
y98 | .0205809 .0042341 4.86 0.000 .0122735 .0288883
y99 | .0360638 .0050754 7.11 0.000 .0261058 .0460218
y00 | (dropped)
_cons | 4.914953 .0478488 102.72 0.000 4.821072 5.008834
-----------------------------------------------------------------------------
124
∙ If do not reject strict exogeneity, can use serial correlation properties
of u it  to choose between FE and FD. Generally a good idea to do FE
and FD and report robust standard errors.
∙ If we maintain system homoskedasticity (sufficient is
Varu i |x i , c i   Varu i ), then unrestricted FDGLS and FEGLS (with a
time period dropped) are asymptotically equivalent.
125
FE versus RE.
∙ Time-constant variables drop out of FE estimation. On the
time-varying covariates, are FE and RE so different after all? Define
the parameter
1/2
  1− 1 ,
1  T 2c / 2u 
which is consistently estimated (for fixed T) by ̂ . (Some authors use 

as the symbol.) The, the RE estimate can be obtained from the pooled
OLS regression
y it − ̂ ȳ i on x it − ̂ x̄ i , t  1, . . . , T; i  1, . . . , N.
126
∙ Call y it − ̂ ȳ i a “quasi-time-demeaned” variable: only a fraction of the
mean is removed.
̂ ≈ 0  ̂ RE ≈ ̂ POLS
̂ ≈ 1  ̂ RE ≈ ̂ FE
 increases to unity as (i)  2c / 2u increases or (ii) T increases. With large

T, FE and RE are often similar.
∙ If x it includes time-constant variables z i , then 1 − ̂ z i appears as a
regressor.
127
. * Can get the quasi-time-demeaning parameter, which Stata calls “theta.”
. xtreg lfare concen ldist ldistsq y98 y99 y00, re cluster(id) theta

Random effects u_i ~Gaussian Wald chi2(7)  386792.52

corr(u_i, X)  0 (assumed) Prob  chi2  0.0000
theta  .83550226

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .2089935 .0422459 4.95 0.000 .126193 .2917939
ldist | -.8520921 .2720902 -3.13 0.002 -1.385379 -.3188051
ldistsq | .0974604 .0201417 4.84 0.000 .0579833 .1369375
y98 | .0224743 .0041461 5.42 0.000 .014348 .0306005
y99 | .0366898 .0051318 7.15 0.000 .0266317 .046748
y00 | .098212 .0055241 17.78 0.000 .0873849 .109039
_cons | 6.222005 .9144067 6.80 0.000 4.429801 8.014209
-----------------------------------------------------------------------------
sigma_u | .31933841
sigma_e | .10651186
------------------------------------------------------------------------------
.* The value .836 makes it clear why FE and RE are pretty close.
128
Testing for Serial Correlation after RE
∙ Can show that under the RE variance matrix assumptions,
r it ≡ v it − v̄ i  1 − c i  u it − ū i has constant (unconditional)
variance and is serially uncorrelated.
∙ Suggests a way to test u it  for serial correlation. After RE
estimation, obtain r̂ it from the regression on the quasi-time-demeaned
data, and use a standard test for, say, AR(1) serial correlation. (Can
ignore estimation of parameters.)
129
Efficiency of RE
∙ Can show that RE is asymptotically more efficient than FE under
RE.1, RE.2, FE.2, and RE.3. Assume, for simplicity, x it has all
time-varying elements. (See text Section 10.7.2 for more general case.)
∙ Then
Avar̂ FE    2u EX
̈ ′i X
̈ i  −1 /N
∙ Let x̆ it  x it − x̄ i be the quasi-time demeaned time-varying

covariates. Then
̆ ′i X
Avar̂ RE    2u EX ̆ i  −1 /N
130
∙ Using ∑ t1
T
ẍ it  0 we have
T T
̆ ′i X
X ̆i  ∑ x̆ ′it x̆ it  ∑ẍ it  1 − x̄ i  ′ ẍ it  1 − x̄ i 
t1 t1
T
 ∑ẍ ′it ẍ it  1 −  2 x̄ ′i x̄ i 
t1
 ̈ ′i X
X ̈i  1 −  2 Tx̄ ′i x̄ i
̆ ′i X
EX ̈ ′i X
̆ i  − EX ̈ i   1 −  2 TEx̄ ′i x̄ i 
which is positive semidefinite.
131
Testing the Key RE Assumption
∙ Recall the key RE assumption is Covx it , c i   0. With lots of good
time-constant controls (“observed heterogeneity”) might be able to
make this condition roughly true.
∙ a. The traditional Hausman Test: Compare the coefficients on the
time-varying explanatory variables, and compute a chi-square statistic.
∙ Caution: Usual Hausman test maintains RE.3 – second moment
assumptions – yet has no systematic power for detecting violations
from this assumption.
∙ With time effects, must use generalized inverse. Easy to get the
degrees of freedom wrong.
132
∙ b. Variable addition test. Write the model as
y it  g t   z i   w it   c i  u it .
Obvious we cannot compare FE and RE estimates of  because the

former is not defined. Less obvious we cannot compare FE and RE
estimates of  (because FE and RE both allow estimation). But it turns
out we can only compare ̂ FE and ̂ RE .
133
∙ Let w it be 1  J. Use a correlated random effects (CRE) formulation
due to Mundlak (1978):
ci    w
̄ i  ai
Ea i |z i , w i   0.
This allows c i to be correlated with the time-varying explanatory

variables through its average level over time. (We might think of this as
a long-run component of w it : t  1, . . . , T.
134
∙ If we substitute c i  w
̄ i   a i into the original equation we get
y it  g t   z i   w it     w
̄ i   a i  u it .
Estimate this model using RE and test H 0 :   0 using RE estimation.

Should make test fully robust if have any doubt about RE.3 (which we
almost always should).
∙ The RE estimate of  when w̄ i is included is actually the FE estimate.
̄ i effectively
For that matter, so is the POLS estimate. Including w
proxies for c i . (The remaining heterogeneity, a i , is uncorrelated with all
explanatory variables.)
135
∙ When we use the CRE formulation to obtain a test of
Ec i |z i , w i   Ec i 
there is no mean relationship between c i and w i1 , w i2 , . . . , w iT . The

alternative Ec i |z i , w i   Ec i |w
̄ i    w
̄ i  is a convenient way to
obtain a test.
∙ Nevertheless, if we believe Ec i |z i , w i     w̄ i  (or use linear
projections) then the CRE formulation has the benefit of allowing us to
estimate the coefficients on z i , the time-consant variables.
136
∙ Guggenberger (2010, Journal of Econometrics) has recently pointed
out the pre-testing problem in using the Hausman test to decide
between RE and FE. The regression-based version of the test shows it
is related to the classic problem of pre-testing on a set of regressors –
̄ i in this case – in order to decide whether or not to include them.
w
∙ If  ≠ 0 but the test has low power, we will omit w̄ i when we should
include it. That is, we will incorrectly opt for RE.
∙ As always, need to distinguish between a statistical and practical
rejection.
137
Airfare Example
. * First use the Hausman test that maintains all of the RE assumptions under
. * the null and directly compares the RE and FE estimates:
. qui xtreg lfare concen ldist ldistsq y98 y99 y00, fe
. estimates store b_fe
. qui xtreg lfare concen ldist ldistsq y98 y99 y00, re
. estimates store b_re
138
. hausman b_fe b_re
---- Coefficients ----

| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| b_fe b_re Difference S.E.
-----------------------------------------------------------------------------
concen | .168859 .2089935 -.0401345 .0126937
y98 | .0228328 .0224743 .0003585 .
y99 | .0363819 .0366898 -.000308 .
y00 | .0977717 .098212 -.0004403 .
------------------------------------------------------------------------------
b  consistent under Ho and Ha; obtained from xtreg
B  inconsistent under Ha, efficient under Ho; obtained from xtreg
Test: Ho: difference in coefficients not systematic
chi2(4)  (b-B)’[(V_b-V_B)^(-1)](b-B)
 10.00
Probchi2  0.0405
(V_b-V_B is not positive definite)
.
. di -.0401/.0127
-3.1574803
. * This is the nonrobust H t test based just on the concen variable. There is
. * only one restriction to test, not four. The p-value reported for the
. * chi-square statistic is incorrect. Notice that the rejection using the
. * correct df is much stronger than if we act as if there are four restrictions.
139
. * Using the same variance matrix estimator solves the problem of wrong df.
. * The next command uses the matrix of the relatively efficient estimator.
. hausman b_fe b_re, sigmamore
Note: the rank of the differenced variance matrix (1) does not equal the
number of coefficients being tested (4); be sure this is what you expect,
or there may be problems computing the test. Examine the output of your
estimators for anything unexpected and possibly consider scaling your
variables so that the coefficients are on a similar scale.
---- Coefficients ----

| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| b_fe b_re Difference S.E.
-----------------------------------------------------------------------------
concen | .168859 .2089935 -.0401345 .0127597
y98 | .0228328 .0224743 .0003585 .000114
y99 | .0363819 .0366898 -.000308 .0000979
y00 | .0977717 .098212 -.0004403 .00014
------------------------------------------------------------------------------
b  consistent under Ho and Ha; obtained from xtreg
B  inconsistent under Ha, efficient under Ho; obtained from xtreg
Test: Ho: difference in coefficients not systematic
chi2(1)  (b-B)’[(V_b-V_B)^(-1)](b-B)
 9.89
Probchi2  0.0017
140
. * The regression-based test is better: it gets the df right AND is fully
. * robust to violations of the RE variance-covariance matrix:
. egen concenbar  mean(concen), by(id)
. xtreg lfare concen concenbar ldist ldistsq y98 y99 y00, re cluster(id)

------------------------------------------------------------------------------
| Robust
-----------------------------------------------------------------------------
concen | .168859 .0494749 3.41 0.001 .07189 .2658279
concenbar | .2136346 .0816403 2.62 0.009 .0536227 .3736466
ldist | -.9089297 .2721637 -3.34 0.001 -1.442361 -.3754987
ldistsq | .1038426 .0201911 5.14 0.000 .0642688 .1434164
y98 | .0228328 .0041643 5.48 0.000 .0146708 .0309947
y99 | .0363819 .0051292 7.09 0.000 .0263289 .0464349
y00 | .0977717 .0055072 17.75 0.000 .0869777 .1085656
_cons | 6.207889 .9118109 6.81 0.000 4.420773 7.995006
-----------------------------------------------------------------------------
sigma_u | .31933841
sigma_e | .10651186
------------------------------------------------------------------------------
. * So the robust t statistic is 2.62 --- still a rejection, but not as strong.
141
. * Using the CRE formulation, we get the FE estimate on the time-varying
. * covariate concen. In this case, the coefficients on the time-constant
. * variables are close to the usual RE estimates, and even closer to the
. * POLS estimates.
142

Cuarta Clase

Uploaded by

Copyright:

Available Formats

Cuarta Clase

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuarta Clase

Uploaded by

Copyright:

Available Formats

UNOBSERVED EFFECTS LINEAR

PANEL DATA MODELS, I

Econometric Analysis of Cross Section and Panel Data, 2e

where u it : t  1, . . . , T are the idiosyncratic errors. The composite

∙ Because of c i , the sequence v it : t  1, . . . , T is almost certainly

so that  j is the partial effect of x tj on Ey t |x t , c, so that we are

Suppose we have T  2 time periods:

which is now a cross section in the changes or differences.

∙ The rank condition is violated if x t has elements that do not change

Ex 2 − x 1  ′ u 2 − u 1   Ex ′2 u 2  − Ex ′1 u 2  − Ex ′2 u 1   Ex ′1 u 1 

because Ex ′t u t   0 under the conditional mean specification.

This is a kind of strict exogeneity assumption. However, we have

where    2 −  1 is the change in the aggregate time effects

where x it is 1  K and so  is K  1. In addition to unobserved effect

where  t : t  1, . . . , T are unknown parameters (and we have to

where g t is a vector of aggregate time effects (often time dummies), z i

although we often will strengthen this.

Contemporaneous Exogeneity Conditional on the Unobserved Effect:

∙ Ideally, we could proceed with just this assumption.

so that only x it affects the expected value of y it once c i is controlled for.

So correlation between c i and x i1 , . . . , x iT  would invalidate the

and so x i,th must be uncorrelated with u it for all h  0.

Ey it |z i1 , . . . , z it , . . . , z iT , c i   Ey it |z it , z i,t−1 , c i .

which is sequential exogeneity conditional on the unobserved effect.

∙ We already covered this. Now, we just recognize that the equation is

∙ Consistency (fixed T, N → ) of the POLS estimator is ensured by

or sometimes with an adjustment, such as multiply by N/N − 1.

∙ State assumptions in conditional mean terms so that second moment

∙ Assume x it includes (at least) unity, and probably time dummies in

and we know the properties of feasible GLS when

Ex ′is v it   0, all s, t  1, . . . , T.

∙ This weaker version of strict exogeneity is implied by Assumption

where j ′T  1, 1, . . . , 1.

ASSUMPTION RE.2:  is nonsingular and rank EX ′i  −1 X i .

Varv it   Varc i  u it   Varc i   Varu it 

∙ This leads to the “random effects” or “exchangeable” structure for :

so the T  T matrix depends on only two parameters,  2c and  2u or,

which shows we only need to estimate  to proceed with FGLS.

So a consistent “estimator” would be

and subtracts K from NTT − 1/2 as a df adjustment. By the usual

where ̂  ̂ 2c /̂ 2v in FGLS.

might not hold.

where v̂ i  y i − X i ̂ RE is the vector of RE (FGLS) residuals.

∙ Under Assumptions RE.1 and RE.3,  has the RE structure and

∙ Later, show how to test u it  for serial correlation allowing for c i ,

∙ Unlike POLS and RE, fixed effects estimation removes c i to form an

across t to get a cross-sectional equation:

where the overbar indicates time averages:

[In practice, an intercept is included to account for nonzero Ec i .]

where ÿ it  y it − ȳ i and so on.

This is the fixed effects (FE) estimator or the within estimator.

∙ In addition to contemporaneous exogeneity, Ex ′it u it   0, we need a

∙ This implies Ex ′is u it   0, all s, t  1, . . . , T, and so Eẍ ′it u it   0,

∙ Using FE, we cannot estimate  1 or  1 , but all other parameters are

∙ To find the asymptotic variance under this assumption, remember the

Varü it   Varu it − ū i    2u   2u /T − 2Covu it , ū i 

and now take away expectation, insert ̂ FE for , and use a df

∙ ̂ 2u is actually unbiased under FE.1, FE.2, and FE.3. It is consistent as