Applied Econometrics 2024

Download as pdf or txt
Download as pdf or txt
You are on page 1of 145

Applied Econometrics

Prof. Dr. Joachim Grammig

University of Tübingen

Summer term 2024

1
Fact sheet Applied Econometrics 2024
Joachim Grammig (Lecturer) joachim.grammig@uni-tuebingen.de
Sylvia Bürger (Sek./Admin) sylvia.buerger@uni-tuebingen.de
ILIAS Password: AE24

● Lecture with practical part and selected problems presentation (from q4r)
embedded: Monday 8h - 10h and Thursday 10-12 H23 Kupferbau
● Tutorial videos on technical aspects, e.g. proofs, detailed derivations
● Practical sessions using R will be via Zoom (to make up for Thursday
public holidays)
● Questions for Review (q4r) updated each week for revision of lecture (in
pairs, group or solo)
● Practical parts using statistical software R: implement/code key concepts
● Course material will be made available on ILIAS
● ... a weekly updated time table
● ... slides (may be updated or extended)
● ... a forum for discussion
● ... tutorial videos
● ... R code from practical part
● ... pdfs of books (parts) and papers
2
Fact sheet Applied Econometrics 2024
● Recommended text books are in the Uni library with many copies
● Please check ILIAS, discussion forum, and your student email address
regularly
● Recommendation: form study groups or pairs and work regularly through
q4r and practical parts
● Work continuously, do not procrastinate
● Take intensive during lecture and while working through tutorial
videos
● Help each other out (in case a friend missed a lecture)
● You my bring excerpts of your handwritten lecture notes to the 90 min
exam (cheat sheets): five handwritten DIN-A4 (or Letter)-sized pages.
Can be written front and back, so in total 10 sides
● TIMMS Video from lecture during pandemic (summer 2020) remain
available, but should be used mainly for revision (lecture content/focus
changes).
● My regular office hours: Wendesdays 13-14h, pls contact Ms. Bürger

3
What is econometrics?

● There are several aspects of the quantitative approach to economics, and


no single one of these aspects, taken by itself, should be confounded with
econometrics.
● Thus, econometrics is by no means the same as economic statistics. Nor
is it identical with what we call general economic theory, although a
considerable portion of this theory has a definitely quantitative character.
Nor should econometrics be taken as synonymous with the application of
mathematics to economics.
● Experience has shown that each of these three view-points, that of
statistics, economic theory, and mathematics, is a necessary, but not by
itself a sufficient condition for a real understanding of the quantitative
relations in modern economic life.
● It is the unification of all three that is powerful.
And it is this unification that constitutes econometrics
Ragnar Frisch, Econometrica, (1933), 1, pp.1-2

4
Econometrics-Econ-Nobels: Gen X and Y

● 1989 TRYGVE HAAVELMO for his clarification of the probability theory


foundations of econometrics and his analyses of simultaneous economic
structures (simultaneity in econometric analysis).
● 1980 LAWRENCE R. KLEIN for the creation of econometric models and
the application to the analysis of economic fluctuations and economic
policies (large scale econometric models)
● 1969 RAGNAR FRISCH and JAN TINBERGEN for having developed and
applied dynamic models for the analysis of economic processes (basic
econometrics)

5
Econometrics-Econ-Nobels: Gen Z and Alpha

● 2021 JOSHUA ANGRIST and GUIDO IMBENS for establishing new


methods of conducting natural experiments in economics studies using
data in which otherwise similar groups of people are separated by crucial
variables - allowing researchers to better understand cause and effect in
complex social situations
● 2013 LARS PETER HANSEN for the empirical analysis of asset pricing
(but really for the development of the Generalized Method of Moments)
● 2011THOMAS J. SARGENT and CHRISTOPHER A. SIMS for their
empirical research on cause and effect in the macroeconomy
● 2003 ROBERT F. ENGLE for methods of analyzing economic time series
with time-varying volatility and CLIVE W. J. GRANGER, for methods of
analyzing economic time series with common trends and cointegration
● 2000 JAMES J. HECKMAN for his development of theory and methods
for analyzing selective samples and DANIEL L. MCFADDEN for his
development of theory and methods for analyzing discrete choice
(Microeconometrics)

6
Recommended texts

● Hansen, B. (2022); Econometrics. Princeton Univerity Press.


● Angrist, J. and Pischke, J. S. (2008): Mostly Harmless Econometrics.
Princeton University Press.
● Hayashi, F. (2000): Econometrics. Princeton University Press.
● Wooldrige, J. (2019): Introductory Econometrics - A Modern Approach. 7
ed. Cengage Learning EMEA

(available in Uni lib and (parts) as pdfs on Ilias)

7
Revise and to dos

1 Review mathematical statistics (Probability and Risk, P & R): random


variables, distribution and density function, expeced values (mean,
variance, moments), orthogonality, moment conditions/restrictions), joint
distributions, independence, covariance and correlation, conditional
probability/density and conditional distributions, conditional expectation,
properties of the multivariate normal distribution. (Math. appendix of J.
Hamilton, Time Series Analysis, Princeton 1994, p.739 ff. available on
Iias):
2 Review basic linear algebra: matrix multiplication, inverse, definiteness,
and rank of a matrix (W. Greene, Econometrics, Matrix Appendix,
available on Ilias) Review statistical testing (QM) and OLS algebra
(EDA/QM)
3 Work through Easy Pieces in Statistics (Ilias)
4 Download presentation by Prof. Philipp Harms (Ilias)

8
Table of Contents (may be modified)

1 Six Justifications for Linear Regression

2 Parameter Estimation

3 Finite Sample Properties of OLS

4 Hypothesis Testing under Normality

5 Goodness-of-fit Measures

6 Large Sample Theory and OLS

9
Table of Contents (may be modified)

1 Time Series Basics

2 Generalized Least Squares (GLS)

3 Multicollinearity

4 Endogeneity: Problem and Solutions

5 Instrumental Variables

10
1. Six Justifications for Linear Regression

Angrist and Pischke, 2008, Ch. 1/2

11
Six justifications for linear regression

1 Structural model suggested/derived from theory

2 Population regression (pure statistical motivation)

3 Linear conditional expectation function (CEF)

4 Smallest mean squared error (MSE) approximation to a nonlinear CEF

5 Smallest MSE prediction of dependent variable using a linear forecast


function

6 (Rubin) causal model

12
Justification A: structural economic model
Regression equation derived from economic/finance theory

dependent variable = constant + β דkey” regressor(s)


[+γ דcontrol” variables]
+ unobservable component/ “residual”

● unobservable component has economic meaning, a “life of its own”


● Parameters β have a structural interpretation
● return on education
● price elasticity (of demand or supply)
● marginal propensity to consume
● Testing of economic hypotheses:
● return on education (> market interest rate ?)
● price elasticity of demand (> 1?)
● marginal propensity to consume (equal one ?)

13
Example 1: Supply and demand functions

Simultaneous equations model of market equilibrium (structural form):


qid = α0 + α1 pi + ui linear demand function
qis = β0 + β1 pi + vi linear supply function

qid = qis market clearing condition

ui demand shock, vi supply shock, α1 and β1 : price sensitivity of


supply/demand functions
● Estimate α0 , α1 (−), β0 , β1 (+) by OLS?
● Regress demand on prices and supply on prices?
● We observe equilibria qid = qis , can we estimate the slope of the demand
and the supply curves from the data?

14
Example 2: Glosten-Harris model
● How do financial asset prices evolve?
(Journal of Financial Economics, 1988, 21 (1), pp.123-142)
● Importance of public and private information on price formation
Ingredients and notation:
● market maker (MM): sets bid (buy) and ask (sell) quotes
● traders: buy from/sell to MM at prevailing quotes
● trade (transaction) events indexed by i = 1, . . .
● Efficient price: mi , incorporates all public and private info
● Transaction price: Pi , per share, of ith trade
● Pia (Pib ) prevailing ask (bid) quote before (!) ith trade


⎪ 1
● Indicator of transaction type: Qi = ⎪
buyer initiated trade


⎩−1
⎪ seller initiated trade

● Trade volume of ith transaction: vi

15
Glosten-Harris model (2)
● Efficient price:
mi = µ + mi−1 + εi + Qi zi , where zi = z0 + z1 vi
● Drift parameter: µ
● new public information accumulated since (i − 1)th trade: εi
● Private information conveyed through trade: Qi zi
● MM sets bid and ask quotes anticipating price impact on m:
MM’s sell price (ask): Pia = µ + mi−1 + εi + zi + c
MM’s buy price (bid): Pib = µ + mi−1 + εi − zi − c
● (Opportunity) costs of MM: c (per share)
⇒ Transaction price change

∆Pi = µ + z0 Qi + z1 vi Qi + c∆Qi + εi

Goal: Estimation of structural parameters µ, z0 , z1 , c

16
Example 3: Mincer equation

Derived from Human Capital theory:

ln(WAGEi ) = β1 + β2 Si + β3 TENUREi + β4 EXPRi + εi

Ingredients and notation:


● Logarithm of the wage rate: ln(WAGEi )
● Years of schooling: Si
● Experience in the current job: TENUREi
● Experience in the labor market: EXPRi
● Unobserved individual effects (“residual”): εi

⇒ β2 : return to schooling

17
Example 4: Linear factor asset pricing models

Asset pricing theory postulates:



E (Rtej ) = β j × λ

j
j
xt+1 j
pt+1 + dt+1
j
Rt+1 = = (return)
ptj ptj

Rtej = Rtj − Rtf : expected excess return of asset j


f = (ft1 , . . . , ftK )′ ): K risk factors
λ = (λ1 , . . . , λK )′ : prices of risk factors (proportional to)
j ′
β j = (β1j , . . . , βK ): exposure of asset j to factor k risk

18
Example 4: Linear factor asset pricing models

Asset pricing theory:



E (Rtej ) = β j × λ
with a single risk factor ft = ft1 (e.g. CAPM):

Cov (Rtej , ft )
βj =
Var(ft )

with K risk factors, ft = (ft1 , . . . , ftK )′ ):

−1
β j = E (ft ft′ ) E (Rtej ft )

−1
β j = E (ft ft′ ) E (Rtej ft ) population regression coefficients (see below)

19
CAPM and Fama-French model

“excess return of
CAPM f = R em = R m − R f
market portfolio”

Fama-French model ft = (Rtem , HMLt , SMBt )

ft contains excess returns → λ = [E(ft1 ), . . . , E(ftK )]′

E(Rtej ) = β j E(Rtem ) (CAPM)

E(Rtej ) = β1j E(Rtem ) + β2j E(SMBt ) + β3j E(HMLt ) (FF)

20
“Compatible regression” (for Fama-French model)

Rtej = αj + β1j Rtem + β2j SMBt + β3j HMLt + εt

⎡ 1 ⎤
⎢ ⎥
⎢ E(R em ) ⎥
⎢ ⎥
= [αj β1j β2j β3 ] × ⎢
j t
⎥ + εt
⎢ E(HMLt ) ⎥
⎢ ⎥
⎢ E(SMBt ) ⎥
⎣ ⎦

Implied by asset pricing theory:

E(Rtej ) = β1j E(Rtem ) + β2j E(SMBt ) + β3j E(HMLt )

● no constant in regression equation


● moment restrictions:
E(εt ) = 0, E(HMLt εt ) = 0, E(SMBt εt ) = 0, E(Rtem εt ) = 0
● β j are population regression coefficients

21
Justification B: Population regression

Y: dependent variable X: random vector (K ×1) of ex-


planatory variables (regres-
sors)
(e.g. wage) (gender, age, union, experi-
ence)

fYX ∶ joint density in population

Yi , Xi : i th draw from population

fYX = fYi Xi ∀i

22
Justification B: Population regression

Population regression coefficients (PRC) from

β̆ = argmin E [(Yi − Xi′ β̃) ]


2

{β̃}

⎡ ⎤
⎢ ⎛ ⎞⎥
⎢ ⎥
PRC β̆ solve F.O.C. ⎢ ⎜
E ⎢Xi ⎜Yi − Xi β̆ ⎟⎥ =0 ′ ⎟⎥
⎢ ⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶⎠⎥
⎢ ⎥
⎣ ε̆i ⎦

⇒ β̆ = E(Xi Xi′ )−1 E(Xi Yi )

Yi = Xi′ β̆ + ε̆i
ε̆i : ● population regression residual
● constructed ε̆i = Yi − Xi′ β̆, “no life of its own”

Interpretation of β̆?
(Angrist/Pischke notation: β̃ = b, β̆ = β, ε̆i = ei )

23
For one constant and single regressor

1
Xi = ( )
Xi2

β̆1
β̆ = ( )
β̆2

⎛ E(Yi ) − β̆2 E(Xi2 ) ⎞


⇒ β̆ = E(Xi Xi′ )−1 E(Xi Yi ) = Cov(Yi ,Xi2 )
⎝ Var(Xi2 ) ⎠


Population regression = linear projection

PRC = projection coefficients

24
“Regression anatomy” formula (Frisch-Waugh)

For Xi = (1, Xi2 , . . . , Xik , . . . , XiK )′

and β̆ = (β̆1 , β̆2 , . . . , β̆k , . . . , β̆K )′

Cov(Yi , X̆ik )
β̆k = (bivariate regression)
Var(X̆ik )

X̆ik : residual from population regression of Xik (dependent variable)


on Xi.k (including constant)
Xi.k : Xi without Xik

′ ′
Xik = γ̆.k Xi.k + X̆ik with γ̆.k = E(Xi.k Xi.k )−1 E(Xi.k Xik )

25
Important laws
● Law of Total Expectation (LTE):

EX [EY ∣X (Y ∣X )] = EY (Y )

● Double Expectation Theorem (DET):

EX [EY ∣X (g(Y )∣X )] = EY [g(Y )]

● Law of Iterated Expectations (LIE):

EZ ∣X [EY ∣X ,Z (Y ∣X , Z )∣X ] = EY ∣X (Y ∣X )

● Generalized DET:

EX [EY ∣X (g(X , Y ))∣X ] = EX ,Y [g(X , Y )]

● Linearity of Conditional Expectations:

EY ∣X [g(X )Y ∣X ] = g(X )EY ∣X [Y ∣X ]

26
Justification C: linear cond. expectation function

CEF E(Yi ∣Xi ) = f (Xi ) [resp. f (xi )]

CEF decomposition property: orthogonal decomposition

Yi = E(Yi ∣Xi ) + ε∗i


ε∗i = Yi − E(Yi ∣Xi ) constructed; “no life of its own”
E(ε∗i ∣Xi ) =0 ⇒ E(ε∗i Xi ) =0

Marginal effect:

∂E(Yi ∣Xi = xi )
can be nonlinear in xik
∂xik

i = εi )
(Angrist/Pischke notation: ε∗

27
Justification C: linear cond. expectation function

Angrist/Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion, 2008, p.31

28
Justification C: linear cond. expectation function

If E(Yi ∣Xi ) = Xi′ β ∗

with β ∗ = E(Xi Xi′ )−1 E(Xi Yi ) = β̆

Yi = Xi′ β ∗ + ε∗i

= Xi′ β̆ + ε̆i

ε∗i = ε̆i and β ∗ = β̆

E(ε∗i ∣Xi ) = 0 ⇒ E(ε∗i Xi ) = 0


E(ε∗i Xi ) = 0 ⇏ E(ε∗i ∣Xi ) = 0

∂E(Yi ∣Xi =xi )


Interpretation of PRC as marginal effects ∂xik
= βk∗ = β̆k

29
Justification D: best approximation to nonlin. CEF

Best (smallest MSE) linear approximation to nonlinear CEF

EY ∣X (Yi ∣Xi ) ≈ Xi′ β̆ β̆ = E(Xi Xi′ )−1 E(Xi Yi )


from
⎡ 2⎤
⎢⎛ ⎞ ⎥
⎢ ⎥
⎢⎜ ⎟ ⎥
argmin EX ⎢⎜EY ∣X (Yi ∣Xi ) − Xi′ β̃ ⎟ ⎥
⎢⎜ ⎟ ⎥
{β̃} ⎢⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶⎠ ⎥
⎢ ⎥
⎣ approximation error ⎦

30
Justification D: best approximation to nonlin. CEF

Angrist/Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion, 2008, p.39

31
Justification D: best approximation to nonlin. CEF

Again:
Yi = Xi′ β̆ + ε̆i (population regression)

here: β̆ used in EY ∣X (Yi ∣Xi ) ≈ Xi′ β̆ [or xi′ β̆ respectively]

∂EY ∣X (Yi ∣Xi = xi )


≈ β̆k approximative marginal effect
∂xik

32
Justification E: Optimal prediction

Goal: Minimize MSE of prediction of Yi using Xi

⎡ function to forecast Yi 2 ⎤
⎢⎛ ⎞ ⎥
⎢ ³¹¹ ¹ ¹ ¹·¹ ¹ ¹ ¹ µ ⎥
⎢⎜ ⎟ ⎥
argmin EXY ⎢⎜Yi − m(Xi ) ⎟ ⎥
⎢ ⎥
{m(Xi )} ⎢⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶⎠ ⎥
⎢ ⎥
⎣ forecast error

mean-square optimal function m(Xi ):

⇒ m(Xi ) = EY ∣X (Yi ∣Xi )

33
Justification E: Optimal prediction
If only linear m(Xi ) used

⎡ 2⎤
⎢⎛ ⎞ ⎥
⎢ ⎥
β̆ = argmin EXY ⎢⎜ Yi − Xi′ β̃ ⎟ ⎥
⎢⎜ ⎟ ⎥
{β̃} ⎢⎝ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ⎠ ⎥
⎢ ε̃∶ forecast error ⎥
⎣ ⎦

solution to F.O.C. yields

β̆ = E(Xi Xi′ )−1 E(Xi Yi ) (PRC)

Yi = Xi′ β̆ + ε̆ as in B, but β̆ interpreted as “linear prediction coefficient”.

ε̆ = Yi − Xi′ β̆ “orthogonal forecast error”


Yi = linear prediction+orthogonal forecast error

= Xi′ β̆ + ε̆i

34
Justification F: (Rubin’s) causal model

Basic concepts and notation


Ci binary treatment indicator
Ci = 1 i th individual drawn received treatment (small class, studied
econometrics)
Ci = 0 i th individual drawn received no treatment
Yi observed (actual) outcome (SAT score, wage)
Y1i potential outcome if i received treatment
Y0i potential outcome if i received no treatment
Y1i − Y0i causal effect of treatment
Problem: either Y1i or Y0i observed - not both

35
Justification F: (Rubin’s) causal model

Yi = Y0i + (Y1i − Y0i ) ⋅Ci


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
treatment
effect, causal

Y0i “counterfactual” for treated i

E(Yi ∣Ci = 1) − E(Yi ∣Ci = 0) = E(Y1i − Y0i ∣Ci = 1)


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
differences in population averages∗ Average treatment effect
on the treated (ATET)∗∗

+ E(Y0i ∣Ci = 1) − E(Y0i ∣Ci = 0)


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
selection bias∗∗∗

∗ estimated by group sample means


∗∗ causal effect, but unobserved
∗ ∗ ∗ unobserved since E(Y0i ∣Ci = 1) counterfactual

36
Conditional independence assumption (CIA)

f (Y0i , Ci ∣Zi ) = f (Y0i ∣Zi ) ⋅ f (Ci ∣Zi )


f (Y1i , Ci ∣Zi ) = f (Y1i ∣Zi ) ⋅ f (Ci ∣Zi )

Y0i ⊥⊥ Ci ∣Zi

Y1i ⊥⊥ Ci ∣Zi

E(Y0i ∣Ci = 1, Zi ) = E(Y0i ∣Ci = 0, Zi ) = E(Y0i ∣Zi )



E(Y1i ∣Ci = 1, Zi ) = E(Y1i ∣Ci = 0, Zi ) = E(Y1i ∣Zi )

37
Conditional independence assumption (CIA)

E(Yi ∣Ci = 1, Zi ) − E(Yi ∣Ci = 0, Zi ) = E(Y1i − Y0i ∣Ci = 1, Zi )


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
Conditional average treatment
effect (CATET) conditional on Z

+ E(Y0i ∣Ci = 1, Zi ) − E(Y0i ∣Ci = 0, Zi )


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=0 because of CIA

38
CIA can remove selection bias: matching estimator

Group sample by identical Z variables (ordinal), estimate CATET by group


average differences of treated and untreated.
Compute ATET by

EZ ∣C =1 ([E(Y1i − Y0i ∣Ci = 1, Zi )] ∣Ci = 1) =EZ ∣C =1 (E[Yi ∣Ci = 1, Zi ]∣Ci = 1)


−EZ ∣C =1 (E[Yi ∣Ci = 0, Zi ]∣Ci = 1)
= E(Y1i − Y0i ∣Ci = 1) by LIE
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
ATET

estimate by weighting estimated CATET by group frequencies

(Angrist/Pischke notation: X instead of Z )

39
CIA can remove selection bias: matching estimator

Angrist/Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion, 2008, p.73

40
Causal regression

Ysi = fi (S ) potential outcome


fi (S ) = α + ρS + ηi causal model

● Si : observed (chosen) treatment


● Yi : observed outcome
Yi = α + ρSi + ηi
● ρ: causal effect of increase of S by one unit

S ∈ {0, 1} ⇒ Y1i = α + ρ + ηi Y0i = α + ηi Y1i − Y0i = ρ

41
Causal regression

Yi = α + ρSi + ηi
Problem if ηi and Si correlated

Population regression:

Yi = β̆1 + β̆2 Si + ε̆i

Cov(Yi , Si )
β̆2 = ≠ρ
Var(Si )

!
(ε̆i and Si are uncorrelated by construction, i.e. E(Si ε̆i ) = 0)

42
Causal regression

Ysi ⊥⊥ Si ∣Ai ∀ S (→ ηi ⊥⊥ Si ∣Ai and νi ⊥⊥ Si ∣Ai )

If CIA holds and E[ηi ∣Ai ] = γ̆ ′ Ai :

Ysi = fi (S ) = α + ρS + A′i γ̆ + νi
Yi = α + ρSi + A′i γ̆ + νi

Si ∶ years schooling
e.g.
Ai ∶ ability variables

Si
population regression Yi on [ ] (long regression) has causal interpretation.
Ai

43
Long and short regression and OVB

Short (population) regression: Yi on Si (only)


Cov(Yi , Si )
Yi = αs + ρs Si + ε̆i PRC: ρs =
Var(Si )
Insert Yi from “long regression” in numerator

⇒ ρs = ρ + γ̆ ′ δAS

Cov(Ai1 , Si ) Cov(Ai2 , Si ) Cov(AiM , Si )
δAS = ( , ,..., )
Var(Si ) Var(Si ) Var(Si )
● δAS : vector of slope coefficients of regression (including constant) of each
element of Ai on Si .
● A variable that affects Ysi (via ηi ) and that is correlated with Si should
be included in Ai . Else: Omitted Variable Bias (OVB)!
● Short population regression coefficient ρs = ρ only if γ̆ = 0 (control
variables don’t affect outcome) or δAS = 0 (control variables and Si
uncorrelated).

44
Long and short regression and OVB

Angrist/Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion, 2008, p.62

45
Epistemological problems

● non-experimental data

● unobservable variables

● endogeneity

● causality

● simultaneity

46
Notation of the different justifications

1 population
justification of linear Angrist Script QM
regression Pischke

A structural model β, γ, ε β, δ, ε
B population regression β,e β̆, ε̆ β̆, ε̆
∗ ∗ ∗
C linear CEF β ,ε β ,ε β ∗ , ε∗
D best approx. to nonli- β,e + β̆, ε̆ +

near CEF
E optimal prediction β + β̆, ε̆ +, ++

+ same as in population regression


++ referred to as “linear prediction coeff.”, “orthogonal forecast error”

47
Notation of the different justifications

2 objective function

justification of Angrist Script


linear regression Pischke

b β̃

2
B population regressi- argmin E [(Y − X ′ b)2 ] argmin E [(Y − X ′ β̃) ]
{b} {β̃}
on

2
D best approx. to argmin E [(E(Y ∣X ) − X ′ b)2 ] argmin EX [(E(Y ∣X ) − X ′ β̃) ]
{b} {β̃}
nonlinear CEF
E(Yi ∣Xi ) ≈ Xi′ β E(Yi ∣Xi ) ≈ Xi′ β̆

E optimal prediction argmin E [(Yi − m(Xi ))2 ] argmin EYX [(Yi − m(Xi ))2 ]
{m(Xi )} {m(Xi )}
m(Xi ) = Xi′ β m(Xi ) = Xi′ β̆

48
2. Parameter Estimation

Hayashi p. 3-18

49
Change in Notation
● random variable Zi

Hayashi Angrist & Pischke QM


zi Zi Zi

● realization of random variable Zi

Hayashi Angrist & Pischke QM


zi zi zi

● vectors of random variables

Hayashi Angrist & Pischke QM


′ ′
xi = (xi1 , xi2 , ⋯, xiK ) Xi = (Xi1 , Xi2 , ⋯, XiK ) xi = (Xi1 , Xi2 , ⋯, XiK )′
with a constant Xi1 = 1 or xi1 = 1

● parameter estimate

Hayashi Angrist & Pischke QM


b β̂ β̂
Script from here on uses Hayashi notation

50
Linear regression model (CLRM) à la Hayashi

yi = β1 xi1 + β2 xi2 + ... + βK xiK + εi = x′i ⋅ β + εi


(1 × K ) (K × 1)

● yi : Dependent variable, observed


● xi = (xi1 , xi2 , ..., xiK )′ : Explanatory variables, observed
● β = (β1 , β2 , ..., βK )′ : Parameters
● εi : “Disturbance” component, unobserved
● b = (b1 , b2 , ..., bK )′ estimate of β
● ei = yi − x′i b: (estimated) residual

51
Introduction of matrix notation

For convenience we introduce matrix notation

y = X ⋅ β + ε
(n × 1) (n × K ) (K × 1) (n × 1)

⎡ y1 ⎤ ⎡ 1 x12 x13 ... x1K ⎤ ⎡ ⎤ ⎡ ⎤


⎢ ⎥ ⎢ ⎥ ⎢ β1 ⎥ ⎢ ε1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ β ⎥ ⎢ ⎥
⎢ y2 ⎥ ⎢ 1 x22 ⎥ ⎢ 2 ⎥ ⎢ ε2 ⎥
⎢ ⎥=⎢ ⎥⋅⎢ ⎥+⎢ ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎢ yn ⎥ ⎢ 1 xn2 ... xnK ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ βK ⎦ ⎣ εn ⎦

52
System of linear equations

written extensively:

y1 = β1 + β2 x12 + . . . + βK x1K + ε1
y2 = β1 + β2 x22 + . . . + βK x2K + ε2

yn = β1 + β2 xn2 + . . . + βK xnK + εn

53
Four classical assumptions

(1.1) Linearity: yi = x′i β + εi or y = Xβ + ε

(1.2) Strict exogeneity: E(εi ∣X) = 0


⇒ E(εi ) = 0 and Cov(εi , xik ) = E(εi xik ) = 0

(1.3) No exact multicollinearity: P(rank(X) = K ) = 1


⇒ No linear dependencies in the data matrix

(1.4) Spherical disturbances: Var(εi ∣X) = E(ε2i ∣X) = σ 2


Cov(εi , εj ∣X) = 0
E(εi εj ∣X) = 0
⇒ E(εi ) = σ and Cov(εi , εj ) = 0 by LTE
2 2

54
(Somewhat sloppy) interpretations of parameters

Interpreting the parameters β of different types of linear equations


● Linear model yi = β1 + β2 xi2 + ... + βK xiK + εi : A one unit c.p. increase in
the variable xik increases the conditional expected value of the dependent
variable by βk units
● Semi-log model ln(yi ) = β1 + β2 xi2 + ... + βK xiK + εi : A one unit c.p.
increase in the variable xk increases the conditional expected value of the
dependent variable approximately by 100 ⋅ βk percent
● Log linear model ln(yi ) = β1 ln(xi1 ) + β2 ln(xi2 ) + ... + βK ln(xiK ) + εi : A
one percent increase in xik c.p. increases the conditional expected value
of the dependent variable yi approximately by βk percent

55
Estimation via minimization of SSR
We estimate the linear model and choose b such that SSR is minimized
Obtain an estimate b of β by minimizing the SSR (sum of squared residuals):

n
argmin SSR (β̃) = argmin ∑ (yi − x′i β̃)
2

{β̃} i=1

Differentiation with respect to β̃1 , β̃2 , ..., β̃K ⇒ FOC’s:

∂SSR(β̃) 1
⇒ ∑(yi − x′i b) =
!
=0 ∑ ei = 0
∂ β̃1 n

∂SSR(β̃) 1
⇒ ∑(yi − x′i b)xiK =
!
=0 ∑ ei xiK = 0
∂ β̃K n

1 ′
⇒ FOC’s can be conveniently written in matrix notation n
Xe =0

56
Estimation via minimization of SSR
The system of K equations is solved by matrix algebra

X′ e = X′ (y − Xb) = X′ y − X′ Xb = 0

Premultiplying by (X′ X)−1 :

(X′ X)−1 X′ y − (X′ X)−1 X′ Xb = 0

(X′ X)−1 X′ y − IK b = 0

OLS-estimator:

b = (X′ X)−1 X′ y

Alternatively:
−1 −1
1 1 ′ 1 n 1 n
b = ( X′ X) X y = ( ∑ xi x′i ) ∑ xi yi
n n n i=1 n i=1

57
Zooming in

−1 −1
1 1 ′ 1 n 1 n
b = ( X′ X) X y = ( ∑ xi x′i ) ∑ xi yi
n n n i=1 n i=1

1 2 1 1
⎛ n ∑ xi1 n ∑ xi1 xi2 ... n ∑ xi1 xiK ⎞
1 n ⎜ n ∑ xi1 xi2
1 1
∑ xi2
2 1
∑ xi2 xiK ⎟
∑ xi xi = ⎜ ⎟
′ n n
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
n i=1 ⎜ ⎟
⎝ 1 ∑ xi1 x 1
∑ i2 xiK 1 2 ⎠
n ∑ iK
n iK n
x ... x

1
⎛ n ∑ xi1 yi ⎞
⎜ n1 ∑ xi2 yi ⎟
1 n ⎜ ⎟
⎜1 ⎟
∑ xi yi = ⎜

x y ⎟
n ∑ i3 i ⎟
n i=1 ⎜ ⎟
⎜ ⋮ ⎟
⎝ 1 ∑ xiK yi ⎠
n

58
3. Finite sample properties of OLS

Hayashi p. 27-31

59
Finite sample properties of b = (X′ X)−1 X′ y

1 E(b) = β: Unbiasedness of OLS


● Holds for any sample size
● Holds under assumptions 1.1 - 1.3

2 Var(b∣X) = σ 2 (X′ X)−1 : Conditional variance of b


● Conditional variance depends on the data
● Holds under assumptions 1.1 - 1.4

3 Var(β̂∣X) ≥ Var(b∣X)
● β̂ is any other linear unbiased estimate of β
● Holds under assumptions 1.1 - 1.4

60
An important result from mathematical statistics

⎛ z1 ⎞ ⎛ a11 a12 ... a1n ⎞


⎜ z2 ⎟ ⎜ a21 a22 ⎟
z =⎜ ⎟ A =⎜ ⎟
(nx1) ⎜ ⋮ ⎟ (mxn) ⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎝ zn ⎠ ⎝ am1 am2 ... amn ⎠
A new random variable: v = A ⋅ z
(mx1) (mxn) (nx1)

⎛ E(v1 ) ⎞
⎜ E(v2 ) ⎟
E(v) = ⎜ ⎟ = AE(z)
(mx1)
⎜ ⋮ ⎟
⎝ E(vm ) ⎠

Var(v) = AVar(z)A′
(mxm)

61
Unbiasedness of OLS

E(b) = β ⇒ E(b − β) = 0
sampling error

b−β = (X′ X)−1 X′ y − β


= (X′ X)−1 X′ (Xβ + ε) − β
= (X′ X)−1 X′ Xβ + (X′ X)−1 X′ ε − β
= β + (X′ X)−1 X′ ε − β
= (X′ X)−1 X′ ε

⇒ E(b − β∣X) = (X′ X)−1 X′ E(ε∣X) = 0 under assumption 1.2

⇒ E(b∣X) = E(β∣X)

⇒ EX (E(b∣X)) = E(b) = EX (β) = β by the LTE

62
We show that Var(b∣X) = σ 2 (X′ X)−1

Var(b∣X) = Var(b − β∣X)


= Var((X′ X)−1 X′ ε∣X) = Var(Aε∣X)
= AVar(ε∣X)A′ = Aσ 2 In A′
= σ 2 AIn A′ = σ 2 AA′
= σ 2 (X′ X)−1 X′ X(X′ X)−1 = σ 2 (X′ X)−1

Note:
● β non-random
● b − β sampling error
● A = (X′ X)−1 X′
● Var(ε∣X) = σ 2 In

63
Sketch proof of the Gauss Markov theorem

Var(β̂∣X) = Var(β̂ − β∣X) = Var[(D + A)ε∣X]


= (D + A)Var(ε∣X)(D′ + A′ ) = σ 2 (D + A)(D′ + A′ )
= σ 2 (DD′ + AD′ + DA′ + AA′ ) = σ 2 [DD′ + (X′ X)−1 ]
≥ σ 2 (X′ X)−1 = Var(b∣X)

where
● C is a function of X
● β̂ = Cy
● D=C−A
● A ≡ (X′ X)−1 X′

Details of proof: Hayashi pages 29 - 30

64
OLS is BLUE

● OLS is linear
⇒ Holds under assumption 1.1

● OLS is unbiased
⇒ Holds under assumption 1.1 - 1.3

● OLS is best estimator


⇒ Holds under the Gauss Markov theorem 1.1 - 1.4 Var(β̂∣X) ≥ Var(b∣X)

65
4. Hypothesis Testing under Normality

Hayashi p. 33-45

66
Hypothesis testing

Economic theory provides hypotheses about parameters:


● theory ⇒ testable implications
● But: Hypotheses cannot be tested without distributional assumptions
about ε

Distributional assumption:

1.5 Conditional normality: ε∣X ∼ N (0, σ 2 In )

Normality assumption about the conditional distribution of ε∣X

67
Important facts from multivariate statistics

Vector of random variables: x = (x1 , x2 , ..., xn )′

Expectation vector:

E(x) = µ = (µ1 , µ2 , ..., µn )′ = (E(x1 ), E(x2 ), ..., E(xn ))′

Variance-covariance matrix:

⎛ Var(x1 ) Cov(x1 , x2 ) ... Cov(x1 , xn ) ⎞


⎜ Cov(x1 , x2 ) Var(x2 ) ⎟
Var(x) = Σ = ⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ Cov(x1 , xn ) ... Var(xn ) ⎠

y = c + Ax; c, A non-random vector/matrix


⇒ E(y) = (E(y1 ), E(y2 ), ..., E(yn ))′ = c + Aµ
⇒ Var(y) = AΣA′
⇒ x ∼ N (µ, Σ) ⇒ y = c + Ax ∼ N (c + Aµ, AΣA′ )

68
Apply facts from mult. statistics and A1.1 - A1.5

b−β = (X′ X)−1 X′ ε


²
sampling error

Assuming
ε∣X ∼ N (0, σ 2 In )
⎛ ⎞
⇒ b − β∣X ∼ N ⎜
⎜(X′
X)−1 ′
X E(ε∣X) , (X′
X) −1 ′ 2
X σ In X(X′
X) −1 ⎟

⎝ ´¹¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¶ ⎠
0

⇒ b − β∣X ∼ N (0, σ 2 (X′ X)−1 )

Note that Var(b∣X) = σ 2 (X′ X)−1

OLS-estimate is conditionally normally distributed if ε∣X is multivariate normal.

69
Testing individual parameters (t-Test)

Null hypothesis: H0 ∶ βk = β̄k

Alternative hypothesis: HA ∶ βk ≠ β̄k

where β̄k is a hypothesized value, a real number

If H0 is true, then E(bk ) = β̄k and the test statistic

bk − β̄k
tk = √ ∼ N (0, 1)
σ 2 [(X′ X)−1 ] kk

[(X′ X)−1 ]kk : k -th row k -th column element of (X′ X)−1

70
Nuisance parameter σ 2

Nuisance parameter σ 2 can be estimated:

σ 2 = E(ε2i ∣X) = Var(εi ∣X) = E(ε2i ) = Var(εi )

We don’t know εi but we use the estimate ei = yi − x′i b

2
1 n 1 n 1 n 2 1 ′
σ̂ 2 = ∑ (ei − ∑ ei ) = ∑ ei = n e e
n i=1 n i=1 n i=1

σ̂ 2 is a biased estimate:
n −K 2
E(σ̂ 2 ∣X) = σ
n

71
An unbiased estimate of σ 2

For s 2 = 1
n−K ∑i=1 ei =
n 2 1
n−K
e′ e we get an unbiased estimate

1
E(s 2 ∣X) = E(e′ e∣X) = σ 2
n −K

EX (E(s 2 ∣X)) = E(s 2 ) = σ 2

Using s 2 for σ 2 provides an unbiased estimate of Var(b∣X) = σ 2 (X′ X)−1 :


̂
Var(b∣X) = s 2 (X′ X)−1

⇒ t-statistic under H0 :
bk − β̄k bk − β̄k bk − β̄k
tk = √ = =√ ∼ t(n − K )
̂
[Var(b∣X)] s.e.(bk ) ̂ k ∣X)]
[Var(b
kk

72
Decision rule for the t-test

1 H0 ∶ βk = β̄k (often β̄k = 0)


HA ∶ βk ≠ β̄k
bk −β̄k
2 Given β̄k , OLS-estimate bk and s 2 , we compute tk = s.e.(bk )

3 Fix significance level α of two-sided test


4 Fix non-rejection and rejection regions ⇒ decision

Remark:


σ 2 [(X′ X)−1 ]kk : standard deviation of bk ∣X

s 2 [(X′ X)−1 ]kk : standard error of bk ∣X

73
Duality of t-test and confidence interval

Under H0 ∶ βk = β k
bk − β k
tk = ∼ t(n − K )
s.e.(bk )

Probability for non-rejection:

P (−t α2 (n − K ) ≤ tk ≤ t α2 (n − K )) = 1 − α

−t α2 (n − K ): lower critical value


t α2 (n − K ): upper critical value
tk : random variable (value of test statistic)
1 − α: fixed number

⇒ P (bk − s.e.(bk )t α2 (n − K ) ≤ β k ≤ bk + s.e.(bk )t α2 (n − K )) = 1 − α

74
1 − α confidence interval for βk

P (bk − s.e.(bk )t α2 (n − K ) ≤ βk ≤ bk + s.e.(bk )t α2 (n − K )) = 1 − α

● bk − s.e.(bk )t α (n − K ): lower bound


2
● bk + s.e.(bk )t α (n − K ): upper bound
2
● Confidence bounds are random variables!
● H0 ∶ βk = β k rejected at significance level α if β k NOT within bounds of
the 1 − α confidence interval.
● H0 ∶ βk = β k cannot be rejected rejected at significance level α for all
values β k inside the 1 − α confidence interval.
● Beware of the wrong interpretation: True parameter βk lies with
probability 1 − α within the bounds of the confidence interval: Downright
wrong!

75
Testing joint hypotheses (F -test/Wald test)

Write hypothesis as:

H0 ∶ R β = r
(#r × K) (K × 1) (#r × 1)

R: matrix of real numbers


r: vector of real numbers
#r: number of restrictions

Replacing β = (β1 , β2 , ..., βk )′ by estimate b = (b1 , b2 , ..., bK )′ :

Rb − r should be close to 0

76
Wald/F -test statistic
Distributional properties of R b:

R E(b∣X) = Rβ [= r only if H0 is true]

R Var(b∣X)R′ = Rσ 2 (X′ X)−1 R′

Rb∣X ∼ N (Rβ, R σ 2 (X′ X)−1 R′ )

Using additional facts from multivariate statistics


● z = (z1 , z2 , ..., zm )′ ∼ N (µ, Ω)
● ⇒ (z − µ)′ Ω−1 (z − µ) ∼ χ2 (m)
Result applied: Wald statistic under H0

(Rb − r)′ (σ 2 R(X′ X)−1 R′ )−1 (Rb − r) ∼ χ2 (#r)

77
Distributional properties

Replace σ 2 by its unbiased estimate s 2 = 1


n−K ∑i=1 ei =
n 2 1
n−K
e′ e and dividing by
#r :

⇒ F -ratio:

(Rb − r)′ [R(X′ X)−1 R′ ]−1 (Rb − r)/#r


F =
(e′ e)/(n − K )

= ̂
(Rb − r)′ [R Var(b∣X)R′ −1
] (Rb − r)/#r ∼ F (#r, n − K )

Note: F -test is one-sided

Proof: see Hayashi p. 41

78
Decision rule of the F -test

1 Specify H0 in the form Rβ = r and HA ∶ Rβ ≠ r.

2 Calculate F -statistic.

3 Look up entry in the table of the F -distribution for #r and n − K at


given significance level.

4 Null is not rejected on the significance level α for F less than


Fα (#r, n − K )

79
Alternative representation of the F -statistic

Minimization of the unrestricted sum of squared residuals:

n
→ ∑(yi − x′i b)2 ⇒ SSRU
i=1

Minimization of the restricted sum of squared residuals:

n
→ ∑(yi − x′i bR )2 ⇒ SSRR
i=1

(Note that Hayashi uses β̃ for bR )

F -ratio:
(SSRR − SSRU )/#r
F=
SSRU /(n − K )

80
5. Goodness-of-fit measures

Hayashi p. 38/20

81
Coefficient of determination: uncentered R 2

Measure of the variability of the dependent variable: ∑ yi2 = y′ y

Orthogonal decomposition of y = ŷ + e:

y′ y = (ŷ + e)′ (ŷ + e)


= ŷ′ ŷ + 2ŷ′ e + e′ e
= ŷ′ ŷ + e′ e

2 e′ e
⇒ Ruc ≡1−
y′ y

A good model explains much and therefore the residual variation is very small
compared to the explained variation.

82
Coefficient of determination: centered R 2

Use a centered R 2 if there is a constant in the model (xi1 = 1)

n n n
2 2 2
∑(yi − y) = ∑(ŷi − y) + ∑ ei
i=1 i=1 i=1

n
∑i=1 ei2 SSR
⇒ Rc2 ≡ 1 − ≡1−
∑i=1 (yi − y)2
n
SST

2
Note, that Ruc and Rc2 lie both in the interval [0, 1] but describe different
models. They are not comparable!

83
Model selection criteria
2
adjusted Radj :

2 SSR/(n − K ) n − 1 SSR
Radj =1− =1−
SST /(n − 1) n − K SST

Akaike criterion (AIC):


SSR 2K
AIC = log ( )+
n n

Schwarz criterion (SBC):

SSR log(n)K
SBC = log ( )+
n n

Note:
● penalty term for heavy parametrization
● Select model with smallest AIC/SBC, highest Radj
2

84
6. Large Sample Theory and OLS

Hayashi p. 88-97/109-133

85
Basic concepts of large sample theory

Using large sample theory we can dispense with basic assumptions from finite
sample theory:
● 1.2 E(εi ∣X) = 0:
strict exogeneity
● 1.4 Var(ε∣X) = σ 2 I:
homoscedasticity
● 1.5 ε∣X ∼ N (0, σ 2 In ):
conditional normality

Asymptotic distribution of b, and t- and the F -statistic can be obtained.

86
Modes of convergence

Modes of convergence:

● Convergence in probability: →
p

● Convergence almost surely: →


a.s.

● Convergence in mean square: →


m.s.

● Convergence in distribution: →
d

{zn }: sequence of random variables


{zn }: sequence of random vectors

87
Convergence in probability

Convergence in probability:

A sequence {zn } converges in probability to a constant α if for any ε > 0

lim P (∣zn − α∣ > ε) = 0


n→∞

Short-hand we write: plim zn = α or zn ÐÐ


p→ α or zn − α Ð
Ðp→ 0

Extends to random vectors:

If lim P (∣znk − αk ∣ > ε) = 0 ∀ k = 1, 2, ..., K


n→∞

then zn ÐÐ
p→ α (element-wise convergence).

88
Convergence almost surely

Convergence almost surely:

A sequence {zn } converges almost surely to a constant α if

P ( lim zn = α) = 1
n→∞

Short-hand we write: zn ÐÐ Ð→ α.
a.s.

Extends to random vectors:

If P ( lim znk = αk ) = 1 ∀ k = 1, 2, ..., K


n→∞

then zn ÐÐ Ð→ α (element-wise convergence).


a.s.

89
Convergence in mean square and distribution

Convergence in mean square:

lim E [(zn − α)2 ] = 0 or zn Ðm.s.


ÐÐ→ α
n→∞

Convergence in mean square implies convergence in probability.

Convergence in distribution:

zn ÐÐ
d→ z

If the c.d.f. of zn , Fzn converges to the c.d.f. of z , Fz , at each point of


continuity.

Convergence in mean square, in probability and almost surely and convergence


in distribution extend to random vectors.

90
Khinchin’s Weak Law of Large Numbers (WLLN)

If {zi } i.i.d. with E(zi ) = µ < ∞,

n
then for z n = 1
n ∑ zi it holds that
i=1

z n ÐÐ
p→ µ

or lim P (∣z n − µ∣ > ε) = 0


n→∞

or plim z n = µ

91
WLLN: extensions

● Extension (1): Multivariate WLLN:


Sequence of random vectors {zi }

● Extension (2): Relaxation of independence

● Extension (3): Functions of random variables h(zi )

● Extension (4): Vector valued functions f (zi )

92
Central Limit Theorems (Lindeberg-Levy)

n
If {zi } i.i.d. with E(zi ) = µ and Var(zi ) = σ 2 and z n = 1
n ∑ zi ÐÐ
p→ µ
i=1


d→ y ∼ N (0, σ )
2
n(z n − µ) ÐÐ

a σ2
or z n − µ ∼ N (0, )
n

a σ2
or z n ∼ N (µ, )
n

a
Remark: Read ∼ ‘approximately distributed as’.
CLT also holds for multivariate extension: sequence of random vectors {zi }.

93
Useful lemmas: Continuous Mapping Theorem

Lemma 1: Hayashi 2.3(a)

a(⋅) ∶ RK → RM

zn ÐÐ
p→ α with a as a continuous function which does not depend on n, then:

a(zn ) ÐÐ
p→ a(α) or plim a(zn ) = a (plim(zn )) = a(α)

Examples:
● xn ÐÐ
p→ α ⇒ ln(xn ) ÐÐ
p→ ln(α)

● xn ÐÐ
p→ β and yn ÐÐ
p→ γ ⇒ xn + yn ÐÐ
p→ β + γ

● Yn ÐÐ
p→ Γ ⇒ Yn−1 ÐÐ
p→ Γ
−1

94
Useful lemmas: Continuous Mapping Theorem

Lemma 2: Hayashi 2.3(b)

If zn ÐÐ
d→ z, then:

a(zn ) ÐÐ
d→ a(z)

Examples:
● zn ÐÐ
d→ z , z ∼ N (0, 1) ⇒ z 2 ∼ χ2 (1)

● zn ÐÐ
d→ z ∼ N (0, 1)

● zn2 ÐÐ
d→ z ∼ χ (1)
2 2

95
Useful lemmas: Slutzky Theorem

Lemma 3: Hayashi 2.4(a)

If xn ÐÐ
d→ x and yn ÐÐ
p→ α, then:

xn + yn ÐÐ
d→ x + α

Examples:
● xn ÐÐ
d→ N (0, 1), yn Ð
Ðp→ α ⇒ xn + yn ÐÐ
d→ N (α, 1)

● xn ÐÐ
d→ x, yn Ð
Ðp→ 0 ⇒ xn + yn ÐÐ
d→ x

Lemma 4: Hayashi 2.4(b)

If xn ÐÐ
d→ x and yn ÐÐ
p→ 0, then:

xn ⋅ yn ÐÐ
p→ 0

96
Useful lemmas: Slutzky Theorem

Lemma 5: Hayashi 2.4(c)

If xn ÐÐ
d→ x and An Ð
Ðp→ A, then:

An ⋅ xn ÐÐ
d→ A ⋅ x

Example:
● xn ÐÐ
d→ N (0, Σ)

● An ⋅ xn ÐÐ ′
d→ N (0, AΣA )

Lemma 6: Hayashi 2.4(d)

If xn ÐÐ
d→ x and An Ð
Ðp→ A, then:

x′n A−1
n xn Ð
Ð ′ −1
d→ x A x

97
Large sample assumptions for OLS

Using Hayashi’s numbering (see pp. 109-113):

(2.1) Linearity: yi = x′i β + εi ∀ i = 1, 2, ..., n

(2.2) and (2.5) assumptions regarding dependence of {yi , xi }

(2.3) Orthogonality/ predetermined regressors: E(xik ⋅ εi ) = 0 ∀ k = 1, . . . , K


If 1 ∈ xi ⇒ E(εi ) = 0 ⇒ Cov(xik , εi ) = 0 ∀ k = 1, . . . , K

(2.4) Rank condition: E(xi x′i ) ≡ ΣXX is non-singular


K ×K

98
Large sample distribution of OLS estimator

We get for b = (X′ X)−1 X′ y:

−1
1 n 1 n
bn = [ ∑ xi x′i ] ∑ xi yi
¯ n i=1 n i=1
n indicates the
dependence on
the sample size

Using a WLLN and lemma 1:

● bn ÐÐ
p→ β


● n(bn − β) ÐÐ
a
d→ N (0, AVar(b)) or b ∼ N (β,
AVar(b)
n
)

⇒ bn is consistent, asymptotically normal (CAN).

99
bn = (X′ X)−1 X′ y is consistent

−1
1 n ′ 1 n
bn = [ ∑ xi xi ] ∑ xi yi
n i=1 n i=1
−1
1 1
⇒ bn − β = [ ∑ xi x′i ] ∑ xi εi
´¹¹ ¹ ¹ ¸¹ ¹ ¹ ¹¶ n n
sampling error

We show: bn ÐÐ
p→ β

When sequence {yi , xi } allows application of WLLN

1 n ′ ′
⇒ ∑ xi xi ÐÐ
p→ E(xi xi )
n i=1

1 n
⇒ ∑ xi εi ÐÐ
p→ E(xi εi ) = 0
n i=1

100
bn = (X′ X)−1 X′ y is consistent

Lemma 1 implies:

−1
CMT 1 n ′ ′ −1
⇒ [ ∑ xi xi ] p→ [E(xi xi )]
ÐÐ
n i=1

−1
1 ′ 1
bn − β = [ ∑ xi xi ] ∑ xi εi
n n

ÐÐ
p→ E(xi x′i )−1 E(xi εi )

ÐÐ
p→ E(xi x′i )−1 ⋅ 0 = 0

bn = (X′ X)−1 X′ y is consistent.

101
bn = (X′ X)−1 X′ y is asymptotically normal
Sequence {gi } = {xi εi } allows applying CLT for 1
n ∑ xi εi = g

√ ′
n(g − E(gi )) ÐÐ
d→ N (0, E(gi gi ))

√ 1 −1 √
n(bn − β) = [ ∑ xi x′i ] ng
n

Applying lemma 5:
−1
1 ′ −1
An = [ ∑ xi xi ] ÐÐp→ A = Σxx
n
√ ′
xn = n g ÐÐ
d→ x ∼ N (0, E(gi gi ))

√ −1 ′ −1
⇒ n(bn − β) ÐÐ
d→ Ax ∼ N (0, Σxx E(gi gi )Σxx )

⇒ bn is CAN

102
White standard errors

Adjusting the test statistics to make them robust against violations of


conditional homoskedasticity.
t-statistic:

bk − β̄k a
tk = √ ∼ N (0, 1)
[n
1 x x′ ] n e 2 x x′ [ 1 n x x′ ]
−1 1
∑n ∑n
i=1 i i i n ∑i=1 i i
−1
[ i=1 i i
n
]
kk

holds under H0 ∶ βk = β k

Wald statistic:

−1
̂
AVar(b)
W = (Rb − r)′ [R R′ ] (Rb − r)′ ∼ χ2 (#r )
a
n

holds under H0 ∶ Rβ − r = 0; allows for linear restrictions on β

103
How to estimate AVar(b)

AVar(b) = Σ−1 ′ −1
xx E(gi gi )Σxx with gi = xi εi

1 n ′ ′
∑ xi xi ÐÐ
p→ E(xi xi )
n i=1

Estimation of E(gi gi′ ): Ŝ = 1


n
2 ′
∑ ei xi xi ÐÐ

p→ E(gi gi )

−1 −1
̂ 1 n ′ 1 n ′
⇒ AVar(b) =[ ∑ xi xi ] Ŝ [ ∑ xi xi ] ÐÐ
p→
n i=1 n i=1

AVar(b) = E(xi x′i )−1 E(gi gi′ )E(xi x′i )−1

104
Testing with conditional homoskedasticity

Developing a test statistic under the assumption of conditional


homoskedasticity.

Assumption: E(ε2i ∣xi ) = σ 2

−1 −1
̂ 1 n ′ 1 n ′ 1
n

AVar(b) = [ ∑ xi xi ] σ̂ 2 ∑ xi xi [ ∑ xi xi ]
n i=1 n i=1 n i=1
−1
1 n ′
= σ̂ 2 [ ∑ xi xi ]
n i=1

with σ̂ 2 = 1
n
n 2
∑i=1 ei

1 n 2 2
Note: n ∑i=1 ei is a biased but consistent estimate for σ

105
7. Time Series Basics
(Stationarity and Ergodicity)

Hayashi p. 97-107

106
Time series dependence

Certain degree of dependence in the data in time series analysis; only one
realization of the data generating process is given.

CLT and WLLN rely on i.i.d. data, but dependence in real world data.

Examples:
● Inflation rate
● Stock market returns

Stochastic process: sequence of r.v.s. indexed by time {z1 , z2 , z3 , ...} or {zi }


with i = 1, 2, ...

A realization/sample path: One possible outcome of the process

107
Parallel worlds: Ensemble means

If we were able to ‘run the world several times’, we had different realizations of
the process at one point in time.

⇒ We could compute ensemble means and apply the WLLN.

As the described repetition is not possible, we take the mean over the one
realization of the process.

Key question:
1 T
Does ∑ xt → E(X ) hold?
T t=1 p

Condition: Stationarity and ergodicity of the process

108
Stationarity restricts the heterogeneity of a s.p.

Strict stationarity:

The joint distribution of zi , zi1 , zi2 , ..., zir depends only on the relative position
i1 − i, i2 − i, ..., ir − i but not on i itself.

In other words: The joint distribution of (zi , zir ) is the same as the joint
distribution of (zj , zjr ) if i − ir = j − jr .

Weak stationarity:

● E(zi ) does not depend on i


● Cov(zi , zi−j ) depends on j (distance), but not on i (absolute position)

109
Ergodicity restricts memory of stochastic process

A stationary process is called ergodic if

lim ∣E [f (zi , zi+1 , ..., zi+k ) ⋅ g(zi+n , zi+n+1 , ..., zi+n+l )]∣
n→∞

= ∣E [f (zi , zi+1 , ..., zi+k )]∣ ⋅ ∣E [g(zi+n , zi+n+1 , ..., zi+n+l )]∣

Ergodic Theorem:

If sequence {zi } is stationary and ergodic with E(zi ) = µ, then


1 n
zn = ∑ zi ÐÐ p→ µ
n i=1

110
Martingale difference sequence

Stationarity and Ergodicity are not enough for applying a CLT. To derive the
CAN property of OLS we assume:

{gi } = {xi εi }

is a stationary and ergodic martingale difference sequence (m.d.s.):

E(gi ∣gi−1 , gi−2 , ..., g1 ) = 0


⇒ E(gi ) = 0 (LTE).

Implications of m.d.s. when 1 ∈ xi :


εi and εi−j are uncorrelated, i.e. Cov(εi , εi−j ) = 0

111
8. Generalized Least Squares

Hayashi p. 54-58

112
GLS Assumptions

(1.1) Linearity: yi = x′i β + εi


(1.2) Strict exogeneity: E(εi ∣X) = 0
⇒ E(εi ) = 0 and Cov(εi , xik ) = E(εi xik ) = 0
(1.3) Full rank: P(rank(X) = K ) = 1

Relaxing assumption (1.4):

⎛ Var(ε1 ∣X) Cov(ε1 , ε2 ∣X) ... Cov(ε1 , εn ∣X) ⎞


⎜ Cov(ε1 , ε2 ∣X) Var(ε2 ∣X) ⋮ ⎟
⎜ ⎟
Var(ε∣X) = ⎜ Cov(ε1 , ε3 ∣X) Cov(ε2 , ε3 ∣X) Var(ε3 ∣X) ⎟
⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ Cov(ε1 , εn ∣X) ... Var(εn ∣X) ⎠

⇒ Var(ε∣X) = E(εε′ ∣X) = σ 2 V(X)

NOT (as in 1.4): Var(ε∣X) = σ 2 In

113
Generalized Least Squares (GLS)

GLS estimator derived under the assumption that V(X) is known, symmetric,
and positive definite

Let V(X)−1 = C′ C

Transformation of y = Xβ + ε: Premultiplying with C

Cy = CXβ + Cε

ỹ = X̃β + ε̃

where ỹ = Cy, X̃ = CX, and ε̃ = Cε

114
Least squares estimation of β (transformed data)

β̂ GLS = (X̃′ X̃)−1 X̃′ ỹ


= (X′ C′ CX)−1 X′ C′ Cy
−1
1 1
= (X′ V(X) −1
X) X′ 2 V(X)−1 y
σ2 σ
−1
= [X′ [Var(ε∣X)]−1 X] X′ [Var(ε∣X)]−1 y

GLS is the best linear unbiased estimator (BLUE)

Problems:
● Difficult to work out the asymptotic properties of β̂ GLS
● In real world applications Var(ε∣X) not known
● If Var(ε∣X) is estimated the BLUE-property of β̂ GLS is lost

115
Special case of GLS - weighted least squares

⎛ V1 (X) 0 ... 0 ⎞
′ 2⎜
0 V2 (X) ⋮ ⎟
E(εε ∣X) = Var(ε∣X) = σ ⎜ ⎟ = σ 2 V(X)
⎜ ⋮ 0 ⋱ 0 ⎟
⎝ 0 ... 0 Vn (X) ⎠

As V(X)−1 = C′ C

√ 1 0 ... 0
⎛ V1 (X) ⎞ ⎛ s1 0 ... 0 ⎞
⎜ √ 1 ⋮ ⎟ ⎜ 01 1
⋮ ⎟
⎜ 0 ⎟ ⎜ ⎟
⇒ C=⎜ V2 (X) ⎟=⎜ s2

⎜ ⋮ ⋱ 0 ⎟ ⎜ ⋮ 0 ⋱ 0 ⎟
⎜ ⎟
⎠ ⎝ 0 ⎠
1
⎝ 0 ... 0 √ 1
Vn (X)
... 0 sn

n
yi 1 xi2 xiK 2
⇒ β̂ GLS = argmin ∑ ( − β̃1 − β̃2 ... − β̃K )
{β̃} i=1 si si si si

Observations are inversely weighted by standard deviations.

116
9. Multicollinearity

117
Exact multicollinearity

Expressing a regressor as linear combination of (an)other regressor(s)


● rank(X) ≠ K : No full rank
● ⇒ Assumption 1.3 or 2.4 is violated
● (X′ X)−1 does not exist

Often economic variables are correlated to some degree


● BLUE result is not affected
● Large sample results are not affected

118
Effects of multicollinearity and solutions

Effects:
● Coefficients may have high standard errors and low significance levels
● Estimates may have the wrong sign
● Small changes in the data produces wide swings in the parameter
estimates
Solutions :
● Increasing precision by implementing more data. (Costly!)
● Building a better fitting model that leaves less unexplained.
● Excluding some regressors. (Dangerous! Omitted variable bias!)

119
10. Endogeneity

Hayashi p. 186-196

120
Omitted variable bias (OVB)

Correctly specified model:

y = X1 β 1 + X2 β 2 + ε

Regression of y on X1
⇒ X2 gets into disturbance term
⇒ Omitted variable bias

b1 = (X′1 X1 )−1 X′1 y


= (X′1 X1 )−1 X′1 (X1 β 1 + X2 β 2 + ε)
= β 1 + (X′1 X1 )−1 X′1 X2 β 2 + (X′1 X1 )−1 X′1 ε

OLS is biased:
● If β 2 ≠ 0 ⇒ (X′1 X1 )−1 X′1 X2 β 2 ≠ 0
● If (X′1 X1 )−1 X′1 X2 ≠ 0 ⇒ (X′1 X1 )−1 X′1 X2 β 2 ≠ 0

121
Endogeneity bias: Working example

Simultaneous equations model of market equilibrium (structural form):

qid = α0 + α1 pi + ui
qis = β0 + β1 pi + vi

Clear markets: qid = qis


It is not possible to estimate α0 , α1 , β0 , β1 as we do not know whether changes
in the market equilibrium are due to supply or demand shocks.
We observe many possible equilibria, however we can not explain the slope of
the demand and the supply curve from the data.
Endogeneity: Correlation between disturbances and regressors, regressors are
not predetermined
Here: Simultaneous equation bias

122
From structural form to reduced form

Solving qi and pi yields reduced form:

β0 − α0 vi − ui
pi = +
α1 − β1 α1 − β1

α1 β0 − α0 β1 α1 vi − β1 ui
qi = +
α1 − β1 α1 − β1

Price is a function of the two disturbance terms


● vi : supply shifter
● ui : demand shifter
Calculating the covariance of pi and the demand shifter ui :

Var(ui )
⇒ Cov(pi , ui ) = −
α1 − β1

123
With endogeneity OLS is not consistent

FOCs in simple regression context yield:

1
∑(qi − q)(pi − p) Cov(pi , qi )
α̂1 = n
ÐÐ
p→
1
(p − p)2
n ∑ i
Var(pi )

But here: Cov(pi , qi ) = α1 Var(pi ) + Cov(pi , ui )

Cov(pi , qi ) Cov(pi , ui )
⇒ = α1 + ≠ α1
Var(pi ) Var(pi )

⇒ OLS is not consistent

Same holds for β1

124
Instruments for the market model

Properties of the instruments:


● Uncorrelated with the disturbances, instruments are predetermined
● Correlated with the endogenous regressors
β2
Cov(xi , pi ) = Var(xi )
α1 − β1
Cov(xi , ui ) = 0

⇒ xi an instrument for pi ⇒ yields new reduced form


β0 − α0 β2 ζi − ui
⇒ pi = + xi +
α1 − β1 α1 − β1 α1 − β1
α1 β0 + α0 β1 α1 β1 α1 ζi + β1 ui
⇒ qi = + xi +
α1 − β1 α1 − β1 α1 − β1
α1 β2 Cov(xi , qi )
Cov(xi , qi ) = Var(xi ) = α1 Cov(xi , pi ) ⇒ α1 =
α1 − β1 Cov(xi , pi )
by WLLN: α̂1 ÐÐ
p→ α1

125
A basic macroeconomic model: Haavelmo (1943)

● Aggregated consumption function: Ci = α0 + α1 Yi + ui


● GDP identity: Yi = Ci + Ii
● Yi affects Ci , but at the same time Ci influences Yi
● Reduced form: Yi = α0
1−α1
+ 1
I
1−α1 i
+ ui
1−α1
⇒ Ci can not be regressed on Yi as the regressor is correlated with the
residual:
Var(ui )
Cov(Yi , ui ) = >0
1 − α1

⇒ OLS is inconsistent: upwards biased

Cov(Ci , Yi ) Cov(Yi , ui )
= α1 + ≠ α1
Var(Yi ) Var(Yi )

Valid instrument for income Yi : investment Ii

126
Errors in variables

Explanatory variable is measured with error (e.g. reporting errors)

Classical example: Friedman’s permanent income hypothesis

Permanent consumption is proportinal to permanent income Ci∗ = kYi∗

Observed variables:
● Yi = Yi∗ + yi
● Ci = Ci∗ + ci
● ci = kyi + ui

Endogeneity due to measurement errors

⇒ Solution: IV

127
11. Instrumental Variables

Hayashi p. 186-196, Angrist and Pischke p. 114-133, 138-140


(Note that Hayashi uses x for instruments and z for regressors and
δ instead of β.)

128
Solution for endogeneity problem: IV

Linear regression:
yi = x′i β + εi
But the assumption of predetermined regressors does not hold:

E(xi εi ) ≠ 0

⇒ For consistency, instrumental variables zi are needed:

⎡ zi1 ⎤
⎢ ⎥
⎢ z ⎥
⎢ i2 ⎥
zi = ⎢ ⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢ ziL ⎥
⎣ ⎦

Every element of zi correlated with endogenous regressor but uncorrelated with


disturbance term.

129
IV Assumptions

(3.1) Linearity yi = x′i β + εi


(3.2) Ergodic stationarity
- L instruments zi
- K regressors xi
- Data sequence wi ≡ {yi , zi , xi } is stationary and ergodic
(3.3) Orthogonality conditions:

E(zi1 (yi − x′i β)) = 0 ⎫






E(zi2 (yi − x′i β)) = 0 ⎪
⎬ ⇒ E(zi (y − x′i β)) = E(zi εi ) = 0
⋮ ⎪




E(ziL (yi − x′i β)) = 0 ⎭

130
IV Assumptions

(3.4) Rank condition for identification: rank(ΣZX ) = K with

⎛ E(xi1 zi1 ) ... E(xiK zi1 ) ⎞


ΣZX = E(zi x′i ) = ⎜ ⋮ ⋱ ⋮ ⎟
⎝ E(xi1 ziL ) ... E(xiK ziL ) ⎠

K = L ⇒ Σ−1
ZX exists.

131
Deriving the IV-estimator (K = L)

E(zi εi ) = E (zi (yi − x′i β) = 0

E(zi yi ) − E(zi x′i β) = 0

−1
β = [E(zi x′i )] E(zi yi )
−1
1 ′ 1
β̂ IV = [ ∑ zi xi ] [ ∑ zi yi ]
n n

If K = L the rank condition implies that Σ−1


ZX exists and the system is exactly
identified.

132
Deriving the IV-estimator (K = L)

Applying WLLN, CLT and the useful lemmas it can be shown that IV estimate
β̂ IV is CAN.
β̂ IV ÐÐ
p→ β

√ ′ −1 −1 ′
n (β̂ IV − β) ÐÐ
d→ N (0, [E(zi xi )] E(ε2i zi z′i ) [[E(zi x′i )] ] )

−1 ′
−1
̂ β̂ ) = [ 1 ∑ zi x′i ]
AVar(
1 2 ′ 1 ′
∑ ei zi zi [[ ∑ zi xi ] ]
IV
n n n

with ei = yi − x′i β̂ IV

̂
̂ β̂ ) = AVar(β̂ IV )
Var( IV
n

133
IV in the context of causal model

ysi = fi (s)
= α + ρS + ηi
yi = α + ρSi + γ ′ Ai + νi
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
ηi

CIA ⇒ OLS delivers consistent estimates of ρ “selection on observables” (Ai ).

☇ What if Ai not observed? ☇

Solution: instrumental variables estimation ⇒ zi uncorrelated with ηi , but


correlated with Si .

134
IV in the context of causal model

“exclusion restriction”
yi = α + ρSi + ηi
figure out:

Cov(yi , zi ) = E[yi zi ] − E[zi ]E[yi ]


= E[(α + ρSi + ηi )zi ] − E[α + ρSi + ηi ]E[zi ]
= αE[zi ] + ρE[Si zi ] + E[ηi zi ] − αE[zi ]
− ρE[Si ]E[zi ] − E[ηi ]E[zi ]
= ρCov(Si , zi ) + Cov(ηi , zi )

135
IV in the context of causal model

Cov(yi , zi ) Cov(Si , zi )
ρ= ∶
Var(zi ) Var(zi )
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
PR of yi on zi PR of Si on zi

sample Cov(yi ,zi )


sample Var(zi )
ρ̂ = sample Cov(Si ,zi )
sample Var(zi )

136
IV in the context of causal model

Formulate problem into IV framework

η̃i = ηi − E(ηi )
yi = α̃ + ρSi + η̃i

with α̃ = α + E(ηi )

and E(η˜i ) = 0

Cov(ηi , zi ) = Cov(η̃i , zi ) = E(η̃i zi ) = 0

137
Special case of IV

1 1 α̃
zi = [ ], xi = [ ], β=[ ], εi = η˜i , yi = yi
zi Si ρ

−1 α̃
β = [E(zi x′i )] E(zi yi ) = [ ]
ρ

Cov(yi , zi )
ρ=
Cov(Si , zi )

α̃ = E(yi ) − ρE(Si )
−1 ˆ
1 ′ 1 α̃
β̂ = [ ∑ zi xi ] ∑ zi yi = [ ]
n n ρ̂

138
Causal model with covariates

yi = α′ xi + ρSi + ηi
Consider 2 PR:

Si = x′i π 10 + π11 zi + ξ1i


yi = x′i π 20 + π21 zi + ξ2i

Cov(z̃i ,Si ) Cov(z̃i ,yi )


Use Frisch-Waugh ⇒ π11 = Var(z̃i )
, π21 = Var(z̃i )

π21 Cov(z̃i , yi )
= =ρ
π11 Cov(z̃i , Si )

139
An alternative view: Two-Stage-Least Squares

Insert yi = α′ xi + ρ [x′i π 10 + π11 zi + ξ1i ] + ηi

yi = [α′ + ρπ ′10 ] xi + ρπ11 zi + ρξ1i + ηi


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ± ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
′ 21 π ξ 2i
π20

Empirical strategy:

1 Run regression of Si on xi and zi ⇒ estimates of π 10 and π11 ⇒ π̂ 10 , π̂11

⇒ compute Ŝi = π̂ ′10 xi + π̂11 zi

2 Run regression yi = α′ xi + ρŜi + ξ2i

140
3rd view IV

yi = α′ xi + ρSi + ηi
Instrument zi :
E(ηi zi ) = 0, E(ηi xi ) = 0
Redefine
xi xi α
zi = [ ], xi = [ ], β=[ ]
zi Si ρ

ηi = yi − α′ xi − ρSi (original)
= yi − x′i β (redefined)

⇒ E(zi ηi ) = 0 redefined

141
3rd view IV

⇒E(zi (yi − x′i β)) = 0 ∶

−1 α
β = [E(zi x′i )] E(zi yi ) = [ ]
ρ

−1
1 ′ 1 α̂
β̂ = [ ∑ zi xi ] ∑ zi yi = [ ]
n n ρ̂

⇒ 2SLS, ILS, IV are identical here!

⇒ use IV inference!

142
Hayashi in a nutshell

143
Hayashi in a nutshell

144
Hayashi in a nutshell

145

You might also like