Applied Econometrics 2024

Applied Econometrics
Prof. Dr. Joachim Grammig
University of Tübingen
Summer term 2024
1
Fact sheet Applied Econometrics 2024
Joachim Grammig (Lecturer) joachim.grammig@uni-tuebingen.de
Sylvia Bürger (Sek./Admin) sylvia.buerger@uni-tuebingen.de
ILIAS Password: AE24
● Lecture with practical part and selected problems presentation (from q4r)
embedded: Monday 8h - 10h and Thursday 10-12 H23 Kupferbau
● Tutorial videos on technical aspects, e.g. proofs, detailed derivations
● Practical sessions using R will be via Zoom (to make up for Thursday
public holidays)
● Questions for Review (q4r) updated each week for revision of lecture (in
pairs, group or solo)
● Practical parts using statistical software R: implement/code key concepts
● Course material will be made available on ILIAS
● ... a weekly updated time table
● ... slides (may be updated or extended)
● ... a forum for discussion
● ... tutorial videos
● ... R code from practical part
● ... pdfs of books (parts) and papers
2
Fact sheet Applied Econometrics 2024
● Recommended text books are in the Uni library with many copies
● Please check ILIAS, discussion forum, and your student email address
regularly
● Recommendation: form study groups or pairs and work regularly through
q4r and practical parts
● Work continuously, do not procrastinate
● Take intensive during lecture and while working through tutorial
videos
● Help each other out (in case a friend missed a lecture)
● You my bring excerpts of your handwritten lecture notes to the 90 min
exam (cheat sheets): five handwritten DIN-A4 (or Letter)-sized pages.
Can be written front and back, so in total 10 sides
● TIMMS Video from lecture during pandemic (summer 2020) remain
available, but should be used mainly for revision (lecture content/focus
changes).
● My regular office hours: Wendesdays 13-14h, pls contact Ms. Bürger
3
What is econometrics?
● There are several aspects of the quantitative approach to economics, and

no single one of these aspects, taken by itself, should be confounded with
econometrics.
● Thus, econometrics is by no means the same as economic statistics. Nor
is it identical with what we call general economic theory, although a
considerable portion of this theory has a definitely quantitative character.
Nor should econometrics be taken as synonymous with the application of
mathematics to economics.
● Experience has shown that each of these three view-points, that of
statistics, economic theory, and mathematics, is a necessary, but not by
itself a sufficient condition for a real understanding of the quantitative
relations in modern economic life.
● It is the unification of all three that is powerful.
And it is this unification that constitutes econometrics
Ragnar Frisch, Econometrica, (1933), 1, pp.1-2
4
Econometrics-Econ-Nobels: Gen X and Y
● 1989 TRYGVE HAAVELMO for his clarification of the probability theory

foundations of econometrics and his analyses of simultaneous economic
structures (simultaneity in econometric analysis).
● 1980 LAWRENCE R. KLEIN for the creation of econometric models and
the application to the analysis of economic fluctuations and economic
policies (large scale econometric models)
● 1969 RAGNAR FRISCH and JAN TINBERGEN for having developed and
applied dynamic models for the analysis of economic processes (basic
econometrics)
5
Econometrics-Econ-Nobels: Gen Z and Alpha
● 2021 JOSHUA ANGRIST and GUIDO IMBENS for establishing new

methods of conducting natural experiments in economics studies using
data in which otherwise similar groups of people are separated by crucial
variables - allowing researchers to better understand cause and effect in
complex social situations
● 2013 LARS PETER HANSEN for the empirical analysis of asset pricing
(but really for the development of the Generalized Method of Moments)
● 2011THOMAS J. SARGENT and CHRISTOPHER A. SIMS for their
empirical research on cause and effect in the macroeconomy
● 2003 ROBERT F. ENGLE for methods of analyzing economic time series
with time-varying volatility and CLIVE W. J. GRANGER, for methods of
analyzing economic time series with common trends and cointegration
● 2000 JAMES J. HECKMAN for his development of theory and methods
for analyzing selective samples and DANIEL L. MCFADDEN for his
development of theory and methods for analyzing discrete choice
(Microeconometrics)
6
Recommended texts
● Hansen, B. (2022); Econometrics. Princeton Univerity Press.

● Angrist, J. and Pischke, J. S. (2008): Mostly Harmless Econometrics.
Princeton University Press.
● Hayashi, F. (2000): Econometrics. Princeton University Press.
● Wooldrige, J. (2019): Introductory Econometrics - A Modern Approach. 7
ed. Cengage Learning EMEA
(available in Uni lib and (parts) as pdfs on Ilias)
7
Revise and to dos
1 Review mathematical statistics (Probability and Risk, P & R): random

variables, distribution and density function, expeced values (mean,
variance, moments), orthogonality, moment conditions/restrictions), joint
distributions, independence, covariance and correlation, conditional
probability/density and conditional distributions, conditional expectation,
properties of the multivariate normal distribution. (Math. appendix of J.
Hamilton, Time Series Analysis, Princeton 1994, p.739 ff. available on
Iias):
2 Review basic linear algebra: matrix multiplication, inverse, definiteness,
and rank of a matrix (W. Greene, Econometrics, Matrix Appendix,
available on Ilias) Review statistical testing (QM) and OLS algebra
(EDA/QM)
3 Work through Easy Pieces in Statistics (Ilias)
4 Download presentation by Prof. Philipp Harms (Ilias)
8
Table of Contents (may be modified)
1 Six Justifications for Linear Regression
2 Parameter Estimation
3 Finite Sample Properties of OLS
4 Hypothesis Testing under Normality
5 Goodness-of-fit Measures
6 Large Sample Theory and OLS
9
Table of Contents (may be modified)
1 Time Series Basics
2 Generalized Least Squares (GLS)
3 Multicollinearity
4 Endogeneity: Problem and Solutions
5 Instrumental Variables
10
1. Six Justifications for Linear Regression
Angrist and Pischke, 2008, Ch. 1/2
11
Six justifications for linear regression
1 Structural model suggested/derived from theory
2 Population regression (pure statistical motivation)
3 Linear conditional expectation function (CEF)
4 Smallest mean squared error (MSE) approximation to a nonlinear CEF
5 Smallest MSE prediction of dependent variable using a linear forecast

function
6 (Rubin) causal model
12
Justification A: structural economic model
Regression equation derived from economic/finance theory
dependent variable = constant + β ×“key” regressor(s)

[+γ ×“control” variables]
+ unobservable component/ “residual”
● unobservable component has economic meaning, a “life of its own”

● Parameters β have a structural interpretation
● return on education
● price elasticity (of demand or supply)
● marginal propensity to consume
● Testing of economic hypotheses:
● return on education (> market interest rate ?)
● price elasticity of demand (> 1?)
● marginal propensity to consume (equal one ?)
13
Example 1: Supply and demand functions
Simultaneous equations model of market equilibrium (structural form):

qid = α0 + α1 pi + ui linear demand function
qis = β0 + β1 pi + vi linear supply function
qid = qis market clearing condition
ui demand shock, vi supply shock, α1 and β1 : price sensitivity of

supply/demand functions
● Estimate α0 , α1 (−), β0 , β1 (+) by OLS?
● Regress demand on prices and supply on prices?
● We observe equilibria qid = qis , can we estimate the slope of the demand
and the supply curves from the data?
14
Example 2: Glosten-Harris model
● How do financial asset prices evolve?
(Journal of Financial Economics, 1988, 21 (1), pp.123-142)
● Importance of public and private information on price formation
Ingredients and notation:
● market maker (MM): sets bid (buy) and ask (sell) quotes
● traders: buy from/sell to MM at prevailing quotes
● trade (transaction) events indexed by i = 1, . . .
● Efficient price: mi , incorporates all public and private info
● Transaction price: Pi , per share, of ith trade
● Pia (Pib ) prevailing ask (bid) quote before (!) ith trade
⎧
⎪ 1
● Indicator of transaction type: Qi = ⎪
buyer initiated trade
⎨
⎪
⎩−1
⎪ seller initiated trade
● Trade volume of ith transaction: vi
15
Glosten-Harris model (2)
● Efficient price:
mi = µ + mi−1 + εi + Qi zi , where zi = z0 + z1 vi
● Drift parameter: µ
● new public information accumulated since (i − 1)th trade: εi
● Private information conveyed through trade: Qi zi
● MM sets bid and ask quotes anticipating price impact on m:
MM’s sell price (ask): Pia = µ + mi−1 + εi + zi + c
MM’s buy price (bid): Pib = µ + mi−1 + εi − zi − c
● (Opportunity) costs of MM: c (per share)
⇒ Transaction price change
∆Pi = µ + z0 Qi + z1 vi Qi + c∆Qi + εi
Goal: Estimation of structural parameters µ, z0 , z1 , c
16
Example 3: Mincer equation
Derived from Human Capital theory:
ln(WAGEi ) = β1 + β2 Si + β3 TENUREi + β4 EXPRi + εi
Ingredients and notation:

● Logarithm of the wage rate: ln(WAGEi )
● Years of schooling: Si
● Experience in the current job: TENUREi
● Experience in the labor market: EXPRi
● Unobserved individual effects (“residual”): εi
⇒ β2 : return to schooling
17
Example 4: Linear factor asset pricing models
Asset pricing theory postulates:

′
E (Rtej ) = β j × λ
j
j
xt+1 j
pt+1 + dt+1
j
Rt+1 = = (return)
ptj ptj
Rtej = Rtj − Rtf : expected excess return of asset j

f = (ft1 , . . . , ftK )′ ): K risk factors
λ = (λ1 , . . . , λK )′ : prices of risk factors (proportional to)
j ′
β j = (β1j , . . . , βK ): exposure of asset j to factor k risk
18
Example 4: Linear factor asset pricing models
Asset pricing theory:

′
E (Rtej ) = β j × λ
with a single risk factor ft = ft1 (e.g. CAPM):
Cov (Rtej , ft )
βj =
Var(ft )
with K risk factors, ft = (ft1 , . . . , ftK )′ ):
−1
β j = E (ft ft′ ) E (Rtej ft )
−1
β j = E (ft ft′ ) E (Rtej ft ) population regression coefficients (see below)
19
CAPM and Fama-French model
“excess return of
CAPM f = R em = R m − R f
market portfolio”
′
Fama-French model ft = (Rtem , HMLt , SMBt )
ft contains excess returns → λ = [E(ft1 ), . . . , E(ftK )]′
E(Rtej ) = β j E(Rtem ) (CAPM)
E(Rtej ) = β1j E(Rtem ) + β2j E(SMBt ) + β3j E(HMLt ) (FF)
20
“Compatible regression” (for Fama-French model)
Rtej = αj + β1j Rtem + β2j SMBt + β3j HMLt + εt
⎡ 1 ⎤
⎢ ⎥
⎢ E(R em ) ⎥
⎢ ⎥
= [αj β1j β2j β3 ] × ⎢
j t
⎥ + εt
⎢ E(HMLt ) ⎥
⎢ ⎥
⎢ E(SMBt ) ⎥
⎣ ⎦
Implied by asset pricing theory:
E(Rtej ) = β1j E(Rtem ) + β2j E(SMBt ) + β3j E(HMLt )
● no constant in regression equation

● moment restrictions:
E(εt ) = 0, E(HMLt εt ) = 0, E(SMBt εt ) = 0, E(Rtem εt ) = 0
● β j are population regression coefficients
21
Justification B: Population regression
Y: dependent variable X: random vector (K ×1) of ex-

planatory variables (regres-
sors)
(e.g. wage) (gender, age, union, experi-
ence)
fYX ∶ joint density in population
Yi , Xi : i th draw from population
fYX = fYi Xi ∀i
22
Justification B: Population regression
Population regression coefficients (PRC) from
β̆ = argmin E [(Yi − Xi′ β̃) ]

2
{β̃}
⎡ ⎤
⎢ ⎛ ⎞⎥
⎢ ⎥
PRC β̆ solve F.O.C. ⎢ ⎜
E ⎢Xi ⎜Yi − Xi β̆ ⎟⎥ =0 ′ ⎟⎥
⎢ ⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶⎠⎥
⎢ ⎥
⎣ ε̆i ⎦
⇒ β̆ = E(Xi Xi′ )−1 E(Xi Yi )
Yi = Xi′ β̆ + ε̆i
ε̆i : ● population regression residual
● constructed ε̆i = Yi − Xi′ β̆, “no life of its own”
Interpretation of β̆?
(Angrist/Pischke notation: β̃ = b, β̆ = β, ε̆i = ei )
23
For one constant and single regressor
1
Xi = ( )
Xi2
β̆1
β̆ = ( )
β̆2
⎛ E(Yi ) − β̆2 E(Xi2 ) ⎞

⇒ β̆ = E(Xi Xi′ )−1 E(Xi Yi ) = Cov(Yi ,Xi2 )
⎝ Var(Xi2 ) ⎠
∧
Population regression = linear projection
∧
PRC = projection coefficients
24
“Regression anatomy” formula (Frisch-Waugh)
For Xi = (1, Xi2 , . . . , Xik , . . . , XiK )′
and β̆ = (β̆1 , β̆2 , . . . , β̆k , . . . , β̆K )′
Cov(Yi , X̆ik )
β̆k = (bivariate regression)
Var(X̆ik )
X̆ik : residual from population regression of Xik (dependent variable)

on Xi.k (including constant)
Xi.k : Xi without Xik
′ ′
Xik = γ̆.k Xi.k + X̆ik with γ̆.k = E(Xi.k Xi.k )−1 E(Xi.k Xik )
25
Important laws
● Law of Total Expectation (LTE):
EX [EY ∣X (Y ∣X )] = EY (Y )
● Double Expectation Theorem (DET):
EX [EY ∣X (g(Y )∣X )] = EY [g(Y )]
● Law of Iterated Expectations (LIE):
EZ ∣X [EY ∣X ,Z (Y ∣X , Z )∣X ] = EY ∣X (Y ∣X )
● Generalized DET:
EX [EY ∣X (g(X , Y ))∣X ] = EX ,Y [g(X , Y )]
● Linearity of Conditional Expectations:
EY ∣X [g(X )Y ∣X ] = g(X )EY ∣X [Y ∣X ]
26
Justification C: linear cond. expectation function
CEF E(Yi ∣Xi ) = f (Xi ) [resp. f (xi )]
CEF decomposition property: orthogonal decomposition
Yi = E(Yi ∣Xi ) + ε∗i

ε∗i = Yi − E(Yi ∣Xi ) constructed; “no life of its own”
E(ε∗i ∣Xi ) =0 ⇒ E(ε∗i Xi ) =0
Marginal effect:
∂E(Yi ∣Xi = xi )
can be nonlinear in xik
∂xik
i = εi )
(Angrist/Pischke notation: ε∗
27
Angrist/Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion, 2008, p.31
28
If E(Yi ∣Xi ) = Xi′ β ∗
with β ∗ = E(Xi Xi′ )−1 E(Xi Yi ) = β̆
Yi = Xi′ β ∗ + ε∗i
= Xi′ β̆ + ε̆i
ε∗i = ε̆i and β ∗ = β̆
E(ε∗i ∣Xi ) = 0 ⇒ E(ε∗i Xi ) = 0

E(ε∗i Xi ) = 0 ⇏ E(ε∗i ∣Xi ) = 0
∂E(Yi ∣Xi =xi )

Interpretation of PRC as marginal effects ∂xik
= βk∗ = β̆k
29
Justification D: best approximation to nonlin. CEF
Best (smallest MSE) linear approximation to nonlinear CEF
EY ∣X (Yi ∣Xi ) ≈ Xi′ β̆ β̆ = E(Xi Xi′ )−1 E(Xi Yi )

from
⎡ 2⎤
⎢⎛ ⎞ ⎥
⎢ ⎥
⎢⎜ ⎟ ⎥
argmin EX ⎢⎜EY ∣X (Yi ∣Xi ) − Xi′ β̃ ⎟ ⎥
⎢⎜ ⎟ ⎥
{β̃} ⎢⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶⎠ ⎥
⎢ ⎥
⎣ approximation error ⎦
30
31
Again:
Yi = Xi′ β̆ + ε̆i (population regression)
here: β̆ used in EY ∣X (Yi ∣Xi ) ≈ Xi′ β̆ [or xi′ β̆ respectively]
∂EY ∣X (Yi ∣Xi = xi )

≈ β̆k approximative marginal effect
∂xik
32
Justification E: Optimal prediction
Goal: Minimize MSE of prediction of Yi using Xi
⎡ function to forecast Yi 2 ⎤
⎢⎛ ⎞ ⎥
⎢ ³¹¹ ¹ ¹ ¹·¹ ¹ ¹ ¹ µ ⎥
⎢⎜ ⎟ ⎥
argmin EXY ⎢⎜Yi − m(Xi ) ⎟ ⎥
⎢ ⎥
{m(Xi )} ⎢⎝´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶⎠ ⎥
⎢ ⎥
⎣ forecast error
⎦
mean-square optimal function m(Xi ):
⇒ m(Xi ) = EY ∣X (Yi ∣Xi )
33
Justification E: Optimal prediction
If only linear m(Xi ) used
⎡ 2⎤
⎢⎛ ⎞ ⎥
⎢ ⎥
β̆ = argmin EXY ⎢⎜ Yi − Xi′ β̃ ⎟ ⎥
⎢⎜ ⎟ ⎥
{β̃} ⎢⎝ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ⎠ ⎥
⎢ ε̃∶ forecast error ⎥
⎣ ⎦
solution to F.O.C. yields
β̆ = E(Xi Xi′ )−1 E(Xi Yi ) (PRC)
Yi = Xi′ β̆ + ε̆ as in B, but β̆ interpreted as “linear prediction coefficient”.
ε̆ = Yi − Xi′ β̆ “orthogonal forecast error”

Yi = linear prediction+orthogonal forecast error
= Xi′ β̆ + ε̆i
34
Justification F: (Rubin’s) causal model
Basic concepts and notation

Ci binary treatment indicator
Ci = 1 i th individual drawn received treatment (small class, studied
econometrics)
Ci = 0 i th individual drawn received no treatment
Yi observed (actual) outcome (SAT score, wage)
Y1i potential outcome if i received treatment
Y0i potential outcome if i received no treatment
Y1i − Y0i causal effect of treatment
Problem: either Y1i or Y0i observed - not both
35
Justification F: (Rubin’s) causal model
Yi = Y0i + (Y1i − Y0i ) ⋅Ci

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
treatment
effect, causal
Y0i “counterfactual” for treated i
E(Yi ∣Ci = 1) − E(Yi ∣Ci = 0) = E(Y1i − Y0i ∣Ci = 1)

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
differences in population averages∗ Average treatment effect
on the treated (ATET)∗∗
+ E(Y0i ∣Ci = 1) − E(Y0i ∣Ci = 0)

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
selection bias∗∗∗
∗ estimated by group sample means

∗∗ causal effect, but unobserved
∗ ∗ ∗ unobserved since E(Y0i ∣Ci = 1) counterfactual
36
Conditional independence assumption (CIA)
f (Y0i , Ci ∣Zi ) = f (Y0i ∣Zi ) ⋅ f (Ci ∣Zi )

f (Y1i , Ci ∣Zi ) = f (Y1i ∣Zi ) ⋅ f (Ci ∣Zi )
Y0i ⊥⊥ Ci ∣Zi
⇒
Y1i ⊥⊥ Ci ∣Zi
E(Y0i ∣Ci = 1, Zi ) = E(Y0i ∣Ci = 0, Zi ) = E(Y0i ∣Zi )

⇒
E(Y1i ∣Ci = 1, Zi ) = E(Y1i ∣Ci = 0, Zi ) = E(Y1i ∣Zi )
37
Conditional independence assumption (CIA)
E(Yi ∣Ci = 1, Zi ) − E(Yi ∣Ci = 0, Zi ) = E(Y1i − Y0i ∣Ci = 1, Zi )

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
Conditional average treatment
effect (CATET) conditional on Z
+ E(Y0i ∣Ci = 1, Zi ) − E(Y0i ∣Ci = 0, Zi )

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=0 because of CIA
38
CIA can remove selection bias: matching estimator
Group sample by identical Z variables (ordinal), estimate CATET by group

average differences of treated and untreated.
Compute ATET by
EZ ∣C =1 ([E(Y1i − Y0i ∣Ci = 1, Zi )] ∣Ci = 1) =EZ ∣C =1 (E[Yi ∣Ci = 1, Zi ]∣Ci = 1)

−EZ ∣C =1 (E[Yi ∣Ci = 0, Zi ]∣Ci = 1)
= E(Y1i − Y0i ∣Ci = 1) by LIE
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
ATET
estimate by weighting estimated CATET by group frequencies
(Angrist/Pischke notation: X instead of Z )
39
CIA can remove selection bias: matching estimator
40
Causal regression
Ysi = fi (S ) potential outcome

fi (S ) = α + ρS + ηi causal model
● Si : observed (chosen) treatment

● Yi : observed outcome
Yi = α + ρSi + ηi
● ρ: causal effect of increase of S by one unit
S ∈ {0, 1} ⇒ Y1i = α + ρ + ηi Y0i = α + ηi Y1i − Y0i = ρ
41
Causal regression
Yi = α + ρSi + ηi
Problem if ηi and Si correlated
Population regression:
Yi = β̆1 + β̆2 Si + ε̆i
Cov(Yi , Si )
β̆2 = ≠ρ
Var(Si )
!
(ε̆i and Si are uncorrelated by construction, i.e. E(Si ε̆i ) = 0)
42
Causal regression
Ysi ⊥⊥ Si ∣Ai ∀ S (→ ηi ⊥⊥ Si ∣Ai and νi ⊥⊥ Si ∣Ai )
If CIA holds and E[ηi ∣Ai ] = γ̆ ′ Ai :
Ysi = fi (S ) = α + ρS + A′i γ̆ + νi
Yi = α + ρSi + A′i γ̆ + νi
Si ∶ years schooling
e.g.
Ai ∶ ability variables
Si
population regression Yi on [ ] (long regression) has causal interpretation.
Ai
43
Long and short regression and OVB
Short (population) regression: Yi on Si (only)

Cov(Yi , Si )
Yi = αs + ρs Si + ε̆i PRC: ρs =
Var(Si )
Insert Yi from “long regression” in numerator
⇒ ρs = ρ + γ̆ ′ δAS
′
Cov(Ai1 , Si ) Cov(Ai2 , Si ) Cov(AiM , Si )
δAS = ( , ,..., )
Var(Si ) Var(Si ) Var(Si )
● δAS : vector of slope coefficients of regression (including constant) of each
element of Ai on Si .
● A variable that affects Ysi (via ηi ) and that is correlated with Si should
be included in Ai . Else: Omitted Variable Bias (OVB)!
● Short population regression coefficient ρs = ρ only if γ̆ = 0 (control
variables don’t affect outcome) or δAS = 0 (control variables and Si
uncorrelated).
44
Long and short regression and OVB
45
Epistemological problems
● non-experimental data
● unobservable variables
● endogeneity
● causality
● simultaneity
46
Notation of the different justifications
1 population
justification of linear Angrist Script QM
regression Pischke
A structural model β, γ, ε β, δ, ε
B population regression β,e β̆, ε̆ β̆, ε̆
∗ ∗ ∗
C linear CEF β ,ε β ,ε β ∗ , ε∗
D best approx. to nonli- β,e + β̆, ε̆ +
near CEF
E optimal prediction β + β̆, ε̆ +, ++
+ same as in population regression

++ referred to as “linear prediction coeff.”, “orthogonal forecast error”
47
Notation of the different justifications
2 objective function
justification of Angrist Script

linear regression Pischke
b β̃
2
B population regressi- argmin E [(Y − X ′ b)2 ] argmin E [(Y − X ′ β̃) ]
{b} {β̃}
on
2
D best approx. to argmin E [(E(Y ∣X ) − X ′ b)2 ] argmin EX [(E(Y ∣X ) − X ′ β̃) ]
{b} {β̃}
nonlinear CEF
E(Yi ∣Xi ) ≈ Xi′ β E(Yi ∣Xi ) ≈ Xi′ β̆
E optimal prediction argmin E [(Yi − m(Xi ))2 ] argmin EYX [(Yi − m(Xi ))2 ]
{m(Xi )} {m(Xi )}
m(Xi ) = Xi′ β m(Xi ) = Xi′ β̆
48
2. Parameter Estimation
Hayashi p. 3-18
49
Change in Notation
● random variable Zi
Hayashi Angrist & Pischke QM

zi Zi Zi
● realization of random variable Zi

zi zi zi
● vectors of random variables

′ ′
xi = (xi1 , xi2 , ⋯, xiK ) Xi = (Xi1 , Xi2 , ⋯, XiK ) xi = (Xi1 , Xi2 , ⋯, XiK )′
with a constant Xi1 = 1 or xi1 = 1
● parameter estimate

b β̂ β̂
Script from here on uses Hayashi notation
50
Linear regression model (CLRM) à la Hayashi
yi = β1 xi1 + β2 xi2 + ... + βK xiK + εi = x′i ⋅ β + εi

(1 × K ) (K × 1)
● yi : Dependent variable, observed

● xi = (xi1 , xi2 , ..., xiK )′ : Explanatory variables, observed
● β = (β1 , β2 , ..., βK )′ : Parameters
● εi : “Disturbance” component, unobserved
● b = (b1 , b2 , ..., bK )′ estimate of β
● ei = yi − x′i b: (estimated) residual
51
Introduction of matrix notation
For convenience we introduce matrix notation
y = X ⋅ β + ε
(n × 1) (n × K ) (K × 1) (n × 1)
⎡ y1 ⎤ ⎡ 1 x12 x13 ... x1K ⎤ ⎡ ⎤ ⎡ ⎤

⎢ ⎥ ⎢ ⎥ ⎢ β1 ⎥ ⎢ ε1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ β ⎥ ⎢ ⎥
⎢ y2 ⎥ ⎢ 1 x22 ⎥ ⎢ 2 ⎥ ⎢ ε2 ⎥
⎢ ⎥=⎢ ⎥⋅⎢ ⎥+⎢ ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎢ yn ⎥ ⎢ 1 xn2 ... xnK ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ βK ⎦ ⎣ εn ⎦
52
System of linear equations
written extensively:
y1 = β1 + β2 x12 + . . . + βK x1K + ε1
y2 = β1 + β2 x22 + . . . + βK x2K + ε2
⋮
yn = β1 + β2 xn2 + . . . + βK xnK + εn
53
Four classical assumptions
(1.1) Linearity: yi = x′i β + εi or y = Xβ + ε
(1.2) Strict exogeneity: E(εi ∣X) = 0

⇒ E(εi ) = 0 and Cov(εi , xik ) = E(εi xik ) = 0
(1.3) No exact multicollinearity: P(rank(X) = K ) = 1

⇒ No linear dependencies in the data matrix
(1.4) Spherical disturbances: Var(εi ∣X) = E(ε2i ∣X) = σ 2

Cov(εi , εj ∣X) = 0
E(εi εj ∣X) = 0
⇒ E(εi ) = σ and Cov(εi , εj ) = 0 by LTE
2 2
54
(Somewhat sloppy) interpretations of parameters
Interpreting the parameters β of different types of linear equations

● Linear model yi = β1 + β2 xi2 + ... + βK xiK + εi : A one unit c.p. increase in
the variable xik increases the conditional expected value of the dependent
variable by βk units
● Semi-log model ln(yi ) = β1 + β2 xi2 + ... + βK xiK + εi : A one unit c.p.
increase in the variable xk increases the conditional expected value of the
dependent variable approximately by 100 ⋅ βk percent
● Log linear model ln(yi ) = β1 ln(xi1 ) + β2 ln(xi2 ) + ... + βK ln(xiK ) + εi : A
one percent increase in xik c.p. increases the conditional expected value
of the dependent variable yi approximately by βk percent
55
Estimation via minimization of SSR
We estimate the linear model and choose b such that SSR is minimized
Obtain an estimate b of β by minimizing the SSR (sum of squared residuals):
n
argmin SSR (β̃) = argmin ∑ (yi − x′i β̃)
2
{β̃} i=1
Differentiation with respect to β̃1 , β̃2 , ..., β̃K ⇒ FOC’s:
∂SSR(β̃) 1
⇒ ∑(yi − x′i b) =
!
=0 ∑ ei = 0
∂ β̃1 n
∂SSR(β̃) 1
⇒ ∑(yi − x′i b)xiK =
!
=0 ∑ ei xiK = 0
∂ β̃K n
1 ′
⇒ FOC’s can be conveniently written in matrix notation n
Xe =0
56
Estimation via minimization of SSR
The system of K equations is solved by matrix algebra
X′ e = X′ (y − Xb) = X′ y − X′ Xb = 0
Premultiplying by (X′ X)−1 :
(X′ X)−1 X′ y − (X′ X)−1 X′ Xb = 0
(X′ X)−1 X′ y − IK b = 0
OLS-estimator:
b = (X′ X)−1 X′ y
Alternatively:
−1 −1
1 1 ′ 1 n 1 n
b = ( X′ X) X y = ( ∑ xi x′i ) ∑ xi yi
n n n i=1 n i=1
57
Zooming in
−1 −1
1 1 ′ 1 n 1 n
b = ( X′ X) X y = ( ∑ xi x′i ) ∑ xi yi
n n n i=1 n i=1
1 2 1 1
⎛ n ∑ xi1 n ∑ xi1 xi2 ... n ∑ xi1 xiK ⎞
1 n ⎜ n ∑ xi1 xi2
1 1
∑ xi2
2 1
∑ xi2 xiK ⎟
∑ xi xi = ⎜ ⎟
′ n n
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
n i=1 ⎜ ⎟
⎝ 1 ∑ xi1 x 1
∑ i2 xiK 1 2 ⎠
n ∑ iK
n iK n
x ... x
1
⎛ n ∑ xi1 yi ⎞
⎜ n1 ∑ xi2 yi ⎟
1 n ⎜ ⎟
⎜1 ⎟
∑ xi yi = ⎜
⎜
x y ⎟
n ∑ i3 i ⎟
n i=1 ⎜ ⎟
⎜ ⋮ ⎟
⎝ 1 ∑ xiK yi ⎠
n
58
3. Finite sample properties of OLS
Hayashi p. 27-31
59
Finite sample properties of b = (X′ X)−1 X′ y
1 E(b) = β: Unbiasedness of OLS

● Holds for any sample size
● Holds under assumptions 1.1 - 1.3
2 Var(b∣X) = σ 2 (X′ X)−1 : Conditional variance of b

● Conditional variance depends on the data
3 Var(β̂∣X) ≥ Var(b∣X)
● β̂ is any other linear unbiased estimate of β
60
An important result from mathematical statistics
⎛ z1 ⎞ ⎛ a11 a12 ... a1n ⎞

⎜ z2 ⎟ ⎜ a21 a22 ⎟
z =⎜ ⎟ A =⎜ ⎟
(nx1) ⎜ ⋮ ⎟ (mxn) ⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎝ zn ⎠ ⎝ am1 am2 ... amn ⎠
A new random variable: v = A ⋅ z
(mx1) (mxn) (nx1)
⎛ E(v1 ) ⎞
⎜ E(v2 ) ⎟
E(v) = ⎜ ⎟ = AE(z)
(mx1)
⎜ ⋮ ⎟
⎝ E(vm ) ⎠
Var(v) = AVar(z)A′
(mxm)
61
Unbiasedness of OLS
E(b) = β ⇒ E(b − β) = 0
sampling error
b−β = (X′ X)−1 X′ y − β

= (X′ X)−1 X′ (Xβ + ε) − β
= (X′ X)−1 X′ Xβ + (X′ X)−1 X′ ε − β
= β + (X′ X)−1 X′ ε − β
= (X′ X)−1 X′ ε
⇒ E(b − β∣X) = (X′ X)−1 X′ E(ε∣X) = 0 under assumption 1.2
⇒ E(b∣X) = E(β∣X)
⇒ EX (E(b∣X)) = E(b) = EX (β) = β by the LTE
62
We show that Var(b∣X) = σ 2 (X′ X)−1
Var(b∣X) = Var(b − β∣X)

= Var((X′ X)−1 X′ ε∣X) = Var(Aε∣X)
= AVar(ε∣X)A′ = Aσ 2 In A′
= σ 2 AIn A′ = σ 2 AA′
= σ 2 (X′ X)−1 X′ X(X′ X)−1 = σ 2 (X′ X)−1
Note:
● β non-random
● b − β sampling error
● A = (X′ X)−1 X′
● Var(ε∣X) = σ 2 In
63
Sketch proof of the Gauss Markov theorem
Var(β̂∣X) = Var(β̂ − β∣X) = Var[(D + A)ε∣X]

= (D + A)Var(ε∣X)(D′ + A′ ) = σ 2 (D + A)(D′ + A′ )
= σ 2 (DD′ + AD′ + DA′ + AA′ ) = σ 2 [DD′ + (X′ X)−1 ]
≥ σ 2 (X′ X)−1 = Var(b∣X)
where
● C is a function of X
● β̂ = Cy
● D=C−A
● A ≡ (X′ X)−1 X′
Details of proof: Hayashi pages 29 - 30
64
OLS is BLUE
● OLS is linear
⇒ Holds under assumption 1.1
● OLS is unbiased
⇒ Holds under assumption 1.1 - 1.3
● OLS is best estimator

⇒ Holds under the Gauss Markov theorem 1.1 - 1.4 Var(β̂∣X) ≥ Var(b∣X)
65
4. Hypothesis Testing under Normality
Hayashi p. 33-45
66
Hypothesis testing
Economic theory provides hypotheses about parameters:

● theory ⇒ testable implications
● But: Hypotheses cannot be tested without distributional assumptions
about ε
Distributional assumption:
1.5 Conditional normality: ε∣X ∼ N (0, σ 2 In )
Normality assumption about the conditional distribution of ε∣X
67
Important facts from multivariate statistics
Vector of random variables: x = (x1 , x2 , ..., xn )′
Expectation vector:
E(x) = µ = (µ1 , µ2 , ..., µn )′ = (E(x1 ), E(x2 ), ..., E(xn ))′
Variance-covariance matrix:
⎛ Var(x1 ) Cov(x1 , x2 ) ... Cov(x1 , xn ) ⎞

⎜ Cov(x1 , x2 ) Var(x2 ) ⎟
Var(x) = Σ = ⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ Cov(x1 , xn ) ... Var(xn ) ⎠
y = c + Ax; c, A non-random vector/matrix

⇒ E(y) = (E(y1 ), E(y2 ), ..., E(yn ))′ = c + Aµ
⇒ Var(y) = AΣA′
⇒ x ∼ N (µ, Σ) ⇒ y = c + Ax ∼ N (c + Aµ, AΣA′ )
68
Apply facts from mult. statistics and A1.1 - A1.5
b−β = (X′ X)−1 X′ ε

²
sampling error
Assuming
ε∣X ∼ N (0, σ 2 In )
⎛ ⎞
⇒ b − β∣X ∼ N ⎜
⎜(X′
X)−1 ′
X E(ε∣X) , (X′
X) −1 ′ 2
X σ In X(X′
X) −1 ⎟
⎟
⎝ ´¹¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¶ ⎠
0
⇒ b − β∣X ∼ N (0, σ 2 (X′ X)−1 )
Note that Var(b∣X) = σ 2 (X′ X)−1
OLS-estimate is conditionally normally distributed if ε∣X is multivariate normal.
69
Testing individual parameters (t-Test)
Null hypothesis: H0 ∶ βk = β̄k
Alternative hypothesis: HA ∶ βk ≠ β̄k
where β̄k is a hypothesized value, a real number
If H0 is true, then E(bk ) = β̄k and the test statistic
bk − β̄k
tk = √ ∼ N (0, 1)
σ 2 [(X′ X)−1 ] kk
[(X′ X)−1 ]kk : k -th row k -th column element of (X′ X)−1
70
Nuisance parameter σ 2
Nuisance parameter σ 2 can be estimated:
σ 2 = E(ε2i ∣X) = Var(εi ∣X) = E(ε2i ) = Var(εi )
We don’t know εi but we use the estimate ei = yi − x′i b
2
1 n 1 n 1 n 2 1 ′
σ̂ 2 = ∑ (ei − ∑ ei ) = ∑ ei = n e e
n i=1 n i=1 n i=1
σ̂ 2 is a biased estimate:
n −K 2
E(σ̂ 2 ∣X) = σ
n
71
An unbiased estimate of σ 2
For s 2 = 1
n−K ∑i=1 ei =
n 2 1
n−K
e′ e we get an unbiased estimate
1
E(s 2 ∣X) = E(e′ e∣X) = σ 2
n −K
EX (E(s 2 ∣X)) = E(s 2 ) = σ 2
Using s 2 for σ 2 provides an unbiased estimate of Var(b∣X) = σ 2 (X′ X)−1 :

̂
Var(b∣X) = s 2 (X′ X)−1
⇒ t-statistic under H0 :
bk − β̄k bk − β̄k bk − β̄k
tk = √ = =√ ∼ t(n − K )
̂
[Var(b∣X)] s.e.(bk ) ̂ k ∣X)]
[Var(b
kk
72
Decision rule for the t-test
1 H0 ∶ βk = β̄k (often β̄k = 0)

HA ∶ βk ≠ β̄k
bk −β̄k
2 Given β̄k , OLS-estimate bk and s 2 , we compute tk = s.e.(bk )
3 Fix significance level α of two-sided test

4 Fix non-rejection and rejection regions ⇒ decision
Remark:
√
σ 2 [(X′ X)−1 ]kk : standard deviation of bk ∣X
√
s 2 [(X′ X)−1 ]kk : standard error of bk ∣X
73
Duality of t-test and confidence interval
Under H0 ∶ βk = β k
bk − β k
tk = ∼ t(n − K )
s.e.(bk )
Probability for non-rejection:
P (−t α2 (n − K ) ≤ tk ≤ t α2 (n − K )) = 1 − α
−t α2 (n − K ): lower critical value

t α2 (n − K ): upper critical value
tk : random variable (value of test statistic)
1 − α: fixed number
⇒ P (bk − s.e.(bk )t α2 (n − K ) ≤ β k ≤ bk + s.e.(bk )t α2 (n − K )) = 1 − α
74
1 − α confidence interval for βk
P (bk − s.e.(bk )t α2 (n − K ) ≤ βk ≤ bk + s.e.(bk )t α2 (n − K )) = 1 − α
● bk − s.e.(bk )t α (n − K ): lower bound

2
● bk + s.e.(bk )t α (n − K ): upper bound
2
● Confidence bounds are random variables!
● H0 ∶ βk = β k rejected at significance level α if β k NOT within bounds of
the 1 − α confidence interval.
● H0 ∶ βk = β k cannot be rejected rejected at significance level α for all
values β k inside the 1 − α confidence interval.
● Beware of the wrong interpretation: True parameter βk lies with
probability 1 − α within the bounds of the confidence interval: Downright
wrong!
75
Testing joint hypotheses (F -test/Wald test)
Write hypothesis as:
H0 ∶ R β = r
(#r × K) (K × 1) (#r × 1)
R: matrix of real numbers

r: vector of real numbers
#r: number of restrictions
Replacing β = (β1 , β2 , ..., βk )′ by estimate b = (b1 , b2 , ..., bK )′ :
Rb − r should be close to 0
76
Wald/F -test statistic
Distributional properties of R b:
R E(b∣X) = Rβ [= r only if H0 is true]
R Var(b∣X)R′ = Rσ 2 (X′ X)−1 R′
Rb∣X ∼ N (Rβ, R σ 2 (X′ X)−1 R′ )
Using additional facts from multivariate statistics

● z = (z1 , z2 , ..., zm )′ ∼ N (µ, Ω)
● ⇒ (z − µ)′ Ω−1 (z − µ) ∼ χ2 (m)
Result applied: Wald statistic under H0
(Rb − r)′ (σ 2 R(X′ X)−1 R′ )−1 (Rb − r) ∼ χ2 (#r)
77
Distributional properties
Replace σ 2 by its unbiased estimate s 2 = 1

n−K ∑i=1 ei =
n 2 1
n−K
e′ e and dividing by
#r :
⇒ F -ratio:
(Rb − r)′ [R(X′ X)−1 R′ ]−1 (Rb − r)/#r

F =
(e′ e)/(n − K )
= ̂
(Rb − r)′ [R Var(b∣X)R′ −1
] (Rb − r)/#r ∼ F (#r, n − K )
Note: F -test is one-sided
Proof: see Hayashi p. 41
78
Decision rule of the F -test
1 Specify H0 in the form Rβ = r and HA ∶ Rβ ≠ r.
2 Calculate F -statistic.
3 Look up entry in the table of the F -distribution for #r and n − K at

given significance level.
4 Null is not rejected on the significance level α for F less than

Fα (#r, n − K )
79
Alternative representation of the F -statistic
Minimization of the unrestricted sum of squared residuals:
n
→ ∑(yi − x′i b)2 ⇒ SSRU
i=1
Minimization of the restricted sum of squared residuals:
n
→ ∑(yi − x′i bR )2 ⇒ SSRR
i=1
(Note that Hayashi uses β̃ for bR )
F -ratio:
(SSRR − SSRU )/#r
F=
SSRU /(n − K )
80
5. Goodness-of-fit measures
Hayashi p. 38/20
81
Coefficient of determination: uncentered R 2
Measure of the variability of the dependent variable: ∑ yi2 = y′ y
Orthogonal decomposition of y = ŷ + e:
y′ y = (ŷ + e)′ (ŷ + e)

= ŷ′ ŷ + 2ŷ′ e + e′ e
= ŷ′ ŷ + e′ e
2 e′ e
⇒ Ruc ≡1−
y′ y
A good model explains much and therefore the residual variation is very small
compared to the explained variation.
82
Coefficient of determination: centered R 2
Use a centered R 2 if there is a constant in the model (xi1 = 1)
n n n
2 2 2
∑(yi − y) = ∑(ŷi − y) + ∑ ei
i=1 i=1 i=1
n
∑i=1 ei2 SSR
⇒ Rc2 ≡ 1 − ≡1−
∑i=1 (yi − y)2
n
SST
2
Note, that Ruc and Rc2 lie both in the interval [0, 1] but describe different
models. They are not comparable!
83
Model selection criteria
2
adjusted Radj :
2 SSR/(n − K ) n − 1 SSR
Radj =1− =1−
SST /(n − 1) n − K SST
Akaike criterion (AIC):

SSR 2K
AIC = log ( )+
n n
Schwarz criterion (SBC):
SSR log(n)K
SBC = log ( )+
n n
Note:
● penalty term for heavy parametrization
● Select model with smallest AIC/SBC, highest Radj
2
84
6. Large Sample Theory and OLS
Hayashi p. 88-97/109-133
85
Basic concepts of large sample theory
Using large sample theory we can dispense with basic assumptions from finite
sample theory:
● 1.2 E(εi ∣X) = 0:
strict exogeneity
● 1.4 Var(ε∣X) = σ 2 I:
homoscedasticity
● 1.5 ε∣X ∼ N (0, σ 2 In ):
conditional normality
Asymptotic distribution of b, and t- and the F -statistic can be obtained.
86
Modes of convergence
Modes of convergence:
● Convergence in probability: →
p
● Convergence almost surely: →

a.s.
● Convergence in mean square: →

m.s.
● Convergence in distribution: →
d
{zn }: sequence of random variables

{zn }: sequence of random vectors
87
Convergence in probability
Convergence in probability:
A sequence {zn } converges in probability to a constant α if for any ε > 0
lim P (∣zn − α∣ > ε) = 0

n→∞
Short-hand we write: plim zn = α or zn ÐÐ

p→ α or zn − α Ð
Ðp→ 0
Extends to random vectors:
If lim P (∣znk − αk ∣ > ε) = 0 ∀ k = 1, 2, ..., K

n→∞
then zn ÐÐ
p→ α (element-wise convergence).
88
Convergence almost surely
Convergence almost surely:
A sequence {zn } converges almost surely to a constant α if
P ( lim zn = α) = 1
n→∞
Short-hand we write: zn ÐÐ Ð→ α.
a.s.
Extends to random vectors:
If P ( lim znk = αk ) = 1 ∀ k = 1, 2, ..., K

n→∞
then zn ÐÐ Ð→ α (element-wise convergence).

a.s.
89
Convergence in mean square and distribution
Convergence in mean square:
lim E [(zn − α)2 ] = 0 or zn Ðm.s.

ÐÐ→ α
n→∞
Convergence in mean square implies convergence in probability.
Convergence in distribution:
zn ÐÐ
d→ z
If the c.d.f. of zn , Fzn converges to the c.d.f. of z , Fz , at each point of

continuity.
Convergence in mean square, in probability and almost surely and convergence

in distribution extend to random vectors.
90
Khinchin’s Weak Law of Large Numbers (WLLN)
If {zi } i.i.d. with E(zi ) = µ < ∞,
n
then for z n = 1
n ∑ zi it holds that
i=1
z n ÐÐ
p→ µ
or lim P (∣z n − µ∣ > ε) = 0

n→∞
or plim z n = µ
91
WLLN: extensions
● Extension (1): Multivariate WLLN:

Sequence of random vectors {zi }
● Extension (2): Relaxation of independence
● Extension (3): Functions of random variables h(zi )
● Extension (4): Vector valued functions f (zi )
92
Central Limit Theorems (Lindeberg-Levy)
n
If {zi } i.i.d. with E(zi ) = µ and Var(zi ) = σ 2 and z n = 1
n ∑ zi ÐÐ
p→ µ
i=1
√
d→ y ∼ N (0, σ )
2
n(z n − µ) ÐÐ
a σ2
or z n − µ ∼ N (0, )
n
a σ2
or z n ∼ N (µ, )
n
a
Remark: Read ∼ ‘approximately distributed as’.
CLT also holds for multivariate extension: sequence of random vectors {zi }.
93
Useful lemmas: Continuous Mapping Theorem
Lemma 1: Hayashi 2.3(a)
a(⋅) ∶ RK → RM
zn ÐÐ
p→ α with a as a continuous function which does not depend on n, then:
a(zn ) ÐÐ
p→ a(α) or plim a(zn ) = a (plim(zn )) = a(α)
Examples:
● xn ÐÐ
p→ α ⇒ ln(xn ) ÐÐ
p→ ln(α)
● xn ÐÐ
p→ β and yn ÐÐ
p→ γ ⇒ xn + yn ÐÐ
p→ β + γ
● Yn ÐÐ
p→ Γ ⇒ Yn−1 ÐÐ
p→ Γ
−1
94
Useful lemmas: Continuous Mapping Theorem
Lemma 2: Hayashi 2.3(b)
If zn ÐÐ
d→ z, then:
a(zn ) ÐÐ
d→ a(z)
Examples:
● zn ÐÐ
d→ z , z ∼ N (0, 1) ⇒ z 2 ∼ χ2 (1)
● zn ÐÐ
d→ z ∼ N (0, 1)
● zn2 ÐÐ
d→ z ∼ χ (1)
2 2
95
Useful lemmas: Slutzky Theorem
Lemma 3: Hayashi 2.4(a)
If xn ÐÐ
d→ x and yn ÐÐ
p→ α, then:
xn + yn ÐÐ
d→ x + α
Examples:
● xn ÐÐ
d→ N (0, 1), yn Ð
Ðp→ α ⇒ xn + yn ÐÐ
d→ N (α, 1)
● xn ÐÐ
d→ x, yn Ð
Ðp→ 0 ⇒ xn + yn ÐÐ
d→ x
Lemma 4: Hayashi 2.4(b)
If xn ÐÐ
d→ x and yn ÐÐ
p→ 0, then:
xn ⋅ yn ÐÐ
p→ 0
96
Useful lemmas: Slutzky Theorem
Lemma 5: Hayashi 2.4(c)
If xn ÐÐ
d→ x and An Ð
Ðp→ A, then:
An ⋅ xn ÐÐ
d→ A ⋅ x
Example:
● xn ÐÐ
d→ N (0, Σ)
● An ⋅ xn ÐÐ ′
d→ N (0, AΣA )
Lemma 6: Hayashi 2.4(d)
If xn ÐÐ
d→ x and An Ð
Ðp→ A, then:
x′n A−1
n xn Ð
Ð ′ −1
d→ x A x
97
Large sample assumptions for OLS
Using Hayashi’s numbering (see pp. 109-113):
(2.1) Linearity: yi = x′i β + εi ∀ i = 1, 2, ..., n
(2.2) and (2.5) assumptions regarding dependence of {yi , xi }
(2.3) Orthogonality/ predetermined regressors: E(xik ⋅ εi ) = 0 ∀ k = 1, . . . , K

If 1 ∈ xi ⇒ E(εi ) = 0 ⇒ Cov(xik , εi ) = 0 ∀ k = 1, . . . , K
(2.4) Rank condition: E(xi x′i ) ≡ ΣXX is non-singular

K ×K
98
Large sample distribution of OLS estimator
We get for b = (X′ X)−1 X′ y:
−1
1 n 1 n
bn = [ ∑ xi x′i ] ∑ xi yi
¯ n i=1 n i=1
n indicates the
dependence on
the sample size
Using a WLLN and lemma 1:
● bn ÐÐ
p→ β
√
● n(bn − β) ÐÐ
a
d→ N (0, AVar(b)) or b ∼ N (β,
AVar(b)
n
)
⇒ bn is consistent, asymptotically normal (CAN).
99
bn = (X′ X)−1 X′ y is consistent
−1
1 n ′ 1 n
bn = [ ∑ xi xi ] ∑ xi yi
n i=1 n i=1
−1
1 1
⇒ bn − β = [ ∑ xi x′i ] ∑ xi εi
´¹¹ ¹ ¹ ¸¹ ¹ ¹ ¹¶ n n
sampling error
We show: bn ÐÐ
p→ β
When sequence {yi , xi } allows application of WLLN
1 n ′ ′
⇒ ∑ xi xi ÐÐ
p→ E(xi xi )
n i=1
1 n
⇒ ∑ xi εi ÐÐ
p→ E(xi εi ) = 0
n i=1
100
bn = (X′ X)−1 X′ y is consistent
Lemma 1 implies:
−1
CMT 1 n ′ ′ −1
⇒ [ ∑ xi xi ] p→ [E(xi xi )]
ÐÐ
n i=1
−1
1 ′ 1
bn − β = [ ∑ xi xi ] ∑ xi εi
n n
ÐÐ
p→ E(xi x′i )−1 E(xi εi )
ÐÐ
p→ E(xi x′i )−1 ⋅ 0 = 0
bn = (X′ X)−1 X′ y is consistent.
101
bn = (X′ X)−1 X′ y is asymptotically normal
Sequence {gi } = {xi εi } allows applying CLT for 1
n ∑ xi εi = g
√ ′
n(g − E(gi )) ÐÐ
d→ N (0, E(gi gi ))
√ 1 −1 √
n(bn − β) = [ ∑ xi x′i ] ng
n
Applying lemma 5:
−1
1 ′ −1
An = [ ∑ xi xi ] ÐÐp→ A = Σxx
n
√ ′
xn = n g ÐÐ
d→ x ∼ N (0, E(gi gi ))
√ −1 ′ −1
⇒ n(bn − β) ÐÐ
d→ Ax ∼ N (0, Σxx E(gi gi )Σxx )
⇒ bn is CAN
102
White standard errors
Adjusting the test statistics to make them robust against violations of

conditional homoskedasticity.
t-statistic:
bk − β̄k a
tk = √ ∼ N (0, 1)
[n
1 x x′ ] n e 2 x x′ [ 1 n x x′ ]
−1 1
∑n ∑n
i=1 i i i n ∑i=1 i i
−1
[ i=1 i i
n
]
kk
holds under H0 ∶ βk = β k
Wald statistic:
−1
̂
AVar(b)
W = (Rb − r)′ [R R′ ] (Rb − r)′ ∼ χ2 (#r )
a
n
holds under H0 ∶ Rβ − r = 0; allows for linear restrictions on β
103
How to estimate AVar(b)
AVar(b) = Σ−1 ′ −1
xx E(gi gi )Σxx with gi = xi εi
1 n ′ ′
∑ xi xi ÐÐ
p→ E(xi xi )
n i=1
Estimation of E(gi gi′ ): Ŝ = 1

n
2 ′
∑ ei xi xi ÐÐ
′
p→ E(gi gi )
−1 −1
̂ 1 n ′ 1 n ′
⇒ AVar(b) =[ ∑ xi xi ] Ŝ [ ∑ xi xi ] ÐÐ
p→
n i=1 n i=1
AVar(b) = E(xi x′i )−1 E(gi gi′ )E(xi x′i )−1
104
Testing with conditional homoskedasticity
Developing a test statistic under the assumption of conditional

homoskedasticity.
Assumption: E(ε2i ∣xi ) = σ 2
−1 −1
̂ 1 n ′ 1 n ′ 1
n
′
AVar(b) = [ ∑ xi xi ] σ̂ 2 ∑ xi xi [ ∑ xi xi ]
n i=1 n i=1 n i=1
−1
1 n ′
= σ̂ 2 [ ∑ xi xi ]
n i=1
with σ̂ 2 = 1
n
n 2
∑i=1 ei
1 n 2 2
Note: n ∑i=1 ei is a biased but consistent estimate for σ
105
7. Time Series Basics
(Stationarity and Ergodicity)
Hayashi p. 97-107
106
Time series dependence
Certain degree of dependence in the data in time series analysis; only one
realization of the data generating process is given.
CLT and WLLN rely on i.i.d. data, but dependence in real world data.
Examples:
● Inflation rate
● Stock market returns
Stochastic process: sequence of r.v.s. indexed by time {z1 , z2 , z3 , ...} or {zi }

with i = 1, 2, ...
A realization/sample path: One possible outcome of the process
107
Parallel worlds: Ensemble means
If we were able to ‘run the world several times’, we had different realizations of
the process at one point in time.
⇒ We could compute ensemble means and apply the WLLN.
As the described repetition is not possible, we take the mean over the one
realization of the process.
Key question:
1 T
Does ∑ xt → E(X ) hold?
T t=1 p
Condition: Stationarity and ergodicity of the process
108
Stationarity restricts the heterogeneity of a s.p.
Strict stationarity:
The joint distribution of zi , zi1 , zi2 , ..., zir depends only on the relative position
i1 − i, i2 − i, ..., ir − i but not on i itself.
In other words: The joint distribution of (zi , zir ) is the same as the joint
distribution of (zj , zjr ) if i − ir = j − jr .
Weak stationarity:
● E(zi ) does not depend on i

● Cov(zi , zi−j ) depends on j (distance), but not on i (absolute position)
109
Ergodicity restricts memory of stochastic process
A stationary process is called ergodic if
lim ∣E [f (zi , zi+1 , ..., zi+k ) ⋅ g(zi+n , zi+n+1 , ..., zi+n+l )]∣
n→∞
= ∣E [f (zi , zi+1 , ..., zi+k )]∣ ⋅ ∣E [g(zi+n , zi+n+1 , ..., zi+n+l )]∣
Ergodic Theorem:
If sequence {zi } is stationary and ergodic with E(zi ) = µ, then

1 n
zn = ∑ zi ÐÐ p→ µ
n i=1
110
Martingale difference sequence
Stationarity and Ergodicity are not enough for applying a CLT. To derive the
CAN property of OLS we assume:
{gi } = {xi εi }
is a stationary and ergodic martingale difference sequence (m.d.s.):
E(gi ∣gi−1 , gi−2 , ..., g1 ) = 0

⇒ E(gi ) = 0 (LTE).
Implications of m.d.s. when 1 ∈ xi :

εi and εi−j are uncorrelated, i.e. Cov(εi , εi−j ) = 0
111
8. Generalized Least Squares
Hayashi p. 54-58
112
GLS Assumptions
(1.1) Linearity: yi = x′i β + εi

(1.2) Strict exogeneity: E(εi ∣X) = 0
⇒ E(εi ) = 0 and Cov(εi , xik ) = E(εi xik ) = 0
(1.3) Full rank: P(rank(X) = K ) = 1
Relaxing assumption (1.4):
⎛ Var(ε1 ∣X) Cov(ε1 , ε2 ∣X) ... Cov(ε1 , εn ∣X) ⎞

⎜ Cov(ε1 , ε2 ∣X) Var(ε2 ∣X) ⋮ ⎟
⎜ ⎟
Var(ε∣X) = ⎜ Cov(ε1 , ε3 ∣X) Cov(ε2 , ε3 ∣X) Var(ε3 ∣X) ⎟
⎜ ⎟
⎜ ⋮ ⋱ ⋮ ⎟
⎝ Cov(ε1 , εn ∣X) ... Var(εn ∣X) ⎠
⇒ Var(ε∣X) = E(εε′ ∣X) = σ 2 V(X)
NOT (as in 1.4): Var(ε∣X) = σ 2 In
113
Generalized Least Squares (GLS)
GLS estimator derived under the assumption that V(X) is known, symmetric,
and positive definite
Let V(X)−1 = C′ C
Transformation of y = Xβ + ε: Premultiplying with C
Cy = CXβ + Cε
ỹ = X̃β + ε̃
where ỹ = Cy, X̃ = CX, and ε̃ = Cε
114
Least squares estimation of β (transformed data)
β̂ GLS = (X̃′ X̃)−1 X̃′ ỹ

= (X′ C′ CX)−1 X′ C′ Cy
−1
1 1
= (X′ V(X) −1
X) X′ 2 V(X)−1 y
σ2 σ
−1
= [X′ [Var(ε∣X)]−1 X] X′ [Var(ε∣X)]−1 y
GLS is the best linear unbiased estimator (BLUE)
Problems:
● Difficult to work out the asymptotic properties of β̂ GLS
● In real world applications Var(ε∣X) not known
● If Var(ε∣X) is estimated the BLUE-property of β̂ GLS is lost
115
Special case of GLS - weighted least squares
⎛ V1 (X) 0 ... 0 ⎞
′ 2⎜
0 V2 (X) ⋮ ⎟
E(εε ∣X) = Var(ε∣X) = σ ⎜ ⎟ = σ 2 V(X)
⎜ ⋮ 0 ⋱ 0 ⎟
⎝ 0 ... 0 Vn (X) ⎠
As V(X)−1 = C′ C
√ 1 0 ... 0
⎛ V1 (X) ⎞ ⎛ s1 0 ... 0 ⎞
⎜ √ 1 ⋮ ⎟ ⎜ 01 1
⋮ ⎟
⎜ 0 ⎟ ⎜ ⎟
⇒ C=⎜ V2 (X) ⎟=⎜ s2
⎟
⎜ ⋮ ⋱ 0 ⎟ ⎜ ⋮ 0 ⋱ 0 ⎟
⎜ ⎟
⎠ ⎝ 0 ⎠
1
⎝ 0 ... 0 √ 1
Vn (X)
... 0 sn
n
yi 1 xi2 xiK 2
⇒ β̂ GLS = argmin ∑ ( − β̃1 − β̃2 ... − β̃K )
{β̃} i=1 si si si si
Observations are inversely weighted by standard deviations.
116
9. Multicollinearity
117
Exact multicollinearity
Expressing a regressor as linear combination of (an)other regressor(s)

● rank(X) ≠ K : No full rank
● ⇒ Assumption 1.3 or 2.4 is violated
● (X′ X)−1 does not exist
Often economic variables are correlated to some degree

● BLUE result is not affected
● Large sample results are not affected
118
Effects of multicollinearity and solutions
Effects:
● Coefficients may have high standard errors and low significance levels
● Estimates may have the wrong sign
● Small changes in the data produces wide swings in the parameter
estimates
Solutions :
● Increasing precision by implementing more data. (Costly!)
● Building a better fitting model that leaves less unexplained.
● Excluding some regressors. (Dangerous! Omitted variable bias!)
119
10. Endogeneity
Hayashi p. 186-196
120
Omitted variable bias (OVB)
Correctly specified model:
y = X1 β 1 + X2 β 2 + ε
Regression of y on X1
⇒ X2 gets into disturbance term
⇒ Omitted variable bias
b1 = (X′1 X1 )−1 X′1 y

= (X′1 X1 )−1 X′1 (X1 β 1 + X2 β 2 + ε)
= β 1 + (X′1 X1 )−1 X′1 X2 β 2 + (X′1 X1 )−1 X′1 ε
OLS is biased:
● If β 2 ≠ 0 ⇒ (X′1 X1 )−1 X′1 X2 β 2 ≠ 0
● If (X′1 X1 )−1 X′1 X2 ≠ 0 ⇒ (X′1 X1 )−1 X′1 X2 β 2 ≠ 0
121
Endogeneity bias: Working example
Simultaneous equations model of market equilibrium (structural form):
qid = α0 + α1 pi + ui
qis = β0 + β1 pi + vi
Clear markets: qid = qis

It is not possible to estimate α0 , α1 , β0 , β1 as we do not know whether changes
in the market equilibrium are due to supply or demand shocks.
We observe many possible equilibria, however we can not explain the slope of
the demand and the supply curve from the data.
Endogeneity: Correlation between disturbances and regressors, regressors are
not predetermined
Here: Simultaneous equation bias
122
From structural form to reduced form
Solving qi and pi yields reduced form:
β0 − α0 vi − ui
pi = +
α1 − β1 α1 − β1
α1 β0 − α0 β1 α1 vi − β1 ui
qi = +
α1 − β1 α1 − β1
Price is a function of the two disturbance terms

● vi : supply shifter
● ui : demand shifter
Calculating the covariance of pi and the demand shifter ui :
Var(ui )
⇒ Cov(pi , ui ) = −
α1 − β1
123
With endogeneity OLS is not consistent
FOCs in simple regression context yield:
1
∑(qi − q)(pi − p) Cov(pi , qi )
α̂1 = n
ÐÐ
p→
1
(p − p)2
n ∑ i
Var(pi )
But here: Cov(pi , qi ) = α1 Var(pi ) + Cov(pi , ui )
Cov(pi , qi ) Cov(pi , ui )
⇒ = α1 + ≠ α1
Var(pi ) Var(pi )
⇒ OLS is not consistent
Same holds for β1
124
Instruments for the market model
Properties of the instruments:

● Uncorrelated with the disturbances, instruments are predetermined
● Correlated with the endogenous regressors
β2
Cov(xi , pi ) = Var(xi )
α1 − β1
Cov(xi , ui ) = 0
⇒ xi an instrument for pi ⇒ yields new reduced form

β0 − α0 β2 ζi − ui
⇒ pi = + xi +
α1 − β1 α1 − β1 α1 − β1
α1 β0 + α0 β1 α1 β1 α1 ζi + β1 ui
⇒ qi = + xi +
α1 − β1 α1 − β1 α1 − β1
α1 β2 Cov(xi , qi )
Cov(xi , qi ) = Var(xi ) = α1 Cov(xi , pi ) ⇒ α1 =
α1 − β1 Cov(xi , pi )
by WLLN: α̂1 ÐÐ
p→ α1
125
A basic macroeconomic model: Haavelmo (1943)
● Aggregated consumption function: Ci = α0 + α1 Yi + ui

● GDP identity: Yi = Ci + Ii
● Yi affects Ci , but at the same time Ci influences Yi
● Reduced form: Yi = α0
1−α1
+ 1
I
1−α1 i
+ ui
1−α1
⇒ Ci can not be regressed on Yi as the regressor is correlated with the
residual:
Var(ui )
Cov(Yi , ui ) = >0
1 − α1
⇒ OLS is inconsistent: upwards biased
Cov(Ci , Yi ) Cov(Yi , ui )
= α1 + ≠ α1
Var(Yi ) Var(Yi )
Valid instrument for income Yi : investment Ii
126
Errors in variables
Explanatory variable is measured with error (e.g. reporting errors)
Classical example: Friedman’s permanent income hypothesis
Permanent consumption is proportinal to permanent income Ci∗ = kYi∗
Observed variables:
● Yi = Yi∗ + yi
● Ci = Ci∗ + ci
● ci = kyi + ui
Endogeneity due to measurement errors
⇒ Solution: IV
127
11. Instrumental Variables
Hayashi p. 186-196, Angrist and Pischke p. 114-133, 138-140

(Note that Hayashi uses x for instruments and z for regressors and
δ instead of β.)
128
Solution for endogeneity problem: IV
Linear regression:
yi = x′i β + εi
But the assumption of predetermined regressors does not hold:
E(xi εi ) ≠ 0
⇒ For consistency, instrumental variables zi are needed:
⎡ zi1 ⎤
⎢ ⎥
⎢ z ⎥
⎢ i2 ⎥
zi = ⎢ ⎥
⎢ ⋮ ⎥
⎢ ⎥
⎢ ziL ⎥
⎣ ⎦
Every element of zi correlated with endogenous regressor but uncorrelated with

disturbance term.
129
IV Assumptions
(3.1) Linearity yi = x′i β + εi

(3.2) Ergodic stationarity
- L instruments zi
- K regressors xi
- Data sequence wi ≡ {yi , zi , xi } is stationary and ergodic
(3.3) Orthogonality conditions:
E(zi1 (yi − x′i β)) = 0 ⎫

⎪
⎪
⎪
⎪
E(zi2 (yi − x′i β)) = 0 ⎪
⎬ ⇒ E(zi (y − x′i β)) = E(zi εi ) = 0
⋮ ⎪
⎪
⎪
⎪
⎪
E(ziL (yi − x′i β)) = 0 ⎭
130
IV Assumptions
(3.4) Rank condition for identification: rank(ΣZX ) = K with
⎛ E(xi1 zi1 ) ... E(xiK zi1 ) ⎞

ΣZX = E(zi x′i ) = ⎜ ⋮ ⋱ ⋮ ⎟
⎝ E(xi1 ziL ) ... E(xiK ziL ) ⎠
K = L ⇒ Σ−1
ZX exists.
131
Deriving the IV-estimator (K = L)
E(zi εi ) = E (zi (yi − x′i β) = 0
E(zi yi ) − E(zi x′i β) = 0
−1
β = [E(zi x′i )] E(zi yi )
−1
1 ′ 1
β̂ IV = [ ∑ zi xi ] [ ∑ zi yi ]
n n
If K = L the rank condition implies that Σ−1

ZX exists and the system is exactly
identified.
132
Deriving the IV-estimator (K = L)
Applying WLLN, CLT and the useful lemmas it can be shown that IV estimate
β̂ IV is CAN.
β̂ IV ÐÐ
p→ β
√ ′ −1 −1 ′
n (β̂ IV − β) ÐÐ
d→ N (0, [E(zi xi )] E(ε2i zi z′i ) [[E(zi x′i )] ] )
−1 ′
−1
̂ β̂ ) = [ 1 ∑ zi x′i ]
AVar(
1 2 ′ 1 ′
∑ ei zi zi [[ ∑ zi xi ] ]
IV
n n n
with ei = yi − x′i β̂ IV
̂
̂ β̂ ) = AVar(β̂ IV )
Var( IV
n
133
IV in the context of causal model
ysi = fi (s)
= α + ρS + ηi
yi = α + ρSi + γ ′ Ai + νi
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
ηi
CIA ⇒ OLS delivers consistent estimates of ρ “selection on observables” (Ai ).
☇ What if Ai not observed? ☇
Solution: instrumental variables estimation ⇒ zi uncorrelated with ηi , but

correlated with Si .
134
“exclusion restriction”
yi = α + ρSi + ηi
figure out:
Cov(yi , zi ) = E[yi zi ] − E[zi ]E[yi ]

= E[(α + ρSi + ηi )zi ] − E[α + ρSi + ηi ]E[zi ]
= αE[zi ] + ρE[Si zi ] + E[ηi zi ] − αE[zi ]
− ρE[Si ]E[zi ] − E[ηi ]E[zi ]
= ρCov(Si , zi ) + Cov(ηi , zi )
135
Cov(yi , zi ) Cov(Si , zi )
ρ= ∶
Var(zi ) Var(zi )
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
PR of yi on zi PR of Si on zi
sample Cov(yi ,zi )

sample Var(zi )
ρ̂ = sample Cov(Si ,zi )
sample Var(zi )
136
Formulate problem into IV framework
η̃i = ηi − E(ηi )
yi = α̃ + ρSi + η̃i
with α̃ = α + E(ηi )
and E(η˜i ) = 0
Cov(ηi , zi ) = Cov(η̃i , zi ) = E(η̃i zi ) = 0
137
Special case of IV
1 1 α̃
zi = [ ], xi = [ ], β=[ ], εi = η˜i , yi = yi
zi Si ρ
−1 α̃
β = [E(zi x′i )] E(zi yi ) = [ ]
ρ
Cov(yi , zi )
ρ=
Cov(Si , zi )
α̃ = E(yi ) − ρE(Si )
−1 ˆ
1 ′ 1 α̃
β̂ = [ ∑ zi xi ] ∑ zi yi = [ ]
n n ρ̂
138
Causal model with covariates
yi = α′ xi + ρSi + ηi
Consider 2 PR:
Si = x′i π 10 + π11 zi + ξ1i

yi = x′i π 20 + π21 zi + ξ2i
Cov(z̃i ,Si ) Cov(z̃i ,yi )

Use Frisch-Waugh ⇒ π11 = Var(z̃i )
, π21 = Var(z̃i )
π21 Cov(z̃i , yi )
= =ρ
π11 Cov(z̃i , Si )
139
An alternative view: Two-Stage-Least Squares
Insert yi = α′ xi + ρ [x′i π 10 + π11 zi + ξ1i ] + ηi
yi = [α′ + ρπ ′10 ] xi + ρπ11 zi + ρξ1i + ηi

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ± ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
′ 21 π ξ 2i
π20
Empirical strategy:
1 Run regression of Si on xi and zi ⇒ estimates of π 10 and π11 ⇒ π̂ 10 , π̂11
⇒ compute Ŝi = π̂ ′10 xi + π̂11 zi
2 Run regression yi = α′ xi + ρŜi + ξ2i
140
3rd view IV
yi = α′ xi + ρSi + ηi
Instrument zi :
E(ηi zi ) = 0, E(ηi xi ) = 0
Redefine
xi xi α
zi = [ ], xi = [ ], β=[ ]
zi Si ρ
ηi = yi − α′ xi − ρSi (original)
= yi − x′i β (redefined)
⇒ E(zi ηi ) = 0 redefined
141
3rd view IV
⇒E(zi (yi − x′i β)) = 0 ∶
−1 α
β = [E(zi x′i )] E(zi yi ) = [ ]
ρ
−1
1 ′ 1 α̂
β̂ = [ ∑ zi xi ] ∑ zi yi = [ ]
n n ρ̂
⇒ 2SLS, ILS, IV are identical here!
⇒ use IV inference!
142
Hayashi in a nutshell
143
144
145

Applied Econometrics 2024

Uploaded by

Copyright:

Available Formats

Applied Econometrics 2024

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Econometrics 2024

Uploaded by

Copyright:

Available Formats

Applied Econometrics

Prof. Dr. Joachim Grammig

Summer term 2024

● There are several aspects of the quantitative approach to economics, and

● 1989 TRYGVE HAAVELMO for his clarification of the probability theory

● 2021 JOSHUA ANGRIST and GUIDO IMBENS for establishing new

● Hansen, B. (2022); Econometrics. Princeton Univerity Press.

(available in Uni lib and (parts) as pdfs on Ilias)

1 Review mathematical statistics (Probability and Risk, P & R): random

1 Six Justifications for Linear Regression

3 Finite Sample Properties of OLS

4 Hypothesis Testing under Normality

6 Large Sample Theory and OLS

1 Time Series Basics

2 Generalized Least Squares (GLS)

4 Endogeneity: Problem and Solutions

Angrist and Pischke, 2008, Ch. 1/2

1 Structural model suggested/derived from theory

2 Population regression (pure statistical motivation)

3 Linear conditional expectation function (CEF)

4 Smallest mean squared error (MSE) approximation to a nonlinear CEF

5 Smallest MSE prediction of dependent variable using a linear forecast

6 (Rubin) causal model

dependent variable = constant + β ×“key” regressor(s)

● unobservable component has economic meaning, a “life of its own”

Simultaneous equations model of market equilibrium (structural form):

qid = qis market clearing condition

ui demand shock, vi supply shock, α1 and β1 : price sensitivity of

● Trade volume of ith transaction: vi

Goal: Estimation of structural parameters µ, z0 , z1 , c

Derived from Human Capital theory:

ln(WAGEi ) = β1 + β2 Si + β3 TENUREi + β4 EXPRi + εi

Ingredients and notation:

Asset pricing theory postulates:

Rtej = Rtj − Rtf : expected excess return of asset j

Asset pricing theory:

with K risk factors, ft = (ft1 , . . . , ftK )′ ):

ft contains excess returns → λ = [E(ft1 ), . . . , E(ftK )]′

E(Rtej ) = β j E(Rtem ) (CAPM)

E(Rtej ) = β1j E(Rtem ) + β2j E(SMBt ) + β3j E(HMLt ) (FF)

Rtej = αj + β1j Rtem + β2j SMBt + β3j HMLt + εt

Implied by asset pricing theory:

E(Rtej ) = β1j E(Rtem ) + β2j E(SMBt ) + β3j E(HMLt )

● no constant in regression equation

Y: dependent variable X: random vector (K ×1) of ex-

fYX ∶ joint density in population

Yi , Xi : i th draw from population

Population regression coefficients (PRC) from

β̆ = argmin E [(Yi − Xi′ β̃) ]

⇒ β̆ = E(Xi Xi′ )−1 E(Xi Yi )

⎛ E(Yi ) − β̆2 E(Xi2 ) ⎞

For Xi = (1, Xi2 , . . . , Xik , . . . , XiK )′

and β̆ = (β̆1 , β̆2 , . . . , β̆k , . . . , β̆K )′

X̆ik : residual from population regression of Xik (dependent variable)

● Double Expectation Theorem (DET):

EX [EY ∣X (g(Y )∣X )] = EY [g(Y )]

● Law of Iterated Expectations (LIE):

EX [EY ∣X (g(X , Y ))∣X ] = EX ,Y [g(X , Y )]

● Linearity of Conditional Expectations: