0% found this document useful (0 votes)
24 views81 pages

Block 1

Uploaded by

karolina.jindr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views81 pages

Block 1

Uploaded by

karolina.jindr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Block 1

Repetition from BSc courses


LRM estimators & non-linear extensions
Predictions from regression models

Advanced econometrics 1 4EK608


Pokročilá ekonometrie 1 4EK416

Vysoká škola ekonomická v Praze


Outline

1 Estimation methods, predictions from a model


Ordinary least squares
General properties of estimators
Method of moments
Maximum likelihood estimator

2 Non-linear extensions to LRM, quantile regression


Non-linear regression models
Quantile regression

3 Predictions from a regression model


Predictions from a CLRM (repetition from BSc courses)
Predictions: general features, kFCV, Variance vs. Bias
Linear regression model (LRM) and OLS estimation

y = Xβ + ε
LRM assumptions (for OLS estimation):
(Notation follows Greene, Econometric analysis, 7th ed.)

A1 Linearity: yi = β1 + β2 xi2 + · · · + βK xiK + εi


LRM describes linear relationship between yi and xi .
A2 Full rank: Matrix X is an n×K matrix with rank K.
Columns of X are linearly independent and n ≥ K.
A3 Exogeneity of regressors: E[εi |X] = 0 (strict form).
If relaxed to contemporaneous form in TS: E[εt |xt ] = 0.
Law of iterated expectations: E[εi |X] = 0 ⇒ E[ε] = 0.
Linear regression model (LRM) and OLS estimation

y = Xβ + ε
LRM assumptions (continued):
A4 Homoscedastic & nonautocorrelated disturbances:
E[εε0 ] = σ 2 In
Homoscedasticity: var[εi |X] = σ 2 , ∀ i = 1, . . . , n.
Independent disturbances: cov[εt , εs |X] = 0, ∀ t 6= s.
GARCH models [i.e. ARCH(1): var[εt |εt−1 ] = σ 2 + αεt−1 ]
do not violate the conditional variance assumption
var[εi |X] = σ 2 . However, var[εt |εt−1 ] 6= var[εt ], with
conditioning on X omitted from notation but left as
implicit.
A5 DGP of X: Variables in X may be fixed or random.
A6 Normal distribution of disturbances:
ε|X ∼ N [0, σ 2 In ].
Ordinary least squares (OLS)
y = Xβ + ε
The least squares estimator is unbiased (given A1 – A3):

β̂ = b = (X 0 X)−1 X 0 y = β + (X 0 X)−1 X 0 ε,
take expectations :
E[b|X] = β + E[(X 0 X)−1 X 0 ε|X] = β, (zero by A3).
Variance of the least squares estimator (A1 – A4):
var[b|X] = var[(X 0 X)−1 X 0 ε|X]
because var(β) = 0. Using A3 & A4:
= A σ 2 In A0 where A = (X 0 X)−1 X 0
which is a matrix quadratic form for var(cZ) = c2 var(Z)
= σ 2 (X 0 X)−1
because (AB)0 = B 0 A0 ; dim. compatible matrices A, B.
Normal distribution of the least squares estimator (A1 – A6):
b|X ∼ N [β, σ 2 (X 0 X)−1 ].
General properties of estimators

Estimators and estimation methods:

LRM is not the only type of regression model.

OLS is not the only useful estimator.

Let’s approach estimators and their properties more


generally.

(again, notation follows Greene, Econometric analysis.)


Estimators and estimation methods
Notation/definitions:
xj = (x1j , . . . , xnj )0 - random sample of n observations.
θ - population parameter [unknown parameter(s)]
f (xj , θ): probability distribution function
θ̂ is some estimator of θ
Basic notions:
All estimators have sampling distributions
mean: E(θ̂)
variance: E[(θ̂ − E(θ̂))2 ], etc.
Estimators × estimate
Generally, many estimators exist for a given parameter.
Population mean example:
Pn
i=1 xi
θ̂1 = x =
n
1
θ̂2 = x̃ = (xmax + xmin )
2
Properties of estimators - classification:

Unbiasedness: can be described as E(θ̂) = θ.


Ocassionally useful – in finite (small) sample context.
Asymptotic unbiasedness (large sample property):
not very useful, discussion would be directed towards consistency
(which is a far more desirable feature).

Consistency: plim(θ̂) = θ.
As n → ∞, vector θ̂ is an unbiased estimator of θ and
plim(var(θ̂)) = 0 [i.e. var(θ̂) → 0 as n → ∞].
Consistent estimators: unbiased (or at least asymptotically
unbiased) & their variance shrinks to zero as sample size
grows (entire population is used).
Minimal requirement for estimator used in statistics or
econometrics.
If some estimator is not consistent, then it does not provide
relevant estimates of population θ values, even with
unlimited data, i.e. as n → ∞.
Unbiased estimators are not necessarily consistent.
Properties of estimators - classification:

Efficiency: an estimator is efficient if it is unbiased and no


other unbiased estimator has a smaller variance. Often
difficult to prove, we usually simplify the concept to
relative efficiency (e.g.: efficiency with respect to linear
unbiased estimators, etc.).
Asymptotic efficiency: holds for an estimator that is
asymptotically unbiased and no other asymptotically
unbiased estimator has smaller asymptotic variance.

Normality, asymptotic normality: basis for most


statistical inference performed with common estimators.
Estimators and estimation methods

Extremum estimator: obtained as the optimizer of some


criterion function q(θ|data). Most common estimators:
n
" #
X
2
LS θ̂LS = argmax − n1 (yi − h(xi , θLS )) ,
i=1
n
" #
X
1
ML θ̂ML = argmax n log f (yi |xi , θML ) ,
i=1
GMM θ̂GMM = argmax [−m(data, θGMM )0 W m(data, θGMM )],

where h(·) is a function (linear/non-linear → OLS/NLS),


f (·) is a probability density function (pdf),
m denotes sample moments,
W is a convenient positive definite matrix.
LS and ML estimators belong to a class of M estimators
(type of extremum estimators where objective function is a sample average).
Estimators and estimation methods

Assumptions for asymptotic properties of extremum estimators:

1 Parameter space: must be convex and the parameter


vector that is the object of estimation must be point in its
interior. Gaps and nonconvexities in parameter spaces
would generally collide with estimation algorithms
(settings such as σ 2 ≥ 0 are OK).

2 Criterion function: must be concave in the parameters


(concave in the neighborhood of the true parameter
vector). Criterion functions need not be globally concave.
In such situation, there may be multiple local optima
(often associated with poor model specification).
Estimators and estimation methods
Assumptions for asymptotic properties of extremum estimators:

3 Identifiability of parameters: has a relatively complex


technical definition (anything like “true parameters θ0 are
identified if...” is problematic - leads to a paradox if condition is
not met). Simple way to secure identification:

LS: for a given set of any two different parameter vectors θ


and θ0 , a vector of observations xi must exist (for some i),
leading to different conditional mean function (ŷi ).
ML: For any two parameter vectors θ 6= θ0 , a data vector
(yi , xi ) must exist, which generates different values of
density function: f (yi |xi , θ) 6= f (yi |xi , θ0 ).
Note: identifiability does not rule out possibility of:
f (yi |xi , θ) = f (y` |x` , θ), where, yi = y` , xi 6= x` .
GMM: sufficient condition for identification:
E[m(data, θ)] 6= 0 if θ 6= θ0 .
Estimators and estimation methods

Assumptions for asymptotic properties of extremum estimators:

4 Behavior of the data: Grenander conditions for


well-behaved data:
Pn
G1 For each xk column of X and d2nk = x0k xk = i=1 x2ik ,
it must hold that: limn→∞ d2nk = +∞.
Sum of squares continue to grow with sample size, i.e. xk
does not degenerate into a series of 0.
G2 The limn→∞ x2ik /d2nk = 0 for all i = 1, 2, . . . , n. Single
observations become less important as sample size grows.
No single observation will dominate x0k xk .
G3 Let Cn be sample correlation matrix of the columns in X
(excluding the intercept, if present). Then limn→∞ Cn = C
where C is positive definite. This implies that the full rank
condition for X (A2) is not asymptotically violated.
Estimators and estimation methods

Quick convergence recap (terminology):

Convergence in probability: a sequence of random variables


X1 , X2 , X3 , . . . converges in probability to a random
p
variable X, denoted as Xn → X [or plim(Xn ) = X], if:

lim P (|Xn − X| ≥ ) = 0, ∀  > 0.


n→∞

Convergence in distribution: a weaker type of convergence.


It states that the CDF of Xn converges to the CDF of X as
n goes to infinity (does not require dependency between
d
Xn and X). Xn → X, if:

lim FXn (x) = FX (x), FX (x) continuous.


n→∞
Estimators and estimation methods

Theorem: Consistency of M estimators


If:
(a) the parameter space is convex and the true parameter
vector is a point in its interior,
(b) the criterion function is concave,
(c) the parameters are identified by the criterion function,
(d) the data are well behaved,
then the M estimator converges in probability to the true
parameter vector.
Estimators and estimation methods

Theorem: Asymptotic normality of M estimators


If:
(a) θ̂ is a consistent estimator of θ0 where θ0 is a point in the
interior of the parameter space Θ,

(b) q(θ|data) is concave and twice continuously differentiable in θ in


a neighborhood of θ0 ,
√ d
(c) n [∂q(θ0 |data)/∂θ0 ] −−−→ N (0, Φ),
 
(d) lim Pr |(∂ 2 q(θ|data)/∂θk ∂θm ) − hkm (θ)| > ε = 0 ∀ε > 0 for
n→∞
any θ in Θ; hkm (θ) is a continuous finite valued function of θ,

(e) the matrix of elements H(θ) is nonsingular at θ0 ,


√ d
then n(θ̂ − θ0 ) −−−→ N {0, [H −1 (θ0 )ΦH −1 (θ0 )]}.

where Φ is a variance-covariance matrix,


and H(θ0 ) = ∂ 2 q(θ|data)/∂θ∂θ 0 is a Hessian (evaluated at θ0 ).
Method of moments

Method of moments (MM)

Generalized method of moments (GMM)


Method of moments

With the method of moments, we simply estimate


population moments by corresponding sample moments.

Under very general conditions, sample moments are


consistent estimators of the corresponding population
moments, but NOT necessarily unbiased estimators.

Application example 1
Sample covariance is a consistent estimator of population
covariance.

Application example 2
OLS estimators we have used for parameters in the CLRM can
be derived by the method of moments.
Method of moments

Method of moments (MM)


Population moments for a stochastic variable X
E(X r ): rth population moment about zero
E(X): population mean: 1st population moment about zero
E[(X − E(X))2 ]: population variance is the second moment
about the mean

Sample moments for sample observations (x1 , x2 , . . . , xn )


Pn r
x
i=1 i
n : rth sample moment about zero
Pn
xi
i=1
n = x : sample mean is the first moment about zero
Pn
(xi −x)2
i=1
n−1 : sample variance is the second sample moment
about the mean
Method of moments

For MM, the usual linear model assumption (concerning


1st population moment) E[xi εi ] = 0 implies:

E[xi (yi − x0i β)] = 0,

which constitutes a population moment equation:

E xi (yi − x0i β) = E [m(β)] = 0 ,


 

and the corresponding sample (empirical) moment equation


can be formalized as:
n
" #
1X
xi (yi − x0i β̂) = m(β̂) = 0.
n i=1
Method of moments
For a LRM with K regressors, MM sample equations can be
cast as:
n
1 X 
yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK = 0
n
i=1
n
1 X 
xi2 yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK = 0
n
i=1

...
n
1 X 
xiK yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK = 0
n
i=1

1
Removing n
elements from equations does not affect the solution.
This is a system of K equations with K unknown parameters βj .
The set of moment equations is equivalent to 1st order conditions for
the OLS estimator:
n
X 2
min yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK
β̂
i=1
Generalized method of moments

GMM is a very general class of estimators, includes many


other estimators as a special case (IVR, simultaneous
equations, Arellano-Bond estimator for dynamic panels).

For single equation linear models, GMM may be


conveniently described using the instrumental variable case:

For the LRM yi = x0i β + εi ,


we abandon the assumption E[x0i εi ] = 0 and
we replace it by E[zi0 εi ] = 0.
Hence, columns of X (n×K) are potentially endogenous
and Z (n×L) is a matrix of exogenous instruments.
Generalized method of moments

GMM equation (matrix form) can be cast by analogy to


the MM case:

we start by E[zi εi ] = 0, which implies a population


moment equation:

E zi (yi − x0i β) = E [m(β)] = 0 ,


 

and corresponding sample (empirical) moment equation:


n
" #
1X
zi (yi − x0i β̂) = m(β̂) = 0.
n i=1
Generalized method of moments
The equation form of GMM empirical equations can be
produced as:
n
1 X 
yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK = 0
n
i=1
n
1 X 
zi2 yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK = 0
n
i=1

...
n
1 X 
ziL yi − β̂1 − β̂2 xi2 − · · · − β̂K xiK = 0
n
i=1

First column of Z is assumed to be a vector of ones (same as for X).


For Z = X as a special case, the above equations are identical to MM
(shown previously) and the solution is identical to the OLS estimator:
β̂ = (X 0 X)−1 X 0 y.
For Z 6= X, where Z is (n×L) and X is (n×K), three identification
possibilities have to be considered.
Generalized method of moments
Identification of GMM equations

1 Underidentified: with L < K, there are fewer moment


equations than unknown parameters (βj ). Without
additional information (parameter restrictions), there is no
solution to the system of GMM equations.

2 Exactly identified: for L = K, single solution exists:


n
" #
1X
zi (yi − x0i β̂) = m(β̂) = 0,
n i=1
can be conveniently re-written as:
1 0 1 0
   
m(β̂) = Zy − Z X β̂ = 0
n n
and the solution yields the familiar IV estimator:
β̂ = (Z 0 X)−1 Z 0 y.
Generalized method of moments

Identification of GMM equations (continued)

3 With L > K, there is no unique solution to the equation


system m(β̂) = 0.
One intuitite solution is the “least squares approach”:
 
min m(β̂)0 m(β̂)
β

Through the first order conditions, we obtain a GMM


estimator as
−1
β̂ = (X 0 Z)(Z 0 X) (X 0 Z)Z 0 y.

Generalized method of moments
GMM - consistency conditions
Convergence of the moments: Empirical (sample)
moments converge in probability to their population
counterparts. DGP meets the conditions for LLN.
p
m(β) = 1
n (Z 0 y − Z 0 Xβ) →
− 0.
Identification: For any n ≥ K and β1 6= β2 it holds that
m(β1 ) 6= m(β2 ). Three implications:
Order condition: L ≥ K. Number of moment equations
at least as large as number of parameters.
Rank condition: matrix G(β) = ∂ m(β)/∂β 0 (i.e. n1 Z 0 X)
is a L×K matrix with row rank equal to K.
Uniqueness: unique solution/optimizer exists.

Limiting Normal distribution for the sample


moments: Population moments obey central limit
theorem (CLT) or some similar variant.
Generalized method of moments
GMM - final remarks & summary
GMM-based asymptotic covariance matrix of β̂ is discussed
in Greene (Econometric analysis, ch. 13.6) for the classical,
heteroscedastic and generalized case (includes TS-based
estimation).
GMM is robust to differences in “specification” of the data
generating process (DGP). → i.e. sample mean or sample
variance estimate their population counterparts (assuming
they exist) regardless of DGP.
GMM is free from distributional assumptions. “Cost” of
this approach: if we know the specific distribution of a
DGP, GMM does not make use of such information →
inefficient estimates.
Alternative approach: method of maximum likelihood
utilizes distributional information and is more efficient
(provided this information is available & valid).
Maximum likelihood estimator

Maximum likelihood estimator (MLE)

Normal distribution & MLE


Maximum likelihood estimator
Maximum likelihood estimator – single parameter
For a stochastic variable y with a known distribution, described
by a single θ parameter:
f (y|θ) is the pdf of y, conditioned on parameter θ.
For n iid observations, joint density of this process:
n
Y
f (y1 , y2 , . . . , yn |θ) = f (yi |θ) = L(θ|y)
i=1
is the likelihood function.
We estimate θ by maximizing L(θ|y) with respect to the
parameter (1st order conditions). Solution (MLE) often
denoted as θ̂ML .
For maximization (MLE), it is usually simpler to work with
a log-transformed likelihood function:
n
X
log L(θ|y) = log f (yi |θ).
i=1
Maximum likelihood estimator

MLE – Poisson distribution example


Consider 10 iid observations from a Poisson distribution:
y 0 = (5, 0, 1, 1, 0, 3, 2, 3, 4, 1).
e−λ λyi
The pdf: f (yi |λ) = yi ! .
P10
n yi
Y e−λ λyi e−10λ λ i=1
Likelihood function: L(λ|y) = = Q10 .
i=1
yi ! i=1 yi !
n
X n
X
logL: log L(λ|y) = −nλ + log λ yi − log(yi !),
i=1 i=1
n
∂ log L(λ|y)
X
1
1st order condition: ∂λ = −n + λ yi = 0.
i=1

From 1st order condition: λ̂ML = y n .


For our empirical example: λ̂ML = 2.
Maximum likelihood estimator

Maximum likelihood estimator – vector of parameters

θ = (θ1 , . . . , θm )0

L = L(θ1 , θ2 , ...θm |y1 , y2 , ..., yn )

We find MLEs of the m parameters by partially


differentiating the likelihood function L (often, log L is
used) with respect to each θ and then setting all the partial
derivatives obtained to zero.
Maximum likelihood estimator

MLE – Normal distribution


n (yi −xi β)2
1
e−
Y
L(θ|data) = L(β, σ 2 |yi , xi ) = √ 2σ 2

i=1 2πσ 2
In matrix form, the log likelihood function is:
n n 1
LL(β, σ 2 |y, X) = − log(2π) − log(σ 2 ) − 2 (y − Xβ)0 (y − Xβ)
2 2 2σ

Recall that:
(y − Xβ)0 (y − Xβ) = y 0 y − 2β 0 X 0 y + β 0 X 0 Xβ
and
∂(y−Xβ)0 (y−Xβ)
∂β 0
= −2X 0 y + 2X 0 Xβ.
Maximum likelihood estimator
MLE – Normal distribution (continued)
n n 1
LL(β, σ 2 |y, X) = − log(2π)− log(σ 2 )− 2 (y−Xβ)0 (y−Xβ)
2 2 2σ

1st order conditions:


0
∂LL
∂β 0 = 1
2σ 2 [2X y − 2X 0 Xβ] = 0
is solved by:
β̂ = (X 0 X)−1 X 0 y

∂LL
∂ σ2 = − 2σn2 + 1
2σ 4 [(y − Xβ)0 (y − Xβ)] = 0
is solved by:
(y−Xβ)0 (y−Xβ) u0 u SSR
σ̂ 2 = n = n = n .

Note: the MLE estimate σ̂ 2 is biased downwards in small


samples, as the unbiased estimate is equal to SSR/(n − K).
Maximum likelihood estimator

Basic MLE assumptions

Parameter space: Gaps and nonconvexities in parameter


spaces would generally collide with estimation algorithms.
Identifiability: The parameter vector θ is identified
(estimable), if for two vectors, θ ∗ 6= θ and for some data
observations x, L(θ ∗ |x) 6= L(θ|x).
Well-behaved data: Laws of large numbers (LLN) apply.
Some form of CLT can be applied to the gradient (i.e. for
the estimation method).
Regularity conditions: “well behaved” derivatives of
f (yi |θ) with respect to θ (see Greene, chapter 14.4.1).
Maximum likelihood estimator

MLE properties

Consistency: plim(θ̂) = θ0 (θ0 is the true parameter)

Asymptotic normality of θ̂

Asymptotic efficiency: θ̂ is asymptotically efficient and


achieves the Cramér-Rao lower bound for consistent
estimators (see Greene, chapter 14.4.5)

Invariance: MLE of γ0 = c(θ0 ) is c(θ̂) if c(θ0 )


is a continuous and countinuously differentiable function.
(empirical advantages: we can use reparameterization in MLE,
e.g. γj = 1/θj or θ2 = 1/σ 2 ).
Maximum likelihood estimator
MLE - properties of the estimator
(Normal distribution):

Under the above assumption, variance-covariance matrix of θ is


the inverse of the Information matrix:
" #
−1 σ 2 (X 0 X)−1 0
var(θ̂) = I[θ̂] = 2σ 4 ,
0 n

where I[θ] = −E[H(θ)]. MLE gives the familiar formula for


the variance-covariance matrix of the β̂: σ 2 (X 0 X)−1 , and
a simple expression for the variance of σ̂ 2 .
The square root of the diagonal elements of I[θ̂]−1 gives
estimates of the standard errors of the parameter estimates.
We can construct simple z-scores to test the null hypothesis
concerning any individual parameter, just as in OLS, but using
the normal instead of the t-distribution.
Maximum likelihood estimator

MLE - inference, three classic tests:


Consider MLE of parameter θ and a test of the hypothesis
H0 : h(θ) = 0. Recall that ML parameter estimates are
asymptotically normally distributed.

1 Likelihood ratio test: If the restriction h(θ) = 0 is valid,


then imposing it should not lead to a large reduction in the
log-likelihood function.

LR = 2(LLU − LLR ) ∼ χ2 (r),


H0

where LLU is the LL of unconstrained model, LLR denotes


restricted model and r is the number of restrictions
imposed. To do this test you have to estimate two models
(one nested) and get the results of both.
Maximum likelihood estimator

MLE - inference, three classic tests:

We have an unrestricted ML estimate θ̂ = (θ̂1 , . . . , θ̂m )0 ,


and test of the hypothesis H0 : h(θ) = q,
where q is a (r × 1) vector function of θ (linear/non-linear
restrictions, continuous partial derivatives assumed).

2 Wald test: If restriction h(θ) = q is valid, then h(θ̂) − q


should be close to zero since MLE is consistent.

h i−1
W = [h(θ̂) − q]0 Asy.Var[h(θ̂) − q] [h(θ̂) − q] ∼ χ2 (r),
H0

where the estimated    


∂h(θ̂) ∂h(θ̂)
Asy.Var[h(θ̂) − q] = Asy.Var(θ̂) .
∂ θ̂ ∂ θ̂
Maximum likelihood estimator
MLE - inference, three classic tests:

We have a ML estimate θ̂R – i.e. ML estimation of the


restricted model, under H0 : h(θ) = 0,

3 Lagrange multiplier test: If the restriction is valid, then


the restricted estimator should be near the point that
maximizes the log-likelihood. Therefore, the slope of the
log-likelihood function should be near zero at the restricted
estimator. The test is based on the slope of the
log-likelihood at the point where the function is maximized
subject to the restriction.

!0 !
∂ log L(θ̂R ) ∂ log L(θ̂R )
LM = I[θ̂R ]−1 ∼ χ2 (r),
∂ θ̂R ∂ θ̂R H0

where −I[θ̂R ] = ∂ 2 LL(θ)/∂θ 0 ∂θ evaluated at θ = θ̂R .


Maximum likelihood estimator

MLE - inference, three classic tests:

The χ2 distributions of the three test statistics are


asymptotically valid.

The three tests are asymptotically equivalent, but may


differ in small samples:

W ≥ LR ≥ LM.

Hence, in finite samples, LR rejects H0 less often than W


but more often than LM.

The above tests are discussed in ML context, i.e. with a


known distribution of the variable/error term
(ML parameter estimates are asymptotically normally
distributed).
Maximum likelihood estimator

MLE – summary

MLE is only possible if we know the form of the probability


distribution function for the population (Normal, Poisson,
Negative Binomial, etc.).

MLE has the large sample properties of consistency and


asymptotic efficiency. There is no guarantee of desirable
small-sample properties.

Under CLRM assumptions (A1 – A6), ML estimator is


identical to OLS estimator (for β̂).
Non-linear extensions to LRM, quantile regression

Non-linear regression models

Quantile regression
Non-linear regression models

Nonlinear regression model:

yi = h(xi , β) + εi

Linear model is a special case of the nonlinear model.


yi = h(xi , β) + εi = x0i β + εi .
Linear models: linear in parameters. Definition includes
non-linear regressors such as x2i , etc.
Many nonlinear models can be transformed into linear
models (log-transformation)

For nonlinear models that cannot be transformed into


LRM, nonlinear LS (NLS) are available.

∂h(xi , β)/∂x is no longer equal to β


(interpretation based on estimated model . . . )
Nonlinear regression
Assumptions relevant to the nonlinear regression model
1 Functional form: The conditional mean function for yi ,
given xi is:

E[yi |xi ] = h(xi , β) , i = 1, 2, . . . , n

2 Identifiability of model parameters: The parameter


vector in the model is identified (estimable) if there is no
nonzero parameter β0 6= β such that h(xi , β0 ) = h(xi , β)
for all xi .

3 Zero mean of the disturbance: For yi = h(xi , β) + εi ,


we assume

E[εi |h(xi , β)] = 0 , i = 1, 2, . . . , n

i.e. disturbance at observation i is uncorrelated with the


conditional mean function.
Nonlinear regression

Assumptions relevant to the nonlinear regression model

4 Homoscedasticity and non-autocorrelation:


conditional homoscedasticity:

E[ε2i |h(xi , β)] = σ 2 , i = 1, 2, . . . , n

non-autocorrelation:

E[εt εs |h(xt , β), h(xs , β)] = 0, for all t 6= s


Nonlinear regression

Assumptions relevant to the nonlinear regression model

5 Data generating process: DGP for xi is assumed to be


a well-behaved population such that first and second
sample moments of the data can be assumed to converge to
fixed, finite population counterparts. The crucial
assumption is that the process generating xi is strictly
exogenous to that generating εi
6 Underlying probability model There is a well-defined
probability distribution generating εi . At this point, we
assume only that this process produces a sample of
uncorrelated, identically (marginally) distributed random
variables εi with mean zero and variance σ 2 conditioned on
h(xi , β). Hence, our statement of the model is
semi-parametric (i.e. specific distributional assumption
on residuals are replaced by weaker assumptions).
Nonlinear Regression: NLS

NLS: estimator of the nonlinear regression model

[yi − h(xi , β)]2


P
NLS: min: S(β) =

Using the standard procedure, we can get k first order


conditions for the minimization:
n
∂S(β) X ∂h(xi , β)
= [yi − h(xi , β)] = 0
∂β i=1
∂β

The above first order conditions are also moment conditions


and this defines the NLS estimator as a GMM estimator.
Nonlinear regression: NLS

NLS: estimator of the nonlinear regression model

NLS being a GMM estimator allows us to deduce that the


NLS estimator has good large sample properties:
consistency and asymptotic normality (if assumptions are
fulfilled).

Hypothesis testing: The principal testing procedure is the


Wald test, which relies on the consistency and asymptotic
normality of the estimator. Likelihood ratio and LM tests
can also be constructed.
Nonlinear regression: computing NLS estimates

For nonlinear models, a closed-form solution (NLS estimator)


usually does not exist.

Most of the nonlinear maximization problems are solved by


an iterative algorithm.
The most commonly used of iterative algorithms are
gradient methods.
The template for most gradient methods in common use is
the Newton’s method.
Look at your software packages which methods are
available for computing NLS estimates.
Nonlinear regression: examples

LRM on TS with autocorrelation:

yt = x0t β + ut , ut = ρut−1 + εt ,
yt = x0t β + ρut−1 + εt note: ut−1 = yt−1 − x0t−1 β,
hence:
yt = ρyt−1 + x0t β + ρ(x0t−1 β) + εt ,
which is non-linear in parameters (ρβ).

Non-linear consumption function example:

consi = β1 + β2 incβi 3 + εi
special case: model is linear for β3 = 1
(such assumption can be tested).
Nonlinear regression: examples

Examples 7.4 & 7.8 (Greene):


Analysis of a Nonlinear Consumption Function
OLS version: for β3 = 1.

Depednent Variable: REALCONS


Method: Least Squares (Marquard - EViews legacy)
Date: 09/19/16 Time 16:31
Sample 1950Q1 2000Q4
Included observations: 204
REALCONS=C(1)+C(2)*REALDPI

Coeficient Std.Error t-Statistic Prob.

C(1) -80.35475 14.30585 -5.616915 0.0000


C(2) 0.921686 0.003872 238.0540 0.0000

R-squared 0.996448 Mean dependent var 2999.436


Adjusted R-squared 0.996431 S.D. dependent var 1459.707
S.E. of regression 87.20983 Akaike info criterion 11.78427
Sum squared resid 1536322 Schwarz criterion 11.81680
Log likelihood -1199.995 Hannan-Quinn criter. 11.79743
F-statistics 56669.72 Durbin-Watson stat 0.092048
Prob(F-statistics) 0.000000
Nonlinear regression: examples

Examples 7.4 & 7.8 (Greene):


Analysis of a Nonlinear Consumption Function
NLS with starting values equal to 0

Depednent Variable: REALCONS


Method: Least Squares (Marquard - EViews legacy)
Sample 1950Q1 2000Q4 Included observations: 204
Convergence achieved after 200 iterations
REALCONS=C(1)+C(2)*REALDPI^C(3)

Coeficient Std.Error t-Statistic Prob.

C(1) 458.7991 22.50140 20.38980 0.0000


C(2) 0.100852 0.010910 9.243667 0.0000
C(3) 1.244827 0.012055 103.2632 0.0000

R-squared 0.998834 Mean dependent var 2999.436


Adjusted R-squared 0.998822 S.D. dependent var 1459.707
S.E. of regression 50.09460 Akaike info criterion 10.68030
Sum squared resid 504403.2 Schwarz criterion 10.72910
Log likelihood -1086.391 Hannan-Quinn criter. 10.70004
F-statistics 86081.29 Durbin-Watson stat 0.295995
Prob(F-statistics) 0.000000
Nonlinear regression: examples

Examples 7.4 & 7.8 (Greene):


Analysis of a Nonlinear Consumption Function
NLS with starting values equal to the parameters from the OLS
estimation (c(3) equal to 1)

Depednent Variable: REALCONS


Method: Least Squares (Marquard - EViews legacy)
Sample 1950Q1 2000Q4 Included observations: 204
Convergence achieved after 80 iterations
REALCONS=C(1)+C(2)*REALDPI^C(3)

Coeficient Std.Error t-Statistic Prob.

C(1) 458.7989 22.50149 20.38971 0.0000


C(2) 0.100852 0.010911 9.243447 0.0000
C(3) 1.244827 0.012055 103.2632 0.0000

R-squared 0.998834 Mean dependent var 2999.436


Adjusted R-squared 0.998822 S.D. dependent var 1459.707
S.E. of regression 50.09460 Akaike info criterion 10.68030
Sum squared resid 504403.2 Schwarz criterion 10.72910
Log likelihood -1086.391 Hannan-Quinn criter. 10.70004
F-statistics 86081.28 Durbin-Watson stat 0.295995
Prob(F-statistics) 0.000000
Quantile regression - LAD

Quantile regression estimates the relationship between


regressors and a specified quantile of dependent variable.
LAD estimator is the QREG for q = 12 (median) and the
loss function can be described as (compare to OLS
objective function):
n
|yi − x0i β̂q |
X
min: Qn (β̂q ) =
β̂q i=1

LAD estimator predates OLS (itself older than 200 years).


Until recently, QREG and LAD have seen little use in
econometrics, as OLS is vastly easier to compute.
Different software packages use a variety of optimization
algorithms for QREG/LAD estimation.
Linear programming can be used for finding QREG
estimates (Koenkerr and Bassett (around 1980).
Quantile regression (QREG)

For LRMs, the q-th quantile QREG estimator βq minimizes:


n n
q|yi − x0i β̂q | + (1 − q)|yi − x0i β̂q |,
X X
min: Qn (β̂q ) =
β̂q i: ei ≥ 0 i: ei < 0

where ei = (yi − x0i β̂q ).


We use the notation β̂q to make clear that different choices
of q lead to different β̂.
Slope of the loss function Qn is asymmetrical
(around ei = 0).
The loss function is not differentiable (at ei = 0)
→ gradient methods are not applicable
(linear programming can be used).
Quantile regression (QREG)

Quantile regression: used to describe relationship between


regressors and a specified quantile of dependent variable.

The (linear) quantile model can be defined as


Q[y|x, q] = x0 βq , such that Prob[y ≤ x0 βq |x] = q, 0 < q < 1
where q denotes the q-th quantile of y.

One important special case of quantile regression is the


least absolute deviations (LAD) estimator, which
corresponds to fitting the conditional median of the
response variable (q = 12 ).

QREG (LAD) estimator can be motivated as a robust


alternative to OLS (with respect to outliers).
Quantile regression

QREG coefficient interpretation example:

(1) wagei = β1 + ui
(2) wagei = β1 + β2 femalei + ui
(3) wagei = β1 + β2 femalei + β3 experi + ui

The above equations are estimated by OLS / LAD / QREG:


Coefficient OLS LAD (q = 1 2
) QREG (q = 3 4
)
(1) β1 β̂1 = y β̂1 = ỹ β̂1 = Q3
sample mean sample median sample 3rd quartile
(2) β1 , β1 +β2 conditional sample mean cond. sample median conditional sample Q3
wage: male / female wage: male / female wage: male / female
(3) β3 change in expected mean change in exp. median change in expected Q3
wage for ∆exper = 1 wage for ∆exper = 1 wage for ∆exper = 1
Quantile regression example

Example 7.10 (Greene):


Income Elasticity of Credit Cards Expenditure
OLS & LAD & Income elasticity at different deciles

Depednent Variable: LOGSPEND


Method: Least Squares
Date: 09/15/16 Time 13:53
Sample (adjusted): 3 13443
Included observations: 10499 after adjustments

Variable Coeficient Std.Error t-Statistic Prob.

C -3.055807 0.239699 -12.74852 0.0000


LOGINC 1.083438 0.032118 33.73296 0.0000
AGE -0.017364 0.001348 -12.88069 0.0000
ADEPCNT -0.044610 0.010921 -4.084857 0.0000

R-squared 0.100572 Mean dependent var 4.728778


Adjusted R-squared 0.100315 S.D. dependent var 1.404820
S.E. of regression 1.332496 Akaike info criterion 3.412366
Sum squared resid 18634.35 Schwarz criterion 3.415131
Log likelihood -17909.21 Hannah-Quinn criter. 3.413300
F-statistic 391.1750 Durbin-Watson stat 1.888912
Prob(F-statistic) 0.000000
Quantile regression example 2

Example 7.10 (Greene):


Income Elasticity of Credit Cards Expenditure (LAD)
Depednent Variable: LOGSPEND Method: Quantile Regression (Median)
Sample (adjusted): 3 13443 Included observations: 10499 after adjustments
Huber Sandwich Standard Errors & Covariance
Sparsity method: Kemel (Epanechnikov) using residuals
Bandwidth method: Hall-Sheather, bw=0.04437
Estimation successfully identifies unique optimal solution

Variable Coeficient Std.Error t-Statistic Prob.

C -2.803756 0.233534 -12.00577 0.0000


LOGINC 1.074928 0.030923 34.76139 0.0000
AGE -0.016988 0.001530 -11.10597 0.0000
ADEPCNT -0.049955 0.011055 -4.518599 0.0000

Pseudo R-squared 0.058243 Mean dependent var 4.728778


Adjusted R-squared 0.057974 S.D. dependent var 1.404820
S.E. of regression 1.346476 Objective 5096.818
Quantile dependent va... 4.941583 Restr. objective 5412.032
Sparsity 2.659971 Quasi-LR statistic 948.0224
Prob(Quasi-LR stat) 0.000000
Quantile regression example 2

Example 7.10 (Greene):


Income Elasticity of Credit Cards Expenditure
(Intercept) log(INCOME)
0

1.4
−2

1.2
−4

1.0
−6

0.8
−8

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

AGE ADEPCNT

0.00
−0.02
−0.015

−0.04
−0.025

−0.06
−0.08
−0.035

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8


Predictions from a model

Predictions from a CLRM (repetition from BSc courses)

Predictions: general features, kFCV, Variance vs. Bias


Predictions - basics

CLRM and its estimate:

y = β1 + β2 x2 + β3 x3 + · · · + βK xK + u
ŷ = β̂1 + β̂2 x2 + β̂3 x3 + · · · + β̂K xK

Prediction of expected value:

ŷp = E(y|x1 = 1, x2 = c2 , . . . , xK = cK )
ŷp = β̂1 + β̂2 c2 + β̂3 c3 + · · · + β̂K cK

Rough (underestimated) confidence interval for the


expected value prediction: (95%): ŷp ± 2 × s.e.(ŷp ).
(Rule of thumb)
Predictions - basics

s.e.(ŷp ) can be obtained by reparametrization:

Reparametrized CLRM:

y ∗ = β1∗ + β2∗ (x2 − c2 ) + β3∗ (x3 − c3 ) + · · · + u

The following holds:

ŷp = βˆ1∗
s.e.(ŷp ) = s.e.(βˆ1∗ ), i.e.
var(ŷp ) = var(β̂1∗ )
Predictions - basics

Predicted and actual values of yp :

ŷp = β̂1 + β̂2 c2 + β̂3 c3 + · · · + β̂K cK


yp = β1 + β2 c2 + β3 c3 + · · · + βK cK + up

Prediction error

êp = yp − ŷp = (β1 + β2 c2 + β3 c3 + · · · + βK cK ) + up − ŷp

Prediction error variance

var(êp ) = var(up ) + var(ŷp )

because var(β1 + β2 c2 + β3 c3 + · · · + βK cK ) = 0
Predictions - basics

In CLRM, homoscedasticity holds, σ 2 = var(up ):


var(êp ) = σ 2 + var(ŷp )
We estimate σ 2 from the original CLRM as (SSR/(n − K))
We get var(ŷp ) from the reparametrized LRM

Standard prediction error:


p
s.e.(êp ) = var(êp )

Prediction interval (95%)


ŷp ± t0.025 × s.e.(êp )
Predictions - basics

Prediction with logarithmic dependent variable

log(y) = β1 + β2 x2 + · · · + βK xK + u
\ = β̂1 + β̂2 x2 + · · · + β̂K xK
log(y)

\
ŷ = elog(y) systematically underestimates ŷ ,
\
b 0 elog(y)
we can use a correction: ŷ = α
Pn
b 0 = n−1
where α i=1 exp(ûi )

is a consistent (not unbiased) estimator of exp (u).


Predictions - basics (Matrix form)

Prediction based on estimated model:

ŷp = x0p β̂

Difference between prediction and actual yp value:

êp = ŷp − yp = x0p β̂ − x0p β − up = x0p (β̂ − β) − up

If β̂ is unbiased estimator for β,


ŷp is an unbiased estimator for yp value:

E(êp ) = E(ŷp − yp ) = x0p E(β̂ − β) + E(−up ) = 0

and the variance of êp can be expressed as:

E(ê2p ) = var(êp ) = x0p var(β̂)xp + var(up )


Predictions - basics (Matrix form)

Variance of êp (continued):

var(êp ) = x0p var(β̂)xp + var(up )


h −1 i
= x0p σ 2 X 0X xp + var(up )
substitute σ 2 , var(up ) with σ̂ 2 (homoscedasticity)
h −1 i
= x0p σ̂ 2 X 0X xp +σ̂ 2
| {z }
σ̂p2

With growing sample size (asymptotically),


var(up ) = σ̂p2 + σ̂ 2 converges to σ̂ 2
. . . plim β̂ = β ↔ plim σ̂p2 = 0
(Note: recall consistency of the OLS estimator under A1–A5
conditions & for the CLRM model - i.e. under A1–A6.)
Predictions - basics (Matrix form)

Variance of êp (continued):


h −1 i
var(êp ) = x0p σ̂ 2 X 0X xp + σ̂ 2
after re-arranging, s.e.(êp ) may be written as

q
s.e.(êp ) = σ̂ · 1 + x0p (X 0X)−1 xp ,
which relates to the individual prediction error.

For mean prediction errors (considering σ̂p2 only):


q
s.e.(eep ) = σ̂ · x0p (X 0X)−1 xp .
Predictions - basics (Matrix form)

Prediction intervals: individual vs. mean value predictions:

Individual prediction: yp ∈ ŷp ± t∗α/2 × s.e.(êp )

Mean value: yp ∈ ŷp ± t∗α/2 × s.e.(eep )


Predictions – general discussion:

Reliability of predictions:

we work with estimated parameters


(if we generalize from the CLRM paradigm, finite/small
sample properties of estimators may be difficult to
describe),
model parameters can change in time
(discussed separately in next Block – see Chow tests),
predictions include “individual” random errors.

Impacts of random errors on predictions of individual


values are usually much bigger than the impacts of
variance in estimated parameters.
Mean Squared Error of prediction

We can generalize the previous discussion on predictions by


considering both biased and unbiased predictors and by
allowing for different functional forms and complexity levels in
predictive models.

Predictions may be compared/evaluated using:


 2 
MSE = E yi − fˆ(xi )

where fˆ(xi ) is the prediction that fˆ generates for the i-th


regressor set. Here, fˆ represents a general class of
predictors (linear, non-linear, non-parametric, etc.) and it
may produce either biased or unbiased predictions
Variance vs. Bias trade-off

Example for a “sine-like” function: y = f (x) + u


Train sample & Test sample

Suppose we fit a model fˆ(x) to some training data


Tr = {yi , xi }n1 and we wish to see how well it performs.
We could compute MSE over Tr:
1 Xh i2
MSETr = yi − fˆ(xi )
n i∈Tr

When searching for the “best” model by minimizing MSE, the


above statistic would lead to over-fit models.

Instead, we should (if possible) compute the MSE using


fresh test data Te = {yi , xi }m
1 :

1 Xh i2
MSETe = yi − fˆ(xi )
m i∈Te
Variance vs. Bias trade-off

Suppose we have a model fˆ(x), fitted to some training data Tr


and let {y0 , x0 } be a test observation drawn from the
population. If the true model is yi = f (xi ) + εi ,
with f (xi ) = E(yi |xi ), then the expected test MSE can be
decomposed into:
E(MSE0 ) = var(fˆ(x0 )) + [Bias (fˆ(x0 ))]2 + var(ε0 ),
where
Bias (fˆ(x0 )) = E[fˆ(x0 )] − f (x0 ),
ε0 is the irreducible error: E(MSE0 ) ≥ ε0 ,
all three RHS elements are non-negative,
The above equation refers to the average test MSE that we
would obtain if we repeatedly estimated f (x) using a large
number of training sets and then tested each fˆ(x) at x0 .
Variance vs. Bias trade-off

E(MSE0 ) = var(fˆ(x0 )) + [Bias (fˆ(x0 ))]2 + var(ε0 ),

This is an illustration, var(ε0 ) not shown explicitly.


(lies at the /asymptotic/ minima of Variance and Bias2 )
k-Fold Cross Validation

Training error (MSETr ) can be calculated easily.


However, MSETr is not a good approximation for the
MSETe (out-of sample predictive properties of the model).
Usually, MSETr dramatically underestimates MSETe .

Cross-validation is based on re-sampling (similar to bootstrap).


Repeatedly fit a model of interest to samples formed from the
training set & make “test sample” predictions, in order to
obtain additional information about predictive properties of the
model.
k-Fold Cross Validation

In k-Fold Cross-Validation (kFCV), the original sample is


randomly partitioned into k roughly equal subsamples
(divisibility).
One of the k subsamples is retained as the test sample, and
the remaining (k − 1) subsamples are used as training data.

The cross-validation process is then repeated k times


(the k folds), with each of the k subsamples used exactly
once as the test sample.
The k results from the folds can then be averaged to
produce a single estimation.
k = 5 or k = 10 is commonly used.
k-Fold Cross Validation

kFCV example for CS data & k = 5:


(random sampling, no replacement)

In TS, a similar “Walk forward” test procedure may be applied.


k-Fold Cross Validation

k
X
1
CV(k) = k MSEs ,
s=1

where CV(k) is the cross-validated estimate of MSE,


k is the number of folds used (e.g. 5 or 10),
1
− ybi )2
P
MSEs = ms i∈Cs (yi
ms is the number of observations in the s-th test sample
Cs refers to the s-th set of test sample observations.

As we evaluate predictions from two or more models,


we look for the lowest CV(k) .

You might also like