MultivariableRegression 4
MultivariableRegression 4
MultivariableRegression 4
Much like in the case of the univariate regression with one independent variable, the multiple
regression model has a number of required assumptions:
(MR.1): Linear Model The Data Generating Process (DGP), or in other words, the
population, is described by a linear (in terms of the coefficients) model:
Y = Xβ + ε (MR.1)
E (ε|X) = 0 (MR.2)
This assumption implies that all error pairs are uncorrelated. For cross-sectional data,
this assumption implies that there is no spatial correlation between errors.
(MR.5) There exists no exact linear relationship between the explanatory variables.
This means that:
rank (X) = k + 1
ε|X ∼ N 0, σ2 I
(MR.6)
Generalized Least Squares
Generalized Least Squares
Y = Xβ + ε
So far, one of our assumptions about the error term was defined by (MR.3) - (MR.4), namely
that:
Var (ε|X) = E εε> = σ2 I
However, sometimes the variance-covariance matrix of the residuals Var (ε|X) 6= σ2 I.
In this lecture we will examine how this change effects parameter estimation, as well as the
possible solutions for such cases.
A General Multiple Linear Regression
Consider the following model:
where Ω is symmetric and positive definite N × N matrix. This model allows for the
errors to be heteroskedastic or autocorrelated (or both) and is often referred to as
special case of the generalized linear (regression) model (GLM).
Consequently, we may refer to this type of a models as a general multiple linear
regression (GMLR), to distinguish it as only a special case of a GLM. We will examine
GLM’s in more detail in a later lecture.
I If Ω = I, then the GMLR is just the simple multiple linear regression model that we are
already familiar with;
I If Ω is diagonal with non-constant diagonal elements, then the error terms are
uncorrelated but they are heteroskedastic;
I If Ω is not diagonal then Cov(i , j ) = Ωi,j 6= 0 for some i 6= j. In econometrics,
non-diagonal covariance matrices are most commonly encountered in time-series data, with
higher correlations for observations closer in time (usually when i and j are differ by 1 or 2).
I If Ω is not diagonal and the diagonal elements are constant, then the errors are
autocorrelated and homoskedastic;
I If Ω is not diagonal and the diagonal elements are not constant, then the errors are
autocorrelated and heteroskedastic;
As such, we will now assume that Ω 6= I, since we have already covered this case in previous
chapters.
Consequences when OLS is used on GMLR
Since the OLS estimator can be expressed as:
−1
>
β
b
OLS = X X X >Y
then:
I The expected value of the OLS estimator:
−1
>
E βb
OLS = β + X X X > E (ε) = β
In other words, the OLS estimates are unbiased and consistent, but their variance
estimators are biased and inconsistent, which leads to incorrect results for statistical
tests.
Generalized Least Squares (GLS)
The general idea behind GLS is that in order to obtain an efficient estimator of β,
b we need to
transform the model, so that the transformed model satisfies the Gauss-Markov theorem (which
is defined by our (MR.1) - (MR.5) assumptions).
Then, estimating the transformed model by OLS yields efficient estimates. The transformation is
expressed in terms of a (usually) triangular matrix Ψ, such that:
Ω−1 = ΨΨ>
Var Ψ> ε = Var Ψ> ε = Ψ> Var (ε) Ψ
−1
= σ 2 Ψ> ΩΨ = σ 2 Ψ> Ω−1 Ψ
−1 −1
= σ 2 Ψ> ΨΨ> Ψ = σ 2 Ψ> Ψ> Ψ−1 Ψ
= σ2 I
which means that now the assumptions (MR.3) - (MR.4) hold true for the model, defined in (1).
The Parameter GLS Estimator
Consequently, the OLS estimator of β from regression in (1) is:
−1 −1
b = X > ΨΨ> X
β X > ΨΨ> Y = X > Ω−1 X X > Ω−1 Y
Alternatively, it can also be obtained as a solution to the minimization of the GLS criterion
function:
>
(Y − Xβ) Ω−1 (Y − Xβ) → min
β
This criterion function can be thought of as a generalization of the RSS function, which is
minimized in the OLS case. The effect of such weighting is clear when Ω is diagonal - each
observation is simply given a weight proportional to the inverse of the variance of its error term.
The Error Variance Estimator
Next, we need to estimate the error variance:
1 >
b2 =
σ Y ∗ − X ∗βb
GLS Y ∗ − X ∗β
b
GLS
N − (k + 1)
1 >
= Ψ> Y − Ψ> X β b
GLS Ψ> Y − Ψ> X βb
GLS
N − (k + 1)
1 >
>
= Y − Xβ b
GLS ΨΨ Y − X β
b
GLS
N − (k + 1)
In other words, all of the variables, including the constant term, are multiplied by the same
weights ωi−1 .
Consequently, β0 is no longer multiplied by a constant, so when estimating the model with
these transformed variables via OLS, create a new variable for 1/ωi , and exclude a
constant from your model.
There are various ways of determining the weights used in weighted least squares estimation. In
the simplest case, either from economic theory, or from some preliminary testing, we may assume
that E(2i ) is proportional to Zi2 , where Zi is some observed variable.
For example Zi may be the population, or income. Alternatively, sometimes E(2i ) may be
proportional the sample size used to obtain an average (or total) value of observation i, which is
saved in the dataset.
I If observation Yi is an average of Ni equally variable (uncorrelated) observations,
√
then Var (Yi ) = σ 2 /Ni and ωi = 1/ Ni ;
I If observation Yi is an aggregated total of Ni (uncorrelated) observations, then
√
Var (Yi ) = Ni σ 2 and ωi = Ni ;
I If the variance of Yi is proportional to some predictor Zi , then Var (Yi ) = Zi2 σ 2
and ωi = Zi ;
It is possible to report various summary statistics, like R 2 , ESS and RSS in terms of Yi , or
Yi /ωi , however, R 2 is only valid for the transformed variables Yi /ωi (since the coefficients are
estimated on the transformed dependent variables).
Feasible Generalized Least Squares
In practice the true form of Ω is unknown.
Even in the simplest case of Ω = diag ω12 , ..., ωN2 , we would need to estimate a total of
k + 1 + N parameters (β0 , ..., βk and ω12 , ..., ωN2 ), which is impossible since we only have N
observations (and a rule of thumb says that we would need a minimum of 5 - 20 observations per
parameters in order to get satisfactory estimates of β).
Therefore, we need to assume some simplifying conditions for Ω. For example, if Ω is diagonal,
suppose that ωi2 = α0 + α1 Xm,i for some m = 1, ..., k - now we only need to estimate two
additional parameters (as opposed to N).
where Ω
b = Ω(θ)
b is a parametric estimation (with parameter vector θ) of the true
unknown matrix Ω.
For the weighted least squares case we would need to:
1. Estimate a model via OLS and calculate the residuals b i,(OLS) . If the regression is correctly
specified, an estimate of ωi2 is the OLS squared residual b 2i,(OLS) .
2. Generally, the residuals are much too variable to be used directly in estimating the weights,
so instead we could use:
I some other kind of transformation of the residuals, or their squares. For example, log(b2i ).
Then regressing these variables on some selected predictors and using the fitted values as
weights.
I regressing the squared residuals against the fitted values. Then the square root of the fitted
values could be used as ωi ;
I regressing the absolute residuals against the fitted values. Then the fitted values could be
used as ωi ;
Plotting the residuals (or their squared values, or their absolute values, or the root square of their
absolute value) against the independent variable X would help in determining the regression form.
The resulting fitted values from this regression are estimates of ωi .
Generally, the structure can be imposed in two most popular ways: by assuming error
heteroskedasticity, or by assuming error serial correlation.
If the assumption about the error covariance structure is incorrect - heteroskedasticity
still remains and FGLS is is no longer BLUE. In this case:
I FGLS it is still unbiased, just like OLS;
I the FGLS estimator of the error variance is biased, just like OLS (but the
magnitude of the bias is not the same);
Furthermore, regarding the use of FGLS instead of GLS, it can be stated that:
Since we do not know the true Ω, but instead estimate Ω b from the sample, then it
becomes a random variable (just like when we estimate other parameters, like β).
b This
fact affects the distribution of the GLS estimator. Very little is known about the
finite-sample properties of the FGLS estimator.
A Note on Coefficient Interpretation
Since we are using the following model:
to estimate β
b with GLS (or FGLS), then the predicted values are:
Yc∗ = X ∗ β
b = Ψ> b
X β i.e. = Ψ >b
Y
GLS GLS
−1
In other words multiplying both sides by Ψ> yields:
−1
Ψ> c∗ = X β
Y b
GLS = Y
b
If we were to use X ∗ to calculate the fitted/predicted values, then we would need to multiply the
−1
c∗ by Ψ>
fitted values Y to get the fitted values for the original data Y
b . Alternatively, this
expression also means that we can use the the original design matrix X with the GLS
estimates of β to get the fitted/predicted values of Y . In other words:
The GLS (as well as WLS and FGLS) estimates of β retain their coefficient interpretation of how a
unit increase in one explanatory variable Xj affects the dependent variable Y , ceteris paribus.
Heteroskedastic Errors
Heteroskedastic Errors
Consider the case where assumption (MR.3) does not hold, but assumption (MR.4) (and the
other remaining assumptions (MR.1), (MR.2), (MR.5) and, optionally, (MR.6)) are still valid.
Then we can write the following model as:
The case when the error variance-covariance matrix is diagonal, but not equal to σ2 I, expressed
as: 2
σ1 0 0 ... 0
0 σ22 0 ... 0
Σ= .
.. .. .. ..
.. . . . .
0 0 0 ... σN2
is referred to as heteroskedasticity. As mentioned before:
I the OLS estimators will remain unbiased;
I the OLS variance estimator is biased and inconsistent;
I the usual t-statistics of the OLS estimates do not have Student’s t distribution, even in
large samples.
1. Assume that Y = Xβ + ε is the true model.
2. Test the null hypothesis that the residuals are homoskedastic:
3. If we fail to reject the null hypothesis - we can use the usual OLS estimators.
4. If we reject the null hypothesis, there are two ways we can go:
I Use the OLS estimators, but correct their variance estimators (i.e. make them
consistent);
I Instead of OLS, use the weighted least squares (WLS) to estimate the parameters;
I Attempt to specify a different model, which would hopefully, be able to account for
heteroskedasticity (this is the least preferred method - our initial model should
already be the one we want in terms of variables, signs, interpretation, etc.).
Example
We will simulate the following model:
Yi = β0 + β1 X1,i + β2 X2,i + ui
ui = i · i , i ∼ N (0, σ 2 ) or ui ∼ N (0, i 2 · σ 2 )
set.seed(123)
#
N <- 200
beta_vec <- c(10, 5, -3)
#
x1 <- seq(from = 0, to = 5, length.out = N)
x2 <- sample(seq(from = 3, to = 17, length.out = 80), size = N, replace = TRUE)
e <- rnorm(mean = 0, sd = 1:N, n = N)
x_mat <- cbind(1, x1, x2)
y <- x_mat %*% beta_vec + e
#
data_mat <- data.frame(y, x1, x2)
Testing For Heteroskedasticity
We can examine the presence of heteroskedasticity from the residuals plots, as well as conducting
a number of formal tests.
We will begin by estimating our model via OLS, as we usually would.
mdl_1 <- lm(y ~ x1 + x2, data = data_mat)
print(round(coef(summary(mdl_1)), 5))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 28.42373 -0.03134 0.97503
## x1 7.21929 5.93429 1.21654 0.22524
## x2 -1.83213 2.18885 -0.83703 0.40359
Residual Plot Diagnostic
One way of investigating the existence of heteroskedasticity is to visually examine the OLS model
residuals. If they are homoskedastic, there should be no pattern in the residuals. If the errors are
heteroskedastic, they would exhibit increasing or decreasing variation in some systematic way.
For example, variation may increase with larger values of Y b , or with larger values of Xj .
par(mfrow = c(2, 2))
#
plot(mdl_1$residuals, main = "Run-Sequence Plot")
plot(mdl_1$fitted.values, mdl_1$residuals, main = "Residuals vs Fitted")
plot(x1, mdl_1$residuals, main = bquote("Residuals vs"~X[1]))
plot(x2, mdl_1$residuals, main = bquote("Residuals vs"~X[2]))
Run−Sequence Plot Residuals vs Fitted
400
400
mdl_1$residuals
mdl_1$residuals
0
0
−400
−400
0 50 100 150 200 −30 −20 −10 0 10 20
Index mdl_1$fitted.values
Residuals vs X1 Residuals vs X2
400
400
mdl_1$residuals
mdl_1$residuals
0
0
−400
−400
0 1 2 3 4 5 4 6 8 10 12 14 16
x1 x2
ε> εj
j b
bj2 =
b
σ
Nj − (kj + 1)
b12
σ
GQ = ∼ F(N1 −(k1 +1),N2 −(k2 +1))
b22
σ
If we reject the null hypothesis, for a chosen significance level α, then the variances
in the sub-samples are different, hence we cannot reject the null hypothesis of het-
eroskedasticity.
In our example model, we see that the residuals can be ordered by either the fitted values, or by X1 . If we specify
to order by X1 :
GQ_t <- lmtest::gqtest(mdl_1, alternative = "two.sided", order.by = ~ x1)
print(GQ_t)
##
## Goldfeld-Quandt test
##
## data: mdl_1
## GQ = 10.217, df1 = 97, df2 = 97, p-value < 2.2e-16
## alternative hypothesis: variance changes from segment 1 to 2
since the p-value is less than 0.05, we reject the null hypothesis and conclude that the residuals are
heteroskedastic.
I Furthermore, we can order by the fitted values:
GQ_t <- lmtest::gqtest(mdl_1, alternative = "two.sided", order.by = order(mdl_1$fitted.values))
print(GQ_t)
##
## Goldfeld-Quandt test
##
## data: mdl_1
## GQ = 5.7826, df1 = 97, df2 = 97, p-value = 3.736e-16
## alternative hypothesis: variance changes from segment 1 to 2
and we would arrive at the same conclusions.
On the other hand, if we would have specified to order by X2 :
GQ_t <- lmtest::gqtest(mdl_1, alternative = "two.sided", order.by = ~ x2)
print(GQ_t)
##
## Goldfeld-Quandt test
##
## data: mdl_1
## GQ = 1.1747, df1 = 97, df2 = 97, p-value = 0.4292
## alternative hypothesis: variance changes from segment 1 to 2
We would have no grounds to reject the null hypothesis, and would have concluded that the errors are
homoskedastic.
This is why the residual plots are important.
Without them, we may have blindly selected one explanatory variable at random to order by, and would have
arrived at a completely different conclusion!
By default, if order.by is not specified, then the data is taken as ordered - i.e. it would be the equivalent of the
residual run-sequence plot.
A General Heteroskedasticity Test
The next test is used for conditional heteroskedasticity, which is related to explanatory variables.
This test is a generalization of the Breusch–Pagan Test.
Consider the following regression:
Y = Xβ + ε
We assume that the general expression for the conditional variance can be expressed as:
Var (ε|Z ) = E εε> |Z = diag (h (Z α))
where Z is some explanatory variable matrix, which may include some (or all) of the explanatory
variables from X. h(·) is some smooth function, and α = [α0 , ..., αm ]> .
We want to test the null hypothesis:
n
H0 : α1 = ... = αm = 0
H1 : at least one αj 6= 0, j ∈ {1, ..., m}
Then the associated BP test specifies the following linear model for the squared residuals:
bεbε> = diag (Z α + v)
Estimating the squared residuals model via OLS yields the parameter estimates. Then, using the F -test
for the joint significance of α1 , ..., αm would be equivalent to testing for homoskedasticity. Alternatively,
and more conveniently, there is a test based on the R 2 :
b
LM = N · R 2 ∼ χ2m
b
If we reject the null hypothesis, for a chosen significance level α, then the error variance is heteroskedastic.
BP_t <- lmtest::bptest(mdl_1)
print(BP_t)
##
## studentized Breusch-Pagan test
##
## data: mdl_1
## BP = 35.896, df = 2, p-value = 1.604e-08
The p-value is less than the chosen 0.05 significance level, so the residuals are heteroskedastic.
ε> = diag (Z α + v)
εb
b
except now, the matrix Z also includes s additional polynomial and interaction terms
of the explanatory variables. The parameter vector is α1 , ..., αm+s . We want to test
the null hypothesis:
(
H0 : α1 = ... = αm+s = 0
H1 : at least one αj 6= 0, j ∈ {1, ..., m + s}
LM = N · Rb2 ∼ χ2m+s
One difficulty with the White test is that it can detect problems other than heteroskedas-
ticity.
Thus, while it is a useful diagnostic, be careful about interpreting the test result -
instead of heteroskedasticity, it may be that you have an incorrect functional form, or
an omitted variable.
The test is also performed similar to how it was for the univariate regression with one
explanatory variable:
W_t <- lmtest::bptest(mdl_1, ~ x1 + I(x1^2) + x2 + I(x2^2) + x1:x2)
print(W_t)
##
## studentized Breusch-Pagan test
##
## data: mdl_1
## BP = 43.195, df = 5, p-value = 3.373e-08
We again reject the null hypothesis of homoskedasticity and conclude that the residuals are heteroskedastic.
Heteroskedasticity-Consistent Standard Errors (HCE)
Once, we have determined that the errors are heteroskedastic, we want to have a way to account
for that.
One alternative is to stick with OLS estimates, but correct their variance. This is known as the
White correction
- it will not change the β OLS , which are unbiased and ineffective, but it will
b
correct Var
d β b
OLS .
If the errors from the error vector ε are independent, but have distinct variances, so that Var (ε|X) =
Σ = diag(σ12 , ..., σN
2 ).
Then the true underlying variance-covariance matrix of the OLS estimates would be:
−1 −1
β OLS = X > X
V b X > ΣX X > X
Since E(i ) = 0, then Var(i ) = E(2i ) = σi2 . Which means that we can estimate the variance diagonal
elements as:
σi2 = b
b 2i
b = diag(b21 , ..., b2N ). Then the Whites Estimators, also known as the Heteroscedasticity-Consistent
Let Σ
Estimator (HCE), can be specified as:
−1 −1
β OLS = X > X
VHCE b X >Σ
bX X >X
## x1 x2
## 693.25117 -91.740409 -52.244062
## x1 -91.74041 46.570134 2.341761
## x2 -52.24406 2.341761 4.749053
or, via the built-in functions:
V_HC_1 <- sandwich::vcovHC(mdl_1, type = "HC0")
print(V_HC_1)
## (Intercept) x1 x2
## (Intercept) 693.25117 -91.740409 -52.244062
## x1 -91.74041 46.570134 2.341761
## x2 -52.24406 2.341761 4.749053
Having estimated the standard errors is nice, but we would also like to get the associated p-values.
print(lmtest::coeftest(mdl_1, vcov. = V_HC_1))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.32966 -0.0338 0.9730
## x1 7.21929 6.82423 1.0579 0.2914
## x2 -1.83213 2.17923 -0.8407 0.4015
compared to the biased standard errors:
print(coef(summary(mdl_1)))
I for small sample sizes, the standard errors from HC0 are quite biased. This
β OLS
b
results in overly liberal inferences in regression models.
I HC0 is consistent when the errors are heteroskedastic - the bias shrinks as the
β OLS
b
sample size increases.
Estimators in this family have come to be known as sandwich estimators - X > ΣX b is the filling
−1
between two matrices X > X .
Consequently, alternative estimators to HC0b β OLS
were proposed, which are asymptotically
equivalent to HC0b β
, but have superior small sample properties, when compared to HC0b β
.
OLS OLS
I An alternative HC estimator adjusts the degrees of freedom of HC0 :
β OLS
b
N −1 −1
HC1 = X >X X > diag(b 2N )X X > X
21 , ..., b
β OLS
b N − (k + 1)
Another HC estimator is defined based on extensive research, that finite-sample bias is a result of the
existence of points of high leverage. It uses the weight matrix, which is defined as:
−1
H = X X >X X>
I Then the HC estimator is constructed by by weighting the i-th squared OLS residual by using the
i-th diagonal elements of the weight matrix H:
−1 b21 b2N −1
HC2 = X >X X > diag , ..., X X >X
β OLS
b 1 − h11 1 − hNN
I Similarly, using 1/(1 − h11 )2 , instead of 1/(1 − h11 ) for the weights leads to yet another HCE
specification:
>
−1 > b21 b2N −1
HC3 = X X X diag , ..., X X >X
β OLS
b (1 − h11 )2 (1 − hNN )2
Evaluating the empirical power of the four methods: HC0, HC1, HC2 and HC3 suggested that
HC3 is a superior estimate, regardless of the presence, or absence, of heteroskedasticity.
Nevertheless, the performance of HC3 depends on the presence, or absence, of points of high
leverage in X and it may fail for certain forms of heteroskedasticity (for example, when the
predictors are from heavy-tailed distributions, and the errors are from light-tailed distributions).
21 2N
−1 −1
> > b b >
HC4b = X X X diag , ..., X X X
β OLS (1 − h11 )δ1 (1 − hNN )δN
where:
N · hii
δi = min 4,
k +1
Simulation tests indicated that HC4 can outperform HC3 when there are high
leverage points and non-normal errors.
Nevertheless, even with these alternative specifications, the variability of HCE are often larger
than model-based estimators, like WLS, GLS of FGLS, if the residual covariance-matrix is
correctly specified. On the other hand, HCE are derived under a minimal set of assumptions
about the errors. As such, it is useful when heteroskedasticity if of an unknown form and cannot
be adequately evaluated from the data.
Therefore, when using HCE (instead of some other model-based estimation methods),
we are trading efficiency for consistency.
Furthermore, there is also another specification, HC5 (Source).
In our example dataset different HCE can be easily specified with the built-in functions:
print(lmtest::coeftest(mdl_1, vcov. = sandwich::vcovHC(mdl_1, type = "HC0")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.32966 -0.0338 0.9730
## x1 7.21929 6.82423 1.0579 0.2914
## x2 -1.83213 2.17923 -0.8407 0.4015
print(lmtest::coeftest(mdl_1, vcov. = sandwich::vcovHC(mdl_1, type = "HC1")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.52939 -0.0336 0.9733
## x1 7.21929 6.87600 1.0499 0.2950
## x2 -1.83213 2.19576 -0.8344 0.4051
print(lmtest::coeftest(mdl_1, vcov. = sandwich::vcovHC(mdl_1, type = "HC2")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.59922 -0.0335 0.9733
## x1 7.21929 6.89407 1.0472 0.2963
## x2 -1.83213 2.20143 -0.8322 0.4063
print(lmtest::coeftest(mdl_1, vcov. = sandwich::vcovHC(mdl_1, type = "HC3")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.87208 -0.0331 0.9736
## x1 7.21929 6.96475 1.0365 0.3012
## x2 -1.83213 2.22390 -0.8238 0.4110
print(lmtest::coeftest(mdl_1, vcov. = sandwich::vcovHC(mdl_1, type = "HC4")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.72908 -0.0333 0.9735
## x1 7.21929 6.92585 1.0424 0.2985
## x2 -1.83213 2.21168 -0.8284 0.4085
Note: Python currently does not have HC4 specification. Available specifications can be found
[here](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html
As per the sandwich::vcovHC documentation, we can even calculate the HC5 specification:
print(lmtest::coeftest(mdl_1, vcov. = sandwich::vcovHC(mdl_1, type = "HC5")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.89066 26.52816 -0.0336 0.9733
## x1 7.21929 6.87476 1.0501 0.2950
## x2 -1.83213 2.19536 -0.8345 0.4050
Weighted Least Squares (WLS)
After detecting heteroskedasticity in the errors, we may want to impose a structure on the
residual covariance matrix and estimate the coefficients via GLS. If we know that there is no
serial correlation in the errors, then the covariance is diagonal. This leads to a specific case of
GLS - weighted least squares (WLS).
In reality, we do not know the true form of heteroskedasticity. So, even knowing that the
covariance matrix is diagonal still does not say anything about the diagonal elements, they could
be either of the following:
taking logarithms of both sides and adding/removing the OLS residual yields:
2i
2i ) = α0 + α1 Z1,i + ... + αm Zm,i + log
b
log(b
σi2
= α0 + α1 Z1,i + ... + αm Zm,i + vi
using this model we can estimate α0 , ..., αm via OLS. the properties of this model depend on the
introduced error term vi - whether it is homoskedastic, with zero-mean. In smaller samples it is
not, but in larger samples it is closer to what we expect from the error term.
Feasible GLS Procedure:
1. Estimate the regression Y = Xβ + ε via OLS.
2. Use the residuals bεOLS and create log(2i ).
3. Estimate the regression log(2i ) = α0 + α1 Z1,i + ... + αm Zm,i + vi and calculate
\
the fitted values log( 2 ). In practice we use the same variables from X, unless we
i
know for sure that there are additional explanatory variables, which may determine
heteroskedasticity.
4. Take the exponent of the fitted values: bhi = exp log(\ 2) .
i
q
−1
5. Estimate the regression Y = Xβ + ε via WLS using weights ωi = 1/ b hi , i.e.
q q
use FGLS with Ψ> = Ψ = diag 1/ b h1 , ..., 1/ bhN .
q
As noted before, applying WLS is equivalent to dividing each observation by b hi and estimating
OLS on the transformed data:
q q q q q
Yi / bhi = β0 · 1/ b hi + β1 · X1,i / b hi + ... + βk · Xk,i / b hi + i / b hi
## est se
## -1.392372 10.1948202
## x1 7.976445 4.0032528
## x2 -2.024586 0.8815747
Note: in most econometric software, you need to pass weights, which are inversely
proportional to the variances. In other words you need to supply weights = 1/σi2 = 1/ωi2
which are the diagonal elements of Ω−1 , not Ψ (regardless of your specification of Σ). Then,
the software automatically takes the square root of the specified values when calculating.
Using built-in functions:
mdl_wls <- lm(y ~ x1 + x2, data = data_mat, weights = 1 / h_est)
print(round(coef(summary(mdl_wls)), 5))
ε∗WLS = Y ∗ − X ∗ β
In other words we need to calculate: b b
WLS .
In most software, using built-in WLS estimation, the residuals are calculated as:
εWLS = Y − X β
b b
WLS .
q q
Consequently, since we have specified Ψ> = Ψ = diag 1/ b h1 , ..., 1/ b
hN , we can calculate
the residuals for the transformed model as:
q q
εb∗ WLS = Ψ> b
εWLS = diag 1/ b
h1 , ..., 1/ b εWLS
hN b
mdl_1$residuals
400
400
0
0
−400
−400
−30 −20 −10 0 10 20 0 1 2 3 4 5
mdl_1$fitted.values x1
4
e_star
e_star
0
0
−4
mdl_1$residuals
400
400
0
0
−400
−400
−30 −20 −10 0 10 20 0 1 2 3 4 5
mdl_1$fitted.values x1
WLS Residuals (of ORIGINAL data) vs Fitted WLS Residuals (of ORIGINAL data) vs X1
mdl_wls$residuals
mdl_wls$residuals
400
400
0
0
−400
−400
−30 −20 −10 0 10 20 30 0 1 2 3 4 5
mdl_wls$fitted.values x1
Looking at the original, untransformed, data it would appear that we did not account
for heteroskedasticity but this is not the case. As mentioned, we have created a
model on the transformed data, hence we should analyse the residuals of the
transformed data.
As mentioned before - we have attempted to approximately calculate the weights, using the
logarithms of the squared residuals. Therefore, we may not always be able to capture all of the
heteroskedasticity. Nevertheless, it is much better than it was initially.
q
On the other hand, if we were to use 1/i as the weights, instead of 1/ b
hi :
mdl_wls_2 <- lm(y ~ x1 + x2, data = data_mat, weights = 1 / (1:N)^2)
print(round(coef(summary(mdl_wls_2)), 5))
1/1:N * mdl_wls_2$residuals
2
2
1
1
0
0
−1
−1
−2
Rg2 = Corr(Y , Y
b )2
where Y
b are the fitted values of the original (i.e. non-weighted) dependent variable).
## [1] 0.04576885
print(summary(mdl_wls)$r.squared)
## [1] 0.04576885
r2g_ols <- cor(y, mdl_1$fitted)^2
r2g_wls <- cor(y, mdl_wls$fitted)^2
print(paste0("OLS pseudo-R^2 = ", r2g_ols))
As was mentioned before regarding R 2 - it is not always a good measure of the overall model
adequacy. Even if the OLS R 2 is higher, if the residuals do not conform to (MR.2) - (MR.6)
assumptions, then the model and its hypothesis test results are not valid.
Choosing between HCE and WLS
The generalized least squares estimator require that we know the underlying form of the
variance-covariance matrix.
Regarding HCE:
I The variance estimator is quite robust because it is valid whether
heteroskedasticity is present or not, but only in a matter that is appropriate
asymptotically.
I In other words, if we are not sure whether the random errors are heteroskedastic or
homoskedastic, then we can use a robust variance estimator and be confident that
our standard errors, t-tests, and interval estimates are valid in large samples.
I In small samples, whether we modify the covariance estimator or not, the usual
statistics will still be unreliable.
I This estimator needs to be modified, if we suspect that the errors may exhibit
autocorrelation (of some unknown form).
Regarding WLS:
I If we know the underlying form of the residual variance-covariance matrix, then
FGLS would result in more efficient estimators;
I If we specify the covariance structure incorrectly, i.e. we do not completely remove
heteroskedasticity, then the FGLS estimator will be unbiased, but not the best and
the standard errors will still be biased (as in the OLS case).
So, both methods have their benefits and their drawbacks. However, we can combine
them both:
I Attempt to correct for heteroskedasticity via the WLS;
I Test the residuals from WLS for heteroskedasticity;
I If the WLS residuals are homoskedastic - the WLS has improved the precision of
our model;
I If the WLS residuals are heteroskedastic - we may use HCE on the WLS residuals.
I This way we protect ourselves from a possible misspecification of the unknown
variance-covariance structure.
Monte Carlo Simulation: OLS vs FGLS
To illustrate the effects of heteroskedasticity on the standard errors of the estimates, and
efficiency between OLS, GLS and FGLS, we will carry out a Monte Carlo simulation. We will
simulate the following model:
(m) (m) (m) (m) (m)
Yi = β0 + β1 X1,i + β2 X2,i + i , i ∼ N (0, i 2 · σ 2 ), i = 1, ..., N, m = 1, ..., MC
We will simulate a total of MC = 1000 samples from this model with specific coefficients and
estimate the parameters via OLS, WLS, as well as correct the standard errors of OLS via HCE.
We will do so with the following code:
set.seed(123)
# Number of simulations:
MC <- 1000
# Fixed parameters
N <- 100
beta_vec <- c(10, 5, -3)
# matrix of parameter estimates for each sample:
beta_est_ols <- NULL
beta_pval_ols <- NULL
beta_pval_hce <- NULL
#
beta_est_wls <- NULL
beta_pval_wls <- NULL
#
beta_est_gls <- NULL
beta_pval_gls <- NULL
for(i in 1:MC){
# simulate the data:
x1 <- seq(from = 0, to = 5, length.out = N)
x2 <- sample(seq(from = 3, to = 17, length.out = 80), size = N, replace = TRUE)
e <- rnorm(mean = 0, sd = 1:N, n = N)
y <- cbind(1, x1, x2) %*% beta_vec + e
data_mat <- data.frame(y, x1, x2)
# estimate via OLS
mdl_0 <- lm(y ~ x1 + x2, data = data_mat)
# correct OLS se's:
mdl_hce <- lmtest::coeftest(mdl_0, vcov. = sandwich::vcovHC(mdl_0, type = "HC0"))
# estimate via WLS
resid_data <- data.frame(log_e2 = log(mdl_0$residuals^2), x1, x2)
h_est <- exp(lm(log_e2 ~ x1 + x2, data = resid_data)$fitted.values)
mdl_wls <- lm(y ~ x1 + x2, data = data_mat, weights = 1 / h_est)
# estimate via GLS (by knowing the true covariance matrix)
mdl_gls <- lm(y ~ x1 + x2, data = data_mat, weights = 1 / (1:N)^2)
# Save the estimates from each sample:
beta_est_ols <- rbind(beta_est_ols, coef(summary(mdl_0))[, 1])
beta_est_wls <- rbind(beta_est_wls, coef(summary(mdl_wls))[, 1])
beta_est_gls <- rbind(beta_est_gls, coef(summary(mdl_gls))[, 1])
# Save the coefficient p-values from each sample:
beta_pval_ols <- rbind(beta_pval_ols, coef(summary(mdl_0))[, 4])
beta_pval_hce <- rbind(beta_pval_hce, mdl_hce[, 4])
beta_pval_wls <- rbind(beta_pval_wls, coef(summary(mdl_wls))[, 4])
beta_pval_gls <- rbind(beta_pval_gls, coef(summary(mdl_gls))[, 4])
}
Firstly, it is interesting to see how many times we would have rejected the null hypothesis that a parameter is
insignificant with significance level α = 0.05. We will divide the number of times that we would have rejected the
null hypothesis by the total number of samples MC to get the rejection rate:
alpha = 0.05
a1 <- colSums(beta_pval_ols < alpha) / MC
a2 <- colSums(beta_pval_hce < alpha) / MC
a3 <- colSums(beta_pval_wls < alpha) / MC
a4 <- colSums(beta_pval_gls < alpha) / MC
#
a <- t(data.frame(a1, a2, a3, a4))
colnames(a) <- names(a1)
rownames(a) <- c("OLS: H0 rejection rate", "HCE: H0 rejection rate", "WLS: H0 rejection rate", "GLS:
print(a)
## (Intercept) x1 x2
## OLS: H0 rejection rate 0.062 0.254 0.569
## HCE: H0 rejection rate 0.109 0.216 0.584
## WLS: H0 rejection rate 0.195 0.369 0.982
## GLS: H0 rejection rate 0.856 0.605 1.000
I We see that we would have rejected the null hypothesis that X2 is insignificant in around 55.5% of the
simulated samples with OLS.
I If we were to correct for heteroskedasticity, this would have increased to 57.4%.
I On the other hand, if we were to re-estimate the model with WLS, then we would have rejected the null
hypothesis that X2 is insignificant around in around 97.6% of the simulated samples.
One possible explanation for the relatively poor performance of HCE is the fact that it uses
Σ
b = diag(b21 , ..., b
2N ) for the covariance matrix. In this case it is clearly inferior to the covariance
matrix specification used in WLS. On the other hand, as we have already mentioned - HCE are
only useful in large samples.
Because of the large variance of i , we see that we would often not reject the null hypothesis that
X1 is insignificant. On the other hand, as we can see, using WLS reduces this risk significantly.
Finally, if we were to know the true covariance structure - we could incorporate GLS - this
would mean that we would almost never reject the null hypothesis that the coefficient of X2 is
insignificant! Furthermore, the rejection rate of the null hypothesis for β0 and β1 are also
significantly larger.
Unfortunately, in empirical applications we will never know the true covariance structure,
so the GLS results are only presented as the best case scenario, which we would hope
to achieve with FGLS.
We can look at the coefficient estimate histograms:
x2 x1
100 150
80 120
True Value True Value
OLS OLS
Frequency
Frequency
WLS WLS
GLS GLS
50
40
0
0
−6 −4 −2 0 2 −10 −5 0 5 10 15 20
beta_est_ols[, j] beta_est_ols[, j]
(Intercept)
100 150
True Value
OLS
Frequency
WLS
GLS
50
0
−40 −20 0 20 40 60
beta_est_ols[, j]
I OLS estimates are less efficient than WLS in terms of the estimate variance.
I On the other hand, both estimates are unbiased - their average is equal to the true
parameter value. - Consequently, because OLS estimators are less efficient, we would need a
larger sample to achieve a similar level of precision as the WLS.
As mentioned before, GLS is the best case scenario, when we exactly know the true covariance
structure and do not need to estimate it.
Autocorrelated (Serially Correlated)
Errors
Autocorrelated (Serially Correlated) Errors
Consider the case where assumption (MR.4) does not hold, but assumption (MR.3) (and the
other remaining assumptions (MR.1), (MR.2), (MR.5) and, optionally, (MR.6)) are still valid.
Then we can write the following model as:
The case when the error variance-covariance matrix is no longer diagonal, but with equal
diagonal elements, expressed as:
2
σ σ1,2 σ1,3 ... σ1,N
σ2,1 σ2 σ2,3 ... σ2,N
Σ= . .. , σi,j = σj,i = Cov(i , j ) 6= 0, i, j = 1, ..., N
.. .. ..
.. . . . .
σN,1 σN,2 σN3, ... σ2
is referred to as autocorrelation (or serial correlation). Just like before with heteroskedasticity:
I the OLS estimators remain unbiased and consistent;
I OLS estimators are no longer efficient;
I the variance estimator of the OLS estimates is biased and inconsistent;
I t-statistics of the OLS estimates are invalid.
It should be stressed that serial correlation is usually present in time-series data. For
cross-sectional data, the errors may be correlated in terms of social, or geographical distance.
For example, the distance between cities, towns, neighborhoods, etc.
3. Test the null hypothesis that the residuals are serially correlated: create a
model on the OLS residuals
H0 : ρ1 = ρ2 = ... = ρp = 0
4. If we fail to reject the null hypothesis - we can use the usual OLS estimators.
5. If we reject the null hypothesis, there are two ways we can go:
I Use the OLS estimators, but correct their variance estimators (i.e. make them
consistent);
I Instead of OLS, use FGLS (and its variations).
I Attempt to specify a different model, which would hopefully, be able to account for
serial correlation (autocorrelation may be the cause of a misspecified model).
Example
We will simulate the following model:
(
Yi = β0 + β1 X1,i + β2 X2,i + i
i = ρi−1 + ui , |ρ| < 1, ui ∼ N (0, σ 2 ), 0 = 0
set.seed(123)
#
N <- 200
beta_vec <- c(10, 5, -3)
rho <- 0.8
#
x1 <- seq(from = 0, to = 5, length.out = N)
x2 <- sample(seq(from = 3, to = 17, length.out = 80), size = N, replace = TRUE)
# serially correlated residuals:
ee <- rnorm(mean = 0, sd = 3, n = N)
for(i in 2:N){
ee[i] <- rho * ee[i-1] + ee[i]
}
#
x_mat <- cbind(1, x1, x2)
y <- x_mat %*% beta_vec + ee
data_mat <- data.frame(y, x1, x2)
Testing For Serial Correlation
We can examine the presence of autocorrelation from the residuals plots, as well as conducting a
number of formal tests.
We will begin by estimating our model via OLS, as we usually would.
mdl_1 <- lm(y ~ x1 + x2, data = data_mat)
print(round(coef(summary(mdl_1)), 5))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.61601 0.97983 10.83457 0
## x1 4.89018 0.20457 23.90497 0
## x2 -2.93686 0.07545 -38.92241 0
Residual Correlogram
We begin by calculating the residuals from the OLS model:
ε = Y − Xβ
b b
OLS
we will use these residuals as estimates for the true (unobserved) error term ε and examine their
autocorrelations. Note that from the OLS structure it holds true that E (b ε|X) = 0. So, assume
that we want to calculate the sample correlation between b i and b
i−k - we want to calculate the
autocorrelation at lag k:
PN
d i , b
Cov(b ) i b
i−k
ρb(k) = Corr(b
[ i , b
i−k ) = q q i−k = i=k+1 b
PN 2
d i ) Var(b
Var(b d i−k ) i
i=1 b
Furthermore, assume that we are interested in testing whether the sample (auto)correlation ρb(k)
is significantly different from zero:
H0 : ρb(k) = 0
H1 : ρb(k) 6= 0
a a
Under the null hypothesis it holds true that ρb(k) ∼ N (0, 1/N), where ∼ indicates asymptotic
distribution - the distribution as the sample size N → ∞. In this case, it means that for large
samples the distribution is approximately normal. Consequently, a suitable statistic can be
constructed as:
ρb(k) − 0 √ a
Z= p = ρb(k) · N ∼ N (0, 1)
1/N
So, at a 5% significance level, the critical value is Zc ≈ 1.96. In this case, w would reject the
null hypothesis when: √ √
ρb(k) · N ≥ 1.96, or ρb(k) · N ≤ 1.96
or alternatively, if we want to have the notation similar to
“estimate ± critical value · se(estimate)”, we can write it as:
√ √
ρb(k) ≥ 1.96/ N, or ρb(k) ≤ 1.96/ N
As a rule of thumb, we can sometimes take 2 instead of 1.96. We can also calculate this via software:
z_c <- qnorm(p = 1 - 0.05 / 2, mean = 0, sd = 1)
print(z_c)
## [1] 1.959964
We will begin by manually calculating the autocorrelations and their confidence bounds for k = 1, ..., 20:
rho <- NULL
e <- mdl_1$residuals
for(k in 1:20){
r <- sum(e[-c(1:k)] * e[1:(N-k)]) / sum(e^2)
rho <- c(rho, r)
}
rho_lower <- - z_c / sqrt(N)
rho_upper <- + z_c / sqrt(N)
5
10
1:k
15
20
Alternatively, we can use the built-in functions:
forecast::Acf(e)
Series e
0.6
0.4
ACF
0.2
0.0
−0.2
5 10 15 20
Lag
We can see from the plots that there are some sample autocorrelations ρb(k), which are
statistically significantly different from zero (i.e. their values are above the blue horizontal line).
As with most visual diagnostics tools - it may not always be a clear-cut answer from the plots
alone. So, we can apply a number of different statistical tests to check whether there are
statistically significant sample correlations.
Autocorrelation Tests
The tests are identical to the ones described in the univariate regression, but re-visited for the
multiple regression model case. These tests will be re-examined when dealing with time-series
data models.
Durbin-Watson Test
The DW test is used to test the hypothesis that the residuals are serially correlated at
lag 1, i.e. that in the following model:
i = ρi−1 + vi
Since we do not observe the true error terms, we use the OLS residuals b
i and calculate
the Durbin-Watson statistic as:
PN
i − b
(b i−1 )2
DW = i=2PN
2i
i=1 b
The DW test statistics critical values may not be available in some econometric software.
Furthermore, its distribution no longer holds, when the equation of Yi contains a lagged
dependent variable, Yi−1 .
As a quick rule of thumb, if the DW statistic is near 2, then we do not reject the null
hypothesis of no serial correlation.
PN PN
If we assume that 2i
i=2 b ≈ 2i−1 ,
i=2 b then we can re-write the DW statistic as:
hP i
N 2−
PN (
PN
2
b −2
PN
+
PN
2 2 i=2
b i i=2
bi
bi−1 4, if ρb(1) = −1
i=2 i i=2 i i−1 i=2 i−1
bb b
DW = P N
≈ PN 2 =2 1−ρ
b(1) = 2, if ρb(1) = 0
2
i=1 i
b b
i=1 i 0, if ρb(1) = 1
which helps in understanding why we expect the DW statistic to be close to 2 under the null
hypothesis.
print(lmtest::dwtest(mdl_1, alternative = "two.sided"))
##
## Durbin-Watson test
##
## data: mdl_1
## DW = 0.56637, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0
because p − value < 0.05, we reject the null hypothesis at the 5% significance level and conclude that the
residuals are serially correlated at lag order 1.
Breusch-Godfrey (BG or LM) Test
The BG test can be applied to a model, with a lagged response variable on the right-hand side, for
example: Yi = β0 + β1 X1,i + ... + βk Xk,i + βk+1 Yi−1 + i .
In general, we estimate the model parameters via OLS and calculate the residuals:
bε = Y − X b
β
LM = (N − p)R 2 ∼ χ2p
b
where R2 is the R-squared from the auxiliary regression on b
ε.
b
The BG test is a more general test compared to DW
Looking at the correlogram plots Let’s test the null hypothesis that at least on of the first three
lags of the residuals is not zero, i.e. p = 3:
bg_t <- lmtest::bgtest(mdl_1, order = 3)
print(bg_t)
##
## Breusch-Godfrey test for serial correlation of order up to 3
##
## data: mdl_1
## LM test = 100.72, df = 3, p-value < 2.2e-16
Since the p − value < 0.05, we reject the null hypothesis and conclude that there is serial correlation in the
residuals.
Heteroskedasticity-and-Autocorrelation-Consistent Standard Errors (HAC)
So far, we have assumed that the diagonal elements of Σ are constant - i.e. that the residuals
are serially correlated but homoskedastic. We can further generalize this for the case of
heteroskedastic serially correlated standard errors:
2
σ1 σ1,2 σ1,3 ... σ1,N
σ2,1 σ22 σ2,3 ... σ2,N
Σ= . .. , σi,j = σj,i = Cov(i , j ) 6= 0, i, j = 1, ..., N
. .. ..
.. .. . . .
σN,1 σN,2 σN3, ... σN2
Similarly to dealing with heteroskedasticity, we can use OLS estimator and correct the standard
errors. In this case the corrected standard errors are known as HAC
(Heteroskedasticity-and-Autocorrelation-Consistent) standard errors, or Newey–West
standard errors.
The White covariance matrix assumes serially uncorrelated residuals. On the other hand, the Newey-
West proposed a more general covariance estimator, which is robust to both heteroskedasticity and
autocorrelation, of the residuals of unknown covariance form. The HAC coefficient covariance
estimator handles autocorrelation with lags up to p - it is assumed that lags larger than p are
insignificant and thus can be ignored. It is defined as:
−1 −1
HAC = X >X b X >X
NΣ
β OLS
b
where:
p h
j
X i
Σ
b =b
Γ(0) + 1− Γ(j) + b
b Γ(−j)
p+1
j=1
N
!
1 X >
Γ(j) =
b i i−j X i X >
bb i−j , X i = 1, X1,i , X2,i , ..., Xk,i
N
i=1
Σ
b =b
Γ(0)
## (Intercept) x1 x2
## (Intercept) 1.43560709 -0.303808295 -0.043572462
## x1 -0.30380830 0.183608167 -0.002018108
## x2 -0.04357246 -0.002018108 0.004421496
Following the documentation, NeweyWest() is a convenience interface to vcovHAC() using Bartlett kernel weights.
In comparison vcovHAC() allows choosing weights as either weightsAndrews, or weightsLumley, or a custom
function to calculate the weights.
Then the coefficients, their standard errors and p-values can be summarized as:
# print(lmtest::coeftest(mdl_1, sandwich::vcovHAC(mdl_1)))
print(lmtest::coeftest(mdl_1, sandwich::NeweyWest(mdl_1, lag = 1))[, ])
we note that:
Var(i ) = Var(ρi−1 + ui ) = Var(ρ(ρi−2 + ui−1 ) + ui )
= Var(ρ2 (ρi−3 + ui−2 ) + ui + ρui−1 )
= ...
∞ ∞
!
X X
j
= Var ρ ui−j = ρ2j · Var (ui−j )
j=0 j=0
σ2
=
1 − ρ2
and:
Cov(i , i−k ) = Cov(ρ(ρi−2 + ui−1 ) + ui , i−k )
= Cov(ρ2 (ρi−3 + ui−2 ) + ρui−1 , i−k )
= ...
= Cov(ρk i−k , i−k )
= ρk Cov(i−k , i−k )
= ρk σ 2 , since Cov(ui , j ) = 0, i 6= j
ρ2 ρN−1
1 ρ ...
σ 2 ρ 1 ρ ... ρN−2
Σ=
. . .. .. ..
1 − ρ2 .. ..
. . .
ρN−1 ρN−2 ρN−3 ... 1
In this case, the knowledge of parameter ρ allows us to empirically apply the GLS - as FGLS.
Cochrane–Orcutt (CORC) estimator:
1. Estimate β via OLS from:
Y = Xβ + ε
(1)
ε = Y − Xb
and calculate the residuals b β , where (1) denotes the first iteration.
2. Estimate the following residual regression via OLS:
b(1)
i = ρb
(1)
i−1 + b
ui
The final value ρb(K ) is then used to get the FGLS (CORC) estimates of β.b Again,
depending on your specification, you may need to transform the intercept coefficient:
βb0 = βb0∗ /(1 − ρb(K ) ).
These FGLS estimators are not unbiased but they are consistent and asymp-
totically efficient.
The procedure can be carried out with the built-in functions as follows:
mdl_1_CORC <- orcutt::cochrane.orcutt(mdl_1)
#
print(coef(summary(mdl_1_CORC)))
## [1] 0.7088817
which is close to the true value of ρ used in the data generation.
Alternative procedures to CORC are Hildreth-Lu Procedure and Prais–Winsten Procedure.
Choosing Between HAC and FGLS
It has become more popular to estimate models by OLS and correct the standard errors for fairly
arbitrary forms of serial correlation and heteroskedasticity.
Regarding HAC:
I Computation of robust standard errors works generally well in large datasets;
I With increase in computational power, not only has it become possible to
(quickly) estimate models on large datasets, but also to calculate their robust
covariance (HAC) estimators;
I While FGLS offers a theoretical efficiency, it involves making additional
assumptions on the error covariance matrix, which may not be easy to test/verify,
which may threaten the consistency of the estimator.
Regarding FGLS:
I if the explanatory variables are not strictly exogenous (i.e. if we include Yi−1 on
the right-hand-side of the equation) - the FGLS is not only inefficient, but it is
also inconsistent;
I in most applications of FGLS, the errors are assumed to follow a first order
autoregressive process. It may be better to evaluate OLS estimates and use a
robust correction on their standard errors for more general forms of serial
correlation;
I in addition to imposing an assumption of the residual covariance structure in
regard to autocorrelation, GLS also requires an exogeneity assumption (MR.3) to
hold, unlike HAC.
Generally, serial correlation is usually encountered in time-series data, which has its own
set of models that specifically deal address serial correlation of either the residuals ,
the endogenous variable Y , or the exogeneous variables Xj , or even all at once.
It is worth noting that autocorrelated residuals are more frequently the result of a
misspecified regression equation, rather than a genuine autocorrelation.
A final thing to note:
In many cases, the presence of autocorrelation, especially in cross-sectional data, is not
an indication that the model has autocorrelated errors, but rather that it:
I is misspecified;
I is suffering from omitted variable bias;
I has an incorrect functional form for either Y , or X .
Monte Carlo Simulation: OLS vs FGLS
To illustrate the effects of heteroskedasticity on the standard errors of the estimates, and
efficiency between OLS and FGLS, we will carry out a Monte Carlo simulation. We will simulate
the following model:
(
(m) (m) (m) (m)
Yi = β0 + β1 X1,i + β2 X2,i + i
(m) (m) (m) (m) (m)
i = ρi−1 + ui , |ρ| < 1, ui ∼ N (0, σ 2 ), 0 = 0, i = 1, ..., N, m = 1, ..., MC
We will simulate a total of MC = 1000 samples from this model with specific coefficients and
estimate the parameters via OLS, WLS, as well as correct the standard errors of OLS via HCE.
See the lecture notes for the code sample
Regarding the rejection rate of the null hypothesis that a parameter is insignificant, we have that:
## (Intercept) x1 x2
## OLS: H0 rejection rate 0.373 0.505 0.757
## HAC: H0 rejection rate 0.457 0.365 0.919
## FGLS: H0 rejection rate 0.135 0.122 0.999
So, the FGLS seems to be better in some parameter cases, while HAC may be similar to OLS. All
in all, if we only have an autocorrelation of order 1 problem - the corrections may not have a
huge impact on our conclusions in some cases.
We can look at the coefficient estimate histograms:
x2 x1
200
150
Frequency
Frequency
True Value True Value
OLS OLS
100
FGLS FGLS
0 50
0
−0.6 −0.4 −0.2 0.0 −2 −1 0 1 2 3
beta_est_ols[, j] beta_est_ols[, j]
(Intercept)
Frequency
True Value
100
OLS
FGLS
0 40
−5 0 5 10
beta_est_ols[, j]
For higher order and more complex correlation structure of the residuals, this may not always be the
case, hence, if we detect serial correlation, we should account for it in some way.