0% found this document useful (0 votes)
268 views

Econometrics Lecture Notes Booklet

This document provides an overview of an econometrics course covering weeks 1-5. It discusses the following key points: 1. Econometrics involves testing economic theories using statistical methods and real-world data. 2. The course will focus on estimating simple regression models, discussing desirable statistical properties of estimators, and conditions for estimators to have these properties. 3. It will also cover how violations of model conditions affect properties of estimators and how to test hypotheses within different models. 4. The material aims to provide students basic experience with the empirical side of econometrics through estimation and hypothesis testing.

Uploaded by

ilma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
268 views

Econometrics Lecture Notes Booklet

This document provides an overview of an econometrics course covering weeks 1-5. It discusses the following key points: 1. Econometrics involves testing economic theories using statistical methods and real-world data. 2. The course will focus on estimating simple regression models, discussing desirable statistical properties of estimators, and conditions for estimators to have these properties. 3. It will also cover how violations of model conditions affect properties of estimators and how to test hypotheses within different models. 4. The material aims to provide students basic experience with the empirical side of econometrics through estimation and hypothesis testing.

Uploaded by

ilma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

UNDERGRADUATE: YEAR 2

ECONOMETRICS
(08 29172)

LECTURE NOTES BOOKLET


(WEEKS 1 – 5)

LECTURER: Dr Joanne Ercolani


J.S.Ercolani What is Econometrics?

What is Econometrics?
Econometrics is the interaction of economic theory, observed data and
statistical methods.

Before we make a start on our econometric journey, it is useful to explain to you why
econometrics is important, without technical details. It is perhaps an over simplistic analysis
and makes the job of an econometrician look much less complicated than it is in practice, but
it should give you a general view of what econometricians do and why it is so useful.

Economic theory tends to make qualitative statements about the relationships between
economic variables, e.g. we know that if the price of a good decreases then the demand for
that good should increase, ceteris paribus, or that the more education an individual receives,
the more money they will earn. These theories do not provide a numerical measure of the
relationship; they simply state that the relationship is negative or positive. It is the
econometrician's job to provide the numerical or empirical content to the economic theory.
Econometricians are therefore responsible for the empirical verification of economic theories.
To do this they use mathematical equations to express economic theories and use real
economic data to test these theories.

The methodology might proceed as follows:

1. Statement of economic theory or hypothesis


2. Specification of the theory in an equation
3. Specification of the econometric model
4. Obtaining the data
5. Estimation of the parameters of the econometric model
6. Hypothesis testing
7. Beyond inference

Let’s consider each step using a particular economic theory, the Keynesian theory of
consumption.

1. The theory is that on average, people increase their consumption as their income
increases, but by a smaller amount than the increase in income. Essentially, Keynes was
stating that the relationship is positive (income and consumption move in the same
direction) and that the Marginal Propensity to Consume (MPC) which measures how
much of a rise in income gets consumed, is less than 1 (consumption increases by less
than the increase in income).

1
J.S.Ercolani What is Econometrics?

2. The economist might suggest a simple way to specify this Keynesian consumption
function using the equation of a straight line as follows:

Y = 1 + 2 X (1)

In this function we call Y the dependent variable (in this case consumption) and X is the
explanatory variable (in this case income). Equation (1) explicitly states the direction of
causality, i.e. changes in income determine changes in consumption, not the other way
around. The terms 1 and  2 are the parameters of the model. The 1 parameter is the
intercept coefficient and  2 is the slope coefficient. The slope is very important as it
reflects how much influence the X variable has on the Y variable. In the consumption
function above,  2 is interpreted as the MPC because it is the coefficient on income and
so directly reflects how income affects consumption. We would expect 0   2  1 to signal
both the positive relationship between income and consumption (  2  0 ) and the MPC
being less than one (  2  1 ).

In a diagrammatic representation of this, the function is upward sloping.


Y (consumption)

 2 = MPC

1

X (income)

3. Equation (1) is an exact relationship (also called deterministic or non-random). In practice


however economic relationships are rarely exact. If we could collect data on income and
consumption for many different households, it is unlikely that all households would lie
exactly along the line drawn above. Firstly, income is not the only determinant of
consumption (the model is a simplification of true economic behaviour) and secondly
there is randomness in human behaviour such that two people on the same income do
not necessarily consume the same amount. So equation (1) and the diagram depict the
general trend of the relationship between the two variables, rather than the behaviour

2
J.S.Ercolani What is Econometrics?

of each individual. To reflect that the economic relationship is not exact, we transform
(1) into an econometric model by adding a disturbance or error term  so that

Y = 1 + 2 X +  (2)

The term  is a random variable, also called a stochastic variable. It is used to represent
all of the factors that affect consumption that we have not included explicitly in the model
(maybe because they are variables that cannot be measured or observed) as well as the
fact that there is inherent randomness in people’s behaviour.

4. The next step is to collect data that are relevant to the model. In this example, we would
collect data on consumption and income. You can find many sources of economic data
on the internet. For the consumption function example, one could do a micro-level
analysis by using data on the income and consumption behaviour of the individuals living
in a particular country. This would be called a cross-sectional regression. Or one could do
a macro-level analysis using aggregate economy data where the data are observed over
time, say 1950-2016. This would be a time series regression.

5. Once you have specified your model and have an appropriate set of data, we bring the
two together and estimate the unknown parameters  1 and 2 in the regression model.
This is one of the main topics of this module and we will deal with the procedures used
for estimation in detail later. For now, let’s suppose that we have applied an estimation
technique and our estimates are ˆ = 184.28 and ˆ = 0.71 so that the estimated model
1 2

can be written as

Yˆi = 184.28 + 0.71X i

where a ^ denotes an estimated value. The i subscript here denotes the individuals over
which the data are observed. If we have 1000 people in our sample, so 1000 pairs of
observations of income and consumption, then i = 1, ,1000 . We have now provided
some empirical content to Keynes' theory. The parameter  2 represents the MPC, and
this has been estimated to have a value of 0.71. This means that for every £1 increase in
income, on average this leads to an increase in consumption of 71p (assuming the data
are for individuals who live in the UK).

6. Once we have estimated the parameters as above, we can test various hypotheses to see
if the economic theory actually holds. In this case, Keynes' theory was that the MPC is
positive but less than 1. Although the value obtained is 0.71 and therefore satisfies the
requirements, we need to ensure that the value is sufficiently above 0 and sufficiently

3
J.S.Ercolani What is Econometrics?

below 1 that it could not have occurred by chance or because of the particular data set
that we have used. What we want to make sure of is that 0.71 is statistically above 0 and
statistically below 1. This is where hypothesis testing comes in. Again we will not go into
details here but this does form another major component of this module.

7. Beyond this inference, once we have shown that the economic theory stands up to
empirical scrutiny, econometricians can use their estimates to predict values for the
dependent variable based on values of the explanatory variable(s). Ultimately the
inference may help government to formulate better policies. From simply knowing the
value estimated for the MPC the government can manipulate the control variable X to
produce a desired level for the target variable Y.

4
J.S.Ercolani Module overview

Module Overview (weeks’ 1-5 material)

In this module, we will cover some of the aspects from the description of econometrics in the
previous section. We will not focus too much attention on parts 1 – 4 from the list. On the
whole, in this part of the module we consider generic econometric models, specified as simple
bivariate or multiple regressions and do not often tie these to any particular economic theory
or model. Further, whilst we discuss a little the concepts of what makes a good set of data,
we do not consider where data can be accessed.

In general, the main content of the weeks’ 1-5 material is about, (i) the estimation of simple
regression models; (ii) the statistical properties that we would like these estimators to have;
(iii) discussion of the conditions required of the regression model in order that the estimators
have these properties; (iv) discussion of what happens to the properties of estimators when
some of these model conditions cannot be met; and (v) how to test various types of
hypotheses within different models. Some basic experience of the empirical side of
econometrics, with the application of economic data to simple economic models, will be
provided in the Eviews workshops.

Before we can do any of that, it is appropriate to refresh our memories on some basic
statistical concepts. Many of the topics in the review Section 1 were considered in the year 1
Applied Economics and Statistics module and so should be fairly familiar to you.

5
J.S.Ercolani Section 1: A Review of Statistical Concepts

1. A Review of Statistical Concepts


Understanding and using even the most basic econometric techniques requires an
understanding of various statistical concepts, and this section (re-)introduces you to the
concepts that you will need to get to grips with for this module. In general these are
distribution and density functions of random variables; moments, i.e. mean, variance,
covariance and correlation; random sampling; and statistical inference with its branches of
estimation and hypothesis testing.

1.1 Random Variables, Distribution and Density Functions


Usually in econometrics we deal with random variables that are continuous in nature. Hence,
the set of values that these variables take can be anywhere along a continuum and there is a
probability associated with it taking values within a particular range along that continuum.
For example, if we picked a person off the street and asked them how much they earn per
year, their income value could be anywhere on the real line above 0. However, there is a
higher probability that they earn a value around £30k than around £100k. Or we could
measure that person’s height. Their height will be anywhere along a continuum above 0, but
there is a greater probability that they will be around 5.5 feet than around 6.5 feet.

The probabilities associated with the values taken by random variables can be depicted on
graphs, called density and distribution functions. The density function of a random variable X
is often denoted f ( x ) and might look something like this:

f(x)

a b X

The x-axis represents the continuous range of values over which the random variable X exists
and the y-axis is the value of the density function. The probabilities associated with variable
X are measured by areas under the density function (under the curve). So, for example, the
probability that the value of X will be between the values a and b is equal to the value of area

6
J.S.Ercolani Section 1: A Review of Statistical Concepts

A, i.e. Area A = P ( a  X  b ) =  f ( x ) dx . Hence calculating probabilities from density


b

functions requires integration.

A feature of random variables and their densities is that the area under the whole density is

equal to unity, i.e.  f ( x ) dx = 1 . Why is this? Let’s suppose that the variable can only take
−

on values in the range 0 to 10. Then the probability of the variable taking on a value between
0 and 10 has to be 1 and therefore  f ( x ) dx = 1 .
10

Related to the density function is the distribution function, often denoted F(x), which depicts
the probability that X takes on values less than x, i.e. F ( a ) = P ( X  a ) =  f ( x ) dx where
a

−

a is a number. The distribution therefore shows the accumulation of the probabilities


associated with values of X up to X = a . For the density function drawn above, this will look
something like the following:

F(x)
1

F(a)

a X

So the distribution function point F ( a ) represents the area under the density function up to
the value a.

1.2 Moments – Mean and Variance


Moments describe certain characteristics of a random variable and its density/distribution
function. We will concentrate on the most common moments, the first and second, that are
usually called the mean and variance.

Mean (or expected value): often denoted as E ( X ) or simply  , this measures the central
tendency of the density function of variable X. You could think of it as being the point at which
the function is balanced, where there is equal mass on either side of this point. The expected
value is given by

7
J.S.Ercolani Section 1: A Review of Statistical Concepts


E ( X ) =  xf ( x )dx   (1)
−

For a symmetric distribution this would correspond to the value of X for which the density is
symmetric on either side, i.e. the center point.

f(x)

E(X) X

The function E (.) is the expectations operator. To calculate an expectation you take
whatever is inside the bracket and use it to multiply with the density function, then integrate.
So in general

E ( g ( X ) ) =  g ( x ) f ( x ) dx

and hence why we get (1), when g ( X ) = X .

Variance: this is a measure of the dispersion of the density around the mean, i.e. how spread
out is the density function. It is given by

var ( X ) = E[( X −  ) ] =  ( x −  ) f ( x ) dx .
2 2
(2)

We can see what the variance implies for a density function:

f(x) Variance =  2

Variance =  2

E(X) X

8
J.S.Ercolani Section 1: A Review of Statistical Concepts

Both of these distributions have the same mean, but  2   2 , because the distribution is
more dispersed around the mean. Another term that we may often refer to is the standard
deviation, which is the square root of the variance, i.e. s.d ( X ) = var ( X ) =  X2 =  X .

1.3 Covariance and Correlation


The mean and variance are properties of a single random variable. Economists are usually
interested in the relationship between two or more variables.

Covariance: Covariance is a measure of how two variables say, X and Y, are associated with
each other, i.e. how they co-vary. This is given by

E ( X −  X )(Y − Y )  =   ( x −  )( y −  ) f ( x, y ) dydx  cov ( X , Y )


X Y
x y

where X and Y are the means of X and Y respectively and f ( x, y ) is the joint density
function (depicting the probabilities associated with X taking on certain values while Y takes
on certain values).

The value of the covariance can be positive or negative. When positive it implies that as the
values assumed by X increase/decrease, then the values of the Y variable also
increase/decrease. If the covariance is negative it means that as the values that the X variable
assumes increase/decrease, then the values of the Y variable decrease/increase, i.e. they
move in the opposite direction.

So for example, you would expect that the more smokers there are in a population, the more
cases of lung disease you would encounter. Hence one would expect a positive correlation
between these two variables. An economic example, and one that we will refer to many times
in this course, is the relationship between income and consumption. There would be a
positive covariance between the amount people earn and how much they consume. Or
another example, the higher the price of a good the lower the demand for that good, this
would give a negative covariance.

Correlation coefficient: this concept is similar to covariance. Covariance tells us whether two
variables are positively or negatively related. The correlation coefficient does the same but it
also gives us information about the strength of the relationship, i.e. it numerically quantifies
the relationship. It is given by

cov ( X , Y )
corr ( X , Y ) = 
 XY

where  denotes the standard deviation. This coefficient has values that are − 1    1 . If
positive/negative this means that the variables X and Y are positively/negatively related, the

9
J.S.Ercolani Section 1: A Review of Statistical Concepts

closer the value is to 1 or -1, the stronger is this positive/negative relationship and the closer
to 0, the weaker the relationship. So, one would expect a correlation coefficient close to 1
between income and consumption because the income a person earns almost solely dictates
how much money they can spend. In the price/demand example one would perhaps expect
a value close to -1, indicating that when a consumer is deciding whether or not to buy a
product, one of the main factors in this decision is the price of the good. In the smoking
example, the correlation between number of smokers and incidence of lung cancer will be
positive but it is difficult to say how close to 1 it would be. Clearly it is not a foregone
conclusion that if you smoke you will get lung cancer, otherwise very few people would
undertake such an activity, so the correlation would not equal to 1 or close to 1. So it is
probably quite low, but definitely positive.

It is important to discuss at this point the difference between correlation and causation. The
correlation coefficient between income and consumption is the same as that between
consumption and income, but at the microeconomic level, it is income that affects
consumption not the other way around. So how much I earn determines how much I spend,
rather than how much I spend determining how much I earn, the causation runs from income
to consumption. And it is the smoking of cigarettes that causes lung cancer, not the lung
cancer that causes the smoking.

A more sophisticated example would look at the correlations between three or more
variables. Let’s consider number of smokers, incidence of lung cancer and sales of cigarette
lighters. The more smokers there are in a population, the more sales of cigarette lighters there
would be, hence a positive correlation. A positive correlation between smoking and lung
cancer and a positive correlation between smoking and sales of cigarette lighters would result
in a positive correlation between sales of cigarette lighters and lung cancer. But clearly it is
not the sale of cigarette lighters per se that causes lung cancer. So, one must be careful to
distinguish between correlation and causation. Correlation does not imply causation.

1.4 The Normal Distribution

This is a very useful distribution that can describe fairly well the distribution of values taken
by many everyday things. If you look at the picture of a normal density function you can see
that it has a symmetric bell shape.

10
J.S.Ercolani Section 1: A Review of Statistical Concepts

f(x)

X
E(X)

Most of the density is around the mean, in the centre, and the probability is small that the
variable will take values in the tails, i.e. relatively high or low values compared to the mean.
You could think of peoples’ heights as being normally distributed, i.e. if we were to collect
data on the heights of everyone taking Econometrics, we would find that a majority would
cluster between 5.5 and 6 foot, and very few would be shorter than 4.5 feet and very few
taller than 6.5 feet. When you look at the distribution of marks for economics modules, you
will see that a majority are clustered around the low 60s, with very few getting below say 20%
and very few getting above say 80%. The income of a population is likely to be normally
distributed with most people earning close to the mean and very few earning below £10k or
above say £200k, i.e. in the tails.

A random variable X that is normally distributed with mean  and variance  2 is denoted
X ~ N (  ,  2 ) . The probability of getting a value for X that is within a standard deviation of
the mean is about 68%, within 2 standard deviations is about 95% and within 3 about 99.7%.

f(x)

X
 − 3  − 2  −    +   + 2  + 3

The density function of a normal random variable is given by


−( x- )
2
1
f ( x) = e 2 2

2 2
where   0, −     and −  x   . A special case of the normal is the Standard
Normal distribution, which is a normal distribution with a specific mean value of 0 and a

11
J.S.Ercolani Section 1: A Review of Statistical Concepts

X −
( )
specific variance of 1. If X ~ N  ,  2 then Z =

~ N ( 0,1) . This is an extremely useful

result and is one we will refer to again. We will encounter other distributions in this course,
like t and F distributions, and these are all built upon normal or standard normal distributions
in some way.

1.5 Random Sampling


Suppose we are interested in calculating the mean of some process, X. We might know that
X has a certain distribution, say a normal, but we do not have enough information to know
the mean of the distribution. An example could be that we are interested in what mean
income is of people that live in England. We suspect that income is normally distributed but
without having the income values of everyone that lives in England, we cannot plot a
distribution to find the mean. So what do we do? In practice, this would involve some kind of
method of data collection and a method for using this data to estimate the mean of the
process. We call the dataset to be used the sample and the estimator used would be the
sample mean.

The basic idea is therefore that given we do not know the true value of the thing that we are
interested in (mean income of people in England in the above example) we need to find a way
of estimating it and we need this estimated value to be accurate and representative.
Therefore we need (i) the sample of data that we use to be a representative sample and (ii)
we need to use an estimator that has good properties. Both of these are essential if we are
to have any confidence in the estimate that we obtain. For example, suppose I collect data on
the income of a sample of people that live in England but I limit my sample to be people in
my family, and I happen to be the daughter of a millionaire. When I calculate mean income I
arrive at a value somewhere in the millions. Is this representative and accurate? It’s doubtful.
The poor data collection technique has led to an estimate that is very strongly upward biased.
I know this is an extreme example, but it highlights my point. Clearly the people we choose to
be in our sample are extremely important. Luckily we have agencies that do this kind of thing
for us and they collect data from people who live all over the country with all sorts of
backgrounds.

In technical statistical terms we want our sample to be a random sample. The set of data
observations that we collect for our variable X, denoted X 1 , X n is a random sample, of size
n, if each observation X i has been drawn independently from the same distribution so that
each X i has the same distribution. These are known as i.i.d random variables, which means
independently and identically distributed. (Many statistical packages nowadays have a
random number generator function, which will generate random samples of data drawn from
particular distributions).

12
J.S.Ercolani Section 1: A Review of Statistical Concepts

1.6 Estimation
Even if we use a proper random sample, our estimate may still be inaccurate if the estimator
that we use has poor properties. We will look at what properties our estimators should
possess in a bit. Let’s first consider what we mean by an estimator. An estimator is simply an
equation that involves the values from the sample of data. When you plug the data into the
estimator equation you get a single value as your estimate. In our example of calculating mean
income, an estimator we could use is the sample mean. This is given by


n
X = 1
n i =1
Xi .

So X is the sample mean estimator whereby you add up the values in your sample, the X i s,
and divide by the number of observations, n, to arrive at a value for X . This is your estimated
value. This is the standard way in which you calculate a mean or an average for a set of data
(another way of calculating a mean is to use the median).

If you were interested in estimating the variance of the variable, or whether the variable was
correlated with another variable, then you could use the following sample moment
estimators:

Sample variance: S X2 = 1
n −1  (X
n
i =1 i −X) =
2
1
n −1 ( n
i =1
X i2 − nX 2 )
Sample covariance: S XY = 1
n −1  (X
n
i =1 i − X )(Yi − Y ) = 1
n −1 ( n
i =1
X iYi − nXY )
Sample correlation coefficient: r =
S XY
=
 ( X − X )(Y − Y )
i i

S X SY
 ( X − X )  (Y − Y )
2 2
i i

It is important to realise that, because estimators are functions of the random sample, for
which each observation X i has been drawn from the same distribution, then the estimators
are themselves random variables and hence have distributions. We call these sampling
distributions. As we have been interested in the mean estimator, let’s continue with this and
consider what the sampling distribution of this estimator might be. Assume that the sample
X i has come from a normal distribution with a mean  and a variance  2 , i.e. X i N (  , 2 )
. What is the distribution of X ? Well it turns out that it is also normally distributed with the
same mean as the underlying observations but a variance of 
2
n , i.e. the variance is scaled
down by the sample size n. Hence

N ( ,  )
2
X n

13
J.S.Ercolani Section 1: A Review of Statistical Concepts

An even stronger result than this, called the central limit theorem states that, if the sample
N ( ,  ) , even if
2
size n is large, i.e. we have a lot of observations in our sample, then X n

the underlying random sample has been picked from a distribution that is not a normal
distribution.

Having discussed what an estimator is and looked at some particular moment estimators, we
should consider what properties we desire from our estimators. The issue for the
econometrician/statistician is that they are trying to estimate a value for a parameter whose
true value is unknown to us (if it was known to us then why would you bother estimating it?!).
For example, there is some true value for the mean income of people that live in England, but
we do not know what it is, so we collect a sample of data and estimate it, using an estimator
like the sample mean X . This will provide us with a point estimate, i.e. a single value that
represents our best guess at what the mean income value is. How do we know that this value
is close to the true but unknown value? The answer is that we don't. But so long as we use a
proper random sample of data and an estimator that exhibits properties that make it likely
that our estimates are close, then we should be OK. These properties are important and we
will return to them on many occasions.

Properties of point estimators: Let ˆ be an estimator of some parameter  , e.g. X is an


estimator of  . If ˆ is to be considered a good estimator and therefore an estimator that we
can rely on to provide a close estimate to the true value of  , it should satisfy the following
properties:

i. ()
Unbiasedness: ˆ is an unbiased estimator of  if E ˆ =  . This means that the
sampling distribution of the estimator ˆ is centered on  . Hence on average the
estimator will yield the true value.

ii. Efficiency: Unbiasedness alone is not adequate to indicate a good estimator. Not only
do we want the sampling distribution of the estimator to be centered around the true
but unknown value  , we want as much of the mass of the distribution to be around
this point, i.e. the variance must be small. This means that the estimator is more precise
and is hence more likely to estimate the true value. If, amongst all unbiased estimators
of  , the estimator ˆ has the smallest variance, it is said to be the most efficient or
the best and is called a "Best Unbiased Estimator". The diagram shows how a smaller
variance can make an estimator more precise.

14
J.S.Ercolani Section 1: A Review of Statistical Concepts

Distribution of estimator ˆ

~
Distribution of estimator 

~
 , ˆ

Remember that the area under the curve represents the probability that the value for
the estimator lies in a certain range. Notice that both estimators are unbiased (their
~
distributions are centered around  ) but the variance of  is greater because it is more
spread out. It is more likely therefore to estimate a value far away from the true value
 when the variance of the estimator is higher. In this case, ˆ is more efficient and a
~
better estimator than  .

iii. Consistency: This is a large sample, or asymptotic, property. It means that as the size of
the sample gets bigger, i.e. as n →  , the variance of the estimator tends to 0 (becomes
more accurate) until the density collapses to a single spike at  . If an estimator is
consistent it therefore means that if we could increase the sample size indefinitely, we
would estimate the true value. Although it is not possible to have an infinite sample
size, some estimators actually have variances that decrease very quickly when the
sample size increases, so that the size does not have to be too large to obtain an
accurate estimate. Of course, an estimator without this property is no good, as this
means that even if we had all the information available, the estimator still cannot
estimate the correct value.
It is easy to see, given we know that the variance of X is 2 , that as the sample size
n

gets larger, the variance of this estimator gets smaller and the variance tends to 0 as
n →  . We can therefore say that the sample mean is a consistent estimator of the true
mean  .

1.7 Hypothesis Testing


Once we have estimated values for the parameter of interest, in the above case the mean,
we can test hypotheses about the parameter. For example, we may have had a pre-conceived
expectation that mean income of people living in England was £25k per year. This is a testable
hypothesis. If the true value for mean income  really is £25k, our estimated value, obtained
after collecting a random sample of data and using the estimator X , is very unlikely to actually
come out as £25k exactly. Hypothesis testing allows us to distinguish between

15
J.S.Ercolani Section 1: A Review of Statistical Concepts

(i) a value for X that is different from £25k because the actual value of  is different
from £25k and;
(ii) a value for X that is different from £25k even though the actual value of  is £25k.

Essentially we are asking whether our estimated value is “sufficiently” far away from our
hypothesised value to suggest that the hypothesis is incorrect.

Suppose that we wish to test a hypothesis concerning population parameter  , and we


specifically want to test whether its value is equal to  * . This is called the null hypothesis,
denoted H 0 and is written

H0 : =  *

In our example, this would be H 0 :  = £25k , i.e. we hypothesis that the true mean is £25k.
The sample evidence will either reject or not reject this null, against an alternative hypothesis,
denoted H1 . We shall consider three types

H1 :    *

H1 :    *

H1 :    *

The first is a two-sided hypothesis, (because we consider either side of the hypothesised
value, i.e. values above and below  * ), and the others are one-sided hypotheses (either above
or below). To test the null against one of these alternatives we use sample data to develop a
test statistic and we use decision rules to help us decide whether the sample evidence
supports or rejects the null.

To show how this works we will use the sample mean example. First let’s consider how we
N ( ,  ) . Let’s suppose that we want to test
2
can develop a test statistic. We know that X n

the hypothesis H 0 :  = £25k , i.e. that the true mean income value is £25k. Under the null
hypothesis, meaning when the null is correct, the distribution of the estimator X is centered
N ( £25k ,  ) and the distribution will look as follows:
2
on £25k, i.e. X n

16
J.S.Ercolani Section 1: A Review of Statistical Concepts

A
X
£25,000 X*

From this picture you can see that if the true value of mean income is £25k, then the
probability of that our estimator, X , will produce a value above amount X * is very small, i.e.
area A. Interpreting this we can say that if our estimated value for X  X * then this is
evidence against the null hypothesis that mean income is £25k, because it is so unlikely that
you would estimate such a high value for X if the null were true. This is the basis of
hypothesis testing – does the sample evidence suggest that the null is appropriate.

The problem with the above approach is that we don’t actually know the position of the
distribution because we do not know the variance of X (we do not know  2 ). We need to
adjust things a bit. If we turn the distribution of X into a standard normal by subtracting the
mean and dividing by the standard deviation, we get

X −
Z= N ( 0,1)
2
n

If we replace  2 by its sample estimator, given above by S 2 , then the standard normal
distribution becomes a t distribution (we will not prove that this is the case in this module).
Hence

X −
t= tn −1
S2
n

I have called the statistic t because it has a t distribution. A t distribution looks very similar to
a normal distribution, but it is always centred on 0 and has fatter tails;

1−

 
2 2

0
-tc tc

17
J.S.Ercolani Section 1: A Review of Statistical Concepts

In the above diagram, the area under the distribution, which equals 1 in total, has been
divided up in such a way that the area above t c and below −t c equals  in total and the
area in between is 1 −  .

If the null hypothesis is true, i.e. that  = 25, 000 then the t statistic below has the
distribution:

X − 25, 000
t= tn −1
S2
n

Any value of X that produces a value for this t statistic that is in the tails of this distribution,
is treated as evidence against the null hypothesis (it is so unlikely that such a t statistic would
come about if the null were true, then if we do get such a high t statistic it must mean that
the null is not true).

So how do you perform such a test in practice? Let’s assume that the alternative hypothesis
is that H1 :   25,000 (a two-sided alternative).

i. In the example t statistic above you will see that on replacing X and S 2 with the values
obtained for these estimates and when you plug in the value of the sample size n, then
you get a value for the t statistic.
ii. If you look at the distribution diagram above, you will see that we need to choose a
value for  which dictates how the distribution is divided. The term  is called the
significance level and usually we set this at 5%, i.e.  = 0.05 and this then dictates the
values of t c and −t c (these are called the critical values and are obtained from a t
−1 and −t n −1 . This means that, if the null is true, there is only a
distribution table) as tn0.025 0.025

5% chance of getting a value of the t statistic above tn0.025


−1 and below −tn0.025
−1 . Hence, if

−1 or below −t n −1 (i.e. in the tails of


we do get a value for our t statistic that is above tn0.025 0.025

the distribution) then this is treated as evidence against the null.


iii. Now that you have a t statistic and a critical value you compare the two values and then
apply the decision rules. If t  tn0.025
−1 , i.e. your t statistic is in the tails of the distribution,

then you reject the null hypothesis, otherwise you do not reject the null hypothesis.

If the alternative hypothesis was H1 :   25,000 then you would find the critical value for
which there is 5% in the upper tail only, (rather than 2.5% in both upper and lower tails as
above) and the decision rule is that if t  tn0.05
−1 then you reject the null, otherwise you do not

reject. Conversely, if the alternative was H1 :   25,000 then you would find the critical value
for which there is 5% in the lower tail only, and the decision rule is that if t  −tn0.05
−1 then you

18
J.S.Ercolani Section 1: A Review of Statistical Concepts

reject the null, otherwise you do not reject. Whether you do two-sided or one-sided tests is
often dictated by the economic theory that you are testing.

Some important concepts:

A Type I error is the error of rejecting a hypothesis when it is true. It is possible that a true
null could be rejected by mistake  % of the time.

A Type II error is the error of not rejecting a false hypothesis. We will denote the probability
of making such an error as  .

These errors are unavoidable and are an intrinsic part of hypothesis testing. Of course we
would like to minimise the chances of making either error, but it is not possible to minimise
them both simultaneously. Statisticians and econometricians usually take the approach that
making a Type I error is worse than making a Type II error. Hence they try to reduce the
probability of making a Type I error by keeping the level of significance  to a low value,
usually 0.01 (or 1%), 0.05 (5%) or 0.1 (10%). Hence if we perform a test at the 5% level of
significance, we are effectively stating that we are prepared to accept at most a 5% probability
of committing a Type I error.

Other useful terms:

The size of a test refers to the probability of making a Type I error, i.e.  . So the size of a test
is the probability of rejecting a true hypothesis. The power of a test refers to the probability
of not committing a Type II error. This is therefore the probability of rejecting a false
hypothesis, i.e. doing the right thing.

1.8 Confidence Bands


Something we haven’t covered yet is confidence intervals. Now that we know more about
hypothesis testing and estimation, we can more easily discuss this idea. Confidence bands
provide an interval of values as an estimate of a parameter within which we have a certain
level of confidence that the true parameter value lies. So if we set up a 95% confidence
interval then we have 95% confidence that the true value lies somewhere in that interval.

As an example, we can find a confidence interval for the mean  of a variable X . We know
that the sample mean X is a point estimator of  and that a t statistic related to this
X −
estimator (shown on page 12) is t = tn −1 . To define a (1 −  ) % confidence interval we
S2
n

19
J.S.Ercolani Section 1: A Review of Statistical Concepts

( )
need to find the cut-off points t c such that P −t c  t  t c = 1 −  (see the diagram on page
12). Substituting the t statistic from above we get

 X − 
P  −t c   tc  = 1− 
 2 
 
S
n

and with a bit of re-arranging

(
P X − tc S2
n    X + tc S2
n ) = 1−  .
The interpretation is that there is a 100 (1 −  ) % (95% if  = 0.05 ) probability that the
random interval X − t c S2
n , X + tc S2
n contains  . Suppose we have a sample of
observations of size 100 and the sample mean is 5.13 and sample variance is 1.76. The 95%
confidence interval for  is therefore

5.13 − t c 1.76
100    5.13 + t c 1.76
100

The value for t c is the value for t990.025 from a t distribution table and is found to be about 1.984.
Therefore there is a 95% probability that the interval (4.867, 5.393) contains  . So with a
point estimate you get a single value as an estimate, but with a confidence band you get a
range of values in which you have a certain degree of confidence that the true value lies.

What if we were only interested in having 90% confidence? The interval would then be (4.910,
5.350) because the t critical value has decreased to 1.660. You will see that the interval has
shrunk, which helps to say a bit more about the true value of  . But we have less confidence
in that interval. It would be nice to be able to say that we are 99.99% confident that  lies in
the interval 5.125 and 5.135. This would be incredibly accurate with a high level of confidence
in a very tight range. But this is impossible because there is a trade-off between size of interval
and confidence in that interval. The more confidence in the range we have, the larger is the
interval. The extreme example is to look at the interval in which we have 100% confidence.
The interval here would be − and  and so we would say that we are 100% confident that
the true value for  lies between − and  , which doesn’t help at all. Of course we are
100% sure of this. So, in order to get a reasonably sized interval, we are forced to lose some
confidence in this interval.

Once we have calculated our confidence band, given there is such a high probability that the
true value for  lies within it, any hypothesised value outside this interval must be rejected.
So if we had hypothesised  = 5.5 , we can reject this based on the fact that there is only a
5% chance that a value for  would be outside the range (4.867, 5.393).

20
J.S.Ercolani Section 1: A Review of Statistical Concepts

APPENDIX

Properties of the Expected Value operator

If a, b are constants:

• E (a) = a
• E ( aX ) = aE ( X )
• E ( aX + b ) = E ( aX ) + E ( b ) = aE ( X ) + b
• E ( X + Y ) = E ( X ) + E (Y )
• In general E ( XY )  E ( X ) E (Y ) unless X and Y are independent, in which case
E ( XY ) = E ( X ) E (Y )

Properties of Variance

• var ( b ) = 0 because b is a constant and therefore its value does not change

Proof: var ( b ) = E ( b − E ( b ) )  = E ( b − b )  = E ( 0 ) = 0
2 2

   
• var ( X + b ) = var ( X )
• var ( aX ) = a 2 var ( X )
• var ( aX + b ) = a 2 var ( X )

var ( X + Y ) = E ( X + Y − E ( X ) − E (Y ) ) 
2

 
= E ( X + Y −  X − Y ) 
2
 
= E ( X −  X ) + (Y − Y ) + 2 ( X −  X )(Y − Y ) 
2 2
 

= var ( X ) + var (Y ) + 2 cov ( X , Y )

• var ( X − Y ) = var ( X ) + var (Y ) − 2 cov ( X , Y )


• var ( aX + bY ) = a 2 var ( X ) + b 2 var (Y ) + 2ab cov ( X,Y )
• cov ( X , Y ) = E ( X −  X )(Y − Y ) 
= E  XY − X Y − Y  X + Y  X 

= E ( XY ) − Y E ( X ) −  X E (Y ) + Y  X

= E ( XY ) − Y  X −  X Y + Y  X

= E ( XY ) − Y  X

21
J.S.Ercolani Section 1: A Review of Statistical Concepts

If X and Y are independent, i.e. E ( XY ) = E ( X ) E (Y ) =  X Y then cov ( X , Y ) = 0

• If X and Y are independent, so that cov ( X , Y ) = 0 then


var ( aX + bY ) = a 2 var ( X ) + b 2 var (Y )

Properties of Covariance

• cov ( a + bX , c + dY ) = E ( a + bX − E ( a + bX ) ) ( c + dX − E ( c + dX ) ) 

= E ( a + bX − a − bE ( X ) ) ( c + dY − c − dE (Y ) )  = E ( a + bX − a − b X )( c + dY − c − d Y ) 
= E ( bX − b X )( dY − d Y )  = E b ( X −  X ) d (Y − Y )  = bdE ( X −  X )(Y − Y ) 
= bd cov ( X , Y )
• cov ( X , X ) = E ( X −  X )( X −  X )  = var ( X )

22
J.S.Ercolani Section 2: Bivariate Linear Regression

2. Bivariate Linear Regression

2.1 Discussion of the model


You were introduced to the terms regression model (sometimes called an econometric
model) and regression analysis in the first lecture. We looked at these concepts in the context
of the Keynesian consumption function, i.e. a regression model that shows how income
influences consumption. In this section we will look in more detail at regression analysis and
consider one particular method that econometricians often use for estimating the unknown
parameters in these models. This will involve the concepts of estimation and hypothesis
testing.

For the moment we will be concerned only with bivariate econometric models, which means
that we are analysing simple models that involve only two variables. We want to reveal more
about the relationship between the two economic variables. The simple consumption function
just mentioned fits into this category because the model composes just two variables, income
and consumption. We make use of economic theory to indicate the causation in the
relationship, i.e. is it changes in the value of X that cause changes in the value of Y or vice
versa. For notational purposes we usually indicate the explanatory variable (also called the
regressor or exogenous variable) by X and the dependent variable (also called the regressand
or endogenous variable) by Y so that the causation runs from X to Y, hence we believe that
changes in economic variable X cause movements in economic variable Y. We are also only
interested in linear econometric models, which here implies that the equations represent
straight line relationships between the variables X and Y. Linearity in regression analysis is
actually broader than this and we’ll discuss this shortly.

Let’s look again at the simple consumption function, at the microeconomic level, such that we
consider the relationship between individuals’ consumption and income behaviour. Clearly,
the more an individual earns, the more they can spend. So, the relationship is a positive one
and might be represented with the following diagram:
Y (consumption)

X (income)

23
J.S.Ercolani Section 2: Bivariate Linear Regression

An economic theory suggesting that the variation in some economic variable Y (in the above
example this is consumption) depends linearly upon the variability in another economic
variable X (income in this example), can be written as

Y = 1 + 2 X (1)

This is just the equation of a straight line where 1 is the intercept on the Y axis (i.e. the value
of Y when X is equal to 0) and  2 is the slope of the line, which represents how much variable
Y changes when the values of variable X change. This equation should attempt to mimic the
behaviour of the economic system that it is representing. But we know that the economy
would not move in such an exact way as this. There might be unobservable factors that
influence Y that we simply cannot include in the equation and hence the line simply mimics
the general relationship between the variables. For example, suppose 1 = 2 and  2 = 0.75 ,
such that Y = 2 + 0.75 X . If X = 17 then the equation suggests that Y = 2 + 0.75 (17 ) = 14.75
but obviously not everyone who earns £17,000 will spend £14,750.

So, to add some realism to the equation we add what we call an error term or random
disturbance term, often denoted as  , to the equation i.e.

Y = 1 + 2 X +  . (2)

Equation (2) is what we would call a regression model. In our consumption function example,
the error term takes account of the fact that randomness in human behaviour prevents all
income and consumption values from the population from sitting exactly along the line. They
may be above it or below it and hence  may be positive or negative in value to reflect this.
The error also accounts for any factors that influence Y but are not included in the equation.
These maybe factors that we cannot easily measure or observe.

We are very interested in what  “looks like” and we will concentrate on this later in this
section. The best way to deal with the error term is to treat it as a random variable, after all
it does account for random behaviour that cannot be quantified or easily modelled. The
properties of this random error are crucially important as we will see later.

The econometrician is interested in the values of the parameters 1 and  2 particularly  2 as


this directly shows how X affects Y. If we knew their numerical values then we would know
the position of the true regression line. However, we do not know these parameter values and

24
J.S.Ercolani Section 2: Bivariate Linear Regression

therefore we do not know the position of the true regression line. So what is the way forward
here?

Well, we collect data on the variables of interest, X and Y i.e. income and consumption in our
example, and we estimate values for 1 and  2 . The plot below shows a possible sample of
data on the income and consumption values of different individuals from the population:

Each point represents the income and consumption values of an individual in the sample. You
can see the general upward trend in the relationship between income and consumption,
indicating that people with higher incomes also tend to have higher consumption values. This
is the positive relationship showing up in the data. So, now we have the data, we also need a
statistical technique that will allow us to estimate values for 1 and  2 . We’ll next consider
ways in which this might be achieved.

Going back to the issue of linear models, a model is defined as linear if it is linear in the
parameters. This means that the model’s parameters do not appear as exponents or products
of other parameters. If the model contains squares or products of variables, we would still
refer to this as a linear regression if it is still linear in parameters. So as examples

Y = 1 +  2 X + 3 X 2 + 

is linear as it is linear in parameters. On the other hand

Y = 1 + 2 X1 + 3 X 2 + 2 3 X 3 + 

25
J.S.Ercolani Section 2: Bivariate Linear Regression

is not linear in parameters and so is referred to as a non-linear regression. Non-linear


regression techniques would be required here, but we do not cover these methods in this
module.

2.2 Fitting the “best” line


As we said above, equation (2) can be thought of as the “true” model that represents the
actual relationship between X and Y. There exists a value for each of the parameters 1 and
 2 that we could therefore call the “true” parameter values. The problem is we do not know
what these values are. Hence we do not know the position of the true regression line (if we
don’t know 1 and  2 then we don’t know where the line crosses the Y axis or what the slope
of the line is).

To glean any information about this economic relationship we have to use some kind of
estimation technique to estimate the values of 1 and  2 . We will denote these estimated
values as ˆ1 and ̂ 2 . So, 1 and  2 are the true parameters whose values are unknown and
ˆ1 and ̂ 2 are estimates of these parameters. The main issue for the econometrician is how
to get these estimates. We know already that to do estimation we need data and we need a
statistical procedure. Combining the two will produce estimated values for the parameters 1
and  2 . An important question is whether the estimates we obtain are accurate. This relies
on us using good quality data and an appropriate estimation method. There are issues such as
the fact that different model specifications may require different techniques, or it may be that
the data we collect is not exactly in the form that the model specifies and this may affect the
properties of the model and hence suggest a certain type of estimation procedure. But for
now we will assume that the data are i.i.d. and we will consider the situation where it is
appropriate to apply the most basic of estimation techniques.

Assume that we have a set of data that represents the income and consumption of a sample
of people, shown in the earlier plot. We want to find the line that “best” fits through this plot
of data. Once we have found this “best” line then we have found our estimates of 1 and  2
i.e. ˆ1 and ̂ 2 , where ˆ1 is the value where the line crosses the Y axis and ̂ 2 is the slope of
the line.

Now that we are relating the regression model (2) to the sample of data observations on
income and consumption, we can show that for each individual i in the sample:

26
J.S.Ercolani Section 2: Bivariate Linear Regression

Yi = 1 + 2 X i +  i (3)

where i = 1, , n . The subscript i indicates that we are looking at data on individuals and there
are n individuals in the data set. We say that the sample is of size n. In our example this is 28
as there are 28 data points in the plot. If we replace the parameters with their estimates (we’ll
discuss shortly how to get these estimates), we get

Yi = ˆ1 + ˆ2 X i + ˆi .

This is the estimated regression line and the term ˆi is called the residual, which is essentially
the estimated version of the error term  i . The residual gives us the distance that each data
point in the sample lies away from the estimated regression line. Using Excel, we find that the
best fitting line for our sample of data is

As you can see, some individuals are above the line, some below, some close to it, some not
so close. But you can see that it’s not possible to fit a single straight line through all points.
You can see on the graph that the estimates are ˆ = 0.7841 and ˆ = 0.7765 .
1 2

The important question for the moment is how to actually find the best fitting line, i.e. how
did Excel come up with the line in the diagram above. We need a mathematical/statistical
criterion with which to do this. Obviously we want the residuals to be as small as possible, i.e.
we want observations to be as close to the line as possible. The example lines below are pretty
bad at fulfilling such a criterion and clearly do not represent the general relationship between
the two variables X and Y. For one of the lines all of the residuals would be negative because
all of the points lie below it. The other line has a negative slope.

27
J.S.Ercolani Section 2: Bivariate Linear Regression

So, what can we do? What if we consider adding up all of the residual values and choose the
line that gives the lowest absolute summed value, i.e. find the values of ˆ and ̂ that give
1 2

the value of S that is closest to 0 where

 ˆ .
28
S= i =1 i

The problem with this criterion is that a line like the downward sloping one above is likely to
produce the lowest value for S. This is because all of the positive and negative residuals would
cancel each other out and the sum would be close to 0. We clearly do not want to use a
criterion that chooses a downward sloping line for a set of data that is clearly trending
upwards.

The criterion that works the best and which is used most often by econometricians is to
minimise the sum of squared residuals, i.e.

S =  i =1 ˆi2 .
28

That way, negative residuals, when squared, would become positive and would no longer
cancel out with the squares of the positive residuals when summed together. The process that
finds estimates based on this criterion is called Ordinary Least Squares estimation or OLS for
short. This is an extremely common estimation technique for econometricians and is often the
basis for other techniques, when OLS itself is not appropriate (we will discuss some cases of
this later in the course). Let’s look at OLS in more detail.

28
J.S.Ercolani Section 2: Bivariate Linear Regression

2.2.1 Ordinary Least Squares Estimation


As we said, OLS is the procedure that minimises the sum of squared residuals S =  i =1 ˆi2 , i.e.
n

we need to

(
min S = min  i =1 ˆi2 = min  i =1 Yi − ˆ1 − ˆ2 X i )
n n 2
.

We know that to find the maximum or minimum of a function we need to differentiate and
set to 0. This gives what we call the first-order conditions. So

S
ˆ1
n
(
= −2 i =1 Yi − ˆ1 − ˆ2 X i = 0 ) (4)

S
ˆ
 2
n
(
= −2 i =1 X i Yi − ˆ1 − ˆ2 X i = 0 ) (5)

On solving these equations we find that

ˆ1 = Y − ˆ2 X (6)

ˆ2 =
 ( X − X )(Y − Y ) =  X Y − nXY .
i i i i
(7)
( X − X )
i  X − nX
2
i
2 2

These are the OLS estimators of 1 and 2 and as you can see they are simply equations that
involve our data, X and Y. We can now see how we combine data with a statistical technique
to get estimates of the unknown parameters in the regression model. We have our data on
variables X and Y. We have the statistical technique of OLS and this gives us the formulae with
which we can use the data in order to get our estimates, i.e. input the values of our dataset,
X and Y , into our estimator equations (6) and (7), and out pop two numbers, one an estimate
of 1 , the other an estimate of 2 . In the example in the plots above, Excel used OLS to
calculate the values ˆ = 0.7841 and ˆ = 0.7765 . The 28 data points on consumption and
1 2

income (the Y and X variables) were plugged into equations (6) and (7) to produce these
estimates.

So this has introduced the concept of estimation and the technique of OLS. We now know one
method for obtaining estimates of the parameters in a regression model. With these
estimated values we can make statements about the economic relationship under
investigation. In the example above we can state that the MPC is 0.7765. What does this
mean? It means that as income goes up by £1, our consumption goes up by 77p. We have

29
J.S.Ercolani Section 2: Bivariate Linear Regression

managed therefore to provide something quantitative to the economic theory posited by the
regression model.

The next question is, how do we know that this estimate is an accurate measure of the real
but unknown MPC, denoted as  2 ? We don’t know  2 , so cannot say whether our estimate
of it, 77p, is close or far away from it. Much of the answer to this question is based on the
quality of the estimator we have used, in this case OLS. We need to know whether the OLS
estimator has “good properties” and we need to know what conditions have to be satisfied in
order for our estimator to have these good properties.

2.2.2 Properties of OLS estimators


The question we need to ask is, how do we know that the values that we obtain for ˆ1 and
̂ 2 are good estimates of the unknown parameters 1 and 2 ? We asked the same question
in Section 1 when we looked at the sample mean. The answer here is exactly the same. So
long as we use estimators that possess good properties (and we use appropriate data) then
we should have some faith that our estimates are close to the true unknown values.

The properties that we are interested in are, again, unbiasedness and efficiency. Remember
that these properties are about the mean and variance of the sampling distribution of our
estimators. Unbiasedness is about where the centre of the distribution is and efficiency relates
to the variance of the estimator. But as yet we haven’t mentioned anything about our
estimators, ˆ1 and ̂ 2 , having sampling distributions. So it’s worth making the following point:

NOTE: From equation (3) we can see that the dependent variable is a function of the random
disturbance and therefore Y can be treated itself as a random variable. Further, from
equations (6) and (7) we can see that the OLS estimators are functions of this dependent
variable. This means that our estimators are also random variables (akin to the sample mean
being random in the Section 1 notes) and hence our estimators have sampling distributions.

So, are we able to say that OLS has these good properties? The answer is that yes, OLS does
have these good properties, but ONLY if certain conditions are satisfied. We call these the
classical linear regression assumptions, and they are stated below. Some of these conditions
are quite strong, meaning that they may be hard to satisfy in some regression models or for
some economic theories, in which case OLS may no longer have the good properties that we
desire. But what we can say is that IF these conditions are satisfied then OLS is the best method

30
J.S.Ercolani Section 2: Bivariate Linear Regression

of estimation that we can use, i.e. it will provide the most accurate estimate possible. So what
are these assumptions?
Classical Linear Regression Assumptions

• CLRM1: The regression is linear in parameters and correctly specified.


• CLRM2: The regressor is assumed fixed, or non-stochastic (non-random), in the sense that
its values are fixed in repeated sampling. This is a strong assumption and implies that we
can choose values of X in order to observe the effects on Y, so is more useful in
experimental settings. If we relax this assumption a little and allow the regressor to be
stochastic, then we assume that the regressor is independent of the error term, i.e.
E( X i i ) = E( X i ) E( i )  corr ( X i ,  i ) = 0 .
• CLRM3: E (  i ) = 0 for all i . This means that the errors have zero mean.
• CLRM4: var (  i ) =  2 for all i . This means that the variance of the error is constant for all
observations and hence the error is said to be homoskedastic.
• CLRM5: cov (  i ,  j ) = E (  i j ) = 0 for all i  j . This means that there is no correlation
between two error terms and hence the errors are said to be not autocorrelated.
• ( )
CLRM6:  i ~ N 0,  2 . Each error has the same normal distribution with the same mean
and variance. We make this distributional assumption to enable us to derive the
distributions of the OLS estimators and hence allow us to perform hypothesis tests. It is
not actually required for unbiasedness or efficiency.

The OLS estimators have properties that are established in the Gauss-Markov Theorem, which
states:

Given the assumptions of the classical linear regression model, amongst all linear unbiased
estimators, the OLS estimators have the minimum variance, i.e. they are Best Linear Unbiased
Estimators or B.L.U.E.

Hence the OLS estimators ˆ1 and ̂ 2 :


• ( ) ( )
are unbiased estimators meaning that E ˆ1 = 1 and E ˆ2 =  2 , i.e. the sampling
distributions of the estimators are centred around their true but unknown values;
• are efficient, i.e. they have the smallest variance when compared to all other linear
unbiased estimators. Section 1 contains more detail on what efficiency means.
Altogether this means that the OLS estimators will more accurately estimate 1 and 2 than
any other linear unbiased estimator under the assumptions above. As a reminder, the classical
assumptions in many circumstances may be quite inappropriate and this has repercussions on

31
J.S.Ercolani Section 2: Bivariate Linear Regression

the properties of OLS. Later in the course we will examine the relevance of some of these
assumptions and look at how the OLS estimators are affected when the assumptions are
violated.

Proof of unbiasedness of the slope estimator: ˆ2 =


 ( X i − X )(Yi −Y ) .
( X i − X )
2

The numerator can be written as ( X i − X ) Yi − Y  ( X i − X ) the latter term of which is 0

since ( X i − X ) =  X i − nX = nX − nX = 0 . Therefore

ˆ2 = 
( X − X )Y
i
=  wY
i
i i where wi =
X −X i
.
 ( )  X −X )
(
2

2
Xi X i

Some properties of wi are that  w = 0 and  w X


i i i = 1 . Therefore

ˆ2 =  wi ( 1 +  2 X i +  i ) = 1  wi +  2  wi X i +  wi i =  2 +  wi i

Hence when we take the expectation

( )
E ˆ2 =  2 +  wi E (  i ) =  2

given the CLRM2 and 3 assumptions. Hence we have unbiasedness. Unbiasedness of the
intercept term can also be established using a similar proof.

2.2.3 Sampling distributions of the OLS estimators


We need to know the nature of the distributions of our OLS estimators so that we can perform
hypothesis tests on them. We made the assumption above that the errors are normally
distributed. An important distributional result is that a linear transform of a normally
distributed variable is also normally distributed. Given that Y is a linear function of the errors,
it is normally distributed and given further that the OLS estimators are linear functions of Y,
then the OLS estimators ˆ1 and ̂ 2 are also normally distributed.

We now need to establish the mean and variance of our OLS estimators. By the Gauss-Markov
theorem, OLS is unbiased. This implies that the means of their sampling distributions are 1

32
J.S.Ercolani Section 2: Bivariate Linear Regression

and 2 respectively because unbiasedness implies that E ˆ1 = 1 and E ˆ2 =  2 . Their ( ) ( )
variances are given by the following equations

 2  X i2
2
( )
 1 = var ˆ1 =
n ( X i − X )
2
(6*)

( )
2
 2 = var ˆ2 =
( X −X)
2 2
i (7*)
( )
Hence ˆ1 ~ N 1 ,  1 and ˆ2 ~ N  2 ,  2 .
2
( 2
)

Derivation of the variance of the slope estimator:

( )
var ˆ2 = var (  2 +  wi i ) = var (  wi i ) =  wi2 var (  i )

This last term comes about because of CLRM5 and if we further impose CLRM4 we get

( )
var ˆ2 =  wi2 2

and because w 2
i =
( Xi − X )
1
2 ( )
then var ˆ2 = 2
( Xi − X )
2 .

We have now discussed the methodology behind OLS, we have derived the equations for the
OLS estimators (for a bivariate model), we have established that OLS has good properties
under certain assumptions and we have found that OLS estimators are normally distributed.
Now we are in a position to move on to other important statistics we should consider when
doing regression analysis.

2.3 Coefficient of determination


It is important to note that although the method of OLS estimation finds the best fitting line
through a scatter of points, it does not mean that it finds a good fitting line. To measure the
"goodness of fit" i.e. how well the fitted regression line fits through the scatter of
observations, we often use the coefficient of determination. It has the notation R2 , although

33
J.S.Ercolani Section 2: Bivariate Linear Regression

in some texts it is written as r 2 when in the context of bivariate models. We can derive the
expression for R2 in the following way:

Yi = ˆ1 + ˆ2 X i + ˆi = Yˆi + ˆi

(
 Yi − Y = Yˆi − Y + ˆi )
or yi = yˆi + ˆi

where the lower case letters denote mean adjusted variables, i.e. yi = Yi − Y . By squaring and
summing this function it can be shown that

 y =  yˆ +  ˆ
i
2
i i
2
i i i
2

where y 2
i = TSS (Total Sum of Squares),  yˆ 2
i = ESS (Explained Sum of Squares) and

 ˆ i
2
= RSS (Residual Sum of Squares) and hence

TSS = ESS + RSS

In words this means that the total variation of the actual Y values around their mean is equal
to the sum of the total variation of the estimated Y values around the mean and the residual
variation of Y.

If the regression line fits through the points very well we would expect the residual variation
to be small. In the extreme case all points lie exactly on the line so that the RSS = 0 , but this
is very unlikely to occur. The R2 value tells us how much of the total variation of Y is
attributable to the regression line so that

R2 = ESS
TSS = 1 − TSS
RSS

How do we interpret the R2 value? It is the case that 0  R2  1 . The closer this value is to 1,
the better the fit of the regression line and the closer to 0, the worse the fit and hence the
higher the residual variation. We therefore would always be happier if the coefficient has a
value close to 1.

2.4 Hypothesis testing

34
J.S.Ercolani Section 2: Bivariate Linear Regression

In Section 1 we introduced the concept of hypothesis testing in the context of an estimator,


which we denoted ˆ , of a parameter  . In examples we looked at testing the mean of a
variable using the sample mean estimator. We are now going to look at testing hypotheses
regarding the parameters in the bivariate regression model, i.e. 1 and 2 . For example, in
the context of the consumption function, we may have some pre-conceived idea of the value
for the MPC, which we can test once we have estimated the parameters.
Let’s say that we hypothesise that the true value of the parameter  2 is 0.5. It has to be
remembered that the least squares estimate is a random variable with a continuous density,
so it is unlikely that the estimated value ̂ 2 would come out at exactly 0.5, even if the
hypothesis is true. So we should not make simple inspections of the estimated values to decide
whether a hypothesis is true or not. We need a test that can try to distinguish between a ̂ 2
that is different from 0.5 because the actual value of  2 is different from 0.5 and a ̂ 2 that is
different from 0.5 even though the actual value of  2 is 0.5.

2.4.1 The simple t-test


To run these tests we require knowledge of the distributional aspects of the OLS estimators.
This was done above where we stated that

ˆ1 ~ N ( 1 ,  2
1
) (
and ˆ2 ~ N  2 ,  22 )

Suppose that we wish to test that 2 is equal to some value  and that there are no
*

theoretical suggestions to help us specify the direction in which the parameter should deviate
under the alternative. We would choose therefore to perform a two-sided test where the null
and alternative hypotheses are specified as

H 0 : 2 =  *
H1 :  2   * (8)

There may however be some a priori evidence to suggest a particular direction for the
alternative. For example, economic theory suggests that the MPC is positive, so that if we
wanted to test whether the MPC is equal to 0 we could choose an alternative hypothesis in
which the relevant parameter is greater than 0. To test a hypothesis like (8), we can use the t
test technique that we analysed in Section 1. We specify a test statistic and then compare this
value to the relevant critical value from an appropriate distribution table.

35
J.S.Ercolani Section 2: Bivariate Linear Regression

We know that ˆ2 N   2 ,  2  . From this we can create a variable that has a standard
2

 ( Xi − X ) 
normal distribution as follows:
ˆ2 −  2
Z= ~ N ( 0,1) (9)
2
( X −X)
2
i

The parameter  2 in (9) is unknown and needs to be estimated. We are not going to show
how to derive such an estimator, so take as given that an unbiased estimator of the error
variance is

 ˆ
n 2

ˆ 2
= i =1 i

n−k

where k represents the number of unknown parameters in the regression model, so in this
case k = 2 . On replacing  2 in (9) with its estimator ̂ 2 , this changes the distribution of the
variable. We usually call the new variable t because it has a t distribution as shown here:
ˆ2 −  2
t= ~ tn − 2
ˆ 2
( Xi − X )
2

This statistic has a tn−2 distribution, where n − 2 is called the degrees of freedom. Now we
can state that, if the null hypothesis is true, i.e. that  2 =  then
*

ˆ2 −  *
t= ~ tn − 2
ˆ 2
( X −X)
2
i

or in a more condensed form,

ˆ2 −  *
t= ~ tn − 2 (10)
( )
s.e. ˆ2

( )
where s.e. ˆ2 denotes the standard error of the estimator and is the square root of the

variance of ̂ 2 .

36
J.S.Ercolani Section 2: Bivariate Linear Regression

To perform the test we therefore calculate the test statistic from (10) which must then be
compared to a critical value from the t distribution table. Following a set of decision rules we
can decide whether to reject or not reject the null hypothesis. On choosing a significance level
 (usually 5% so  = 0.05 ), and given the degrees of freedom n − 2 , the critical value is easily
found. The decision is made via the following rules:


• H1 :  2   * : if t  tn −2 2 , reject the null in favour of the alternative hypothesis;
• H1 :  2   * : if t  tn− 2 , reject the null in favour of the alternative hypothesis;
• H1 :  2   * : if t  −tn− 2 , reject the null in favour of the alternative hypothesis.

A test of special interest to econometricians is the test of significance. It is used so often that
statistical software packages automatically produce the test statistic alongside the coefficient
estimates for the  parameters. It is called the test of significance because the null and
alternative hypotheses are for a specific value that the parameter is equal to 0.

2.4.2 The test of significance


If we wish to test the significance of the X variable in regression model (3) then we would test
whether the parameter on X, i.e.  2 , is equal to zero. Please note that to test the significance
of X we do not test whether X=0. We are not testing whether the data are 0. We are testing
whether the effect that variable X has on Y is 0. Hence we would test

H0 : 2 = 0 against H1 : 2  0

Under the null hypothesis the following test statistic has a t distribution

ˆ2
t= ~ tn − 2
( )
se ˆ2

If we cannot reject the null then we are effectively saying that  2 = 0 and that the regression
model should actually be written as

Yi = 1 +  i

37
J.S.Ercolani Section 2: Bivariate Linear Regression

and therefore variable X is not a significant determinant of Y, i.e. any change in the value of
variable X has no impact on variable Y. If we reject the null then X is a significant determinant
of Y and the model is

Yi = 1 + 2 X i +  i .

Of course this test can also be performed on the other parameter 1 . In Section 3 we will
consider multiple regression models in which more X variables appear on the right-hand side
of the equation and this widens the scope for many more forms of hypothesis to be tested.

2.5 Confidence Bands


We looked at constructing confidence intervals for the mean  of a variable. We can now
look at building such intervals for the parameters in regression models using OLS estimators.
So, we know that

ˆi − i
t= ~ tn − 2
( )
s.e. ˆi

( ( ) ( )) = 1 −  and hence we have (1 −  ) %


therefore P ˆi − t c s.e. ˆi  i  ˆi + t c s.e. ˆi

confidence that the interval ˆ − t s.e. ( ˆ ) , ˆ + t s.e. ( ˆ ) contains the true value for  .
i
c
i i
c
i i

2.6 Analysing Eviews output


It is useful at this point to indicate the different aspects of Eviews output that you should be
able to interpret using the theory we have done so far. Let’s suppose that we want to estimate
a simple consumption function

CONSt = 1 + 2 INCt +  t

where the data are for the UK over the time period 1955 to 2010, i.e. t = 1955, 2010 such
that we have 56 observations of data. The dependent variable is consumption, here denoted
CONS and the regressor is income, denoted INC. The table of Eviews output is below. We have
not covered many of the statistics in this table yet, but let’s interpret as much as we can.

38
J.S.Ercolani Section 2: Bivariate Linear Regression

Dependent Variable: CONS


Method: Least Squares
Sample: 1955 2010
Included observations: 56

Variable Coefficient Std. Error t-Statistic Prob.

C -4264.223 5332.261 -0.799703 0.4274


INC 0.938440 0.009947 94.34038 0.0000

Root MSE 15416.11 R-squared 0.993969


Mean dependent var 458215.0 Adjusted R-squared 0.993858
S.D. dependent var 200309.6 S.E. of regression 15699.00
Akaike info criterion 22.19564 Sum squared resid 1.33E+10
Schwarz criterion 22.26798 Log likelihood -619.4780
Hannan-Quinn criter. 22.22369 F-statistic 8900.108
Durbin-Watson stat 0.344329 Prob(F-statistic) 0.000000

The table tells us what the dependent variable is, the estimation procedure, sample time span
and number of observations. Then we have 5 columns.

• The 1st tells us the name of each variable on the right hand side of the equation, in this
case C which is the constant term in the regression and INC.
• The 2nd contains the OLS parameter estimates. So here we have that ˆ = −4264.223 and 1

ˆ2 = 0.93844 .
• The 3rd is the standard error of the estimate, e.g. the s.e. ˆ2 = 0.009947 . ( )
• The 4th is the t statistic of significance, i.e. the test of each parameter being equal to 0.
You should see that this column is the result of dividing the 2nd column by the 3rd column
ˆi
because t = ( ) . The critical value is roughly equal to 2, so these suggest that we reject
s .e. ˆi

the null that the variable is insignificant. Hence income is a statistically significant
determinant of consumption (it would have been surprising to get anything different).
• The 5th is the P value which measures the probability under the t distribution that lies
above the t statistic value. The values here imply that the t statistics are so large that the
area above these values in a t distribution is so small it cannot be read to 4 decimal places.
This again implies that the variables are significant. This column is useful because it avoids
the need to find the critical value in a distribution table. If you are doing a 5% two-sided
test then you compare this value to 0.05. If the P value is less than 0.05 then you reject
the null.
At the bottom of the table is the goodness of fit statistic, which is quite high at 0.994. This
suggests that the model fits the data well, even in a simple bivariate regression. This is because
income is a very important variable and the main determinant of how much we spend.

39
J.S.Ercolani Section 3: Multiple Linear Regression

3 Multiple Linear Regression

3.1 Discussion of the model


The regression model in Section 2 involved just two variables, X and Y and we looked at the
OLS method for estimating the unknown coefficients and testing simple hypotheses about
these parameters. However, for many economic theories, bivariate models are much too
simplistic. If we wish to analyse the effects of other variables on the dependent then we need
to expand the model. For example in the consumption function, we may wish to include other
factors like prices and interest rates, as well as income, to help determine the variation in
consumption. In general, a linear k-variable model is given by

Yi = 1 + 2 X 2i + 3 X 3i + + k X ki +  i , for i = 1, , n (1)

where there now exist k-1 explanatory variables and k parameters to estimate. So what are
the differences between this model and the bivariate model, can we estimate the parameters
in the same way and do these estimates have the same properties as before? What about
hypothesis testing? In this section we will analyse these issues and point out where differences
lie between the multiple and the bivariate case.

Think of (1) as an extension to the bivariate model, we just have more factors to analyse. The
term 1 still represents the intercept or constant term but the  j for j = 2, , k are now
interpreted as partial slope coefficients. This means that, for example, 2 measures the
change in the mean value of Y per unit change in X 2 , ceteris paribus (whilst holding the values
of the other explanatory variables constant). Or, we could say that given the other variables
are in the model, the parameter 2 measures the additional explanatory power of variable
X 2 . We can therefore analyse how much of the variation in Y is directly attributable to X 2 ,
how much to X 3 etc.

To give a good example (given in Koop pg 44-46), we have a model that tries to explain how
house prices are determined. Let Y be house prices in £s, P is the size of the plot in square
feet, BD is the number of bedrooms, BT is the number of bathrooms and F is the number of
floors, so the regression is

Yi = 1 + 2 Pi + 3 BDi + 4 BTi + 5 Fi +  i

40
J.S.Ercolani Section 3: Multiple Linear Regression

In this example you would expect all parameters to be estimated with positive values because
each house feature should increase the price of the house. Suppose that ˆ = 48.36 . How do
2

we interpret this value? Can we simply state that houses with bigger plots are worth more?
Well not strictly because there will be some exceptions, a derelict house on a large plot is
unlikely to be more expensive than a luxury house on a smaller plot. What we can say is that
for houses that are comparable in other respects the one on the bigger plot will be worth
more. Or more precisely for this example, an extra square foot raises the price of a house by
£48.36, ceteris paribus, or alternatively, for houses with the same number of bedrooms,
bathrooms and floors, an extra square foot will increase the price by £48.36.

3.2 Ordinary Least Squares estimation


We can still use OLS to estimate the unknown parameters in a multiple regression model,
although now we have more than two. Of course to do so requires the collection of more data,
for example data on consumption, income, as well as prices and interest rates etc. The
methodology of OLS is also exactly the same, i.e. we minimise the sum of squared residuals.
Of course it becomes more complicated because of the higher dimension of the parameter
space. To show how it works we will look at the OLS estimators of the parameters of a 3-
variable model

Yi = 1 + 2 X 2i + 3 X 3i +  i , for i = 1, , n

for which we have the following minimisation problem, i.e. to minimise S where

(
S =  i =1 ˆi2 =  i =1 Yi − ˆ1 − ˆ2 X 2i − ˆ3 X 3i )
n n 2

S S S
Three derivatives need to be found, , and . Each equation is set to zero and
ˆ ˆ
1  2 ˆ3
solved. The results are the following estimators:

ˆ1 = Y − ˆ2 X 2 − ˆ3 X 3

ˆ
2 =  yi x2i  x32i −  yi x3i  x2i x3i
 x22i  x32i − (  x2i x3i )
2

ˆ3 =  i 3i  2i  i 2i  22i 3i
yx x − yx 2
x x
 x2i  x3i − (  x2i x3i )
2 2

41
J.S.Ercolani Section 3: Multiple Linear Regression

where the lower case letters represent deviations from means, e.g. x2i = X 2i − X 2i . You can
easily imagine how difficult this becomes when we add more variables into the model. We are
lucky that we have computer software packages that have these procedures programmed in
to them, so that we do not have to worry about calculating these equations by hand.

Now although the procedure for deriving the OLS estimators is the same here as in the
bivariate model, do these estimators still have the same properties of unbiasedness and
efficiency? That is are they still B.L.U.E? As with the 2-variable model, this depends upon a set
of assumptions. The classical linear regression assumptions are still appropriate in the multiple
regression model, but with minor modifications. Assumption CLRM2 needs to be modified to
now hold for all explanatory variables in the model, so each regressor is non stochastic. We
must also add a new assumption:

• CLRM7: No exact collinearity exists between any of the explanatory variables. This means
that there should not be an exact linear relationship between any regressors.

Under the original CLRM assumptions plus this extra one, the OLS estimators of the 
parameters in multiple linear regression models are indeed B.L.U.E.

The reasoning behind this new assumption should be explained. Suppose we wish to estimate
the parameters in the regression model

Yi = 1 + 2 X 2i + 3 X 3i + 4 X 4i +  i

Let's say that one regressor in the model can be expressed as an exact linear function of
another, e.g. X 3i = 5 − 3 X 2i . This would cause problems for the OLS estimation of some of the
 parameters. This relationship implies that we can write the model as

Yi = 1 +  2 X 2i +  3 ( 5 − 3 X 2i ) +  4 X 4i +  i
 Yi = ( 1 + 53 ) + (  2 − 33 ) X 2i +  4 X 4i +  i

So the model that we are really estimating only has two regressors rather than three, i.e.

Yi =  1 +  2 X 2i +  3 X 4i +  i

42
J.S.Ercolani Section 3: Multiple Linear Regression

where  1 = 1 + 53 ,  2 = 2 − 33 and  3 = 4 . We can therefore estimate only the three 
parameters, not the four  parameters. We therefore cannot assess the individual effects
here of X 2 or X 3 on Y .

Exact collinearity, also called pure multicollinearity, is an extreme case and rare. But it is often
the case that regressors can be highly (not exactly) correlated with each other, which itself
brings about estimation problems. This concept of multicollinearity will be explored in a bit
more depth later.

Sampling distributions of the OLS estimators


In the same way as for bivariate regressions, the sampling distributions of the OLS estimators
are vital for hypothesis testing. Under the assumption of normality of the disturbance term 
we showed in Section 2 that the estimators are also normally distributed. Therefore we can
j ( j j )
state that ˆ ~ N  ,  2 showing the unbiasedness of the estimators by the fact that the

( )
means equal the true but unknown values, i.e. E ˆ j =  j . Consider the three-variable model
again, the variances are much more complicated than they were in the bivariate model and
are given by

 X 22  x32i + X 32  x22i − 2 X 2 X 3  x2i x3i 


ˆ ( )
 1 = var 1 = 
2 2 1
 + 
n  x2i  x3i − (  x2i x3i ) 
2 2 2
 

 
2
( )
ˆ
= var  2 =  2
  x32i

2
  x22i  x32i − (  x2i x3i )2 
 

 
( )
 2 = var ˆ3 =  2  x 2
2i

3
  x  x − ( x
2 2
2 i 3i ) 
x
2

 2i 3i

Remember that we need to estimate the error variance  2 and in a three-variable model this
is given by

 =
ˆ 2  ˆi2
n−3

43
J.S.Ercolani Section 3: Multiple Linear Regression

Notice the change in the denominator in this equation to its equivalent estimator in the
bivariate case. In a general k-variable model like (1), this estimator is ˆ 2 =  ˆi2 ( n − k ) .
These variance estimators will also allow you to appreciate the complexities involved in
including more regressors in a model. This is why econometricians tend to analyse multiple
regressions using the matrix form of the model (this is not covered in this module).

3.3 Coefficient of determination


The function of this statistic and its interpretation, as a measure of the goodness of fit of the
regression line to the scatter of observations, is the same as in the bivariate model and is
calculated in the same way, i.e. R = 1 − TSS
2 RSS
.

One use of this statistic is as a way of helping to choose between different economic models,
i.e. between models with different variables on the right-hand side. So long as the models
have the same dependent variable, one would, in general, prefer the model with a higher R2
value (although on its own this is not enough to choose between models). However, there is
a problem with doing this. The problem with the R2 statistic is that it will always increase in
value when more explanatory variables are included. Therefore one should be wary of
comparing one model with another on the basis of their R2 values. Even if the variables that
you add to the model are not important or relevant to the economic theory, the R2 value will
always increase, making it look as if these variables are important in helping to describe the
variation in the dependent variable. For example, in our consumption function example, we
could run a regression of consumption on income, prices and interest rates, and then run a
regression of consumption on income, prices, interest rates and rainfall in the UK. You would
find that the second regression produces a higher R2 even though rainfall is unlikely to have
an effect on our consumption patterns.

So how can we properly compare two models with the same dependent variable but a
different number of explanatory variables? This can be done using the adjusted R2 often
denoted R 2 . This statistic essentially penalises the inclusion of more explanatory variables. The
statistic is calculated as follows

 n − 1  2 
R 2 = 1 −   (1 − R ) 
 n − k  

It is therefore more appropriate to compare two models with the same dependent variable
on the basis of their R 2 values than their R2 values. The value of the R 2 will increase only
when the extra variables added have something important to add to the analysis.

44
J.S.Ercolani Section 3: Multiple Linear Regression

3.4 Hypothesis testing


Now that there are more explanatory variables in the regression model, there are far more
hypothesis tests that can be performed. We will consider tests that involve one parameter
and tests that involve multiple parameters.

3.4.1 Tests involving one parameter


Consider a k-variable model like model (1) that has k-1 explanatory variables and k parameters
to estimate. We may wish to test the hypothesis

H 0 :  j =  *j
H1 :  j   *j

where the subscript j can be any number from 1 to k. The test is performed in exactly the same
way as in the bivariate model. From knowledge of the sampling distribution of the estimator
of this parameter we can calculate a t statistic and base our reject/not reject decision upon a
comparison of this statistic to a critical value from a t table. Therefore we calculate

ˆ j −  *j
t= ~ tn − k
( )
s.e. ˆ j

Notice however that the degrees of freedom parameter has changed. This is because the
number of parameters that we must estimate before performing the test has increased. Also,
( )
in calculating the s.e. ˆ , we must estimate  2 which is now done using the estimator
j

ˆ 2 =  
ˆ 2
i
n−k . The critical value that we use is also dependent upon the form of the alternative
hypothesis, i.e. whether we are doing a one or a two-tailed test. In the example above we are
doing a two-tailed test. Note that each of the  parameters can be tested in the same way.

The test of significance can still be applied to each parameter, i.e. the test that the parameter
is equal to 0. Suppose we are interested in testing whether 3 = 0 . What we are testing is
whether, given the presence of the other variables in the regression, variable X 3 has any
additional explanatory power. If we find that we cannot reject the null that 3 = 0 , then we
conclude that the variable X 3 is not a significant determinant of the dependent variable Y.
ˆ
The test statistic in this case is t = 3 s.e.( ˆ3 ) .

3.4.2 Testing a single linear restriction

45
J.S.Ercolani Section 3: Multiple Linear Regression

The t statistic can also be used to test slightly more complicated forms of restriction.
Sometimes econometricians may want to test a linear combination of the parameters, for
example that the sum of parameters equals 1, or that one parameter is equal in value to
another etc. For example suppose you want to test whether  2 =  3 , i.e. that the additional
explanatory power of variable X 2 is exactly the same as that of X 3 . As an example, may be
you are interested in the factors that affect how much we earn, and you are interested in
testing whether the marginal impact on earnings of doing a degree are the same as doing on-
the-job training.

The null hypothesis can be equivalently stated as H 0 : 2 − 3 = 0 . So we need to establish


the distribution of ˆ2 − ˆ3 . If we define their individual distributions as

ˆ2 (
N  2 ,  22 and ˆ3 ) (
N 3 ,  23 )
then

ˆ2 − ˆ3 (
N  2 −  3 ,  22 +  23 − 2 2 3 )

where  2 3 is the covariance between the two parameter estimates. Hence

ˆ2 − ˆ3 − (  2 − 3 )
Z= N ( 0,1)
 2 +  2 − 2  
2 3 2 3

On replacing the denominator with its estimated values then we can say that under the null
where 2 − 3 = 0 then

ˆ2 − ˆ3
t= tn − k
ˆ 2 + ˆ 2 − 2ˆ  
2 3 2 3

or more simply

ˆ2 − ˆ3
t= tn − k
(
s.e. ˆ2 − ˆ3 )
This statistic is then compared to the appropriate critical value from a t table.

46
J.S.Ercolani Section 3: Multiple Linear Regression

3.4.3 Tests of joint restrictions


If the null hypothesis contains more than one restriction, the testing procedure is very
different. Suppose from model (1) we wish to test the joint hypothesis that

H 0 : 2 = 4 = 0
H1 :  2  0 and/or  4  0

This is a test of the joint significance of 2 and 4 . The test involves estimating two models,
an unrestricted and a restricted model and comparing the RSS,  ˆ i
2
, (residual sum of
squares) from both. The unrestricted model is the original, in this case

Yi = 1 + 2 X 2i + 3 X 3i + 4 X 4i + 5 X 5i + + k X ki +  i

The restricted model is the one that results from imposing the restrictions under the null onto
the unrestricted model. In this example the restrictions are that both 2 and 4 are equal to
0. On imposing these restrictions we get the model

Yi = 1 + 3 X 3i + 5 X 5i + + k X ki +  i

The test statistic in this case is not a t but an F statistic, i.e. it has an F distribution. The statistic
itself is calculated from the following equation

F=
( RSS R − RSSU ) q ~ F
(2)
RSSU ( n − k )
q ,n−k

where RSSR and RSSU are the residual sums of squares from the restricted and unrestricted
models respectively and q represents the number of restrictions that we are imposing, which
in the above case is 2 (  2 = 0 and  4 = 0 ). As with the t testing procedures that we have
looked at, we must compare the test statistic with a critical value, this time from an F
distribution table, not a t distribution. The decision rules at the  % level of significance are
as follows:3 If F  Fq,n − k then reject the null hypothesis;
otherwise do not reject the null.

On the diagram of an F distribution this looks as follows

47
J.S.Ercolani Section 3: Multiple Linear Regression

1−

Fc

Do not reject null Reject null

You may be wondering what the difference is between doing an F test of the restriction
2 = 4 = 0 and two separate t tests, one for  2 = 0 and the other for  4 = 0 . In other words
what is the difference between testing the joint significance of X 2 and X 4 and testing their
individual significance. The former tests whether both  2 and  4 are 0. The latter tests
whether  2 is 0, when  4 is free to be whatever it is estimated to be, and vice versa. One
important feature is a consequence of multicollinearity. If the two regressors X 2 and X 4 are
highly correlated with each other, then individual t tests of the significance of each and a joint
F test of joint significance are likely to lead to different conclusions. The strong correlation of
the two variables means that when we do a t test on X 2 , given X 4 is already in the model,
X 2 is likely to look insignificant because its additional explanatory power above that of X 4 is
negligible. And vice versa, when we do a t test on X 4 , given X 2 is already in the model, X 4 is
likely to look insignificant because its additional explanatory power above that of X 2 is
negligible. Hence on the basis of t tests they both appear insignificant, and the econometrician
may exclude them from the model on that basis. However, an F test of joint significance may
suggest that at least one of them is significant. It may be that you are interested in how
interest rates affect a particular economic variable, maybe growth. There are many different
types of interest rate that you could choose from and maybe you include two of them. Interest
rates are clearly important in this setting and an F test of their joint significance will reflect
this. But because the two interest rate variables are likely to be highly correlated, they will
both look insignificant on the basis of t tests. The presence of one interest rate variable means
that the additional explanatory power of the other is minimal. The solution here would be to
include only one interest rate variable, the second is redundant given it just mimics the
movements of the first.

Now let’s consider a more complicated set of restrictions. Suppose we wish to test

48
J.S.Ercolani Section 3: Multiple Linear Regression

H0 : 2 + 3 = 1 and 4 = 5
H1 : 2 + 3  1, and/or 4  5

from regression model (1). We again perform the test using an F statistic and comparison of
restricted and unrestricted models as we did above, the problem now though is that the
restrictions are not as simple to impose as the previous set of restrictions where we were
looking at simply setting parameters equal to zero. Here, to get the restricted model, we need
to either replace 4 with  5 , or replace  5 with 4 , and we need to either replace 2 with
1 − 3 , or  3 with 1 − 2 . All would be correct and produce the same result. Let's do the latter
combination of each restriction, so that we get

Yi = 1 +  2 X 2i + (1 −  2 ) X 3i +  4 X 4i +  4 X 5i + +  k X ki +  i

A little bit of re-arranging provides

Yi − X 3i = 1 +  2 ( X 2i − X 3i ) +  4 ( X 4i + X 5i ) + +  k X ki +  i

so that the restricted model that we should estimate has the form

Zi = 1 + 2W1i + 4W2i + + k X ki +  i

where Z = Y − X 3 , W1 = X 2 − X 3 and W2 = X 4 + X 5 . We can then proceed to estimate the


restricted and unrestricted models to obtain the RSS statistics and compute the F statistic
using (2).

3.4.4 Test of overall significance


One important form of joint hypothesis test is that of the test of the overall significance, or
the test of the significance of the regression, for which the null has the specific form

H 0 :  2 = 3 = = k = 0
H1 : at least one  0

This is a test that all k-1 slope parameters (doesn’t include the constant parameter) are jointly
equal to 0, i.e. that none of the variables in the model are important in determining Y. The test
is performed in the same way, using an F statistic for which the restricted and unrestricted
models are compared. The restricted model in this case has the form

49
J.S.Ercolani Section 3: Multiple Linear Regression

Yi = 1 +  i

The test statistic is the same as above with q = k − 1 , so that it has a Fk −1,n − k distribution.

An alternative way of performing the test of overall significance is to use the coefficient of
determination. Note that 2 = 3 = = k = 0 is equivalent to stating that R2 = 0 . So we are
also testing
H0 : R2 = 0
H1 : R 2  0

The test statistic can be written as

R 2 ( k − 1)
F= ~ Fk −1,n −k
(1 − R ) ( n − k )
2

It is important to note that this particular test statistic involving the R2 is only relevant when
testing overall significance of the regression. It would be wrong to use it to test, for example,
the null that we looked at before, 2 = 0 and 4 = 0 . We have to use the statistic that involves
the RSS in this case.

3.4.5 Testing a single restriction hypothesis using the F statistic


In Sections 3.4.1 and 3.4.2 we looked at testing hypotheses that involve a single restriction. In
the former we were looking at testing single parameter restrictions like  j =  * and in the
latter the restrictions involved multiple parameters and we showed how to use the t statistic
for both. However, both could be done using the F test given in equation (2). The procedure
is the same, you estimate the unrestricted model, impose the restriction and estimate this
restricted model (obtaining the RSS values) and plug in all the values into (2) to get your
statistic. The value of q is 1 here because only one restriction is being tested. Your distribution
will be an F1,n − k and you will find that the F statistic obtained is equal to the squared value of
the equivalent t statistic.

3.4.6 A numerical example


This example uses real data to model the determinants of food expenditure in the US from
1991 to 2015. The model assumes that aggregate expenditure on food, Y, depends upon
aggregate personal income, Z, aggregate personal taxation, T, and the relative price of food,

50
J.S.Ercolani Section 3: Multiple Linear Regression

P. Because the data are observed over time (annually in fact), we will use the subscript t for
time rather than i . The model is therefore

Yt = 1 + 2 Zt + 3Tt + 4 Pt +  t for t = 1991, ,2015

The estimated regression with the sample of size 25 annual observations (1991-2015) is

Yˆt = 116.7 + 0.113Z t − 0.115Tt − 0.741Pt


( 9.800 )( 0.009 ) ( 0.040 ) ( 0.120 )
11.9 12.7   2.9 6.2
R2 = 0.9923; ˆ = 1.764; RSS = 65.380

The values in round parentheses are standard errors and those in square brackets are t ratios
of the null that the relevant parameter equals 0. You should note that the coefficient divided
by the standard error gives the t ratio. Notice that all variables are statistically significant at
the 5% level and the R2 is close to 1. These are good results.

An alternative theory is that food expenditure depends only upon income. If this is the case
our model is

Yt = 1 + 2 Zt +  t

for which the estimated version is given by

Yˆt = 59.600 + 0.076Z t


( 2.2 ) ( 0.002 )
R2 = 0.9778; ˆ = 2.862; RSS = 188.405

This model is a restricted version of the first with the restrictions  3 =  4 = 0 imposed. Hence
RSSU = 65.380 and RSS R = 188.405 . We can formally test therefore the hypothesis

H 0 : 3 =  4 = 0
H1 : at least one  0

using the F statistic

51
J.S.Ercolani Section 3: Multiple Linear Regression

188.405 − 65.380 21
F=  = 19.76  F2,21
0.05
= 3.47
65.380 2

This means that we reject the null hypothesis. The alternative theory is therefore invalid.

IMPORTANT: Notice that the estimated coefficients for 1 and  2 in both models are very
different from each other. The reason for this is that in the second model, important variables
have been excluded. This has lead to a bias effect on the remaining estimated parameters, i.e.
the estimators ˆ1 and ̂ 2 in the second model no longer have the unbiasedness property
because of the omission of T and P.

Notice from the first estimated regression that the parameters on Z and T are very similar in
size but opposite in sign, i.e. ˆ2  − ˆ3 , ( 0.113  0.115 ) . There may be some economic
reasoning for this, so let's formally test the hypothesis

H0 : 2 = −3 which is equivalent to testing H 0 : 2 + 3 = 0


H1 : 2  −3 H1 : 2 + 3  0

using the F statistic (although this could be done using a t test where the t statistic would be
ˆ ˆ
t =  2 + 3 ˆ 22 +ˆ 23 + 2ˆ 23
tn − k ). To get the restricted model we need to impose the restriction onto

the unrestricted model so that

Yt = 1 + 2 Zt − 2Tt + 4 Pt +  t
 Yt = 1 +  2 ( Z t − Tt ) +  4 Pt +  t

The estimated version of this restricted model is

Yˆt = 116.7 + 0.112 ( Zt − Tt ) − 0.739 Pt


( 9.6 ) ( 0.003) ( 0.114 )
R2 = 0.9923; ˆ = 1.724; RSS = 65.398

The test statistic is now of the form

52
J.S.Ercolani Section 3: Multiple Linear Regression

65.398 − 65.380 21
F=  = 0.01  F1,210.05
= 4.3
65.380 1
We therefore do not reject the null here, which implies that instead of including the variables
Z and T separately, we should include them as Z-T. What is the economic rationale for this
variable transformation? Well the term Z-T is personal income less tax, i.e. personal disposable
income. Notice also that the number of restrictions here is 1 not 2. Although two 
parameters are involved, there is only one restriction.

3.5 Multicollinearity
Multicollinearity is a problem that can occur with multiple regressions. We mentioned it at
the start of this section and discussed the concept of collinearity or pure multicollinearity. This
is the extreme situation in which an explanatory variable is exactly related to other
explanatory variables in the model. We looked at the consequences of this, showing that we
cannot estimate the parameters in the model, only combinations of them. In this section we
want to investigate the effects on estimation when explanatory variables are highly
correlated, but not exactly correlated.

3.5.1 Consequences of multicollinearity


Remember that under the classical assumptions, OLS estimators are B.L.U.E. One of the
classical assumptions is that there should be no exact multicollinearity. We have seen what
happens when there is, in fact if you use a computer to estimate a regression model that
contains explanatory variables that are exactly linearly related, it will refuse to compute
anything. The case that we are interested in here is not of the exact kind and so
multicollinearity does not strictly violate the classical assumptions. Hence the OLS estimators
are still B.L.U.E. However, there are some serious consequences for OLS estimators with the
presence of multicollinearity amongst the explanatory variables, even though they are still the
best estimators! So although OLS remains the best procedure that we can use, the estimates
still might not be very good, i.e. we cannot guarantee, in any situation, that the estimates we
get are near to the true but unknown parameter values. The reasons are explained below.

• With multicollinear variables, the OLS estimators are still unbiased. It is important to
remember however, that unbiasedness is what we call a repeated sample property. It
means that if we had many samples of data then the estimators would, on average,
( )
estimate the correct but unknown values, i.e. E ˆ =  . In economics, we only ever have
the chance to work with one sample of real data for any given empirical problem. So we
get just one estimate for each parameter.

53
J.S.Ercolani Section 3: Multiple Linear Regression

• The presence of multicollinearity does not affect the property of minimum variance for
the OLS estimators, i.e. they still have minimum variance amongst all linear unbiased
estimators. But just because they provide the smallest variances, does not imply that they
will provide small variances, they could still be the smallest and yet quite large. The larger
the variance the less precise the estimator.
• We say that multicollinearity is a sample phenomenon. This means that even if economic
theory does not suggest that two variables will be highly correlated, in the particular
sample of data that we have, the values may give a high correlation coefficient. This can
happen with time series data, where variables tend to increase in value over time, which
can make the variables look as though they are related, but there is no economic reason
why they should be.

In practice therefore, what signs should we expect to see if multicollinearity is present in a


model?
• We would find that our parameters are estimated with large variances (and therefore large
standard errors).
• This means that the t statistics for the tests of significance, which are calculated as
ˆ
t= ( ) , are likely to be small in value. We may find that we cannot reject the null
s .e. ˆ

hypotheses that the parameters are equal to 0 and hence conclude that variables are
insignificant.
• Even though you may find several insignificant variables in the model, the R2 value will
still be high, suggesting that the model is significant and that the regression model fits the
data well. The t statistics and R2 therefore seem to contradict each other.
• You may find the coefficients are estimated to have values that are the wrong sign to what
the theory predicts.
• If some of the data values were to change by a small amount, the OLS estimates would
change considerably. This means that the estimators are sensitive to the data and are said
to be unstable.

3.5.2 Detecting multicollinearity


There are several methods that can be used:
• One good way of detecting whether multicollinearity is present in the data is to carry on
with the OLS estimation anyway and use the regression results as a guide to detection. It
was stated above that one of the consequences of multicollinearity is to have few
significant t values even though the variables are jointly significant, i.e. a high goodness of
fit. Hence, if you notice that this is the case, then it may imply that some of your
explanatory variables are strongly related to each other.
• We can look at the correlation coefficients between some of these variables to see how
strongly related they are. Of course, we can only look at correlation between two variables

54
J.S.Ercolani Section 3: Multiple Linear Regression

at a time, so if you had three explanatory variables in your model you could check the
correlation between ( X , Y ) , ( X , Z ) and (Y , Z ) .
• We could run additional regressions where we take one explanatory variable and regress
it on the other Xs, and we do this for each of the explanatory variables. So if your model
contains three X variables, you run three extra regressions, i.e.
X1i = 1 + 2 X 2i + 3 X 3i + 1i
X 2i = 1 +  2 X1i + 3 X 3i +  2i
X 3i = 1 + 2 X1i + 3 X 2i +  3i
These regressions will tell you which variables are related to the rest by the size and
statistical significance of the R2 value from each of the estimated regressions.

3.5.3 Dealing with multicollinearity


Consider the situation where we discover that multicollinearity exists in our model. Is there
anything that we can do to remedy, or at least help, the situation?

• You could consider dropping one or more of the problem variables, i.e. some of those that
are collinear. This may get rid of a multicollinearity problem, but unfortunately it could
cause another problem. This is due to that fact that in formulating our econometric models
we include variables that economic theory states to be important. By excluding variables
from the equation we may be mis-specifying the model. This in itself causes the estimates
of the remaining variables to be biased. So, you're stuck between a rock and a hard place!
• Given that multicollinearity is a sample problem, it may be eradicated if we use a different
sample of data. Of course, we are likely to be very restricted in this respect as good data
can be hard to come by. But if it is possible to increase the sample size by either increasing
the number of years in the sample or, in the case of cross-sectional data, including more
individuals or countries in the analysis, then this could reduce the scale of the problem.
• It is possible that changing the functional form of the model could help, e.g. there may be
multicollinearity present in a log-linear model that does not appear in a purely linear form.
• If the empirical study that you are interested in is the focus of previous literature, then it
may be possible to use results from these studies to help with a multicollinearity problem.
For example, if empirical studies have already been done in your chosen area of research
then you could use the relevant estimated values from them. By replacing the coefficients
of the problem variables with these previously estimated values, it should be possible to
estimate the remaining parameters in your model with more precision. The problem of
course is that the information you utilise may itself be incorrect. It may hold under the
sample of observations used in that study but not relevant for yours.

55
J.S.Ercolani Section 3: Multiple Linear Regression

Multicollinearity is obviously a potential problem in any econometric exercise for which


detection and remedy can be troublesome. Much research by econometricians has been done
on this problem and the subject is deep. We will not go any deeper.

3.6 Alternative Functional Forms


The models that we have considered so far are both linear in variables and linear in
parameters. We could still use OLS to estimate the parameters in the model

Yi = 1 +  2 X 1i + 3 X 12i +  4 X 1i X 2i +  i
because it is still linear in parameters, even though it is not linear in the variables. In fact, OLS
could be used to estimate the parameters in, for example

Yi = 1 +  2 f1 ( X 1i , X 2i ) + 3 f 2 ( X 1i , X 2i ) +  i

where the f1 (.) and f 2 (.) are any non-linear functions of the variables. However, OLS could
not be used to estimate the parameters directly from, for example

Yi = 1 X 1i 2 X 2i3  i (3)

or

Yi = 1e 2 X1i e 3 X 2 i e i . (4)

These models can however be transformed in such a way that OLS does become applicable.
The transformation involves taking logarithms of the variables. It suffices to note the following
properties of natural logs:

• If z = ln x then x = e z where e is the exponential;


• ln xy = ln x + ln y ;
• ln xb = b ln x ;
d ln x 1
• = .
dx x

Using the second and third rules here allows us to re-write (3) as

ln Yi =  + 2 ln X1i + 3 ln X 2i + ln  i (5)

56
J.S.Ercolani Section 3: Multiple Linear Regression

and (4) as

ln Yi =  + 2 X1i + 3 X 2i +  i (6)

(where  = ln 1 ) and both equations are now linear in the parameters. Model (5) is called a
log-linear model because it is linear in the logarithms of all the variables. Model (6) is called a
semi-log model because only the dependent variable has been log transformed. So long as the
disturbance term in (5) and (6) satisfy the classical assumptions, the OLS estimators of 
(from which one can obtain 1 = e ),  2 and  3 are B.L.U.E.

A feature of model (5) that is often taken advantage of by applied economists is that the slope
parameters are interpreted as elasticities. To see this we make use of the fourth log rule. The
parameter  2 in (5) is

1
d ln Y dY dY X 1
2 = = 1
Y
= 
d ln X 1 X1 dX 1 dX 1 Y

This is the definition of an elasticity. This model is often called the constant elasticity model
because the 2 and  3 parameters are constant. Going back to our consumption function

example, an alternative functional form could be given by Yi = 1 X i 2  i and would be
transformed to ln Yi =  + 2 ln X i + ui where Y is consumption, X is income and the 2
parameter is now interpreted as the income elasticity (this is the specification of the
consumption function in Computer Practical 3). Remember though that in the linear
consumption function 2 was the MPC. So depending upon how the model is formulated, the
interpretation of the parameters is different.

To decide which functional form is better, i.e. linear or log-linear, might be a matter of
economic theory dictating the correct form, or a matter of empirics. We could firstly plot the
data on X against the data on Y to see what the scatter of observations looks like. This can only
give a rough idea of the relationship, but obviously if the observations seem to follow a curve
rather than a straight line, a linear regression is inappropriate. You are not going to get
accurate results if you try to fit a straight line through a set of data that do not follow a straight
line. If however you transform X and Y into logarithms and plot x = ln X against y = ln Y , and
find that it follows a straight line, then may be the log-linear model is appropriate. Choosing
on the basis of the highest R2 value is a bad idea. You should not compare R2 or R 2 values

57
J.S.Ercolani Section 3: Multiple Linear Regression

for models that have different dependent variables. This is the case here because one model
has dependent variable Y and the other has dependent variable lnY .

58
J.S.Ercolani Section 4: Classical assumption violations

4 Classical assumption violations


Until now we have assumed that the classical linear regression assumptions hold. In this
section, we are considering violating two of these assumptions (the no autocorrelation and
homoskedasticity assumptions respectively) and we are interested in what the consequences
there are for the properties of the OLS estimators of the model parameters. We know that
under the classical assumptions OLS is BLUE, but what happens when assumptions do not
hold?

The classical assumptions are actually rather strong in some contexts. In models in which the
data are observed over time, we find that the errors tend to be autocorrelated, violating
CLRM5. In models where the data are cross-sectional, we find that the errors are often
heteroskedastic, violating CLRM4.

4.1 Autocorrelated errors


4.1.1 Autocorrelated variables
A time series variable is autocorrelated if it is correlated with itself at different points in time.
In notation this is written, for a zero mean random variable X t , as
E ( X t X s )  0 for t  s or
E ( X t X t − k )  0 for k = 1, 2,

For example, suppose that the variable follows a first-order autoregressive process (AR(1)).
This is where the variable is modelled as a function of itself from the previous time period,

X t =  X t −1 +  t

where −1    1 and is a parameter called the autocorrelation coefficient. Let’s first look at
the variance of this variable:

 X2 = var ( X t ) = var (  X t −1 +  t ) =  2 var ( X t −1 ) + var ( t ) =  2 var ( X t ) + var ( t )


=  2 X2 +  2  (1 −  2 )  X2 =  2 and therefore  X2 =  2
.
(1−  2 )

Note: this proof has assumed that X t is homoskedastic, i.e. the variance is the same no matter
what time period we are in ( var ( X t ) = var ( X t −1 ) ). Now let’s look at the covariance between
the variable and its first lag:

59
J.S.Ercolani Section 4: Classical assumption violations

cov ( X t , X t −1 ) = E ( X t X t −1 ) = E ( (  X t −1 +  t ) X t −1 ) = E (  X t2−1 +  t X t −1 )

=  E ( X t2−1 ) =  var ( X t −1 ) =  var ( X t ) =  X2 =  2


.
(1−  )
2

So, there is a non-zero first-order autocorrelation. We can also see that there is a non-zero
second-order autocorrelation:

(
cov ( X t , X t −2 ) = E ( X t X t −2 ) = E ( (  X t −1 +  t ) X t −2 ) = E (  (  X t −2 +  t −1 ) +  t ) X t −2 )
= E (  2 X t2−2 +  t −1 X t −2 +  t X t −2 ) =  
2 2

(1−  ) 2

and in general

cov ( X t , X t − j ) = E ( X t X t − j ) =  j 2
.
(1−  )
2

Given that the autocorrelation coefficient is less than 1 in absolute value, then
 j → 0 as j →  . Hence as we look at autocorrelations further and further into the past, the
correlation gets smaller. This makes sense - a variable is more likely to be correlated with itself
in the near past than the distant past.

The following are plots of a simulated process X that I generated according to the AR(1) above,
to exhibit no autocorrelation (  = 0 ), positive autocorrelation (   0 ) and negative
autocorrelation (   0 ). When  = 0 , the plot shows a fairly random pattern with no
discernible pattern or predictability.

When  = 0.9 there is strong positive autocorrelation. Here, the pattern is smoother with runs
of positive and negative values and looks a bit like a cycle.

60
J.S.Ercolani Section 4: Classical assumption violations

When  = −0.9 there is strong negative autocorrelation. Negative autocorrelation shows a


spikier plot in which a positive value in one period is likely to be followed by a negative value
in the next period.

This sub-section has shown you what autocorrelated variables in general look like. For
econometricians we are not particularly worried about autocorrelation in our variables, unless
the autocorrelation arises in the error terms of our regression models. So if the X variable
above is a regressor in a model then the autocorrelation is not a problem. If it is the
disturbance  from a regression model then this is a concern. The rest of this section therefore
concentrates on autocorrelated errors.

4.1.2 Autocorrelated regression errors


Consider the multiple regression model

Yt = 1 + 2 X 2t + + k X kt +  t (1)

Previously we assumed that the error  satisfied the classical assumptions and we showed
that the OLS estimators of 1 , ,  k are BLUE. Now let’s assume that  satisfies all of the
classical assumptions except for the “no autocorrelation” assumption (CLRM5 in these notes)
that states that cov (  t ,  s ) = 0 for all t  s . Hence the error term now violates this
assumption, so that

61
J.S.Ercolani Section 4: Classical assumption violations

cov (  t ,  s )  0 for t  s

This means that the error is correlated with itself in different time periods. Here we are only
going to consider what we call first-order autocorrelation, which arises when the error term
can be written as an AR(1) process, i.e.

 t =  t −1 + ut (2)

where  is the autocorrelation coefficient and ut is itself an error term, (we assume that the
error u satisfies all of the classical assumptions). Using the information from section 4.1.1, we
know that
cov (  t ,  t −1 ) = 1 cov (  t ,  t − 2 ) = 1− 2 and in general cov (  t ,  t − j ) =  j 2
2 2 2

−2
, 1−  2 .

Unless  = 0 , the classical assumption is clearly violated here. We know that in order for OLS
estimators to be B.L.U.E, autocorrelation must not exist in  t . So we are interested in
understanding:

(i) the problems that autocorrelation creates for the properties of OLS estimators,

(ii) ways in which we can test for the presence of autocorrelation and

(iii) methods that can be used to resolve these issues.

We will assume here that this is the only assumption to be violated, i.e. that we have
autocorrelation but everything else is OK.

4.1.3 The consequences of autocorrelation


Suppose we wish to estimate a model such as (1) in which the error process can be written as
an AR(1), such as (2). If we ignore the autocorrelation in the errors and just estimate the 
parameters in (1) using OLS, what are the consequences?

• The OLS estimators still have the good property of being unbiased. We know this because
when we proved the OLS estimators were unbiased back in Section 2, we did not need to
use the “no autocorrelation” assumption, i.e. it doesn’t matter whether the errors are
autocorrelated or not, OLS will still be unbiased.
• When the errors are AR(1), the equations for the variances of the OLS estimators are
( )
incorrect. For example, look at the derivation of var ˆ in Section 2. Notice that it 2

62
J.S.Ercolani Section 4: Classical assumption violations

(( X )
− X )  t which we condensed to ( X − X )  2 . But this
2
includes the term var t t

could only be done because we were under the classical assumptions and could impose
homoskedastic and non-autocorrelated errors. In this section we can still impose
homoskedasticity but we have autocorrelation. This means that

var (  ( X − X )  ) = var (( X − X )  + ( X − X )  + + ( X − X )  )
t t 1 1 2 2 n n

= E (( X − X )  + ( X − X )  + + ( X − X )  )
2
1 1 2 2 n n

= E ( ( X − X )  ) + E ( ( X − X )  ) + + E ( ( X − X )  ) + E ( ( X − X )( X − X ) 1 2 )
2 2 2
1 1 2 2 n n 1 2

+ + E ( ( X − X )( X − X )   ) + E ( ( X − X )( X − X )   ) +
1 n 1 n 2 1 2 1

+ E ( ( X − X )( X − X )   ) +
2 n 2 n

It is the cross-product terms that makes this different and is a consequence of the
autocorrelated errors. The OLS variance can be shown to be

( )
var ˆ2 =

2
( Xt − X )
2
1 + 2  ( X 1 − X )( X 2 − X ) + 2  2 ( X 1 − X )( X 3 − X ) +


There are many terms inside this bracket that I have not defined. But it serves to show that
if we use the variance estimator from Section 2 in the situation when the errors are AR(1),
we are excluding the whole of the bracketed term and would hence get the wrong
estimated variance.

Remember that when we do hypothesis testing, say a simple t-test, this uses the standard
error of the estimate (square root of its variance) in the denominator. If this standard error
is estimated incorrectly then the t-test will be wrong and we may make incorrect decisions
about whether our variables are significant or not. So, if we use OLS and ignore
autocorrelation in the errors we could seriously affect the inferences we make about the
parameters and variables in our models.
• Even if we use the OLS estimator and the correctly adjusted variance above, OLS is still not
the best estimator as it does not have the smallest variance. There is another estimator,
called a Generalised Least Squares or GLS estimator that provides a smaller variance than
OLS when the errors are autocorrelated. We will discuss this estimator shortly. So, OLS is
inefficient, it is no longer the best, i.e. no longer BLUE.

63
J.S.Ercolani Section 4: Classical assumption violations

4.1.4 Testing for autocorrelated errors


We have established that there are serious consequences for OLS estimation when the errors
are subject to autocorrelation. It is important therefore that we know whether the errors are
autocorrelated, otherwise we would simply assume that the errors satisfy the classical
assumptions and use OLS. However, in any given practical situation how do we know that
there is autocorrelation in the errors? A non-rigorous way of detecting autocorrelation in the
errors is to analyse their time plots as in the plots at the start of this section. Of course, the
errors in any regression model are unobserved. Hence to plot graphs like those above one
would have to estimate the regression model first and use the residuals from that regression.
Although not a formal test for autocorrelation, visual inspection of the residuals can give a
rough impression of whether such a problem as autocorrelation exists. If the plot looked
similar to the second, then this may indicate positive autocorrelation. We do need a way of
formally testing for autocorrelation, and some methods are considered here.

• The Durbin-Watson test


The test that we are interested in here and a very common test for autocorrelation with
applied econometricians is called the Durbin-Watson test (DW). This is a test for first-order
autocorrelation only and it assumes that the error term is written as in (2), i.e.

 t =  t −1 + ut

The null hypothesis for this test is that there is no autocorrelation, which in the above
equation means that  = 0 . Hence the DW test is a test of the following

H0 :  = 0
H1 :   0

The test statistic is of the form

 (ˆ − ˆ )
n 2

 2 (1 − ˆ )
t −1
DW = t =2 t

 ˆ
n 2
t =1 t

where the ˆ terms are the residuals obtained from the OLS estimation of the linear regression
model of interest, and ̂ is the estimated value for the coefficient of autocorrelation. Most
econometric software packages automatically compute the DW statistic for you. Once you
have the statistic, you need to know what to do with it, i.e. how to use it to decide whether
you reject or do not reject the null hypothesis. This is similar to the testing procedures that
we have looked at before, where we compare the test statistic to a critical value. Here we use

64
J.S.Ercolani Section 4: Classical assumption violations

a DW distribution table. The difference with this particular test is that we have two critical
values, which we denote d L and dU , where the L and U represent the Lower and Upper values
that we read from the table. We base our decision on where the DW statistic falls on the
following line (it can only take a value between 0 and 4):

Evidence of +ve ? Evidence of no ? Evidence of –ve


autocorrelation autocorrelation autocorrelation

0 dL dU 2 4 − dU 4 − dL 4

There are some drawbacks to this test.

o The inconclusive regions mean that we cannot make a decision about whether there is any
autocorrelation if the DW statistic lies in this region. In most cases these regions will be
small.

o It tests only for 1st order autocorrelation, i.e. where the error term is written as a first order
autoregressive process as in (2). However, there are many different ways that the error
could be represented and still exhibit autocorrelation. For example, it could have 2nd order
autocorrelation, in which we could write the error as a second order autoregression, AR(2),
 t = 1 t −1 + 2 t −2 + ut . If the error does in fact take this form, the DW is not an appropriate
test.

o The test is invalid if one of the regressors in our regression model is the lagged dependent
variable, Yt −1 . In this case we would need to use Durbin’s h statistic.

• Durbin’s h test
If the model that you are estimating contains the lag of the dependent variable, as below

Yt = 1 + 2 X 2t + + k X kt + Yt −1 +  t

then the DW statistic is not appropriate. As yet we haven't come across the use of lagged
variables or the concept of including a lag of the dependent on the right-hand side of the
equation. This form of dynamic model is very important in time series econometrics and you
will consider such models in the latter weeks of this module.

The DW test in these circumstances might find evidence of no autocorrelation when in fact
there is autocorrelation present in the errors, which is obviously no good to us. The test that

65
J.S.Ercolani Section 4: Classical assumption violations

we should use here is the Durbin's h statistic. The null hypothesis that we are testing is the
same as in the DW test but the statistic is calculated from

 DW  n n
h = 1 −  = ˆ ~ N ( 0,1)
 2  1 − n var (ˆ ) 1 − n var (ˆ )

where n is the sample size and var (ˆ ) is the variance of the OLS estimator of the parameter
on the lagged dependent variable. It does not matter how many regressors or how many lags
of the dependent variable that we include in the model, the h statistic is the same. Hence we
always use the variance of the coefficient on the first lag of Y. Because this statistic has a
standard normal distribution, the critical value that we use to compare to the value for h is
taken from a standard normal table. If h  hc then you would reject the null hypothesis of no
autocorrelation.

4.1.5 Dealing with autocorrelated errors


If we find no evidence of autocorrelation then all is fine and we can be assured that the OLS
estimators are B.L.U.E (so long as all the other classical assumptions hold). However, if we
detect autocorrelation in the errors of our regression model, we know that the OLS estimates
are inefficient and that the computed standard errors are biased, which means that
hypothesis testing on the parameters may result in incorrect inference. So what can we do?
To estimate the models parameters precisely, we need to use a different estimation
procedure.

• Generalised Least Squares (GLS)


The idea behind this method is that ultimately we use the OLS procedure, but the model that
we estimate is a transformed version of our original model that had autocorrelated errors.
The model is transformed so that the errors are no longer autocorrelated, which is why it is
OK to use OLS. So, consider that we wish to estimate the following bivariate model

Yt = 1 + 2 X t +  t (3)

and we have performed the appropriate tests which show that the errors are autocorrelated
and that they can be written as  t =  t −1 + ut . We assume here that the error ut satisfies all
of the classical assumptions. To use GLS, we firstly transform the model by doing the following:

1. Lag the model by one period and multiply by 

66
J.S.Ercolani Section 4: Classical assumption violations

Yt −1 = 1 + 2 X t −1 +  t −1

2. Subtract this model from the original so that

Yt − Yt −1 = 1 (1 −  ) +  2 ( X t −  X t −1 ) +  t −  t −1
or Yt * = 1 +  2 X t* + ut

where Yt * and X t* are the quasi-differences, 1 = 1 (1 −  ) and the new error term is
ut =  t −  t −1 .

The crucial thing to notice here is that the error term in the transformed model is ut which
satisfies the classical assumptions (given our original AR(1) error  t =  t −1 + ut ). Hence if we
make these transformations to our data on Y and X, we can estimate the transformed model
using OLS, because we have not violated the assumptions. Hence the OLS estimates of  1 and
2 will be B.L.U.E. We can determine 1 once we know 1 . Note that the procedure can be
used in multiple regression models with autocorrelated errors, but we must remember to take
quasi-differences of all of the explanatory variables.

In summary, GLS is essentially OLS applied to a model written in quasi-differences in which the
errors are not autocorrelated. You will notice however that to perform GLS estimation, you
need to know the value of  so that the data can be transformed into the quasi-differences.
In most practical situations this value is unknown. We therefore need to look at a method,
similar to GLS in set-up but has an added step where  is estimated as well.

• The Cochrane-Orcutt iterative procedure


One such procedure is called the Cochrane-Orcutt procedure, which is essentially an extension
of GLS. Consider the same set-up as for GLS, i.e. the same model and the same autocorrelated
error process. The procedure follows these steps:

1. Estimate the parameters of the original model (3) using OLS, so that the estimated
model is

Yt = ˆ1 + ˆ2 X t + ˆt

2. Using the observations on the residuals, use OLS to estimate the  parameter in the AR
model

ˆt = ˆt −1 + ut

67
J.S.Ercolani Section 4: Classical assumption violations

so that we then have an estimated value ̂ .

3. Use this estimated value to transform the variables into their quasi-differences,
Yt * = Yt − ˆYt −1 and X t* = X t − ˆ X t −1 .

4. Now use OLS to estimate the parameters in the transformed model

Yt * = 1 +  2 X t* + ut
where ut satisfies the classical assumptions.

5. Use the residuals from this model to repeat step 2, so that we get another estimate of
 denoted ̂ˆ (second estimate).

6. Repeat step 3 using ̂ˆ so that Yt ** = Yt − ˆˆYt −1 etc and use OLS again to estimate the
parameters in a model that regresses Yt ** on X t** .

7. Continue repeating these steps until the consecutive estimates for the model
parameters and  change by a very small amount. When this happens we say that the
estimates have converged.

This is why the procedure is said to be iterative, because we repeat the steps until
convergence. Once convergence has occurred, these are the final estimates of our unknown
parameters.

4.1.6 Artificial autocorrelation


It is appropriate to mention that a mis-specified model can make an error process look
autocorrelated. We have mentioned in previous sections that the error term often takes on
the role of a proxy variable for all of the effects that we have not included in the main part of
the model. Suppose that the true model is of the form

Yt = 1 + 2 X 2t + 3 X 3t + t

where  is the error term. Suppose instead that we exclude X 3 and estimate

Yt = 1 + 2 X 2t +  t

so that effectively  t = 3 X 3t + t . Any pattern in X 3 would therefore show up in a plot of the


residuals. If X 3 has cyclical behaviour then the residuals may exhibit positive autocorrelation.

68
J.S.Ercolani Section 4: Classical assumption violations

This should however be viewed as false or artificial autocorrelation because it has arisen out
of model mis-specification rather than because the errors are specifically of the form given in
(2).

It is probably better to make sure that any autocorrelation present in the error term is not
artificial, before attempting GLS or Cochrane-Orcutt. The GLS and Cochrane-Orcutt
procedures are reserved for autocorrelation that arises because the error term is inherently
in the form of an AR process. They are not really appropriate if the autocorrelation is borne
out of model mis-specification. This is often caused by an inappropriate dynamic structure in
the model, i.e. not enough lags of the variables are included.

So suppose that we wish to estimate model (3) and find that the error is autocorrelated. A
method that applied econometricians usually try first is to add a lagged dependent variable
into the regression model, i.e.

Yt = 1 + 2 X t + 3Yt −1 +  t

and estimate this model using OLS. If this resolves the autocorrelation problem, checked by
using Durbin's h statistic, then there is no need to use GLS or Cochrane-Orcutt.

4.2 Heteroskedastic errors


Here we are concerned with the effects of a violation of the classical assumption that states
that the errors are homoskedastic. Again let’s consider a multiple regression model

Yi = 1 + 2 X 2i + 3 X 3i + + k X ki +  i (1)

The assumption of homoskedasticity states that var (  i ) =  2 for all i , which means that the
variance of each  i is the same, i.e. var ( 1 ) = var (  2 ) = = var (  n ) =  2 (we have changed
back to the i subscript here because heteroskedastic errors tend to be a problem in cross-
sectional models). This condition must hold in order for the OLS estimators of the 
parameters in model (1) to be B.L.U.E. If this condition is violated then the errors are said to
be heteroskedastic, which can be represented by

var (  i ) =  i2

69
J.S.Ercolani Section 4: Classical assumption violations

This implies that var ( 1 ) =  12 , var (  2 ) =  22 , , var (  n ) =  n2 , i.e. each error term is
allowed to have a different variance, which is in clear contradiction to the classical
assumption.

To show you how heteroskedasticity can take effect in cross-sectional studies, let's look at an
example. Consider a bivariate model in which we analyse the effect that income X has on our
savings Y. We would expect the relationship to be upward sloping to show that as income
increases, savings increase. If this model exhibits homoskedastic or constant variance errors,
then the distributions of the observations around the regression line at each X value would be
dispersed to the same degree. Represented on a graph we have

Density

Savings

Income

Under the homoskedasticity assumption, the spread of savings at each income level will be
the same. This is represented by the distributions in the graph having the same dispersions or
variance. But in the real world is this likely to be the case? Isn't it more likely that as income
increases, we would observe a greater spread of savings? People on low incomes tend to save
less because they have little money left over after they have bought all of the necessities for
living. Hence the spread of savings for all those earning low incomes would be small. However,
looking at the savings patterns of people who earn a lot, you would find that some of them
are likely to spend most of it and hence save only a small amount and others will save a lot.
This is because human behaviour is random, so we wouldn't expect everyone to do the same
thing under the same conditions. Hence in actuality the graph is more likely to look like this:

70
J.S.Ercolani Section 4: Classical assumption violations

Density

Savings

Income

Notice how the spread of the distribution increases as income increases. This would be the
case with heteroskedastic errors. Hence, in this scenario we are more likely to observe
heteroskedastic errors than homoskedastic errors.

So we can see how heteroskedasticity can arise quite naturally in cross-sectional studies. The
form of heteroskedasticity in the error discussed above was related to the explanatory
variable, because as income increased the variance of the error increased. We could therefore
write this relationship as

var (  i ) =  i2 =  2 f ( X i )

i.e. the variance of the error is a function of the X variable. For example, if var (  i ) =  2 X i
then a plot of the error terms against X may look like this

x
i x x
x
x
x

x x x
x x x
x x x x
x x x x
0 x x x
x x x X
x x
x x x
x x x
x x
x x x

whereas if var (  i ) =  2 X i2 it could look like this

71
J.S.Ercolani Section 4: Classical assumption violations

x
i x x
x x x
x x x
x x x
x x x x
x x x x
0 x x x x
x x x X
x x
x x x
x x x
x x
x x x

Of course any form of non-constant variance can represent heteroskedasticity, not just those
considered above. As with autocorrelation, we can inspect plots of the error terms to get a
rough impression of whether heteroskedasticity is present. This again would involve
estimating the regression model, like (1) and plotting the residuals.

This section will take on a very similar structure to the section on autocorrelation. This is
because we want to ask the same kinds of questions, i.e. what are the problems caused by
heteroskedasticity for OLS estimation, how can we test for heteroskedasticity and what
measures can we take to deal with it?

4.2.1 The consequences of heteroskedastic errors


These are identical to the effects of autocorrelation on OLS estimators. Suppose that we wish
to estimate a model such as (1) in which the error process is heteroskedastic. If we ignore the
heteroskedasticity and use OLS to estimate the  parameters, what are the consequences?

• The OLS estimators still have the good property of being unbiased. Again when we proved
the unbiasedness of OLS in Section 2, we did not need to refer to the homoskedasticity
assumption, i.e. it doesn’t matter whether the errors are heteroskedastic, OLS is still
unbiased.
• When the errors are heteroskedastic, the equations for the variances of the OLS estimators
are incorrect. Notice that in the variance derivation in Section 2, the term var (  i ) was
replaced with  2 as we assumed homoskedastic errors. Adjusting the variance equation
to account for heteroskedastic errors leaves us with the equation

( X − X ) 
2 2

var ( ˆ ) =
i i

(( X − X ) )
2
2 2
i

72
J.S.Ercolani Section 4: Classical assumption violations

As in the case of autocorrelated errors, the existence of heteroskedasticity means that if


we go ahead and use the original OLS variance estimator (perhaps because we are not
aware that the errors are heteroskedastic), then this will seriously affect our hypothesis
tests.
• Even if we use OLS and the heteroskedasticity adjusted variance above, OLS is still not the
best estimator as it does not have the smallest variance. OLS is no longer efficient. There
is another estimator called Weighted Least Squares or WLS that is unbiased and has a
smaller variance than OLS.

As with autocorrelation, the consequences of heteroskedasticity are serious for OLS so it is


best to be sure that we know whether we are facing heteroskedastic errors or not. We need
therefore to formally test for heteroskedastic errors.

4.2.2 Testing for heteroskedasticity


We will cover just a few of the range of tests available for the detection of heteroskedasticity.

• White’s test
We considered above the situation in which the error variance was dependent upon the
explanatory variable X. In a multiple regression model it is possible for the error variance to
be a function of all of the explanatory variables, i.e.

var (  i ) =  i2 =  2 f ( X 1i , X 2i , , X ki )

In this test the function f (.) includes all variables, their squared values and their cross
products. The idea is that we estimate a regression that relates the error variance to this f (.)
function. Of course we do not observe these variances, i.e.  i2 is unobservable, so we use the
squared residuals as a proxy ˆi2 . To show you how this works, assume that we want to
estimate the following model and we want to test for heteroskedasticity in the errors

Yi = 1 + 2 X 2i + 3 X 3i +  i

We estimate this model using OLS so that we can obtain the residuals ˆi . We then run another
regression, using the squared residuals as the dependent variable, as follows

ˆi2 =  1 +  2 X 2i +  3 X 3i +  4 X 22i +  5 X 32i +  6 X 2i X 3i + ui (4)

73
J.S.Ercolani Section 4: Classical assumption violations

where ui is an error term that satisfies the classical assumptions.

The White’s test is a test of the null hypothesis of homoskedasticity against the alternative of
heteroskedasticity. In model (4), homoskedasticity occurs if all of the coefficients on the Xs
are zero. Hence the null hypothesis takes the form

H0 :  2 =  3 =  4 =  5 =  6 = 0
H1 : at least one  0

To see why the restrictions under the null imply homoskedasticity, when we impose them on
(4) we get

ˆi2 =  1 + ui (5)

and when we take the expected value of this we have

var ( ˆi ) = E ( ˆi2 ) = E (  1 + ui ) = E (  1 ) + E ( ui ) =  1

Hence the variances of the residuals under the null are constant, i.e. homoskedastic. If at least
one of the  coefficient parameters is non-zero then we have heteroskedastic errors. For
example, if  3  0 then

ˆi2 =  1 +  3 X 3i + ui

such that

var ( ˆi ) = E ( ˆi2 ) =  1 +  3 X 3i

which will change as the values of X 3 change.

But how do we test this null hypothesis? Well you should recognise the form of the null
hypothesis from the section on hypothesis testing in multiple regression models. It is
effectively a test of the overall significance of regression (4) for which the test statistic is an F
statistic. Only for such a test of overall significance can you use both types of F statistic, i.e.

74
J.S.Ercolani Section 4: Classical assumption violations

R 2 ( m − 1)
F= ~ Fm−1,n −m
(1 − R ) ( n − m )
2

where R2 is the goodness of fit value from the estimated regression (4), n is the sample size
and m is the number of parameters to be estimated in (4). In the above example, m = 6 . Or
one can use

F=
( RSS R − RSSU ) (m − 1) ~ F
RSSU ( n − m )
m −1, n − m

where the unrestricted model is (4) and the restricted model is (5). As in all F testing
procedures, if F  F c , where F c is the appropriate critical value, then we would reject the
null hypothesis. In this case it would imply that the errors were heteroskedastic.

• The variant of White’s test


This test uses a similar approach to the White’s test. However when we estimate our model
using OLS to obtain the residuals, we also obtain the fitted values of Y, which are denoted Yˆ
where Yˆi = ˆ1 + ˆ2 X 2i + + ˆk X ki . The test proceeds by estimating another regression of the
form

ˆi2 = 1 + 2Yˆi 2 + ui (6)

Although it does not look like it at first glance, this regression is actually similar to that used in
the White’s test. Suppose our original model has two explanatory variables as it did in the
above sub-section, i.e. Y =  +  X +  X +  . Then Yˆ 2 implicitly contains the variables
i 1 2 2i 3 3i i i
2 2
X 2i , X 3i , X , X and X 2i X 3i .
2i 3i

Again the test is of the null of homoskedasticity and the alternative of heteroskedasticity,
which for this form of model amounts to testing

H 0 : 2 = 0
H1 :  2  0

To see why the null represents homoskedasticity here, impose the restriction on (6) to find
that ˆi2 = 1 + ui . This is of the same form of restricted version in White’s test.

75
J.S.Ercolani Section 4: Classical assumption violations

We have spent a long time on testing hypotheses of this sort, that is tests of a single
parameter, and therefore you should immediately realise that we can actually use a simple t
statistic. Hence we have

ˆ2
t= tn − 2
se (ˆ2 )

and we compare this value to the critical value from a t table. This is a two-sided test so we

would reject the null and conclude that the errors are heteroskedastic if t  tn −2 2 .

• Goldfeld-Quandt test
Both the White’s test and its variant are useful if the form of heteroskedasticity is unknown,
i.e. the econometrician suspects that the errors are heteroskedastic but is not sure of what
form the heteroskedasticity takes and which X variables are causing the problem.

The Goldfeld-Quandt test however is useful if it is known that the variance of the disturbance
term changes with the value of a particular regressor X i . Consider estimating the model
Yi = 1 + 2 X 2i + 3 X 3i +  i where the econometrician has some suspicion that the error
variance is changing and that it depends upon the values of X 2i . To run this test, the data
must be re-ordered in ascending order of the variable upon which the variance is thought to
depend, in this case the X 2i variable. The data are then split into two groups of size n1 and n2
which correspond to small values for X 2i and large values for X 2i respectively. Usually
n1 + n2  n because some middle observations are left out - this leaves a clear distinction
between the samples involving small and large values.

Once the data have been ordered and the subsamples created, two separate regressions are
estimated by OLS, using the different samples. These are of the form

Yi a = 1a +  2a X 2ai +  3a X 3ai +  ia for i = 1, , n1


Yi b = 1b +  2b X 2bi +  3b X 3bi +  ib for i = n − n2 + 1, ,n

where the superscripts denote the two regressions with the different samples, a and b. The
estimated variances of the disturbances are obtained from each regression. We will denote
these as

76
J.S.Ercolani Section 4: Classical assumption violations

ˆ12 =  and ˆ 22 =  i .
ˆi2( a ) ˆ 2(b )
n1 − k n2 − k

where k is the number of parameters in the regressions which in the above example is 3. The
idea is that a comparison of these values should indicate whether or not the variance of the
error term is different in the two sub-samples. The null hypothesis is again of homoskedasticity
and the alternative of heteroskedasticity, i.e.

H 0 :  12 =  22
H1 :  12   22

The test statistic takes the form

ˆ 22
FGQ = 2 ~ Fn − k ,n − k
ˆ1 2 1

As with all F tests, if F  F c we reject the null hypothesis in favour of the alternative, i.e., we
reject homoskedasticity in favour of heteroskedasticity.

The choice of sub-sample size n1 and n2 are somewhat arbitrary. It is usual for n1 = n2 and to
make them greater than a third of the total sample size, with observations missing in the
middle. Notice what happens to the F statistic when n1 = n2 ,

ˆ 22  ˆi
2( b )
RSSb
FGQ = 2 = =
ˆ1  ˆi 2( a )
RSSa

Another thing to note is that the form of the test statistic assumes that the variance is
increasing with X 2i such that ˆ12  ˆ 22 and the statistic is greater than 1. If this is not the case
and the variance is decreasing in the X variable, the test should still be set up so that the
ˆ12
statistic is greater than 1 so that FGQ = and the alternative hypothesis is that  22   12 .
ˆ 2
2

4.2.3 Dealing with heteroskedastic errors


As with autocorrelated errors, if we find no evidence of heteroskedasticity then we can
estimate our regression model using OLS knowing that the estimators are B.L.U.E (again
assuming that all other classical assumptions hold). But if we do find evidence of

77
J.S.Ercolani Section 4: Classical assumption violations

heteroskedasticity then we need to consider estimating our model by some alternative


procedure to OLS because we know that the consequences are serious.

It is sometimes the case that transforming the variables into logarithms can transform a
heteroskedastic error into a homoskedastic error. If this doesn't work, or perhaps
heteroskedasticity is a problem when your model is already specified in logs, then an
alternative procedure is required. Here are a couple of suggestions.

• Weighted Least Squares (WLS)


This is a similar approach to the Generalised Least Squares method suggested for dealing with
autocorrelation and in fact it is often referred to as GLS. The idea is to transform the model so
that the errors are no longer heteroskedastic and estimate the parameters of the transformed
model using OLS. Suppose we wish to estimate a bivariate regression model,

Yi = 1 + 2 X i +  i

but have found heteroskedasticity in the errors and that the heteroskedasticity is of the
following form

var (  i ) =  i2 =  2 Z i

where  is just a number. I am not being specific about the variable Z, it could be the
explanatory variable but it may be some other variable. Whatever Z is, it is obvious that the
variance changes as the values of Z change. It is possible to get rid of the heteroskedasticity
from the error term by dividing the regression equation by the square root of the form of Z
variable in the variance function so that

Yi 1 Xi
= 1 + 2 + vi
 
Zi Zi Zi

i
where vi = . It is not obvious at first glance how this has solved the problem. We need
Zi
to analyse the variance of the error more closely and we need to know a little bit about how
the variance operator works. The only property that we need to understand is that if

var (  ) =  2 then var ( c ) = c 2 var (  ) = c 2 2

78
J.S.Ercolani Section 4: Classical assumption violations

where c is a non-random term. Applying this rule to the error in the transformed model we
have

2
    1  1 1
var ( vi ) = var  i =  var (  i ) =  var (  i ) =   2 Zi =  2
 Z   Z  Zi Zi
 i   i 

Hence although the variance of  is heteroskedastic, that of v is not. This transformed model
can therefore safely be estimated using OLS. For example, if var (  i ) =  2 Z i2 then the
appropriate transformation would be to divide the regression model by Z.

The problem of course is in identifying a variable or variables that create the


heteroskedasticity in the first place. It is possible to identify if the heteroskedasticity is a
function of one of the explanatory variables from the regression model by plotting the
residuals against each of the regressors in turn. The shape of these plots should help
determine which explanatory variables affect the error variance and of what form the
relationship is. Examples of such plots were given at the start of this section. The method of
estimation would then be to use OLS to estimate the model once transformed by the
appropriate X variable, in a similar way to the method involving Z above.

• White's heteroskedasticity-consistent variance estimator


We showed in Section 2 that the variance of the OLS slope estimator from a bivariate
regression model has the form var ˆ = 
2
( )22 . However, this was derived under the
( Xi − X )

assumption of homoskedasticity. When heteroskedasticity is present in the errors, the


variance of the OLS slope parameter is actually

( X − X ) 
2 2

var ( ˆ ) =
i i
(7)
(( X − X ) )
2
2 2
i

ˆ ˆ2 =
Therefore if we use the normal variance estimator var ( )
( Xi − X )
ˆ 2
2 where ˆ 2 = 
ˆi2
n − 2 , this

would be a biased estimator of the true variance and hypothesis tests and confidence intervals
will be incorrect. White shows that a consistent estimator of the correct variance is

( ) (
X − X ) ˆ
2 2

ˆ ˆ2 =
var
i i
(8)
(( X − X ) ) i
2 2

which simply replaces the variance  i2 by the squared residuals ˆi2 . This being a consistent
estimator of the true OLS variance parameter means that as the sample size increases, this
variance estimator (8) will tend to its true value. It must be remembered however, that even

79
J.S.Ercolani Section 4: Classical assumption violations

though this heteroskedasticity-consistent variance estimator is better than the usual OLS
variance estimator, it is still not efficient. For efficient estimation one would do better with a
WLS type estimator.

80

You might also like