Econometrics Lecture Notes Booklet
Econometrics Lecture Notes Booklet
ECONOMETRICS
(08 29172)
What is Econometrics?
Econometrics is the interaction of economic theory, observed data and
statistical methods.
Before we make a start on our econometric journey, it is useful to explain to you why
econometrics is important, without technical details. It is perhaps an over simplistic analysis
and makes the job of an econometrician look much less complicated than it is in practice, but
it should give you a general view of what econometricians do and why it is so useful.
Economic theory tends to make qualitative statements about the relationships between
economic variables, e.g. we know that if the price of a good decreases then the demand for
that good should increase, ceteris paribus, or that the more education an individual receives,
the more money they will earn. These theories do not provide a numerical measure of the
relationship; they simply state that the relationship is negative or positive. It is the
econometrician's job to provide the numerical or empirical content to the economic theory.
Econometricians are therefore responsible for the empirical verification of economic theories.
To do this they use mathematical equations to express economic theories and use real
economic data to test these theories.
Let’s consider each step using a particular economic theory, the Keynesian theory of
consumption.
1. The theory is that on average, people increase their consumption as their income
increases, but by a smaller amount than the increase in income. Essentially, Keynes was
stating that the relationship is positive (income and consumption move in the same
direction) and that the Marginal Propensity to Consume (MPC) which measures how
much of a rise in income gets consumed, is less than 1 (consumption increases by less
than the increase in income).
1
J.S.Ercolani What is Econometrics?
2. The economist might suggest a simple way to specify this Keynesian consumption
function using the equation of a straight line as follows:
Y = 1 + 2 X (1)
In this function we call Y the dependent variable (in this case consumption) and X is the
explanatory variable (in this case income). Equation (1) explicitly states the direction of
causality, i.e. changes in income determine changes in consumption, not the other way
around. The terms 1 and 2 are the parameters of the model. The 1 parameter is the
intercept coefficient and 2 is the slope coefficient. The slope is very important as it
reflects how much influence the X variable has on the Y variable. In the consumption
function above, 2 is interpreted as the MPC because it is the coefficient on income and
so directly reflects how income affects consumption. We would expect 0 2 1 to signal
both the positive relationship between income and consumption ( 2 0 ) and the MPC
being less than one ( 2 1 ).
2 = MPC
1
X (income)
2
J.S.Ercolani What is Econometrics?
of each individual. To reflect that the economic relationship is not exact, we transform
(1) into an econometric model by adding a disturbance or error term so that
Y = 1 + 2 X + (2)
The term is a random variable, also called a stochastic variable. It is used to represent
all of the factors that affect consumption that we have not included explicitly in the model
(maybe because they are variables that cannot be measured or observed) as well as the
fact that there is inherent randomness in people’s behaviour.
4. The next step is to collect data that are relevant to the model. In this example, we would
collect data on consumption and income. You can find many sources of economic data
on the internet. For the consumption function example, one could do a micro-level
analysis by using data on the income and consumption behaviour of the individuals living
in a particular country. This would be called a cross-sectional regression. Or one could do
a macro-level analysis using aggregate economy data where the data are observed over
time, say 1950-2016. This would be a time series regression.
5. Once you have specified your model and have an appropriate set of data, we bring the
two together and estimate the unknown parameters 1 and 2 in the regression model.
This is one of the main topics of this module and we will deal with the procedures used
for estimation in detail later. For now, let’s suppose that we have applied an estimation
technique and our estimates are ˆ = 184.28 and ˆ = 0.71 so that the estimated model
1 2
can be written as
where a ^ denotes an estimated value. The i subscript here denotes the individuals over
which the data are observed. If we have 1000 people in our sample, so 1000 pairs of
observations of income and consumption, then i = 1, ,1000 . We have now provided
some empirical content to Keynes' theory. The parameter 2 represents the MPC, and
this has been estimated to have a value of 0.71. This means that for every £1 increase in
income, on average this leads to an increase in consumption of 71p (assuming the data
are for individuals who live in the UK).
6. Once we have estimated the parameters as above, we can test various hypotheses to see
if the economic theory actually holds. In this case, Keynes' theory was that the MPC is
positive but less than 1. Although the value obtained is 0.71 and therefore satisfies the
requirements, we need to ensure that the value is sufficiently above 0 and sufficiently
3
J.S.Ercolani What is Econometrics?
below 1 that it could not have occurred by chance or because of the particular data set
that we have used. What we want to make sure of is that 0.71 is statistically above 0 and
statistically below 1. This is where hypothesis testing comes in. Again we will not go into
details here but this does form another major component of this module.
7. Beyond this inference, once we have shown that the economic theory stands up to
empirical scrutiny, econometricians can use their estimates to predict values for the
dependent variable based on values of the explanatory variable(s). Ultimately the
inference may help government to formulate better policies. From simply knowing the
value estimated for the MPC the government can manipulate the control variable X to
produce a desired level for the target variable Y.
4
J.S.Ercolani Module overview
In this module, we will cover some of the aspects from the description of econometrics in the
previous section. We will not focus too much attention on parts 1 – 4 from the list. On the
whole, in this part of the module we consider generic econometric models, specified as simple
bivariate or multiple regressions and do not often tie these to any particular economic theory
or model. Further, whilst we discuss a little the concepts of what makes a good set of data,
we do not consider where data can be accessed.
In general, the main content of the weeks’ 1-5 material is about, (i) the estimation of simple
regression models; (ii) the statistical properties that we would like these estimators to have;
(iii) discussion of the conditions required of the regression model in order that the estimators
have these properties; (iv) discussion of what happens to the properties of estimators when
some of these model conditions cannot be met; and (v) how to test various types of
hypotheses within different models. Some basic experience of the empirical side of
econometrics, with the application of economic data to simple economic models, will be
provided in the Eviews workshops.
Before we can do any of that, it is appropriate to refresh our memories on some basic
statistical concepts. Many of the topics in the review Section 1 were considered in the year 1
Applied Economics and Statistics module and so should be fairly familiar to you.
5
J.S.Ercolani Section 1: A Review of Statistical Concepts
The probabilities associated with the values taken by random variables can be depicted on
graphs, called density and distribution functions. The density function of a random variable X
is often denoted f ( x ) and might look something like this:
f(x)
a b X
The x-axis represents the continuous range of values over which the random variable X exists
and the y-axis is the value of the density function. The probabilities associated with variable
X are measured by areas under the density function (under the curve). So, for example, the
probability that the value of X will be between the values a and b is equal to the value of area
6
J.S.Ercolani Section 1: A Review of Statistical Concepts
A feature of random variables and their densities is that the area under the whole density is
equal to unity, i.e. f ( x ) dx = 1 . Why is this? Let’s suppose that the variable can only take
−
on values in the range 0 to 10. Then the probability of the variable taking on a value between
0 and 10 has to be 1 and therefore f ( x ) dx = 1 .
10
Related to the density function is the distribution function, often denoted F(x), which depicts
the probability that X takes on values less than x, i.e. F ( a ) = P ( X a ) = f ( x ) dx where
a
−
F(x)
1
F(a)
a X
So the distribution function point F ( a ) represents the area under the density function up to
the value a.
Mean (or expected value): often denoted as E ( X ) or simply , this measures the central
tendency of the density function of variable X. You could think of it as being the point at which
the function is balanced, where there is equal mass on either side of this point. The expected
value is given by
7
J.S.Ercolani Section 1: A Review of Statistical Concepts
E ( X ) = xf ( x )dx (1)
−
For a symmetric distribution this would correspond to the value of X for which the density is
symmetric on either side, i.e. the center point.
f(x)
E(X) X
The function E (.) is the expectations operator. To calculate an expectation you take
whatever is inside the bracket and use it to multiply with the density function, then integrate.
So in general
E ( g ( X ) ) = g ( x ) f ( x ) dx
Variance: this is a measure of the dispersion of the density around the mean, i.e. how spread
out is the density function. It is given by
var ( X ) = E[( X − ) ] = ( x − ) f ( x ) dx .
2 2
(2)
f(x) Variance = 2
Variance = 2
E(X) X
8
J.S.Ercolani Section 1: A Review of Statistical Concepts
Both of these distributions have the same mean, but 2 2 , because the distribution is
more dispersed around the mean. Another term that we may often refer to is the standard
deviation, which is the square root of the variance, i.e. s.d ( X ) = var ( X ) = X2 = X .
Covariance: Covariance is a measure of how two variables say, X and Y, are associated with
each other, i.e. how they co-vary. This is given by
where X and Y are the means of X and Y respectively and f ( x, y ) is the joint density
function (depicting the probabilities associated with X taking on certain values while Y takes
on certain values).
The value of the covariance can be positive or negative. When positive it implies that as the
values assumed by X increase/decrease, then the values of the Y variable also
increase/decrease. If the covariance is negative it means that as the values that the X variable
assumes increase/decrease, then the values of the Y variable decrease/increase, i.e. they
move in the opposite direction.
So for example, you would expect that the more smokers there are in a population, the more
cases of lung disease you would encounter. Hence one would expect a positive correlation
between these two variables. An economic example, and one that we will refer to many times
in this course, is the relationship between income and consumption. There would be a
positive covariance between the amount people earn and how much they consume. Or
another example, the higher the price of a good the lower the demand for that good, this
would give a negative covariance.
Correlation coefficient: this concept is similar to covariance. Covariance tells us whether two
variables are positively or negatively related. The correlation coefficient does the same but it
also gives us information about the strength of the relationship, i.e. it numerically quantifies
the relationship. It is given by
cov ( X , Y )
corr ( X , Y ) =
XY
where denotes the standard deviation. This coefficient has values that are − 1 1 . If
positive/negative this means that the variables X and Y are positively/negatively related, the
9
J.S.Ercolani Section 1: A Review of Statistical Concepts
closer the value is to 1 or -1, the stronger is this positive/negative relationship and the closer
to 0, the weaker the relationship. So, one would expect a correlation coefficient close to 1
between income and consumption because the income a person earns almost solely dictates
how much money they can spend. In the price/demand example one would perhaps expect
a value close to -1, indicating that when a consumer is deciding whether or not to buy a
product, one of the main factors in this decision is the price of the good. In the smoking
example, the correlation between number of smokers and incidence of lung cancer will be
positive but it is difficult to say how close to 1 it would be. Clearly it is not a foregone
conclusion that if you smoke you will get lung cancer, otherwise very few people would
undertake such an activity, so the correlation would not equal to 1 or close to 1. So it is
probably quite low, but definitely positive.
It is important to discuss at this point the difference between correlation and causation. The
correlation coefficient between income and consumption is the same as that between
consumption and income, but at the microeconomic level, it is income that affects
consumption not the other way around. So how much I earn determines how much I spend,
rather than how much I spend determining how much I earn, the causation runs from income
to consumption. And it is the smoking of cigarettes that causes lung cancer, not the lung
cancer that causes the smoking.
A more sophisticated example would look at the correlations between three or more
variables. Let’s consider number of smokers, incidence of lung cancer and sales of cigarette
lighters. The more smokers there are in a population, the more sales of cigarette lighters there
would be, hence a positive correlation. A positive correlation between smoking and lung
cancer and a positive correlation between smoking and sales of cigarette lighters would result
in a positive correlation between sales of cigarette lighters and lung cancer. But clearly it is
not the sale of cigarette lighters per se that causes lung cancer. So, one must be careful to
distinguish between correlation and causation. Correlation does not imply causation.
This is a very useful distribution that can describe fairly well the distribution of values taken
by many everyday things. If you look at the picture of a normal density function you can see
that it has a symmetric bell shape.
10
J.S.Ercolani Section 1: A Review of Statistical Concepts
f(x)
X
E(X)
Most of the density is around the mean, in the centre, and the probability is small that the
variable will take values in the tails, i.e. relatively high or low values compared to the mean.
You could think of peoples’ heights as being normally distributed, i.e. if we were to collect
data on the heights of everyone taking Econometrics, we would find that a majority would
cluster between 5.5 and 6 foot, and very few would be shorter than 4.5 feet and very few
taller than 6.5 feet. When you look at the distribution of marks for economics modules, you
will see that a majority are clustered around the low 60s, with very few getting below say 20%
and very few getting above say 80%. The income of a population is likely to be normally
distributed with most people earning close to the mean and very few earning below £10k or
above say £200k, i.e. in the tails.
A random variable X that is normally distributed with mean and variance 2 is denoted
X ~ N ( , 2 ) . The probability of getting a value for X that is within a standard deviation of
the mean is about 68%, within 2 standard deviations is about 95% and within 3 about 99.7%.
f(x)
X
− 3 − 2 − + + 2 + 3
2 2
where 0, − and − x . A special case of the normal is the Standard
Normal distribution, which is a normal distribution with a specific mean value of 0 and a
11
J.S.Ercolani Section 1: A Review of Statistical Concepts
X −
( )
specific variance of 1. If X ~ N , 2 then Z =
~ N ( 0,1) . This is an extremely useful
result and is one we will refer to again. We will encounter other distributions in this course,
like t and F distributions, and these are all built upon normal or standard normal distributions
in some way.
The basic idea is therefore that given we do not know the true value of the thing that we are
interested in (mean income of people in England in the above example) we need to find a way
of estimating it and we need this estimated value to be accurate and representative.
Therefore we need (i) the sample of data that we use to be a representative sample and (ii)
we need to use an estimator that has good properties. Both of these are essential if we are
to have any confidence in the estimate that we obtain. For example, suppose I collect data on
the income of a sample of people that live in England but I limit my sample to be people in
my family, and I happen to be the daughter of a millionaire. When I calculate mean income I
arrive at a value somewhere in the millions. Is this representative and accurate? It’s doubtful.
The poor data collection technique has led to an estimate that is very strongly upward biased.
I know this is an extreme example, but it highlights my point. Clearly the people we choose to
be in our sample are extremely important. Luckily we have agencies that do this kind of thing
for us and they collect data from people who live all over the country with all sorts of
backgrounds.
In technical statistical terms we want our sample to be a random sample. The set of data
observations that we collect for our variable X, denoted X 1 , X n is a random sample, of size
n, if each observation X i has been drawn independently from the same distribution so that
each X i has the same distribution. These are known as i.i.d random variables, which means
independently and identically distributed. (Many statistical packages nowadays have a
random number generator function, which will generate random samples of data drawn from
particular distributions).
12
J.S.Ercolani Section 1: A Review of Statistical Concepts
1.6 Estimation
Even if we use a proper random sample, our estimate may still be inaccurate if the estimator
that we use has poor properties. We will look at what properties our estimators should
possess in a bit. Let’s first consider what we mean by an estimator. An estimator is simply an
equation that involves the values from the sample of data. When you plug the data into the
estimator equation you get a single value as your estimate. In our example of calculating mean
income, an estimator we could use is the sample mean. This is given by
n
X = 1
n i =1
Xi .
So X is the sample mean estimator whereby you add up the values in your sample, the X i s,
and divide by the number of observations, n, to arrive at a value for X . This is your estimated
value. This is the standard way in which you calculate a mean or an average for a set of data
(another way of calculating a mean is to use the median).
If you were interested in estimating the variance of the variable, or whether the variable was
correlated with another variable, then you could use the following sample moment
estimators:
Sample variance: S X2 = 1
n −1 (X
n
i =1 i −X) =
2
1
n −1 ( n
i =1
X i2 − nX 2 )
Sample covariance: S XY = 1
n −1 (X
n
i =1 i − X )(Yi − Y ) = 1
n −1 ( n
i =1
X iYi − nXY )
Sample correlation coefficient: r =
S XY
=
( X − X )(Y − Y )
i i
S X SY
( X − X ) (Y − Y )
2 2
i i
It is important to realise that, because estimators are functions of the random sample, for
which each observation X i has been drawn from the same distribution, then the estimators
are themselves random variables and hence have distributions. We call these sampling
distributions. As we have been interested in the mean estimator, let’s continue with this and
consider what the sampling distribution of this estimator might be. Assume that the sample
X i has come from a normal distribution with a mean and a variance 2 , i.e. X i N ( , 2 )
. What is the distribution of X ? Well it turns out that it is also normally distributed with the
same mean as the underlying observations but a variance of
2
n , i.e. the variance is scaled
down by the sample size n. Hence
N ( , )
2
X n
13
J.S.Ercolani Section 1: A Review of Statistical Concepts
An even stronger result than this, called the central limit theorem states that, if the sample
N ( , ) , even if
2
size n is large, i.e. we have a lot of observations in our sample, then X n
the underlying random sample has been picked from a distribution that is not a normal
distribution.
Having discussed what an estimator is and looked at some particular moment estimators, we
should consider what properties we desire from our estimators. The issue for the
econometrician/statistician is that they are trying to estimate a value for a parameter whose
true value is unknown to us (if it was known to us then why would you bother estimating it?!).
For example, there is some true value for the mean income of people that live in England, but
we do not know what it is, so we collect a sample of data and estimate it, using an estimator
like the sample mean X . This will provide us with a point estimate, i.e. a single value that
represents our best guess at what the mean income value is. How do we know that this value
is close to the true but unknown value? The answer is that we don't. But so long as we use a
proper random sample of data and an estimator that exhibits properties that make it likely
that our estimates are close, then we should be OK. These properties are important and we
will return to them on many occasions.
i. ()
Unbiasedness: ˆ is an unbiased estimator of if E ˆ = . This means that the
sampling distribution of the estimator ˆ is centered on . Hence on average the
estimator will yield the true value.
ii. Efficiency: Unbiasedness alone is not adequate to indicate a good estimator. Not only
do we want the sampling distribution of the estimator to be centered around the true
but unknown value , we want as much of the mass of the distribution to be around
this point, i.e. the variance must be small. This means that the estimator is more precise
and is hence more likely to estimate the true value. If, amongst all unbiased estimators
of , the estimator ˆ has the smallest variance, it is said to be the most efficient or
the best and is called a "Best Unbiased Estimator". The diagram shows how a smaller
variance can make an estimator more precise.
14
J.S.Ercolani Section 1: A Review of Statistical Concepts
Distribution of estimator ˆ
~
Distribution of estimator
~
, ˆ
Remember that the area under the curve represents the probability that the value for
the estimator lies in a certain range. Notice that both estimators are unbiased (their
~
distributions are centered around ) but the variance of is greater because it is more
spread out. It is more likely therefore to estimate a value far away from the true value
when the variance of the estimator is higher. In this case, ˆ is more efficient and a
~
better estimator than .
iii. Consistency: This is a large sample, or asymptotic, property. It means that as the size of
the sample gets bigger, i.e. as n → , the variance of the estimator tends to 0 (becomes
more accurate) until the density collapses to a single spike at . If an estimator is
consistent it therefore means that if we could increase the sample size indefinitely, we
would estimate the true value. Although it is not possible to have an infinite sample
size, some estimators actually have variances that decrease very quickly when the
sample size increases, so that the size does not have to be too large to obtain an
accurate estimate. Of course, an estimator without this property is no good, as this
means that even if we had all the information available, the estimator still cannot
estimate the correct value.
It is easy to see, given we know that the variance of X is 2 , that as the sample size
n
gets larger, the variance of this estimator gets smaller and the variance tends to 0 as
n → . We can therefore say that the sample mean is a consistent estimator of the true
mean .
15
J.S.Ercolani Section 1: A Review of Statistical Concepts
(i) a value for X that is different from £25k because the actual value of is different
from £25k and;
(ii) a value for X that is different from £25k even though the actual value of is £25k.
Essentially we are asking whether our estimated value is “sufficiently” far away from our
hypothesised value to suggest that the hypothesis is incorrect.
H0 : = *
In our example, this would be H 0 : = £25k , i.e. we hypothesis that the true mean is £25k.
The sample evidence will either reject or not reject this null, against an alternative hypothesis,
denoted H1 . We shall consider three types
H1 : *
H1 : *
H1 : *
The first is a two-sided hypothesis, (because we consider either side of the hypothesised
value, i.e. values above and below * ), and the others are one-sided hypotheses (either above
or below). To test the null against one of these alternatives we use sample data to develop a
test statistic and we use decision rules to help us decide whether the sample evidence
supports or rejects the null.
To show how this works we will use the sample mean example. First let’s consider how we
N ( , ) . Let’s suppose that we want to test
2
can develop a test statistic. We know that X n
the hypothesis H 0 : = £25k , i.e. that the true mean income value is £25k. Under the null
hypothesis, meaning when the null is correct, the distribution of the estimator X is centered
N ( £25k , ) and the distribution will look as follows:
2
on £25k, i.e. X n
16
J.S.Ercolani Section 1: A Review of Statistical Concepts
A
X
£25,000 X*
From this picture you can see that if the true value of mean income is £25k, then the
probability of that our estimator, X , will produce a value above amount X * is very small, i.e.
area A. Interpreting this we can say that if our estimated value for X X * then this is
evidence against the null hypothesis that mean income is £25k, because it is so unlikely that
you would estimate such a high value for X if the null were true. This is the basis of
hypothesis testing – does the sample evidence suggest that the null is appropriate.
The problem with the above approach is that we don’t actually know the position of the
distribution because we do not know the variance of X (we do not know 2 ). We need to
adjust things a bit. If we turn the distribution of X into a standard normal by subtracting the
mean and dividing by the standard deviation, we get
X −
Z= N ( 0,1)
2
n
If we replace 2 by its sample estimator, given above by S 2 , then the standard normal
distribution becomes a t distribution (we will not prove that this is the case in this module).
Hence
X −
t= tn −1
S2
n
I have called the statistic t because it has a t distribution. A t distribution looks very similar to
a normal distribution, but it is always centred on 0 and has fatter tails;
1−
2 2
0
-tc tc
17
J.S.Ercolani Section 1: A Review of Statistical Concepts
In the above diagram, the area under the distribution, which equals 1 in total, has been
divided up in such a way that the area above t c and below −t c equals in total and the
area in between is 1 − .
If the null hypothesis is true, i.e. that = 25, 000 then the t statistic below has the
distribution:
X − 25, 000
t= tn −1
S2
n
Any value of X that produces a value for this t statistic that is in the tails of this distribution,
is treated as evidence against the null hypothesis (it is so unlikely that such a t statistic would
come about if the null were true, then if we do get such a high t statistic it must mean that
the null is not true).
So how do you perform such a test in practice? Let’s assume that the alternative hypothesis
is that H1 : 25,000 (a two-sided alternative).
i. In the example t statistic above you will see that on replacing X and S 2 with the values
obtained for these estimates and when you plug in the value of the sample size n, then
you get a value for the t statistic.
ii. If you look at the distribution diagram above, you will see that we need to choose a
value for which dictates how the distribution is divided. The term is called the
significance level and usually we set this at 5%, i.e. = 0.05 and this then dictates the
values of t c and −t c (these are called the critical values and are obtained from a t
−1 and −t n −1 . This means that, if the null is true, there is only a
distribution table) as tn0.025 0.025
then you reject the null hypothesis, otherwise you do not reject the null hypothesis.
If the alternative hypothesis was H1 : 25,000 then you would find the critical value for
which there is 5% in the upper tail only, (rather than 2.5% in both upper and lower tails as
above) and the decision rule is that if t tn0.05
−1 then you reject the null, otherwise you do not
reject. Conversely, if the alternative was H1 : 25,000 then you would find the critical value
for which there is 5% in the lower tail only, and the decision rule is that if t −tn0.05
−1 then you
18
J.S.Ercolani Section 1: A Review of Statistical Concepts
reject the null, otherwise you do not reject. Whether you do two-sided or one-sided tests is
often dictated by the economic theory that you are testing.
A Type I error is the error of rejecting a hypothesis when it is true. It is possible that a true
null could be rejected by mistake % of the time.
A Type II error is the error of not rejecting a false hypothesis. We will denote the probability
of making such an error as .
These errors are unavoidable and are an intrinsic part of hypothesis testing. Of course we
would like to minimise the chances of making either error, but it is not possible to minimise
them both simultaneously. Statisticians and econometricians usually take the approach that
making a Type I error is worse than making a Type II error. Hence they try to reduce the
probability of making a Type I error by keeping the level of significance to a low value,
usually 0.01 (or 1%), 0.05 (5%) or 0.1 (10%). Hence if we perform a test at the 5% level of
significance, we are effectively stating that we are prepared to accept at most a 5% probability
of committing a Type I error.
The size of a test refers to the probability of making a Type I error, i.e. . So the size of a test
is the probability of rejecting a true hypothesis. The power of a test refers to the probability
of not committing a Type II error. This is therefore the probability of rejecting a false
hypothesis, i.e. doing the right thing.
As an example, we can find a confidence interval for the mean of a variable X . We know
that the sample mean X is a point estimator of and that a t statistic related to this
X −
estimator (shown on page 12) is t = tn −1 . To define a (1 − ) % confidence interval we
S2
n
19
J.S.Ercolani Section 1: A Review of Statistical Concepts
( )
need to find the cut-off points t c such that P −t c t t c = 1 − (see the diagram on page
12). Substituting the t statistic from above we get
X −
P −t c tc = 1−
2
S
n
(
P X − tc S2
n X + tc S2
n ) = 1− .
The interpretation is that there is a 100 (1 − ) % (95% if = 0.05 ) probability that the
random interval X − t c S2
n , X + tc S2
n contains . Suppose we have a sample of
observations of size 100 and the sample mean is 5.13 and sample variance is 1.76. The 95%
confidence interval for is therefore
5.13 − t c 1.76
100 5.13 + t c 1.76
100
The value for t c is the value for t990.025 from a t distribution table and is found to be about 1.984.
Therefore there is a 95% probability that the interval (4.867, 5.393) contains . So with a
point estimate you get a single value as an estimate, but with a confidence band you get a
range of values in which you have a certain degree of confidence that the true value lies.
What if we were only interested in having 90% confidence? The interval would then be (4.910,
5.350) because the t critical value has decreased to 1.660. You will see that the interval has
shrunk, which helps to say a bit more about the true value of . But we have less confidence
in that interval. It would be nice to be able to say that we are 99.99% confident that lies in
the interval 5.125 and 5.135. This would be incredibly accurate with a high level of confidence
in a very tight range. But this is impossible because there is a trade-off between size of interval
and confidence in that interval. The more confidence in the range we have, the larger is the
interval. The extreme example is to look at the interval in which we have 100% confidence.
The interval here would be − and and so we would say that we are 100% confident that
the true value for lies between − and , which doesn’t help at all. Of course we are
100% sure of this. So, in order to get a reasonably sized interval, we are forced to lose some
confidence in this interval.
Once we have calculated our confidence band, given there is such a high probability that the
true value for lies within it, any hypothesised value outside this interval must be rejected.
So if we had hypothesised = 5.5 , we can reject this based on the fact that there is only a
5% chance that a value for would be outside the range (4.867, 5.393).
20
J.S.Ercolani Section 1: A Review of Statistical Concepts
APPENDIX
If a, b are constants:
• E (a) = a
• E ( aX ) = aE ( X )
• E ( aX + b ) = E ( aX ) + E ( b ) = aE ( X ) + b
• E ( X + Y ) = E ( X ) + E (Y )
• In general E ( XY ) E ( X ) E (Y ) unless X and Y are independent, in which case
E ( XY ) = E ( X ) E (Y )
Properties of Variance
• var ( b ) = 0 because b is a constant and therefore its value does not change
Proof: var ( b ) = E ( b − E ( b ) ) = E ( b − b ) = E ( 0 ) = 0
2 2
• var ( X + b ) = var ( X )
• var ( aX ) = a 2 var ( X )
• var ( aX + b ) = a 2 var ( X )
var ( X + Y ) = E ( X + Y − E ( X ) − E (Y ) )
2
•
= E ( X + Y − X − Y )
2
= E ( X − X ) + (Y − Y ) + 2 ( X − X )(Y − Y )
2 2
= E ( XY ) − Y E ( X ) − X E (Y ) + Y X
= E ( XY ) − Y X − X Y + Y X
= E ( XY ) − Y X
21
J.S.Ercolani Section 1: A Review of Statistical Concepts
Properties of Covariance
• cov ( a + bX , c + dY ) = E ( a + bX − E ( a + bX ) ) ( c + dX − E ( c + dX ) )
= E ( a + bX − a − bE ( X ) ) ( c + dY − c − dE (Y ) ) = E ( a + bX − a − b X )( c + dY − c − d Y )
= E ( bX − b X )( dY − d Y ) = E b ( X − X ) d (Y − Y ) = bdE ( X − X )(Y − Y )
= bd cov ( X , Y )
• cov ( X , X ) = E ( X − X )( X − X ) = var ( X )
22
J.S.Ercolani Section 2: Bivariate Linear Regression
For the moment we will be concerned only with bivariate econometric models, which means
that we are analysing simple models that involve only two variables. We want to reveal more
about the relationship between the two economic variables. The simple consumption function
just mentioned fits into this category because the model composes just two variables, income
and consumption. We make use of economic theory to indicate the causation in the
relationship, i.e. is it changes in the value of X that cause changes in the value of Y or vice
versa. For notational purposes we usually indicate the explanatory variable (also called the
regressor or exogenous variable) by X and the dependent variable (also called the regressand
or endogenous variable) by Y so that the causation runs from X to Y, hence we believe that
changes in economic variable X cause movements in economic variable Y. We are also only
interested in linear econometric models, which here implies that the equations represent
straight line relationships between the variables X and Y. Linearity in regression analysis is
actually broader than this and we’ll discuss this shortly.
Let’s look again at the simple consumption function, at the microeconomic level, such that we
consider the relationship between individuals’ consumption and income behaviour. Clearly,
the more an individual earns, the more they can spend. So, the relationship is a positive one
and might be represented with the following diagram:
Y (consumption)
X (income)
23
J.S.Ercolani Section 2: Bivariate Linear Regression
An economic theory suggesting that the variation in some economic variable Y (in the above
example this is consumption) depends linearly upon the variability in another economic
variable X (income in this example), can be written as
Y = 1 + 2 X (1)
This is just the equation of a straight line where 1 is the intercept on the Y axis (i.e. the value
of Y when X is equal to 0) and 2 is the slope of the line, which represents how much variable
Y changes when the values of variable X change. This equation should attempt to mimic the
behaviour of the economic system that it is representing. But we know that the economy
would not move in such an exact way as this. There might be unobservable factors that
influence Y that we simply cannot include in the equation and hence the line simply mimics
the general relationship between the variables. For example, suppose 1 = 2 and 2 = 0.75 ,
such that Y = 2 + 0.75 X . If X = 17 then the equation suggests that Y = 2 + 0.75 (17 ) = 14.75
but obviously not everyone who earns £17,000 will spend £14,750.
So, to add some realism to the equation we add what we call an error term or random
disturbance term, often denoted as , to the equation i.e.
Y = 1 + 2 X + . (2)
Equation (2) is what we would call a regression model. In our consumption function example,
the error term takes account of the fact that randomness in human behaviour prevents all
income and consumption values from the population from sitting exactly along the line. They
may be above it or below it and hence may be positive or negative in value to reflect this.
The error also accounts for any factors that influence Y but are not included in the equation.
These maybe factors that we cannot easily measure or observe.
We are very interested in what “looks like” and we will concentrate on this later in this
section. The best way to deal with the error term is to treat it as a random variable, after all
it does account for random behaviour that cannot be quantified or easily modelled. The
properties of this random error are crucially important as we will see later.
24
J.S.Ercolani Section 2: Bivariate Linear Regression
therefore we do not know the position of the true regression line. So what is the way forward
here?
Well, we collect data on the variables of interest, X and Y i.e. income and consumption in our
example, and we estimate values for 1 and 2 . The plot below shows a possible sample of
data on the income and consumption values of different individuals from the population:
Each point represents the income and consumption values of an individual in the sample. You
can see the general upward trend in the relationship between income and consumption,
indicating that people with higher incomes also tend to have higher consumption values. This
is the positive relationship showing up in the data. So, now we have the data, we also need a
statistical technique that will allow us to estimate values for 1 and 2 . We’ll next consider
ways in which this might be achieved.
Going back to the issue of linear models, a model is defined as linear if it is linear in the
parameters. This means that the model’s parameters do not appear as exponents or products
of other parameters. If the model contains squares or products of variables, we would still
refer to this as a linear regression if it is still linear in parameters. So as examples
Y = 1 + 2 X + 3 X 2 +
Y = 1 + 2 X1 + 3 X 2 + 2 3 X 3 +
25
J.S.Ercolani Section 2: Bivariate Linear Regression
To glean any information about this economic relationship we have to use some kind of
estimation technique to estimate the values of 1 and 2 . We will denote these estimated
values as ˆ1 and ̂ 2 . So, 1 and 2 are the true parameters whose values are unknown and
ˆ1 and ̂ 2 are estimates of these parameters. The main issue for the econometrician is how
to get these estimates. We know already that to do estimation we need data and we need a
statistical procedure. Combining the two will produce estimated values for the parameters 1
and 2 . An important question is whether the estimates we obtain are accurate. This relies
on us using good quality data and an appropriate estimation method. There are issues such as
the fact that different model specifications may require different techniques, or it may be that
the data we collect is not exactly in the form that the model specifies and this may affect the
properties of the model and hence suggest a certain type of estimation procedure. But for
now we will assume that the data are i.i.d. and we will consider the situation where it is
appropriate to apply the most basic of estimation techniques.
Assume that we have a set of data that represents the income and consumption of a sample
of people, shown in the earlier plot. We want to find the line that “best” fits through this plot
of data. Once we have found this “best” line then we have found our estimates of 1 and 2
i.e. ˆ1 and ̂ 2 , where ˆ1 is the value where the line crosses the Y axis and ̂ 2 is the slope of
the line.
Now that we are relating the regression model (2) to the sample of data observations on
income and consumption, we can show that for each individual i in the sample:
26
J.S.Ercolani Section 2: Bivariate Linear Regression
Yi = 1 + 2 X i + i (3)
where i = 1, , n . The subscript i indicates that we are looking at data on individuals and there
are n individuals in the data set. We say that the sample is of size n. In our example this is 28
as there are 28 data points in the plot. If we replace the parameters with their estimates (we’ll
discuss shortly how to get these estimates), we get
This is the estimated regression line and the term ˆi is called the residual, which is essentially
the estimated version of the error term i . The residual gives us the distance that each data
point in the sample lies away from the estimated regression line. Using Excel, we find that the
best fitting line for our sample of data is
As you can see, some individuals are above the line, some below, some close to it, some not
so close. But you can see that it’s not possible to fit a single straight line through all points.
You can see on the graph that the estimates are ˆ = 0.7841 and ˆ = 0.7765 .
1 2
The important question for the moment is how to actually find the best fitting line, i.e. how
did Excel come up with the line in the diagram above. We need a mathematical/statistical
criterion with which to do this. Obviously we want the residuals to be as small as possible, i.e.
we want observations to be as close to the line as possible. The example lines below are pretty
bad at fulfilling such a criterion and clearly do not represent the general relationship between
the two variables X and Y. For one of the lines all of the residuals would be negative because
all of the points lie below it. The other line has a negative slope.
27
J.S.Ercolani Section 2: Bivariate Linear Regression
So, what can we do? What if we consider adding up all of the residual values and choose the
line that gives the lowest absolute summed value, i.e. find the values of ˆ and ̂ that give
1 2
ˆ .
28
S= i =1 i
The problem with this criterion is that a line like the downward sloping one above is likely to
produce the lowest value for S. This is because all of the positive and negative residuals would
cancel each other out and the sum would be close to 0. We clearly do not want to use a
criterion that chooses a downward sloping line for a set of data that is clearly trending
upwards.
The criterion that works the best and which is used most often by econometricians is to
minimise the sum of squared residuals, i.e.
S = i =1 ˆi2 .
28
That way, negative residuals, when squared, would become positive and would no longer
cancel out with the squares of the positive residuals when summed together. The process that
finds estimates based on this criterion is called Ordinary Least Squares estimation or OLS for
short. This is an extremely common estimation technique for econometricians and is often the
basis for other techniques, when OLS itself is not appropriate (we will discuss some cases of
this later in the course). Let’s look at OLS in more detail.
28
J.S.Ercolani Section 2: Bivariate Linear Regression
we need to
(
min S = min i =1 ˆi2 = min i =1 Yi − ˆ1 − ˆ2 X i )
n n 2
.
We know that to find the maximum or minimum of a function we need to differentiate and
set to 0. This gives what we call the first-order conditions. So
S
ˆ1
n
(
= −2 i =1 Yi − ˆ1 − ˆ2 X i = 0 ) (4)
S
ˆ
2
n
(
= −2 i =1 X i Yi − ˆ1 − ˆ2 X i = 0 ) (5)
ˆ2 =
( X − X )(Y − Y ) = X Y − nXY .
i i i i
(7)
( X − X )
i X − nX
2
i
2 2
These are the OLS estimators of 1 and 2 and as you can see they are simply equations that
involve our data, X and Y. We can now see how we combine data with a statistical technique
to get estimates of the unknown parameters in the regression model. We have our data on
variables X and Y. We have the statistical technique of OLS and this gives us the formulae with
which we can use the data in order to get our estimates, i.e. input the values of our dataset,
X and Y , into our estimator equations (6) and (7), and out pop two numbers, one an estimate
of 1 , the other an estimate of 2 . In the example in the plots above, Excel used OLS to
calculate the values ˆ = 0.7841 and ˆ = 0.7765 . The 28 data points on consumption and
1 2
income (the Y and X variables) were plugged into equations (6) and (7) to produce these
estimates.
So this has introduced the concept of estimation and the technique of OLS. We now know one
method for obtaining estimates of the parameters in a regression model. With these
estimated values we can make statements about the economic relationship under
investigation. In the example above we can state that the MPC is 0.7765. What does this
mean? It means that as income goes up by £1, our consumption goes up by 77p. We have
29
J.S.Ercolani Section 2: Bivariate Linear Regression
managed therefore to provide something quantitative to the economic theory posited by the
regression model.
The next question is, how do we know that this estimate is an accurate measure of the real
but unknown MPC, denoted as 2 ? We don’t know 2 , so cannot say whether our estimate
of it, 77p, is close or far away from it. Much of the answer to this question is based on the
quality of the estimator we have used, in this case OLS. We need to know whether the OLS
estimator has “good properties” and we need to know what conditions have to be satisfied in
order for our estimator to have these good properties.
The properties that we are interested in are, again, unbiasedness and efficiency. Remember
that these properties are about the mean and variance of the sampling distribution of our
estimators. Unbiasedness is about where the centre of the distribution is and efficiency relates
to the variance of the estimator. But as yet we haven’t mentioned anything about our
estimators, ˆ1 and ̂ 2 , having sampling distributions. So it’s worth making the following point:
NOTE: From equation (3) we can see that the dependent variable is a function of the random
disturbance and therefore Y can be treated itself as a random variable. Further, from
equations (6) and (7) we can see that the OLS estimators are functions of this dependent
variable. This means that our estimators are also random variables (akin to the sample mean
being random in the Section 1 notes) and hence our estimators have sampling distributions.
So, are we able to say that OLS has these good properties? The answer is that yes, OLS does
have these good properties, but ONLY if certain conditions are satisfied. We call these the
classical linear regression assumptions, and they are stated below. Some of these conditions
are quite strong, meaning that they may be hard to satisfy in some regression models or for
some economic theories, in which case OLS may no longer have the good properties that we
desire. But what we can say is that IF these conditions are satisfied then OLS is the best method
30
J.S.Ercolani Section 2: Bivariate Linear Regression
of estimation that we can use, i.e. it will provide the most accurate estimate possible. So what
are these assumptions?
Classical Linear Regression Assumptions
The OLS estimators have properties that are established in the Gauss-Markov Theorem, which
states:
Given the assumptions of the classical linear regression model, amongst all linear unbiased
estimators, the OLS estimators have the minimum variance, i.e. they are Best Linear Unbiased
Estimators or B.L.U.E.
31
J.S.Ercolani Section 2: Bivariate Linear Regression
the properties of OLS. Later in the course we will examine the relevance of some of these
assumptions and look at how the OLS estimators are affected when the assumptions are
violated.
since ( X i − X ) = X i − nX = nX − nX = 0 . Therefore
ˆ2 =
( X − X )Y
i
= wY
i
i i where wi =
X −X i
.
( ) X −X )
(
2
−
2
Xi X i
( )
E ˆ2 = 2 + wi E ( i ) = 2
given the CLRM2 and 3 assumptions. Hence we have unbiasedness. Unbiasedness of the
intercept term can also be established using a similar proof.
We now need to establish the mean and variance of our OLS estimators. By the Gauss-Markov
theorem, OLS is unbiased. This implies that the means of their sampling distributions are 1
32
J.S.Ercolani Section 2: Bivariate Linear Regression
and 2 respectively because unbiasedness implies that E ˆ1 = 1 and E ˆ2 = 2 . Their ( ) ( )
variances are given by the following equations
2 X i2
2
( )
1 = var ˆ1 =
n ( X i − X )
2
(6*)
( )
2
2 = var ˆ2 =
( X −X)
2 2
i (7*)
( )
Hence ˆ1 ~ N 1 , 1 and ˆ2 ~ N 2 , 2 .
2
( 2
)
( )
var ˆ2 = var ( 2 + wi i ) = var ( wi i ) = wi2 var ( i )
This last term comes about because of CLRM5 and if we further impose CLRM4 we get
( )
var ˆ2 = wi2 2
and because w 2
i =
( Xi − X )
1
2 ( )
then var ˆ2 = 2
( Xi − X )
2 .
We have now discussed the methodology behind OLS, we have derived the equations for the
OLS estimators (for a bivariate model), we have established that OLS has good properties
under certain assumptions and we have found that OLS estimators are normally distributed.
Now we are in a position to move on to other important statistics we should consider when
doing regression analysis.
33
J.S.Ercolani Section 2: Bivariate Linear Regression
in some texts it is written as r 2 when in the context of bivariate models. We can derive the
expression for R2 in the following way:
(
Yi − Y = Yˆi − Y + ˆi )
or yi = yˆi + ˆi
where the lower case letters denote mean adjusted variables, i.e. yi = Yi − Y . By squaring and
summing this function it can be shown that
y = yˆ + ˆ
i
2
i i
2
i i i
2
where y 2
i = TSS (Total Sum of Squares), yˆ 2
i = ESS (Explained Sum of Squares) and
ˆ i
2
= RSS (Residual Sum of Squares) and hence
In words this means that the total variation of the actual Y values around their mean is equal
to the sum of the total variation of the estimated Y values around the mean and the residual
variation of Y.
If the regression line fits through the points very well we would expect the residual variation
to be small. In the extreme case all points lie exactly on the line so that the RSS = 0 , but this
is very unlikely to occur. The R2 value tells us how much of the total variation of Y is
attributable to the regression line so that
R2 = ESS
TSS = 1 − TSS
RSS
How do we interpret the R2 value? It is the case that 0 R2 1 . The closer this value is to 1,
the better the fit of the regression line and the closer to 0, the worse the fit and hence the
higher the residual variation. We therefore would always be happier if the coefficient has a
value close to 1.
34
J.S.Ercolani Section 2: Bivariate Linear Regression
ˆ1 ~ N ( 1 , 2
1
) (
and ˆ2 ~ N 2 , 22 )
Suppose that we wish to test that 2 is equal to some value and that there are no
*
theoretical suggestions to help us specify the direction in which the parameter should deviate
under the alternative. We would choose therefore to perform a two-sided test where the null
and alternative hypotheses are specified as
H 0 : 2 = *
H1 : 2 * (8)
There may however be some a priori evidence to suggest a particular direction for the
alternative. For example, economic theory suggests that the MPC is positive, so that if we
wanted to test whether the MPC is equal to 0 we could choose an alternative hypothesis in
which the relevant parameter is greater than 0. To test a hypothesis like (8), we can use the t
test technique that we analysed in Section 1. We specify a test statistic and then compare this
value to the relevant critical value from an appropriate distribution table.
35
J.S.Ercolani Section 2: Bivariate Linear Regression
We know that ˆ2 N 2 , 2 . From this we can create a variable that has a standard
2
( Xi − X )
normal distribution as follows:
ˆ2 − 2
Z= ~ N ( 0,1) (9)
2
( X −X)
2
i
The parameter 2 in (9) is unknown and needs to be estimated. We are not going to show
how to derive such an estimator, so take as given that an unbiased estimator of the error
variance is
ˆ
n 2
ˆ 2
= i =1 i
n−k
where k represents the number of unknown parameters in the regression model, so in this
case k = 2 . On replacing 2 in (9) with its estimator ̂ 2 , this changes the distribution of the
variable. We usually call the new variable t because it has a t distribution as shown here:
ˆ2 − 2
t= ~ tn − 2
ˆ 2
( Xi − X )
2
This statistic has a tn−2 distribution, where n − 2 is called the degrees of freedom. Now we
can state that, if the null hypothesis is true, i.e. that 2 = then
*
ˆ2 − *
t= ~ tn − 2
ˆ 2
( X −X)
2
i
ˆ2 − *
t= ~ tn − 2 (10)
( )
s.e. ˆ2
( )
where s.e. ˆ2 denotes the standard error of the estimator and is the square root of the
variance of ̂ 2 .
36
J.S.Ercolani Section 2: Bivariate Linear Regression
To perform the test we therefore calculate the test statistic from (10) which must then be
compared to a critical value from the t distribution table. Following a set of decision rules we
can decide whether to reject or not reject the null hypothesis. On choosing a significance level
(usually 5% so = 0.05 ), and given the degrees of freedom n − 2 , the critical value is easily
found. The decision is made via the following rules:
• H1 : 2 * : if t tn −2 2 , reject the null in favour of the alternative hypothesis;
• H1 : 2 * : if t tn− 2 , reject the null in favour of the alternative hypothesis;
• H1 : 2 * : if t −tn− 2 , reject the null in favour of the alternative hypothesis.
A test of special interest to econometricians is the test of significance. It is used so often that
statistical software packages automatically produce the test statistic alongside the coefficient
estimates for the parameters. It is called the test of significance because the null and
alternative hypotheses are for a specific value that the parameter is equal to 0.
H0 : 2 = 0 against H1 : 2 0
Under the null hypothesis the following test statistic has a t distribution
ˆ2
t= ~ tn − 2
( )
se ˆ2
If we cannot reject the null then we are effectively saying that 2 = 0 and that the regression
model should actually be written as
Yi = 1 + i
37
J.S.Ercolani Section 2: Bivariate Linear Regression
and therefore variable X is not a significant determinant of Y, i.e. any change in the value of
variable X has no impact on variable Y. If we reject the null then X is a significant determinant
of Y and the model is
Yi = 1 + 2 X i + i .
Of course this test can also be performed on the other parameter 1 . In Section 3 we will
consider multiple regression models in which more X variables appear on the right-hand side
of the equation and this widens the scope for many more forms of hypothesis to be tested.
ˆi − i
t= ~ tn − 2
( )
s.e. ˆi
confidence that the interval ˆ − t s.e. ( ˆ ) , ˆ + t s.e. ( ˆ ) contains the true value for .
i
c
i i
c
i i
CONSt = 1 + 2 INCt + t
where the data are for the UK over the time period 1955 to 2010, i.e. t = 1955, 2010 such
that we have 56 observations of data. The dependent variable is consumption, here denoted
CONS and the regressor is income, denoted INC. The table of Eviews output is below. We have
not covered many of the statistics in this table yet, but let’s interpret as much as we can.
38
J.S.Ercolani Section 2: Bivariate Linear Regression
The table tells us what the dependent variable is, the estimation procedure, sample time span
and number of observations. Then we have 5 columns.
• The 1st tells us the name of each variable on the right hand side of the equation, in this
case C which is the constant term in the regression and INC.
• The 2nd contains the OLS parameter estimates. So here we have that ˆ = −4264.223 and 1
ˆ2 = 0.93844 .
• The 3rd is the standard error of the estimate, e.g. the s.e. ˆ2 = 0.009947 . ( )
• The 4th is the t statistic of significance, i.e. the test of each parameter being equal to 0.
You should see that this column is the result of dividing the 2nd column by the 3rd column
ˆi
because t = ( ) . The critical value is roughly equal to 2, so these suggest that we reject
s .e. ˆi
the null that the variable is insignificant. Hence income is a statistically significant
determinant of consumption (it would have been surprising to get anything different).
• The 5th is the P value which measures the probability under the t distribution that lies
above the t statistic value. The values here imply that the t statistics are so large that the
area above these values in a t distribution is so small it cannot be read to 4 decimal places.
This again implies that the variables are significant. This column is useful because it avoids
the need to find the critical value in a distribution table. If you are doing a 5% two-sided
test then you compare this value to 0.05. If the P value is less than 0.05 then you reject
the null.
At the bottom of the table is the goodness of fit statistic, which is quite high at 0.994. This
suggests that the model fits the data well, even in a simple bivariate regression. This is because
income is a very important variable and the main determinant of how much we spend.
39
J.S.Ercolani Section 3: Multiple Linear Regression
Yi = 1 + 2 X 2i + 3 X 3i + + k X ki + i , for i = 1, , n (1)
where there now exist k-1 explanatory variables and k parameters to estimate. So what are
the differences between this model and the bivariate model, can we estimate the parameters
in the same way and do these estimates have the same properties as before? What about
hypothesis testing? In this section we will analyse these issues and point out where differences
lie between the multiple and the bivariate case.
Think of (1) as an extension to the bivariate model, we just have more factors to analyse. The
term 1 still represents the intercept or constant term but the j for j = 2, , k are now
interpreted as partial slope coefficients. This means that, for example, 2 measures the
change in the mean value of Y per unit change in X 2 , ceteris paribus (whilst holding the values
of the other explanatory variables constant). Or, we could say that given the other variables
are in the model, the parameter 2 measures the additional explanatory power of variable
X 2 . We can therefore analyse how much of the variation in Y is directly attributable to X 2 ,
how much to X 3 etc.
To give a good example (given in Koop pg 44-46), we have a model that tries to explain how
house prices are determined. Let Y be house prices in £s, P is the size of the plot in square
feet, BD is the number of bedrooms, BT is the number of bathrooms and F is the number of
floors, so the regression is
Yi = 1 + 2 Pi + 3 BDi + 4 BTi + 5 Fi + i
40
J.S.Ercolani Section 3: Multiple Linear Regression
In this example you would expect all parameters to be estimated with positive values because
each house feature should increase the price of the house. Suppose that ˆ = 48.36 . How do
2
we interpret this value? Can we simply state that houses with bigger plots are worth more?
Well not strictly because there will be some exceptions, a derelict house on a large plot is
unlikely to be more expensive than a luxury house on a smaller plot. What we can say is that
for houses that are comparable in other respects the one on the bigger plot will be worth
more. Or more precisely for this example, an extra square foot raises the price of a house by
£48.36, ceteris paribus, or alternatively, for houses with the same number of bedrooms,
bathrooms and floors, an extra square foot will increase the price by £48.36.
Yi = 1 + 2 X 2i + 3 X 3i + i , for i = 1, , n
for which we have the following minimisation problem, i.e. to minimise S where
(
S = i =1 ˆi2 = i =1 Yi − ˆ1 − ˆ2 X 2i − ˆ3 X 3i )
n n 2
S S S
Three derivatives need to be found, , and . Each equation is set to zero and
ˆ ˆ
1 2 ˆ3
solved. The results are the following estimators:
ˆ
2 = yi x2i x32i − yi x3i x2i x3i
x22i x32i − ( x2i x3i )
2
ˆ3 = i 3i 2i i 2i 22i 3i
yx x − yx 2
x x
x2i x3i − ( x2i x3i )
2 2
41
J.S.Ercolani Section 3: Multiple Linear Regression
where the lower case letters represent deviations from means, e.g. x2i = X 2i − X 2i . You can
easily imagine how difficult this becomes when we add more variables into the model. We are
lucky that we have computer software packages that have these procedures programmed in
to them, so that we do not have to worry about calculating these equations by hand.
Now although the procedure for deriving the OLS estimators is the same here as in the
bivariate model, do these estimators still have the same properties of unbiasedness and
efficiency? That is are they still B.L.U.E? As with the 2-variable model, this depends upon a set
of assumptions. The classical linear regression assumptions are still appropriate in the multiple
regression model, but with minor modifications. Assumption CLRM2 needs to be modified to
now hold for all explanatory variables in the model, so each regressor is non stochastic. We
must also add a new assumption:
• CLRM7: No exact collinearity exists between any of the explanatory variables. This means
that there should not be an exact linear relationship between any regressors.
Under the original CLRM assumptions plus this extra one, the OLS estimators of the
parameters in multiple linear regression models are indeed B.L.U.E.
The reasoning behind this new assumption should be explained. Suppose we wish to estimate
the parameters in the regression model
Yi = 1 + 2 X 2i + 3 X 3i + 4 X 4i + i
Let's say that one regressor in the model can be expressed as an exact linear function of
another, e.g. X 3i = 5 − 3 X 2i . This would cause problems for the OLS estimation of some of the
parameters. This relationship implies that we can write the model as
Yi = 1 + 2 X 2i + 3 ( 5 − 3 X 2i ) + 4 X 4i + i
Yi = ( 1 + 53 ) + ( 2 − 33 ) X 2i + 4 X 4i + i
So the model that we are really estimating only has two regressors rather than three, i.e.
Yi = 1 + 2 X 2i + 3 X 4i + i
42
J.S.Ercolani Section 3: Multiple Linear Regression
where 1 = 1 + 53 , 2 = 2 − 33 and 3 = 4 . We can therefore estimate only the three
parameters, not the four parameters. We therefore cannot assess the individual effects
here of X 2 or X 3 on Y .
Exact collinearity, also called pure multicollinearity, is an extreme case and rare. But it is often
the case that regressors can be highly (not exactly) correlated with each other, which itself
brings about estimation problems. This concept of multicollinearity will be explored in a bit
more depth later.
( )
means equal the true but unknown values, i.e. E ˆ j = j . Consider the three-variable model
again, the variances are much more complicated than they were in the bivariate model and
are given by
2
( )
ˆ
= var 2 = 2
x32i
2
x22i x32i − ( x2i x3i )2
( )
2 = var ˆ3 = 2 x 2
2i
3
x x − ( x
2 2
2 i 3i )
x
2
2i 3i
Remember that we need to estimate the error variance 2 and in a three-variable model this
is given by
=
ˆ 2 ˆi2
n−3
43
J.S.Ercolani Section 3: Multiple Linear Regression
Notice the change in the denominator in this equation to its equivalent estimator in the
bivariate case. In a general k-variable model like (1), this estimator is ˆ 2 = ˆi2 ( n − k ) .
These variance estimators will also allow you to appreciate the complexities involved in
including more regressors in a model. This is why econometricians tend to analyse multiple
regressions using the matrix form of the model (this is not covered in this module).
One use of this statistic is as a way of helping to choose between different economic models,
i.e. between models with different variables on the right-hand side. So long as the models
have the same dependent variable, one would, in general, prefer the model with a higher R2
value (although on its own this is not enough to choose between models). However, there is
a problem with doing this. The problem with the R2 statistic is that it will always increase in
value when more explanatory variables are included. Therefore one should be wary of
comparing one model with another on the basis of their R2 values. Even if the variables that
you add to the model are not important or relevant to the economic theory, the R2 value will
always increase, making it look as if these variables are important in helping to describe the
variation in the dependent variable. For example, in our consumption function example, we
could run a regression of consumption on income, prices and interest rates, and then run a
regression of consumption on income, prices, interest rates and rainfall in the UK. You would
find that the second regression produces a higher R2 even though rainfall is unlikely to have
an effect on our consumption patterns.
So how can we properly compare two models with the same dependent variable but a
different number of explanatory variables? This can be done using the adjusted R2 often
denoted R 2 . This statistic essentially penalises the inclusion of more explanatory variables. The
statistic is calculated as follows
n − 1 2
R 2 = 1 − (1 − R )
n − k
It is therefore more appropriate to compare two models with the same dependent variable
on the basis of their R 2 values than their R2 values. The value of the R 2 will increase only
when the extra variables added have something important to add to the analysis.
44
J.S.Ercolani Section 3: Multiple Linear Regression
H 0 : j = *j
H1 : j *j
where the subscript j can be any number from 1 to k. The test is performed in exactly the same
way as in the bivariate model. From knowledge of the sampling distribution of the estimator
of this parameter we can calculate a t statistic and base our reject/not reject decision upon a
comparison of this statistic to a critical value from a t table. Therefore we calculate
ˆ j − *j
t= ~ tn − k
( )
s.e. ˆ j
Notice however that the degrees of freedom parameter has changed. This is because the
number of parameters that we must estimate before performing the test has increased. Also,
( )
in calculating the s.e. ˆ , we must estimate 2 which is now done using the estimator
j
ˆ 2 =
ˆ 2
i
n−k . The critical value that we use is also dependent upon the form of the alternative
hypothesis, i.e. whether we are doing a one or a two-tailed test. In the example above we are
doing a two-tailed test. Note that each of the parameters can be tested in the same way.
The test of significance can still be applied to each parameter, i.e. the test that the parameter
is equal to 0. Suppose we are interested in testing whether 3 = 0 . What we are testing is
whether, given the presence of the other variables in the regression, variable X 3 has any
additional explanatory power. If we find that we cannot reject the null that 3 = 0 , then we
conclude that the variable X 3 is not a significant determinant of the dependent variable Y.
ˆ
The test statistic in this case is t = 3 s.e.( ˆ3 ) .
45
J.S.Ercolani Section 3: Multiple Linear Regression
The t statistic can also be used to test slightly more complicated forms of restriction.
Sometimes econometricians may want to test a linear combination of the parameters, for
example that the sum of parameters equals 1, or that one parameter is equal in value to
another etc. For example suppose you want to test whether 2 = 3 , i.e. that the additional
explanatory power of variable X 2 is exactly the same as that of X 3 . As an example, may be
you are interested in the factors that affect how much we earn, and you are interested in
testing whether the marginal impact on earnings of doing a degree are the same as doing on-
the-job training.
ˆ2 (
N 2 , 22 and ˆ3 ) (
N 3 , 23 )
then
ˆ2 − ˆ3 (
N 2 − 3 , 22 + 23 − 2 2 3 )
ˆ2 − ˆ3 − ( 2 − 3 )
Z= N ( 0,1)
2 + 2 − 2
2 3 2 3
On replacing the denominator with its estimated values then we can say that under the null
where 2 − 3 = 0 then
ˆ2 − ˆ3
t= tn − k
ˆ 2 + ˆ 2 − 2ˆ
2 3 2 3
or more simply
ˆ2 − ˆ3
t= tn − k
(
s.e. ˆ2 − ˆ3 )
This statistic is then compared to the appropriate critical value from a t table.
46
J.S.Ercolani Section 3: Multiple Linear Regression
H 0 : 2 = 4 = 0
H1 : 2 0 and/or 4 0
This is a test of the joint significance of 2 and 4 . The test involves estimating two models,
an unrestricted and a restricted model and comparing the RSS, ˆ i
2
, (residual sum of
squares) from both. The unrestricted model is the original, in this case
Yi = 1 + 2 X 2i + 3 X 3i + 4 X 4i + 5 X 5i + + k X ki + i
The restricted model is the one that results from imposing the restrictions under the null onto
the unrestricted model. In this example the restrictions are that both 2 and 4 are equal to
0. On imposing these restrictions we get the model
Yi = 1 + 3 X 3i + 5 X 5i + + k X ki + i
The test statistic in this case is not a t but an F statistic, i.e. it has an F distribution. The statistic
itself is calculated from the following equation
F=
( RSS R − RSSU ) q ~ F
(2)
RSSU ( n − k )
q ,n−k
where RSSR and RSSU are the residual sums of squares from the restricted and unrestricted
models respectively and q represents the number of restrictions that we are imposing, which
in the above case is 2 ( 2 = 0 and 4 = 0 ). As with the t testing procedures that we have
looked at, we must compare the test statistic with a critical value, this time from an F
distribution table, not a t distribution. The decision rules at the % level of significance are
as follows:3 If F Fq,n − k then reject the null hypothesis;
otherwise do not reject the null.
47
J.S.Ercolani Section 3: Multiple Linear Regression
1−
Fc
You may be wondering what the difference is between doing an F test of the restriction
2 = 4 = 0 and two separate t tests, one for 2 = 0 and the other for 4 = 0 . In other words
what is the difference between testing the joint significance of X 2 and X 4 and testing their
individual significance. The former tests whether both 2 and 4 are 0. The latter tests
whether 2 is 0, when 4 is free to be whatever it is estimated to be, and vice versa. One
important feature is a consequence of multicollinearity. If the two regressors X 2 and X 4 are
highly correlated with each other, then individual t tests of the significance of each and a joint
F test of joint significance are likely to lead to different conclusions. The strong correlation of
the two variables means that when we do a t test on X 2 , given X 4 is already in the model,
X 2 is likely to look insignificant because its additional explanatory power above that of X 4 is
negligible. And vice versa, when we do a t test on X 4 , given X 2 is already in the model, X 4 is
likely to look insignificant because its additional explanatory power above that of X 2 is
negligible. Hence on the basis of t tests they both appear insignificant, and the econometrician
may exclude them from the model on that basis. However, an F test of joint significance may
suggest that at least one of them is significant. It may be that you are interested in how
interest rates affect a particular economic variable, maybe growth. There are many different
types of interest rate that you could choose from and maybe you include two of them. Interest
rates are clearly important in this setting and an F test of their joint significance will reflect
this. But because the two interest rate variables are likely to be highly correlated, they will
both look insignificant on the basis of t tests. The presence of one interest rate variable means
that the additional explanatory power of the other is minimal. The solution here would be to
include only one interest rate variable, the second is redundant given it just mimics the
movements of the first.
Now let’s consider a more complicated set of restrictions. Suppose we wish to test
48
J.S.Ercolani Section 3: Multiple Linear Regression
H0 : 2 + 3 = 1 and 4 = 5
H1 : 2 + 3 1, and/or 4 5
from regression model (1). We again perform the test using an F statistic and comparison of
restricted and unrestricted models as we did above, the problem now though is that the
restrictions are not as simple to impose as the previous set of restrictions where we were
looking at simply setting parameters equal to zero. Here, to get the restricted model, we need
to either replace 4 with 5 , or replace 5 with 4 , and we need to either replace 2 with
1 − 3 , or 3 with 1 − 2 . All would be correct and produce the same result. Let's do the latter
combination of each restriction, so that we get
Yi = 1 + 2 X 2i + (1 − 2 ) X 3i + 4 X 4i + 4 X 5i + + k X ki + i
Yi − X 3i = 1 + 2 ( X 2i − X 3i ) + 4 ( X 4i + X 5i ) + + k X ki + i
so that the restricted model that we should estimate has the form
Zi = 1 + 2W1i + 4W2i + + k X ki + i
H 0 : 2 = 3 = = k = 0
H1 : at least one 0
This is a test that all k-1 slope parameters (doesn’t include the constant parameter) are jointly
equal to 0, i.e. that none of the variables in the model are important in determining Y. The test
is performed in the same way, using an F statistic for which the restricted and unrestricted
models are compared. The restricted model in this case has the form
49
J.S.Ercolani Section 3: Multiple Linear Regression
Yi = 1 + i
The test statistic is the same as above with q = k − 1 , so that it has a Fk −1,n − k distribution.
An alternative way of performing the test of overall significance is to use the coefficient of
determination. Note that 2 = 3 = = k = 0 is equivalent to stating that R2 = 0 . So we are
also testing
H0 : R2 = 0
H1 : R 2 0
R 2 ( k − 1)
F= ~ Fk −1,n −k
(1 − R ) ( n − k )
2
It is important to note that this particular test statistic involving the R2 is only relevant when
testing overall significance of the regression. It would be wrong to use it to test, for example,
the null that we looked at before, 2 = 0 and 4 = 0 . We have to use the statistic that involves
the RSS in this case.
50
J.S.Ercolani Section 3: Multiple Linear Regression
P. Because the data are observed over time (annually in fact), we will use the subscript t for
time rather than i . The model is therefore
The estimated regression with the sample of size 25 annual observations (1991-2015) is
The values in round parentheses are standard errors and those in square brackets are t ratios
of the null that the relevant parameter equals 0. You should note that the coefficient divided
by the standard error gives the t ratio. Notice that all variables are statistically significant at
the 5% level and the R2 is close to 1. These are good results.
An alternative theory is that food expenditure depends only upon income. If this is the case
our model is
Yt = 1 + 2 Zt + t
This model is a restricted version of the first with the restrictions 3 = 4 = 0 imposed. Hence
RSSU = 65.380 and RSS R = 188.405 . We can formally test therefore the hypothesis
H 0 : 3 = 4 = 0
H1 : at least one 0
51
J.S.Ercolani Section 3: Multiple Linear Regression
188.405 − 65.380 21
F= = 19.76 F2,21
0.05
= 3.47
65.380 2
This means that we reject the null hypothesis. The alternative theory is therefore invalid.
IMPORTANT: Notice that the estimated coefficients for 1 and 2 in both models are very
different from each other. The reason for this is that in the second model, important variables
have been excluded. This has lead to a bias effect on the remaining estimated parameters, i.e.
the estimators ˆ1 and ̂ 2 in the second model no longer have the unbiasedness property
because of the omission of T and P.
Notice from the first estimated regression that the parameters on Z and T are very similar in
size but opposite in sign, i.e. ˆ2 − ˆ3 , ( 0.113 0.115 ) . There may be some economic
reasoning for this, so let's formally test the hypothesis
using the F statistic (although this could be done using a t test where the t statistic would be
ˆ ˆ
t = 2 + 3 ˆ 22 +ˆ 23 + 2ˆ 23
tn − k ). To get the restricted model we need to impose the restriction onto
Yt = 1 + 2 Zt − 2Tt + 4 Pt + t
Yt = 1 + 2 ( Z t − Tt ) + 4 Pt + t
52
J.S.Ercolani Section 3: Multiple Linear Regression
65.398 − 65.380 21
F= = 0.01 F1,210.05
= 4.3
65.380 1
We therefore do not reject the null here, which implies that instead of including the variables
Z and T separately, we should include them as Z-T. What is the economic rationale for this
variable transformation? Well the term Z-T is personal income less tax, i.e. personal disposable
income. Notice also that the number of restrictions here is 1 not 2. Although two
parameters are involved, there is only one restriction.
3.5 Multicollinearity
Multicollinearity is a problem that can occur with multiple regressions. We mentioned it at
the start of this section and discussed the concept of collinearity or pure multicollinearity. This
is the extreme situation in which an explanatory variable is exactly related to other
explanatory variables in the model. We looked at the consequences of this, showing that we
cannot estimate the parameters in the model, only combinations of them. In this section we
want to investigate the effects on estimation when explanatory variables are highly
correlated, but not exactly correlated.
• With multicollinear variables, the OLS estimators are still unbiased. It is important to
remember however, that unbiasedness is what we call a repeated sample property. It
means that if we had many samples of data then the estimators would, on average,
( )
estimate the correct but unknown values, i.e. E ˆ = . In economics, we only ever have
the chance to work with one sample of real data for any given empirical problem. So we
get just one estimate for each parameter.
53
J.S.Ercolani Section 3: Multiple Linear Regression
• The presence of multicollinearity does not affect the property of minimum variance for
the OLS estimators, i.e. they still have minimum variance amongst all linear unbiased
estimators. But just because they provide the smallest variances, does not imply that they
will provide small variances, they could still be the smallest and yet quite large. The larger
the variance the less precise the estimator.
• We say that multicollinearity is a sample phenomenon. This means that even if economic
theory does not suggest that two variables will be highly correlated, in the particular
sample of data that we have, the values may give a high correlation coefficient. This can
happen with time series data, where variables tend to increase in value over time, which
can make the variables look as though they are related, but there is no economic reason
why they should be.
hypotheses that the parameters are equal to 0 and hence conclude that variables are
insignificant.
• Even though you may find several insignificant variables in the model, the R2 value will
still be high, suggesting that the model is significant and that the regression model fits the
data well. The t statistics and R2 therefore seem to contradict each other.
• You may find the coefficients are estimated to have values that are the wrong sign to what
the theory predicts.
• If some of the data values were to change by a small amount, the OLS estimates would
change considerably. This means that the estimators are sensitive to the data and are said
to be unstable.
54
J.S.Ercolani Section 3: Multiple Linear Regression
at a time, so if you had three explanatory variables in your model you could check the
correlation between ( X , Y ) , ( X , Z ) and (Y , Z ) .
• We could run additional regressions where we take one explanatory variable and regress
it on the other Xs, and we do this for each of the explanatory variables. So if your model
contains three X variables, you run three extra regressions, i.e.
X1i = 1 + 2 X 2i + 3 X 3i + 1i
X 2i = 1 + 2 X1i + 3 X 3i + 2i
X 3i = 1 + 2 X1i + 3 X 2i + 3i
These regressions will tell you which variables are related to the rest by the size and
statistical significance of the R2 value from each of the estimated regressions.
• You could consider dropping one or more of the problem variables, i.e. some of those that
are collinear. This may get rid of a multicollinearity problem, but unfortunately it could
cause another problem. This is due to that fact that in formulating our econometric models
we include variables that economic theory states to be important. By excluding variables
from the equation we may be mis-specifying the model. This in itself causes the estimates
of the remaining variables to be biased. So, you're stuck between a rock and a hard place!
• Given that multicollinearity is a sample problem, it may be eradicated if we use a different
sample of data. Of course, we are likely to be very restricted in this respect as good data
can be hard to come by. But if it is possible to increase the sample size by either increasing
the number of years in the sample or, in the case of cross-sectional data, including more
individuals or countries in the analysis, then this could reduce the scale of the problem.
• It is possible that changing the functional form of the model could help, e.g. there may be
multicollinearity present in a log-linear model that does not appear in a purely linear form.
• If the empirical study that you are interested in is the focus of previous literature, then it
may be possible to use results from these studies to help with a multicollinearity problem.
For example, if empirical studies have already been done in your chosen area of research
then you could use the relevant estimated values from them. By replacing the coefficients
of the problem variables with these previously estimated values, it should be possible to
estimate the remaining parameters in your model with more precision. The problem of
course is that the information you utilise may itself be incorrect. It may hold under the
sample of observations used in that study but not relevant for yours.
55
J.S.Ercolani Section 3: Multiple Linear Regression
Yi = 1 + 2 X 1i + 3 X 12i + 4 X 1i X 2i + i
because it is still linear in parameters, even though it is not linear in the variables. In fact, OLS
could be used to estimate the parameters in, for example
Yi = 1 + 2 f1 ( X 1i , X 2i ) + 3 f 2 ( X 1i , X 2i ) + i
where the f1 (.) and f 2 (.) are any non-linear functions of the variables. However, OLS could
not be used to estimate the parameters directly from, for example
or
These models can however be transformed in such a way that OLS does become applicable.
The transformation involves taking logarithms of the variables. It suffices to note the following
properties of natural logs:
Using the second and third rules here allows us to re-write (3) as
ln Yi = + 2 ln X1i + 3 ln X 2i + ln i (5)
56
J.S.Ercolani Section 3: Multiple Linear Regression
and (4) as
ln Yi = + 2 X1i + 3 X 2i + i (6)
(where = ln 1 ) and both equations are now linear in the parameters. Model (5) is called a
log-linear model because it is linear in the logarithms of all the variables. Model (6) is called a
semi-log model because only the dependent variable has been log transformed. So long as the
disturbance term in (5) and (6) satisfy the classical assumptions, the OLS estimators of
(from which one can obtain 1 = e ), 2 and 3 are B.L.U.E.
A feature of model (5) that is often taken advantage of by applied economists is that the slope
parameters are interpreted as elasticities. To see this we make use of the fourth log rule. The
parameter 2 in (5) is
1
d ln Y dY dY X 1
2 = = 1
Y
=
d ln X 1 X1 dX 1 dX 1 Y
This is the definition of an elasticity. This model is often called the constant elasticity model
because the 2 and 3 parameters are constant. Going back to our consumption function
example, an alternative functional form could be given by Yi = 1 X i 2 i and would be
transformed to ln Yi = + 2 ln X i + ui where Y is consumption, X is income and the 2
parameter is now interpreted as the income elasticity (this is the specification of the
consumption function in Computer Practical 3). Remember though that in the linear
consumption function 2 was the MPC. So depending upon how the model is formulated, the
interpretation of the parameters is different.
To decide which functional form is better, i.e. linear or log-linear, might be a matter of
economic theory dictating the correct form, or a matter of empirics. We could firstly plot the
data on X against the data on Y to see what the scatter of observations looks like. This can only
give a rough idea of the relationship, but obviously if the observations seem to follow a curve
rather than a straight line, a linear regression is inappropriate. You are not going to get
accurate results if you try to fit a straight line through a set of data that do not follow a straight
line. If however you transform X and Y into logarithms and plot x = ln X against y = ln Y , and
find that it follows a straight line, then may be the log-linear model is appropriate. Choosing
on the basis of the highest R2 value is a bad idea. You should not compare R2 or R 2 values
57
J.S.Ercolani Section 3: Multiple Linear Regression
for models that have different dependent variables. This is the case here because one model
has dependent variable Y and the other has dependent variable lnY .
58
J.S.Ercolani Section 4: Classical assumption violations
The classical assumptions are actually rather strong in some contexts. In models in which the
data are observed over time, we find that the errors tend to be autocorrelated, violating
CLRM5. In models where the data are cross-sectional, we find that the errors are often
heteroskedastic, violating CLRM4.
For example, suppose that the variable follows a first-order autoregressive process (AR(1)).
This is where the variable is modelled as a function of itself from the previous time period,
X t = X t −1 + t
where −1 1 and is a parameter called the autocorrelation coefficient. Let’s first look at
the variance of this variable:
Note: this proof has assumed that X t is homoskedastic, i.e. the variance is the same no matter
what time period we are in ( var ( X t ) = var ( X t −1 ) ). Now let’s look at the covariance between
the variable and its first lag:
59
J.S.Ercolani Section 4: Classical assumption violations
cov ( X t , X t −1 ) = E ( X t X t −1 ) = E ( ( X t −1 + t ) X t −1 ) = E ( X t2−1 + t X t −1 )
So, there is a non-zero first-order autocorrelation. We can also see that there is a non-zero
second-order autocorrelation:
(
cov ( X t , X t −2 ) = E ( X t X t −2 ) = E ( ( X t −1 + t ) X t −2 ) = E ( ( X t −2 + t −1 ) + t ) X t −2 )
= E ( 2 X t2−2 + t −1 X t −2 + t X t −2 ) =
2 2
(1− ) 2
and in general
cov ( X t , X t − j ) = E ( X t X t − j ) = j 2
.
(1− )
2
Given that the autocorrelation coefficient is less than 1 in absolute value, then
j → 0 as j → . Hence as we look at autocorrelations further and further into the past, the
correlation gets smaller. This makes sense - a variable is more likely to be correlated with itself
in the near past than the distant past.
The following are plots of a simulated process X that I generated according to the AR(1) above,
to exhibit no autocorrelation ( = 0 ), positive autocorrelation ( 0 ) and negative
autocorrelation ( 0 ). When = 0 , the plot shows a fairly random pattern with no
discernible pattern or predictability.
When = 0.9 there is strong positive autocorrelation. Here, the pattern is smoother with runs
of positive and negative values and looks a bit like a cycle.
60
J.S.Ercolani Section 4: Classical assumption violations
This sub-section has shown you what autocorrelated variables in general look like. For
econometricians we are not particularly worried about autocorrelation in our variables, unless
the autocorrelation arises in the error terms of our regression models. So if the X variable
above is a regressor in a model then the autocorrelation is not a problem. If it is the
disturbance from a regression model then this is a concern. The rest of this section therefore
concentrates on autocorrelated errors.
Yt = 1 + 2 X 2t + + k X kt + t (1)
Previously we assumed that the error satisfied the classical assumptions and we showed
that the OLS estimators of 1 , , k are BLUE. Now let’s assume that satisfies all of the
classical assumptions except for the “no autocorrelation” assumption (CLRM5 in these notes)
that states that cov ( t , s ) = 0 for all t s . Hence the error term now violates this
assumption, so that
61
J.S.Ercolani Section 4: Classical assumption violations
cov ( t , s ) 0 for t s
This means that the error is correlated with itself in different time periods. Here we are only
going to consider what we call first-order autocorrelation, which arises when the error term
can be written as an AR(1) process, i.e.
t = t −1 + ut (2)
where is the autocorrelation coefficient and ut is itself an error term, (we assume that the
error u satisfies all of the classical assumptions). Using the information from section 4.1.1, we
know that
cov ( t , t −1 ) = 1 cov ( t , t − 2 ) = 1− 2 and in general cov ( t , t − j ) = j 2
2 2 2
−2
, 1− 2 .
Unless = 0 , the classical assumption is clearly violated here. We know that in order for OLS
estimators to be B.L.U.E, autocorrelation must not exist in t . So we are interested in
understanding:
(i) the problems that autocorrelation creates for the properties of OLS estimators,
(ii) ways in which we can test for the presence of autocorrelation and
We will assume here that this is the only assumption to be violated, i.e. that we have
autocorrelation but everything else is OK.
• The OLS estimators still have the good property of being unbiased. We know this because
when we proved the OLS estimators were unbiased back in Section 2, we did not need to
use the “no autocorrelation” assumption, i.e. it doesn’t matter whether the errors are
autocorrelated or not, OLS will still be unbiased.
• When the errors are AR(1), the equations for the variances of the OLS estimators are
( )
incorrect. For example, look at the derivation of var ˆ in Section 2. Notice that it 2
62
J.S.Ercolani Section 4: Classical assumption violations
(( X )
− X ) t which we condensed to ( X − X ) 2 . But this
2
includes the term var t t
could only be done because we were under the classical assumptions and could impose
homoskedastic and non-autocorrelated errors. In this section we can still impose
homoskedasticity but we have autocorrelation. This means that
var ( ( X − X ) ) = var (( X − X ) + ( X − X ) + + ( X − X ) )
t t 1 1 2 2 n n
= E (( X − X ) + ( X − X ) + + ( X − X ) )
2
1 1 2 2 n n
= E ( ( X − X ) ) + E ( ( X − X ) ) + + E ( ( X − X ) ) + E ( ( X − X )( X − X ) 1 2 )
2 2 2
1 1 2 2 n n 1 2
+ + E ( ( X − X )( X − X ) ) + E ( ( X − X )( X − X ) ) +
1 n 1 n 2 1 2 1
+ E ( ( X − X )( X − X ) ) +
2 n 2 n
It is the cross-product terms that makes this different and is a consequence of the
autocorrelated errors. The OLS variance can be shown to be
( )
var ˆ2 =
2
( Xt − X )
2
1 + 2 ( X 1 − X )( X 2 − X ) + 2 2 ( X 1 − X )( X 3 − X ) +
There are many terms inside this bracket that I have not defined. But it serves to show that
if we use the variance estimator from Section 2 in the situation when the errors are AR(1),
we are excluding the whole of the bracketed term and would hence get the wrong
estimated variance.
Remember that when we do hypothesis testing, say a simple t-test, this uses the standard
error of the estimate (square root of its variance) in the denominator. If this standard error
is estimated incorrectly then the t-test will be wrong and we may make incorrect decisions
about whether our variables are significant or not. So, if we use OLS and ignore
autocorrelation in the errors we could seriously affect the inferences we make about the
parameters and variables in our models.
• Even if we use the OLS estimator and the correctly adjusted variance above, OLS is still not
the best estimator as it does not have the smallest variance. There is another estimator,
called a Generalised Least Squares or GLS estimator that provides a smaller variance than
OLS when the errors are autocorrelated. We will discuss this estimator shortly. So, OLS is
inefficient, it is no longer the best, i.e. no longer BLUE.
63
J.S.Ercolani Section 4: Classical assumption violations
t = t −1 + ut
The null hypothesis for this test is that there is no autocorrelation, which in the above
equation means that = 0 . Hence the DW test is a test of the following
H0 : = 0
H1 : 0
(ˆ − ˆ )
n 2
2 (1 − ˆ )
t −1
DW = t =2 t
ˆ
n 2
t =1 t
where the ˆ terms are the residuals obtained from the OLS estimation of the linear regression
model of interest, and ̂ is the estimated value for the coefficient of autocorrelation. Most
econometric software packages automatically compute the DW statistic for you. Once you
have the statistic, you need to know what to do with it, i.e. how to use it to decide whether
you reject or do not reject the null hypothesis. This is similar to the testing procedures that
we have looked at before, where we compare the test statistic to a critical value. Here we use
64
J.S.Ercolani Section 4: Classical assumption violations
a DW distribution table. The difference with this particular test is that we have two critical
values, which we denote d L and dU , where the L and U represent the Lower and Upper values
that we read from the table. We base our decision on where the DW statistic falls on the
following line (it can only take a value between 0 and 4):
0 dL dU 2 4 − dU 4 − dL 4
o The inconclusive regions mean that we cannot make a decision about whether there is any
autocorrelation if the DW statistic lies in this region. In most cases these regions will be
small.
o It tests only for 1st order autocorrelation, i.e. where the error term is written as a first order
autoregressive process as in (2). However, there are many different ways that the error
could be represented and still exhibit autocorrelation. For example, it could have 2nd order
autocorrelation, in which we could write the error as a second order autoregression, AR(2),
t = 1 t −1 + 2 t −2 + ut . If the error does in fact take this form, the DW is not an appropriate
test.
o The test is invalid if one of the regressors in our regression model is the lagged dependent
variable, Yt −1 . In this case we would need to use Durbin’s h statistic.
• Durbin’s h test
If the model that you are estimating contains the lag of the dependent variable, as below
Yt = 1 + 2 X 2t + + k X kt + Yt −1 + t
then the DW statistic is not appropriate. As yet we haven't come across the use of lagged
variables or the concept of including a lag of the dependent on the right-hand side of the
equation. This form of dynamic model is very important in time series econometrics and you
will consider such models in the latter weeks of this module.
The DW test in these circumstances might find evidence of no autocorrelation when in fact
there is autocorrelation present in the errors, which is obviously no good to us. The test that
65
J.S.Ercolani Section 4: Classical assumption violations
we should use here is the Durbin's h statistic. The null hypothesis that we are testing is the
same as in the DW test but the statistic is calculated from
DW n n
h = 1 − = ˆ ~ N ( 0,1)
2 1 − n var (ˆ ) 1 − n var (ˆ )
where n is the sample size and var (ˆ ) is the variance of the OLS estimator of the parameter
on the lagged dependent variable. It does not matter how many regressors or how many lags
of the dependent variable that we include in the model, the h statistic is the same. Hence we
always use the variance of the coefficient on the first lag of Y. Because this statistic has a
standard normal distribution, the critical value that we use to compare to the value for h is
taken from a standard normal table. If h hc then you would reject the null hypothesis of no
autocorrelation.
Yt = 1 + 2 X t + t (3)
and we have performed the appropriate tests which show that the errors are autocorrelated
and that they can be written as t = t −1 + ut . We assume here that the error ut satisfies all
of the classical assumptions. To use GLS, we firstly transform the model by doing the following:
66
J.S.Ercolani Section 4: Classical assumption violations
Yt − Yt −1 = 1 (1 − ) + 2 ( X t − X t −1 ) + t − t −1
or Yt * = 1 + 2 X t* + ut
where Yt * and X t* are the quasi-differences, 1 = 1 (1 − ) and the new error term is
ut = t − t −1 .
The crucial thing to notice here is that the error term in the transformed model is ut which
satisfies the classical assumptions (given our original AR(1) error t = t −1 + ut ). Hence if we
make these transformations to our data on Y and X, we can estimate the transformed model
using OLS, because we have not violated the assumptions. Hence the OLS estimates of 1 and
2 will be B.L.U.E. We can determine 1 once we know 1 . Note that the procedure can be
used in multiple regression models with autocorrelated errors, but we must remember to take
quasi-differences of all of the explanatory variables.
In summary, GLS is essentially OLS applied to a model written in quasi-differences in which the
errors are not autocorrelated. You will notice however that to perform GLS estimation, you
need to know the value of so that the data can be transformed into the quasi-differences.
In most practical situations this value is unknown. We therefore need to look at a method,
similar to GLS in set-up but has an added step where is estimated as well.
1. Estimate the parameters of the original model (3) using OLS, so that the estimated
model is
2. Using the observations on the residuals, use OLS to estimate the parameter in the AR
model
ˆt = ˆt −1 + ut
67
J.S.Ercolani Section 4: Classical assumption violations
3. Use this estimated value to transform the variables into their quasi-differences,
Yt * = Yt − ˆYt −1 and X t* = X t − ˆ X t −1 .
Yt * = 1 + 2 X t* + ut
where ut satisfies the classical assumptions.
5. Use the residuals from this model to repeat step 2, so that we get another estimate of
denoted ̂ˆ (second estimate).
6. Repeat step 3 using ̂ˆ so that Yt ** = Yt − ˆˆYt −1 etc and use OLS again to estimate the
parameters in a model that regresses Yt ** on X t** .
7. Continue repeating these steps until the consecutive estimates for the model
parameters and change by a very small amount. When this happens we say that the
estimates have converged.
This is why the procedure is said to be iterative, because we repeat the steps until
convergence. Once convergence has occurred, these are the final estimates of our unknown
parameters.
Yt = 1 + 2 X 2t + 3 X 3t + t
where is the error term. Suppose instead that we exclude X 3 and estimate
Yt = 1 + 2 X 2t + t
68
J.S.Ercolani Section 4: Classical assumption violations
This should however be viewed as false or artificial autocorrelation because it has arisen out
of model mis-specification rather than because the errors are specifically of the form given in
(2).
It is probably better to make sure that any autocorrelation present in the error term is not
artificial, before attempting GLS or Cochrane-Orcutt. The GLS and Cochrane-Orcutt
procedures are reserved for autocorrelation that arises because the error term is inherently
in the form of an AR process. They are not really appropriate if the autocorrelation is borne
out of model mis-specification. This is often caused by an inappropriate dynamic structure in
the model, i.e. not enough lags of the variables are included.
So suppose that we wish to estimate model (3) and find that the error is autocorrelated. A
method that applied econometricians usually try first is to add a lagged dependent variable
into the regression model, i.e.
Yt = 1 + 2 X t + 3Yt −1 + t
and estimate this model using OLS. If this resolves the autocorrelation problem, checked by
using Durbin's h statistic, then there is no need to use GLS or Cochrane-Orcutt.
Yi = 1 + 2 X 2i + 3 X 3i + + k X ki + i (1)
The assumption of homoskedasticity states that var ( i ) = 2 for all i , which means that the
variance of each i is the same, i.e. var ( 1 ) = var ( 2 ) = = var ( n ) = 2 (we have changed
back to the i subscript here because heteroskedastic errors tend to be a problem in cross-
sectional models). This condition must hold in order for the OLS estimators of the
parameters in model (1) to be B.L.U.E. If this condition is violated then the errors are said to
be heteroskedastic, which can be represented by
var ( i ) = i2
69
J.S.Ercolani Section 4: Classical assumption violations
This implies that var ( 1 ) = 12 , var ( 2 ) = 22 , , var ( n ) = n2 , i.e. each error term is
allowed to have a different variance, which is in clear contradiction to the classical
assumption.
To show you how heteroskedasticity can take effect in cross-sectional studies, let's look at an
example. Consider a bivariate model in which we analyse the effect that income X has on our
savings Y. We would expect the relationship to be upward sloping to show that as income
increases, savings increase. If this model exhibits homoskedastic or constant variance errors,
then the distributions of the observations around the regression line at each X value would be
dispersed to the same degree. Represented on a graph we have
Density
Savings
Income
Under the homoskedasticity assumption, the spread of savings at each income level will be
the same. This is represented by the distributions in the graph having the same dispersions or
variance. But in the real world is this likely to be the case? Isn't it more likely that as income
increases, we would observe a greater spread of savings? People on low incomes tend to save
less because they have little money left over after they have bought all of the necessities for
living. Hence the spread of savings for all those earning low incomes would be small. However,
looking at the savings patterns of people who earn a lot, you would find that some of them
are likely to spend most of it and hence save only a small amount and others will save a lot.
This is because human behaviour is random, so we wouldn't expect everyone to do the same
thing under the same conditions. Hence in actuality the graph is more likely to look like this:
70
J.S.Ercolani Section 4: Classical assumption violations
Density
Savings
Income
Notice how the spread of the distribution increases as income increases. This would be the
case with heteroskedastic errors. Hence, in this scenario we are more likely to observe
heteroskedastic errors than homoskedastic errors.
So we can see how heteroskedasticity can arise quite naturally in cross-sectional studies. The
form of heteroskedasticity in the error discussed above was related to the explanatory
variable, because as income increased the variance of the error increased. We could therefore
write this relationship as
var ( i ) = i2 = 2 f ( X i )
i.e. the variance of the error is a function of the X variable. For example, if var ( i ) = 2 X i
then a plot of the error terms against X may look like this
x
i x x
x
x
x
x x x
x x x
x x x x
x x x x
0 x x x
x x x X
x x
x x x
x x x
x x
x x x
71
J.S.Ercolani Section 4: Classical assumption violations
x
i x x
x x x
x x x
x x x
x x x x
x x x x
0 x x x x
x x x X
x x
x x x
x x x
x x
x x x
Of course any form of non-constant variance can represent heteroskedasticity, not just those
considered above. As with autocorrelation, we can inspect plots of the error terms to get a
rough impression of whether heteroskedasticity is present. This again would involve
estimating the regression model, like (1) and plotting the residuals.
This section will take on a very similar structure to the section on autocorrelation. This is
because we want to ask the same kinds of questions, i.e. what are the problems caused by
heteroskedasticity for OLS estimation, how can we test for heteroskedasticity and what
measures can we take to deal with it?
• The OLS estimators still have the good property of being unbiased. Again when we proved
the unbiasedness of OLS in Section 2, we did not need to refer to the homoskedasticity
assumption, i.e. it doesn’t matter whether the errors are heteroskedastic, OLS is still
unbiased.
• When the errors are heteroskedastic, the equations for the variances of the OLS estimators
are incorrect. Notice that in the variance derivation in Section 2, the term var ( i ) was
replaced with 2 as we assumed homoskedastic errors. Adjusting the variance equation
to account for heteroskedastic errors leaves us with the equation
( X − X )
2 2
var ( ˆ ) =
i i
(( X − X ) )
2
2 2
i
72
J.S.Ercolani Section 4: Classical assumption violations
• White’s test
We considered above the situation in which the error variance was dependent upon the
explanatory variable X. In a multiple regression model it is possible for the error variance to
be a function of all of the explanatory variables, i.e.
var ( i ) = i2 = 2 f ( X 1i , X 2i , , X ki )
In this test the function f (.) includes all variables, their squared values and their cross
products. The idea is that we estimate a regression that relates the error variance to this f (.)
function. Of course we do not observe these variances, i.e. i2 is unobservable, so we use the
squared residuals as a proxy ˆi2 . To show you how this works, assume that we want to
estimate the following model and we want to test for heteroskedasticity in the errors
Yi = 1 + 2 X 2i + 3 X 3i + i
We estimate this model using OLS so that we can obtain the residuals ˆi . We then run another
regression, using the squared residuals as the dependent variable, as follows
73
J.S.Ercolani Section 4: Classical assumption violations
The White’s test is a test of the null hypothesis of homoskedasticity against the alternative of
heteroskedasticity. In model (4), homoskedasticity occurs if all of the coefficients on the Xs
are zero. Hence the null hypothesis takes the form
H0 : 2 = 3 = 4 = 5 = 6 = 0
H1 : at least one 0
To see why the restrictions under the null imply homoskedasticity, when we impose them on
(4) we get
ˆi2 = 1 + ui (5)
Hence the variances of the residuals under the null are constant, i.e. homoskedastic. If at least
one of the coefficient parameters is non-zero then we have heteroskedastic errors. For
example, if 3 0 then
ˆi2 = 1 + 3 X 3i + ui
such that
But how do we test this null hypothesis? Well you should recognise the form of the null
hypothesis from the section on hypothesis testing in multiple regression models. It is
effectively a test of the overall significance of regression (4) for which the test statistic is an F
statistic. Only for such a test of overall significance can you use both types of F statistic, i.e.
74
J.S.Ercolani Section 4: Classical assumption violations
R 2 ( m − 1)
F= ~ Fm−1,n −m
(1 − R ) ( n − m )
2
where R2 is the goodness of fit value from the estimated regression (4), n is the sample size
and m is the number of parameters to be estimated in (4). In the above example, m = 6 . Or
one can use
F=
( RSS R − RSSU ) (m − 1) ~ F
RSSU ( n − m )
m −1, n − m
where the unrestricted model is (4) and the restricted model is (5). As in all F testing
procedures, if F F c , where F c is the appropriate critical value, then we would reject the
null hypothesis. In this case it would imply that the errors were heteroskedastic.
Although it does not look like it at first glance, this regression is actually similar to that used in
the White’s test. Suppose our original model has two explanatory variables as it did in the
above sub-section, i.e. Y = + X + X + . Then Yˆ 2 implicitly contains the variables
i 1 2 2i 3 3i i i
2 2
X 2i , X 3i , X , X and X 2i X 3i .
2i 3i
Again the test is of the null of homoskedasticity and the alternative of heteroskedasticity,
which for this form of model amounts to testing
H 0 : 2 = 0
H1 : 2 0
To see why the null represents homoskedasticity here, impose the restriction on (6) to find
that ˆi2 = 1 + ui . This is of the same form of restricted version in White’s test.
75
J.S.Ercolani Section 4: Classical assumption violations
We have spent a long time on testing hypotheses of this sort, that is tests of a single
parameter, and therefore you should immediately realise that we can actually use a simple t
statistic. Hence we have
ˆ2
t= tn − 2
se (ˆ2 )
and we compare this value to the critical value from a t table. This is a two-sided test so we
would reject the null and conclude that the errors are heteroskedastic if t tn −2 2 .
• Goldfeld-Quandt test
Both the White’s test and its variant are useful if the form of heteroskedasticity is unknown,
i.e. the econometrician suspects that the errors are heteroskedastic but is not sure of what
form the heteroskedasticity takes and which X variables are causing the problem.
The Goldfeld-Quandt test however is useful if it is known that the variance of the disturbance
term changes with the value of a particular regressor X i . Consider estimating the model
Yi = 1 + 2 X 2i + 3 X 3i + i where the econometrician has some suspicion that the error
variance is changing and that it depends upon the values of X 2i . To run this test, the data
must be re-ordered in ascending order of the variable upon which the variance is thought to
depend, in this case the X 2i variable. The data are then split into two groups of size n1 and n2
which correspond to small values for X 2i and large values for X 2i respectively. Usually
n1 + n2 n because some middle observations are left out - this leaves a clear distinction
between the samples involving small and large values.
Once the data have been ordered and the subsamples created, two separate regressions are
estimated by OLS, using the different samples. These are of the form
where the superscripts denote the two regressions with the different samples, a and b. The
estimated variances of the disturbances are obtained from each regression. We will denote
these as
76
J.S.Ercolani Section 4: Classical assumption violations
ˆ12 = and ˆ 22 = i .
ˆi2( a ) ˆ 2(b )
n1 − k n2 − k
where k is the number of parameters in the regressions which in the above example is 3. The
idea is that a comparison of these values should indicate whether or not the variance of the
error term is different in the two sub-samples. The null hypothesis is again of homoskedasticity
and the alternative of heteroskedasticity, i.e.
H 0 : 12 = 22
H1 : 12 22
ˆ 22
FGQ = 2 ~ Fn − k ,n − k
ˆ1 2 1
As with all F tests, if F F c we reject the null hypothesis in favour of the alternative, i.e., we
reject homoskedasticity in favour of heteroskedasticity.
The choice of sub-sample size n1 and n2 are somewhat arbitrary. It is usual for n1 = n2 and to
make them greater than a third of the total sample size, with observations missing in the
middle. Notice what happens to the F statistic when n1 = n2 ,
ˆ 22 ˆi
2( b )
RSSb
FGQ = 2 = =
ˆ1 ˆi 2( a )
RSSa
Another thing to note is that the form of the test statistic assumes that the variance is
increasing with X 2i such that ˆ12 ˆ 22 and the statistic is greater than 1. If this is not the case
and the variance is decreasing in the X variable, the test should still be set up so that the
ˆ12
statistic is greater than 1 so that FGQ = and the alternative hypothesis is that 22 12 .
ˆ 2
2
77
J.S.Ercolani Section 4: Classical assumption violations
It is sometimes the case that transforming the variables into logarithms can transform a
heteroskedastic error into a homoskedastic error. If this doesn't work, or perhaps
heteroskedasticity is a problem when your model is already specified in logs, then an
alternative procedure is required. Here are a couple of suggestions.
Yi = 1 + 2 X i + i
but have found heteroskedasticity in the errors and that the heteroskedasticity is of the
following form
var ( i ) = i2 = 2 Z i
where is just a number. I am not being specific about the variable Z, it could be the
explanatory variable but it may be some other variable. Whatever Z is, it is obvious that the
variance changes as the values of Z change. It is possible to get rid of the heteroskedasticity
from the error term by dividing the regression equation by the square root of the form of Z
variable in the variance function so that
Yi 1 Xi
= 1 + 2 + vi
Zi Zi Zi
i
where vi = . It is not obvious at first glance how this has solved the problem. We need
Zi
to analyse the variance of the error more closely and we need to know a little bit about how
the variance operator works. The only property that we need to understand is that if
78
J.S.Ercolani Section 4: Classical assumption violations
where c is a non-random term. Applying this rule to the error in the transformed model we
have
2
1 1 1
var ( vi ) = var i = var ( i ) = var ( i ) = 2 Zi = 2
Z Z Zi Zi
i i
Hence although the variance of is heteroskedastic, that of v is not. This transformed model
can therefore safely be estimated using OLS. For example, if var ( i ) = 2 Z i2 then the
appropriate transformation would be to divide the regression model by Z.
( X − X )
2 2
var ( ˆ ) =
i i
(7)
(( X − X ) )
2
2 2
i
ˆ ˆ2 =
Therefore if we use the normal variance estimator var ( )
( Xi − X )
ˆ 2
2 where ˆ 2 =
ˆi2
n − 2 , this
would be a biased estimator of the true variance and hypothesis tests and confidence intervals
will be incorrect. White shows that a consistent estimator of the correct variance is
( ) (
X − X ) ˆ
2 2
ˆ ˆ2 =
var
i i
(8)
(( X − X ) ) i
2 2
which simply replaces the variance i2 by the squared residuals ˆi2 . This being a consistent
estimator of the true OLS variance parameter means that as the sample size increases, this
variance estimator (8) will tend to its true value. It must be remembered however, that even
79
J.S.Ercolani Section 4: Classical assumption violations
though this heteroskedasticity-consistent variance estimator is better than the usual OLS
variance estimator, it is still not efficient. For efficient estimation one would do better with a
WLS type estimator.
80