EC400Stats Lecturenotes2021
EC400Stats Lecturenotes2021
EC400Stats Lecturenotes2021
Contents
1 Probability theory 4
1.1 Basic algebra of sets and Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The Probability Function - Review of well known properties . . . . . . . . . . . . . . . . 9
1.3 Conditional probability and Bayes’formula . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Random variables 15
2.1 Probability distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Survival and hazard function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Expectations of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Mean and variance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Higher order moments and existence of moments . . . . . . . . . . . . . . . . . . 24
2.4.3 Moment generating function and characteristic function . . . . . . . . . . . . . . 27
2.5 Percentiles and mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1
4 Some special distributions 47
4.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Chi-squared, t and F distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Bernouilli, binomial and poisson distributions . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Some other distributions (not discussed in 2018) . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7 Hypothesis testing 76
7.1 Classical testing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1.1 Type of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1.2 Signi…cance level and power of a test . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Test of the mean (variance known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2.1 The z-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.2 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.3 Power of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Test of the mean (variance unknown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3.1 The t-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Test of the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.5 Hypothesis testing and con…dence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 84
2
8.4 Properties of OLS estimators in CLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.4.1 Unbiased, e¢ cient (BLUE), consistent . . . . . . . . . . . . . . . . . . . . . . . . 90
8.5 Derivation OLS estimator using Matrix notation (Non-examinable) . . . . . . . . . . . . 91
8.6 Statistical Inference in CLM under normality . . . . . . . . . . . . . . . . . . . . . . . . 92
8.6.1 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.6.2 The F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.7 Gauss-Markov violations - brief summary . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3
1 Probability theory
Read LM Chapter 2.
The starting point for studying probability is the de…nition of four key terms: experiment, sample
outcome, sample space, and event.
– The latter three are carry-overs from classical set theory, and give us a familiar mathematical
framework within which to work.
– The former is what provides the conceptual mechanism for casting real-world phenomena
into probabilistic terms.
Experiment: is any procedure that (1) can be repeated, theoretically, an in…nite number of times
and (2) has a well-de…ned set of possible outcomes.
Sample outcome ! –is a particular draw from the sample space (relate to random variable)
Event –any designated collection of sample outcomes, including individual outcomes, the entire
sample space, and the null set.
Let us consider a particular example (for other examples see LM, Ch 2):
Associated with events de…ned on a sample space are several operations collectively referred to
as the algebra of sets. These are rules that govern the ways in which one event can be combined
with another.
4
Algebra of Sets:
– Let A and B be any two events de…ned over the sample sample space
The intersection of A and B, written A \ B, is the event whose outcomes belong to both
A and B: If A \ B = ; then A and B are mutually exclusive.
The union of A and B, written A [ B, is the event whose outcomes belong to either A
or B or both.
The complement of A; written AC ; is the event consisting of all outcomes in other
than those contained in A:
– The notions of unions and intersections can easily be extended to more than two events.
Next we will want to assign a probability to an experiment’s outcome - or, more generally, to an
event.
Kolmogorov showed that the following axioms are necessary and su¢ cient for characterizing the
probability function P :
Given the outcome of the experiment, !; all events in F that contain the selected outcome are
said to have occurred.
If this experiment were to be repeated an in…nite number of times, the relative frequencies of
occurrence of each of the events would coincide with the probabilities prescribed by the function
P:
5
1.1 Basic algebra of sets and Venn Diagrams
The intersection of A and B, written A \ B, is the event whose outcomes belong to both A and
B: If A \ B = ; then A and B are mutually exclusive.
The union of A and B, written A [ B, is the event whose outcomes belong to either A or B or
both.
The complement of A; written AC ; is the event consisting of all outcomes in other than those
contained in A:
Mutually exclusive
B⊂A
B
Venn diagrams
6
A
Mutually exclusive
and exaustive
C
B
Partition of A
Venn diagrams
Example: Tossing a six-faced “fair” die. The sample space is f1; 2; 3; 4; 5; 6g ; each number being
an outcome,
7
A few results:
The following results can be easily seen with the aid of Venn Diagrams. Figure 1 illustrates the …rst
one.
C
1. (A [ B) = AC \ B C
C
2. (A \ B) = AC [ B C
3. for any events A; B1 ; B2 ; :::; Bn :
A \ (B1 [ B2 [ ::: [ Bn ) = A
Ω AC
A B
A∪B BC
(A∪B)C AC ∩BC
C
Figure 1: (A [ B) = AC \ B C
8
Examples of a algebra:
F = f;; g forms a algebra
–F contains the empty set: ;
–F is closed under to complements: ;c = 2 F and c = ; 2 F
–F is closed under countable unions: ; [ = 2 F
–F is closed under countable intersections: ; \ = ; 2 F
F = f;; fa; bg ; fc; dg ; fa; b; c; dgg forms a -algebra
–F contains the empty set: ;
c
–F is closed under to complements: e.g., fa; bg = fc; dg 2 F
–F is closed under countable unions: e.g., fa; bg [ fc; dg = 2 F
–F is closed under countable intersections: e.g., fa; bg \ fc; dg = ; 2 F
Ω
A B
9
4. for any event A; P AC = 1 P (A)
5. for any events A and B:
P (A) = P (A \ B) + P A \ B C
6. for exhaustive, mutually exclusive events B1 ; B2 ; :::; Bn , and for any event A:
P (A) = P (A \ B1 ) + P (A \ B2 ) + ::: + P (A \ Bn )
Xn
= P (A \ Bi )
i=1
Exercise: A household survey of electric appliances found that 75% of houses have radios (R),
65% have irons (I), 55% have electric toasters (T ), 50% have (IR), 40% have (RT ), 30% have
(IT ), and 20% have all three. Find the probability that a household has at least one of these
appliances.
– Solution:
P (R [ I [ T ) = P (R) + P (I) + P (T ) P (R \ I)
P (R \ T ) P (I \ T )
+P (R \ I \ T )
= :75 + :65 + :55 :5 :4 :3 + :2 = :95
– Solution:
C
P AC [ B C = P (A \ B)
= 1 P (A \ B)
= :8
10
Exercise: Suppose you and I are investors and part of our capital is invested in Thailand. I know
that you’ve just received some information about Thailand, I do not know what it is. It may be
B, “things are good”, or B C , “things are not good”. Let the event A be “you take the money out
of Thailand”, and AC “you keep the money Thailand”. Say that I know the probability of the
event A given B and given B C as well as the probability of B. I see that you chose A. What is
the probability of B given A?
Theorem Bayes’ Theorem (simple form): For any events A and B with P (A) > 0,
P (B) :P (AjB)
P (BjA) =
P (A)
Theorem Bayes’ Theorem (general form): If B1; B2; :::; Bn form a partition of the sample space S,
then
P (AjBj ) P (Bj )
P (Bj jA) = Pn for each j = 1; 2; :::; n
i=1 P (AjBi ) P (Bi )
If we know P (AjBj ) for all j; the theorem allows us to compute P (Bj jA) :
Bayesian analysis: P [Bj ] are referred to as prior probabilities, and P (Bj jA) as posterior proba-
bilities.
1.4 Independence
The independence of (non-empty) events A and B is equivalent to
– The probability of a given event A remains the same regardless of the outcome of a second
event B:
11
Example: Die-tossing experiment with sample space is S = f1; 2; 3; 4; 5; 6g. Consider the following
events:
A = f1; 2; 3g “the number tossed is 3"
B = f2; 4; 6g “the number tossed is even”
C = f1; 2g “the number tossed is a 1 or a 2”
D = f1; 6g “the number tossed doesn’t start with the letters ‘f’or ‘t’”.
A few results:
2. P AC jB = 1 P (AjB)
P (A\B) P (A)
3. if A B then P (AjB) = P (B) = P (B) , and P (BjA) = 1
4. If A and B are independent events then AC and B are independent events, A and B C are
independent events, and AC and B C are independent events.
1.5 Combinatorics
Counting ordered sequences: the multiplication rule. If operation A can be performed in m dif-
ferent ways and operation B in n di¤erent ways, the sequence (operation A; operation B) can be
performed in m n di¤erent ways.
12
– Rolling a dice twice, yields 6x6 possible outcomes.
– If an operation Ai , i = 1; ::; k; can be performed in ni ways i = 1; 2; ::; k respectively, then
the ordered sequence (operation A1 ; operation A2 ; :::; operation Ak ) can be performed in
n1 n2 :: nk ways.
Counting permutations (when the objects are all distinct): The number of permutations of length
k that can be formed from a set of n distinct elements, repetitions not allowed, is denoted by n Pk
:
n!
n Pk = n (n 1) ::: (n k + 1) =
(n k)!
Counting permutations (when the objects are not all distinct): The number of ways to arrange n
objects, n1 being of one kind, n2 of the second kind, ..., and nr of an rth kind is
n! Pr
, where ni = n
n1 !n2 !::nr ! i=1
Counting combinations: The number of ways to form combinations of size k from a set of n
distinct objects, repetitions not allowed, is denoted by the symbols nk or n Ck , where
n n!
k =n C k =
k! (n k)!
n
– k ; k = 0; :::; n; are commonly referred to as binomial coe¢ cients
– Pascal’s triangle allows us to easily obtain the binomial coe¢ cients (see p.110 in Larsen and
Marx)
1.6 Exercises
1 5 7
Exercise: If P (A) = 6 and P (B) = 12 , and P (AjB) + P (BjA) = 10 :Find P (A \ B) :
– Solution:
P (A \ B)
P (BjA) = = 6P (A \ B)
P (A)
P (A \ B) 12
P (AjB) = = P (A \ B)
P (B) 5
12 7
! 6+ P (A \ B) =
5 10
1
! P (A \ B) =
12
Exercise: Three dice have the probabilities of throwing a “6”: p; q; r; respectively. One of the
dice is chosen at random and thrown (each is equally likely to be chosen). A “6”appeared. What
is the probability that the die chosen was the …rst one?
13
– Solution: The event “a 6 is thrown” is denoted by “6”
P ((die 1) \ (“6”))
P (die 1 j “6”) =
P (“6")
P (“6”j die 1) P (die 1)
=
P (“6”)
1
p 3
=
P [“6”]
p 31
! P [die 1 j \6"] =
P [\6"]
p 31 p
= 1 =
(p + q + r) 3
p+q+r
Exercise: Identical twins come from the same egg and hence are of the same sex. Fraternal twins
have a 50-50 chance of being the same sex. Among twins, the probability of a fraternal set is p and
an identical set is q = 1 p. If the next set of twins are of the same sex, what is the probability
that they are identical?
– Solution: Let A be the event ”the next set of twins are of the same sex”, and let B be the
event ”the next sets of twins are identical”. We are given:
P (AjB) = 1; P AjB C = :5
P (B) = q; P B C = p = 1 q:
P (A \ B)
Then P (BjA) =
P (A)
But P (A \ B) = P (AjB) P (B) = q;
C
and P A \ B = P AjB C P B C = :5p
Thus, P (A) = P (A \ B) + P A \ B C
= q + :5p = q + :5(1 q)
= :5(1 + q);
q
and P (B j A) =
:5(1 + q)
14
Exercise: Let events A and B be independent. Find the probability, in terms of P (A) and P (B),
that exactly one of the events A and B occurs.
Since A and B are independent, it follows that A and B C are also independent, as are B
and AC :
Then P A \ B C [ B \ AC
= P (A) P B C + P (B) P AC
= P (A) (1 P (B)) + P (B) (1 P (A))
= P (A) + P (B) 2P (A) P (B) :
2 Random variables
See also Greene Appendix C.
In the previous chapter we introduced the sample space ; which may be quite tedious to describe
in the elements of are not numbers.
With the help of random variables we formulate a rule, or a set of rules, by which the elements
! of may be represented by numbers x or ordered pairs of numbers (x1 ; x2 ) or, more generally
n tuplets of numbers (x1 ; x2 ; ::; xn ) :
– Example: The random experiment may be the toss of a coin and = fH; T g : We may de…ne
X such that X(!) = 0 if ! = T and X (!) = 1 if ! = H:
– A random variable X is a function that carries the probability from a sample space to a
space of real numbers:
15
Random variables can be scalar (univariate) or vectors (multivariate)
– Let capital letters (X) denote the random variable and small letters (x) a particular realiza-
tion.
– We should distinguish discrete and continuous random variables
A discrete random variable can take on values from a …nite or countable in…nite sequence only.
Example: Suppose I’ll toss a coin until the …rst head occurs. Related to this experiment we may
think of the following two examples of random variables:
– Random variable X
X = 1 if the …rst head occurs on an even-numbered toss
X = 0 if the …rst head occurs on an odd-numbered toss;
– Random variable Y
Y = n, where n is the number of the toss on which the …rst head occurs.
– Both X and Y are discrete random variables, where X can take on only the values 0 or 1,
and Y can take on any positive integer value.
– X and Y are based on the same sample space –the sample points are sequences of tail coin
‡ips ending with a head coin ‡ip:
= fH; T H; T T H; T T T H; T T T T H; :::g.
X(H) = 0 (a head on ‡ip one, an odd-numbered ‡ip),
X(T H) = 1;
X(T T H) = 0; :::and so on.
Y (H) = 1 (…rst head on ‡ip 1),
Y (T H) = 2;
Y (T T H) = 3;
Y (T T T H) = 4; :::and so on.
A continuous random variable can assume numerical values from an interval of real numbers, (e.g.,
the set of real numbers <).
Simple examples are the weight and height of a person or household income.
16
For a continuous random variable, the probability of a particular outcome is zero.
pdf (x)
a b x
R1
2. 1
f (x)dx = 1
A mixed (discrete and continuous) random variable has some points with non-zero probability
mass, and with a continuous p.d.f elsewhere.
– The sum of the probabilities at the discrete points of probability plus the integral of the
density function on the continuous region for X must be 1.
17
Example: X has probability of .5 at X = 0, and X is a continuous random variable on the interval
(0,1) with density function f (x) = x for 0 < x < 1, and X has no density or probability elsewhere.
Note that:
Z 1 Z 1
P (X = 0) + f (x)dx = :5 + xdx
0 0
= :5 + :5 = 1:
P
A discrete random variable with probability function f (x) has a c.d.f equalling F (x) = w x f (w) where
F (x) is a “step function”(it has a jump at each point with non-zero probability, while remaining
constant until next jump).
A
R xcontinuous random variable X with density function f (x); has a distribution function F (x) =
@
1
f (t)dt. F (x) is a continuous, di¤erentiable, non-decreasing function such that @x (x) =
F
F 0 (x) = f (x):
If X has a mixed distribution, then F (x) is continuous except at the points of non-zero probability
mass, where F (x) will have a jump.
2. If X has a mixed distribution, the P (X = t) will be non-zero for some value(s) of t, and P (a < X < b)
will not always be equal to P (a X b) (they will not be equal if X has a non-zero probability
mass at either a or b):
3. f (x) may be de…ned piecewise, meaning that f (x) is de…ned by a di¤erent algebraic formula on
di¤erent intervals.
4. A continuous random variable may have two or more di¤erent, but equivalent p.d.f’s, but the
di¤erence in the p.d.f’s would only occur at a …nite (or countably in…nite) number of points. The
c.d.f of a random variable of any type is always unique to that random variable.
18
2.2.1 Examples
X = number turning up when tossing one fair die
1
so X has probability function fX (X) = P [X = x] = 6 for x = 1; 2; 3; 4; 5; 6: X is a discrete
random variable. 8
>
> 0 if x < 1
>
> 1
>
> 6 if 1 x < 2
>
> 2
< 6 if 2 x < 3
3
FX (x) = P [X x] = 6 if 3 x < 4
>
> 4
>
> 6 if 4 x < 5
>
> 5
>
> if 5 x < 6
: 6
1 if x 6
Z has a mixed distribution on the interval [0; 1). Z has probability of .5 at Z = 0, and Z has
density function fZ (z) = z for 0 < z < 1, and Z has no density or probability elsewhere.
Then 8
>
> 0 if z < 0
<
:5 if z = 0
FZ (z) = 1 2
>
> :5 + 2 z if 0 <z<1
:
1 if z 1
Exercise: A die is loaded in such a way that the probability of the face with j dots turning up is
proportional to j for j = 1; 2; 3; 4; 5; 6: What is the probability, in one roll of the die, that an even
number of dots will turn up?
– Solution: Let X denote the random variable representing the number of dots that appears
when the die is rolled once. Then, P [X = k] = R k for k = 1; 2; 3; 4; 5; 6; where R is the
proportional constant. Since the sum of all the probabilities of points in that can occur must
be 1, it follows that
R [1 + 2 + 3 + 4 + 5 + 6] = 1;
1
so that R = :
21
Then,
19
– Solution: X is a discrete random variable that can take on an integer value of 1 or more.
The probability function for X is the probability of x 1 successive odd tosses followed by
an even toss: x
1
f (x) = P [X = x] =
2
Then
Exercise: The continuous random variable X has density function f (x) = 3 48x2 for :25
x :25 (and f (x) = 0 elsewhere). Find P 18 X 5
16 :
– Solution:
P [:125 X :3125] = P [:125 X :25]
since there is no density for X at points greater than .25. The probability is
Z :25
5
(3 48x2 )dx = :
:125 32
Exercise: Suppose that the continuous random variable X has the cumulative distribution func-
tion F (x) = 1+e1 x for 1 < x < 1: Find X’s density function.
– Solution: The density function for a continuous random variable is the …rst derivatives of
the cumulative distribution function. The density function of X is
f (x) = F 0 (x)
e x
=
(1 + e x )2
x
Exercise: X is a random variable for which P [X x] = 1 e for x 1, and P [X x] = 0
for x < 1. Which of the following statements is true?
A) P [x = 2] = 1 e 2 and P [X = 1] = 1 e 1
B) P [x = 2] = 1 e 2 and P [X 1] = 1 e 1
C) P [x = 2] = 1 e 2 and P [X < 1] = 1 e 1
D) P [x < 2] = 1 e 2 and P [X < 1] = 1 e 1
E) P [x < 2] = 1 e 2 and P [X = 1] = 1 e 1
20
Exercise: A continuous random variable X has a density function
8
< 2x 0 < x < 12
4 2x 1
f (x) = x<2
: 3 2
0 elsewhere;
Find P [:25 < X 1:25] :
– Solution:
Z 1:25
P [:25 < X 1:25] = f (x)dx
:25
Z :5 Z 1:25
4 2x
= 2xdx + dx
:25 :5 3
3
=
4
Note that since X is a continuous random variable, the probability P [:25 X < 1:25] would
be the same as P [:25 < X 1:25] :This is an example of a density function de…ned piecewise.
More examples can be found in LM Chapter 3.3 and 3.4.
21
2.4 Expectations of a random variable
See also LM Chapter 3.5 and 3.6.
– It is denoted by E [X], or X or .
– It is also called the expectation of X, or the mean of X:
(in practice, the interval of integration is the interval of non-zero density for X).
– Example (continuous): If the pdf of Y is given by f (y) = 1 for y 2 [1; 2] and f (y) = 0
otherwise,
Z 2
E [X] = (1 x)dx
1
= 1:5
Theorem Jensen’s inequality: If g is a function and X is a random variable such that g 00 (x) 0 at
all points x with non-zero density or probability for X; then:
@ 2 g(x)
E [g (X)] g (E [X]) , when g 00 (x) 0
@2x
with strict inequality if g 00 (x) > 0.
22
Graphically, the function below is convex (g 00 > 0). Therefore, the expected value of Y = g(X) is
bigger than g(E(X)).
– Later we will discuss how to obtain the distribution of Y = g(X) given that we know the
distribution of X:
The expected value provides us with important information about a random variable. But higher
order moments are also very relevant.
Example: Suppose you have two investment opportunities: A yields on average 4% a year, B has an
expected return of 5% (both in British Pounds). If B better than A?
Answer : Maybe. An important issue is: how risky are those investments? If A is a bond issued by the
British Treasury and B was issued by a state bank in Argentina, B will yield more on average, but the
possibility that you will lose money is also substantially higher...
2 2
The variance of X measures the dispersion of X. It is denoted by V ar [X] ; X or :
V ar [X] = E (X E (X))2
( P
2
(x ) f (x) if X is discrete
= R1 2
1
(x ) f (x)dx if X is continuous
23
How to calculate the variance: (with = E(X))
– Discrete case: 2
X
E [X ] = (x )2 f (x)
X
E X2 2
= x2 f (x) 2
– Continuous case: Z 1
2
E [X ] = (x )2 f (x)dx
1
Z 1
2 2
E X = x2 f (x)dx 2
1
Exercise: Suppose E[X] = 2: Compute the variance in the following three cases:
(i) Pr(X = 2) = 1:
(ii) Pr(X = 1) = Pr(X = 3) = 1=2:
(iii) Pr(X = 0) = Pr(X = 4) = 1=2:
24
The coe¢ cient of excess kurtosis is de…ned to be:
h i
4
E (X )
4
3
pdf pdf
µ µ
High kurtosis Low kurtosis
Exercise: Suppose E[X] = 2: Compute the coe¢ cient of excess kurtose in the following three
cases:
(i) Pr(X = 2) = 1:
(ii) Pr(X = 1) = 1=4; Pr(X = 2) = 1=2; Pr(X = 3) = 1=4:
(iii) Pr(X = 0) = 1=16; Pr(X = 2) = 7=8; Pr(X = 4) = 1=16:
h i
4
– (i) E (X ) = 0: Since Var(X) = 0; excess kurtose coe¢ cient not determined.
h i
4
(ii) E (X E(X)) = 1=4:(1 2)4 + 1=2:(2 2)4 + 1=4:(3 2)4 = 1=2
h i
2 2
E (X ) = 1=4 ( 1) + 1=4 (1)2 = 1=2
So excess
h kurtose icoe¢ cient: 21 =( 12 )2 3 = 1
4
(iii) E (X E(X)) = 1=16:(0 2)4 + 7=8:(2 2)4 + 1=16:(4 2)4 = 2
h i
2 2
E (X ) = 1=16 ( 2) + 1=16 (2)2 = 1=2
So excess kurtose coe¢ cient: 2=(1=2)2 3 = 1
– Relative to the standard normal distribution, (iii) has fatter tails (positive coe¢ cient), while
(ii) has thinner tails (negative coe¢ cients)
– An existence result: If the k th moment of a random variable exist, all moments of order
less than k exist.
25
Example: The mean might not exist (it might be +1 or -1). Consider the continuous random variable
X with p.d.f:
1
x2 for x 1
f (x) =
0, otherwise;
It is a pdf as: Z 1
1 1 1
dx = =1
1 x2 x 1
It is expected value is: Z Z
1 1
1 1 1
x: dx = dx = log(x) = +1
1 x2 1 x 1
Proof existence result (LM, page 201): Let f (y) be the pdf of a continuous random variable
Y . As E(Y k ) exist: Z 1
k
jyj :f (y)dy < 1
1
– Distribution is symmetric, bell shaped (like normal distribution). The degrees of freedom
(v) determine what moments exist
All moments of order v or higher do not exist
Mean only exist when v > 1 (otherwise unde…ned); Skewness (Kurtosis) only de…ned
when v > 3 (v > 4) :
– The tails when v = 5 are thicker and more peaked compared to v = 1 (N (0; 1)) t(5)
leptokurtic (peaked).
26
2.4.3 Moment generating function and characteristic function
For the random variable X, with probability density function, f (x); if the function
Z
tX
M (t) = E e (= etx f (x)dx)
By rewriting the function to be integrated as the product of a constant and the pdf of a
normal with mean + 2 t and variance 2 ; the answer follows.
Z
1 2 2
e 2 2 [x ( +t )] :e t+ 2 t dx
1 1 2 2
= p
Z 2 2
1 2 2
e 2 2 [x ( +t )] :dx e t+ 2 t = e t+ 2 t
1 1 2 2 1 2 2
= p
2 2
| {z }
=1
R
This recognizes the property of a pdf which guarantees f (x)dx = 1:
1 2 2
Using the M (t) = exp t+ 2 t ; we can show that M 0 (0) E(X) = and M 00 (0)
E(X 2 ) = 2 + 2
I M 0 (t) = + 2
t exp t+ 1
2
2 2
t and M 0 (0) = = E(X)
2
I M (t) =
00 2
exp t+ 1
2
2 2
t + + 2
t exp t+ 1
2
2 2
t and M 00 (0) = 2
+ 2
!
A useful feature of MGFs is that if X and Y are independent, then the MGF of X + Y is
Mx (t)My (t):
E et(X+Y ) = E etX etY =indep E etX E etY
27
– Useful result when considering the distribution of sums of random variables. More later.
While there are many distributions whose MGF does not exist, every distribution has a unique
characteristic function.
– Characteristic functions is a fundamental tool in proofs of central limit theorems. More later.
P [X cp ] p and P [X cp ] 1 p:
Median: the 50-th percentile of a distribution is referred to as the median of the distri-
bution — it is the point M for which P [X M ] = :5: Half of the distribution probability is to
the left of M and half is to the right.
The mode of a distribution: The mode is any point m at which the probability or density
function f (x) is maximized.
The distribution of the random variable X is said to be symmetric about the point c if
f (c + t) = f (c t) for any value of t. In this case
2.6 Exercises
Exercise: The time between consecutive eruptions of the volcano Mauna Loa follows the following
pdf :
f (t) = 0:027:e 0:027:t
where t is given in months. What is the average time between consecutive eruptions of the volcano?
28
Integrating by parts, we have:
Z 1
0:027:t 1 0:027:t
E(t) = t:e j0 e :dt
0
1 0:027y 1
= 0+ e j0
0:027
1
= 0
0:027
= 37 months
Exercise: If the pdf of Y is given by f (y) = 1 for y 2 [1; 2] and f (y) = 0 otherwise, what is
V ar(Y )?
– Solution:
Z 2
2
E(Y ) = y 2 dy
1
3
y 2 8 1 7
= 1 = =
3 3 3
As E(X) = 3=2:
2
V ar(Y ) = E(X 2 ) [E(X)]
2
7 3
=
3 2
1
=
12
Exercise: Let X equal the number of tosses of a fair die until the …rst “1”appears. Find E [X] :
– Solution: X is a discrete random variable that can take on an integer value 1:The prob-
ability that the …rst 1 appears on the x-th toss is f (x) = ( 56 )x 1 ( 16 ) for x 1 (x 1 tosses
that are not 1 followed by a 1). This is the probability function of X.
Then
1
X 1
X 5 1
E [X] = k f (k) = k ( )k 1
( )
6 6
k=1 k=1
1 5 5
= ( ) 1 + 2( ) + 3( )2 + :::
6 6 6
1
We use the general increasing geometric series relation 1 + 2r + 3r2 + ::: = (1 r)2 , so that
1 1
E [X] = ( ) = 6:
6 (1 56 )2
Find V ar [X] :
29
– Solution: The density of X is symmetric about 0 (since f (x) = f ( x)), so that E [X] = 0
This can be veri…ed directly:
Z 1
E [X] = x(1 jxj)dx
1
Z 0 Z 1
= x(1 + x)dx + x(1 x)dx
1 0
1 1
= + =0
6 6
Then
2
V ar [X] = E X2 (E [X]) = E X 2
Z 1
= x2 (1 jxj)dx
1
Z 0 Z 1
2 1
= x (1 + x)dx + x2 (1 x)dx =
1 0 6
1 jxj
Exercise: The continuous random variable X has p.d.f f (x) = 2 e for 1 < x < 1. Find
the 87.5-th percentile of the distribution.
Rb
– Solution: The 87.5-th percentile is the number b for which 0:875 = P [X b] = 1 f (x)dx =
Rb 1
1 2
e jxj dx: This distribution is symmetric about 0, since f ( x) = f (x); so the mean
and median are both 0. Thus b > 0, and so
Z b Z 0 Z b
1 jxj 1 jxj 1
e dx = e dx + e jxj dx
1 2 1 2 0 2
Z b
1 1
= :5 + e x dx = :5 + (1 e b )
0 2 2
= :875
! b= ln(:25) = ln 4
30
– Everything can be generalized to general multivariate random variables
By using vectors, e.g., f (x) = f (x1 ; :::; xn )
Useful results
To visualize this result, note that F (x2 ; y2 ) considers the probability of all points in ( 1; x2 )\
( 1; y2 ). Subtracting F (x2 ; y1 ) and F (x1 ; y2 ), we end up with the probability in the interval
we are interested (x1 ; x2 ) \ (y1 ; y2 ) minus the probability in the interval ( 1; x1 ) \ ( 1; y1 )
because this was subtracted twice (look at …gure: X Y ).
31
y
y2
y1
x1 x2 x
Figure: X Y
To obtain the marginal distribution from the joint density, we need to sum or integrate out the
other variable(s)
The marginal probability function or marginal density function of X
P
fX (x) = R 1y f (x; y) in the discrete case
1
f (x; y)dy in the continuous case
R1 R1 R1
– Extension to the multivariate setting, fX1 (x1 ) = 1 1
::: 1
f (t)dt2 ::dtn :
The marginal cumulative distribution of X can be found from the joint distribution F (x; y)
as:
FX (x) = lim F (x; y):
y!1
32
– Recall the de…nition of conditional probability:
P [B \ A]
P [BjA] =
P [A]
The density/probability function of jointly distributed variables X and Y can be written in the
form
– The points in red are the ones with higher probability density, the points in black have
probability density close to 0. We see for higher values of X, higher values of Y are more
likely. When X is low, Y is more likely to be low.
33
Figure 2: Joint distribution of X and Y
34
Figure 3: Marginal and conditional distributions of X
35
3.5.1 Covariance and correlation
Covariance between X and Y :
The covariance will indicate the direction of covariation of X and Y: Its magnitude depends on
the scales of measurement, unlike the correlation coe¢ cient.
36
Zero covariance versus independence
– Exception to this is when X and Y are normally distributed, then zero covariance implies
independence!
Bivariate normal distribution is given by:
1
p
fXY (x; y) = 2
2 x y 1 xy
h i
y y
exp( 2(1
1
(x x
)2 + ( y
)2 2 xy (
x x
)( y
)
2
xy ) x y x y
– Solution
(i) Pr(XY = 1) = 1=4, Pr(XY = 3) = 1=2, Pr(XY = 9) = 1=4.
Thus E[XY ] = 41 1 + 12 3 + 41 9 = 4, E[X]E[Y ] = 4; so Cov [X; Y ] = X;Y = 0.
(ii) Pr(XY = 1) = 1=2, and Pr(XY = 9) = 1=2.
Thus E[XY ] = 21 1 + 12 9 = 5, Cov [X; Y ] = 5 4 = 1, and XY =1
(iii) Pr(XY = 3) = 1.
Thus E[XY ] = 3, Cov [X; Y ] = 3 4= 1, and XY = 1
– In case (iii) the correlation is negative (high X corresponds with low Y ) whereas in (ii) it is
positive (high X corresponds with high Y )
Exercise: X and Y are discrete random variables which are jointly distributed with the following
probability function f (x; y) :
X
1 0 1
1 1 1
1 18 9 6
1
Y 0 9 0 61
1 1 1
1 6 9 9
Find E [X Y ] :
37
P P
– Solution: Recall: E [XY ] = x y xy f (x; y)
1 1 1
E [XY ] = ( 1)(1)(
) + ( 1)(0)( ) + ( 1)( 1)( )
18 9 6
1 1
+(0)(1)( ) + (0)(0)(0) + (0)( 1)( )
9 9
1 1 1
+(1)(1)( ) + (1)(0)( ) + (1)( 1)( )
6 6 9
1
=
6
2
– The
h diagonali contains the variances associated with each element in the vector X : i =
2
E (Xi i)
2
If X1 ; :::; Xn are independent with E (Xi ) = 0 and V ar(Xi ) = :
2
E (X) = 0 and V ar(X) = In (scalar covariance matrix)
– Note that a scalar covariance matrix does not guarantee independence (joint normality re-
quired!)
38
3.5.3 Mean and variance of sums of random variables
The expected value of a sum of two random variables is:
E [X + Y ] = E [X] + E [Y ]
and, generalizing,
P
n P
n
E Xi = E [Xi ]
i=1 i=1
and, generalizing,
P
n P
n P
n P
n
V ar Xi = V ar [Xi ] + Cov(Xi ; Xj )
i=1 i=1 i=1 j=1
i6=j
– Proof (in bivariate setting): Using the fact that V ar(Z) = E(Z 2 ) E(Z)2 ; we get
h i
2 2
V ar [X + Y ] = E (X + Y ) (E [X + Y ])
2
= E X 2 + 2XY + Y 2 (E [X] + E [Y ])
= E X 2 + E [2XY ] + E Y 2
2
(E [X]) 2E [X] E [Y ] (E [Y ])2
= V ar [X] + V ar [Y ] + 2 Cov [X; Y ]
V ar [X + Y ] = V ar [X] + V ar [Y ]
V ar [aX + bY + c] = a2 V ar [X] + b2 V ar [Y ]
+2ab Cov [X; Y ]
39
Pn
Important case: Consider the sample mean X = ( i=1 Xi ) =n. Assume that the observations
are independent (Cov [Xi ; Xj ] = 0 for i 6= j) and V ar [Xi ] = 2 . Then
n
! n
1
2 X 1 X 1 2
V ar X = V ar Xi =indep 2 V ar [Xi ] = 2 n: 2 =
n i=1
n i=1 n n
It is useful to add some matrix algebra to the above discussion. Let X be an n 1 vector of
random variables.
E [Y ] = a0 E [X] + b0 and
Xn X
n
0
V ar [Y ] = a V ar [X] a = ai aj ij
i=1 j=1
E [Y ] = AE (X) + b and
V ar [Y ] = AV ar [X] A0
0
V ar(Y ) is a covariance matrix which is de…ned as E ((Y EY ) (Y EY ) ).
40
3.5.4 Conditional mean and variance
The conditional expectation of Y given X = x is
E [Y jX = x]
P
= R 1 y y fY jX (yjX = x) in the discrete case
1
y fY jX (yjX = x)dy in the continuous case
V ar [Y jX = x]
( P
2
(y E [Y jX = x]) fY jX (yjX = x) in the discrete case
= R1 y 2
1
(y E [Y jX = x]) fY jX (yjX = x)dy in the continuous case
2
= E Y 2 jX = x E [Y jX = x]
– The conditional variance is called the scedastic function, and like the regression, typically a
function of x:
– The case where V ar(Y jX = x) does not vary with x is called homoskedasticity.
Theorem Law of iterated expectations. Let h(X; Y ) be a function of two random variables
E (Y ) = EX [E (Y jX = x)]
41
– V ar(Y ) = E Y 2 E(Y )2
2
= EX E Y 2 jX = x (EX [E(Y jX = x)]) using theorem
h i
2
– We add and subtract EX E (Y jX = x) and rearrange to yield:
h i h i
2 2 2
= EX E Y 2 jX = x EX E (Y jX = x) +EX E (Y jX = x) (EX [E(Y jX = x)])
| {z } | {z }
EX (V ar(Y jX=x)) V arX (E(Y jX=x))
– If E(Y jX = x) does not depend on x we get the simpler result: V ar(Y ) = EX [V ar(Y jX = x)]
E [Y jX = x] = E [Y ]
– While we may have mean independence, this does not guarantee that, e.g., V ar [Y jX = x]
does not depend on x: Independence: fY jX=x = fY involves all moments!
– Since X is …xed when considering E(Y XjX = x); we can take X outside the expectation
and consider
E (Y X) = EX (XE(Y jX = x))
= EX (XE(Y ))
42
3.6 Exercises
Exercise: If f (X; Y ) = K(X 2 + Y 2 ) is the density function for the joint distribution of the
continuous random variables X and Y de…ned over the unit square bounded by the points (0,0),
(1,0), (1,1) and (0,1), …nd K:
– Solution: The (double) integral of the density function over the region of density must be 1,
so that
Z 1Z 1
1 = K(x2 + y 2 )dydx
0 0
2
= K
3
3
! K= :
2
Exercise: The cumulative distribution function for the joint distribution of the continuous random
variables X and Y is F (x; y) = (:2)(3x3 y + 2x2 y 2 ), for 0 x 1 and 0 y 1. Find f ( 21 ; 12 ).
– Solution:
@2
f (x; y) = F (x; y)
@x@y
= (:2)(9x2 + 8xy)
1 1 17
! f( ; ) = :
2 2 20
Exercise: Continuous random variables X and Y have a joint distribution with density function
f (x; y) = 3(2 2x
2
y)
in the region bounded by y = 0; x = 0 and y = 2 2x: Find the density
function for the marginal distribution of X for 0 < x < 1:
– Solution: X must be in the interval (0; 1) and Y must be in the interval (0; 2). It is good to
draw the region on which the density is nonzero:
1 x
Figure: x y
R1
Since fX (x) = 1 f (x; y)dy, we note that given a value of x in (0; 1), the possible values of
y (with non-zero density for f (x; y)) must satisfy 0 < y < 2 2x; so that
Z 2 2x
fX (x) = f (x; y)dy
0
Z 2 2x
3(2 2x y)
= dy
0 2
= 3(1 x)2
43
Exercise: Suppose that X and Y are independent continuous random variables with the following
density functions - fX (x) = 1 for 0 < x < 1 and fY (y) = 2y for 0 < y < 1. Find P [Y < X] :
– Solution: Since X and Y are independent, the density function of the joint distribution of
X and Y is
f (x; y) = fX (x) fY (y) = 2y
and is de…ned on the unit square.
Z 1 Z x
1
P [Y < X] = 2ydydx = :
0 0 3
Exercise: Continuous random variables X and Y have a joint distribution with a density function
f (x; y) = x2 + xy 1 1
3 for 0 < x < 1 and 0 < y < 2:Find P X > 2 jY > 2 .
1 1 P (X > 21 ) \ (Y > 12 )
P X> jY > =
2 2 P Y > 12
Z Z h
1 1 1 2
xy i 43
P (X > ) \ (Y > ) = x2 + dydx =
2 2 1
2
1
2
3 64
Z 2
1
P Y > = fY (y)dy
2 1
2
Z 2 Z 1
= f (x; y)dx dy
1
2 0
Z Z h
2 1
xy i 13
= x2 + dxdy =
1
2 0 3 6
1 1 43=64 43
! P X> jY > = = :
2 2 13=16 52
Exercise: Continuous random variables X and Y have a joint distribution density function
f (x; y) = 2 (sin 2 y)e x for 0 < x < 1 and 0 < y < 1: Find P X > 1jY = 12 :
1 P (X > 1) \ (Y = 21 )
P X > 1jY = =
2 fY ( 21 )
Z 1
1 1
P (X > 1) \ (Y = ) = f (x; )dx
2 2
Z1 1
x
= (sin )e dx
1 2 4
1=2
[2] 1
= e
4
44
Z 1
1 1
fY ( ) = f (x; )dx
2 2
Z0 1
x
= (sin )e dx
0 2 4
1=2
[2]
=
4
1 1
! P X > 1jY = =e
2
Exercise: X is a continuous random variable with density function fX (x) = x + 12 for 0 < x < 1.
X is also jointly distributed with the continuous random variable Y , and the conditional density
function of Y given X = x is
x+y
fY jX (yjX = x) =
x + 21
for 0 < x < 1 and 0 < y < 1: Find fY (y) for 0 < y < 1:
– Solution:
Then
Z 1
fY (y) = f (x; y)dx
0
1
= y+
2
Exercise: The coe¢ cient of correlation between random variables X and Y is 13 , and 2X = a;
2
Y = 4a: The random variable Z is de…ned to be Z = 3X 4Y; and it is found that 2Z = 114:
Find a:
– Solution:
2
Z = V ar [Z]
= 9V ar [X] + 16V ar [Y ] 2 (3)(4)Cov [X; Y ]
Since
2 2a
114 = Z = 9a + 16(4a) 24
3
= 57a
! a=2
45
Exercise: Suppose that X has a continuous distribution with p.d.f fX (x) = 2x on the interval
(0; 1) and fX (x) = 0 elsewhere. Suppose that Y is a continuous random variable such that the
conditional distribution of Y given X = x is uniform on the interval (0; x). Find the mean and
variance of Y:
1
– Solution: We are given fX (x) = 2x for 0 < x < 1 and fY jX (y j X = x) = x for 0 < y < x:
Then,
f (x; y) = f (yjx) fX (x)
1
= 2x
x
= 2 for 0 < x < 1 and 0 < y < x:
The unconditional (marginal) distribution of Y has p.d.f.
Z 1
fY (y) = f (x; y)dx
1
Z 1
= 2dx
y
= 2(1 y) for 0 < y < 1
(and fY (y) is 0 elsewhere). Then
Z 1
1
E [Y ] = y 2(1 y)dy = ;
0 3
Z 1
1
E Y2 = y 2 2(1 y)dy = ;
0 6
and
2
V ar [Y ] = E Y2 (E [Y ])
2
1 1 1
= = :
6 3 18
Exercise: Given n independent random variables X1 ; X2 ; :::; Xn each having the same variance
of 2 ; and de…ning U = 2X1 + X2 + ::: + Xn 1 and V = X2 + X3 + ::: + 2Xn , …nd the coe¢ cient
of correlation between U and V:
– Solution:
Cov [U; V ]
UV = ;
U V
2 2 2
U = (4 + 1 + 1 + ::: + 1) = (n + 2) =
2
= V :
0
Since the X s are independent, if i 6= j then Cov [Xi ; Xj ] = 0: Then, noting that Cov [W; W ] =
V ar [W ] ; we have Cov [U; V ] =
= Cov [2X1 ; X2 ] + Cov [2X1 ; X3 ] + :::
+Cov [Xn 1 ; 2Xn ]
= V ar [X2 ] + V ar [X3 ] + ::: + V ar [Xn 1]
2
= (n 2)
2
(n 2) n 2
Then UV = (n+2) 2 = n+2 .
46
4 Some special distributions
4.1 Normal distribution
See also LM Chapter 4.3 (with exercises).
2
Univariate. The pdf of a normal random variable X with mean and variance is
1 1
(x )2
f (x) = p e 2 2 for 1 < x < 1:
2 2
2
– This result is typically denoted: X N( ; )
2 X
If X N( ; ); then the random variable Z = N (0; 1)
– The standard normal distribution is often denoted by (z) and its cdf (z)
– Tables of values of (z) may be found in most statistics textbooks
– Using this notation
1 x x
f (x) = and F (x) =
Multivariate. The pdf of jointly normal random variable X1 ; :::; Xn with mean (vector) and
covariance (matrix) is
n 1 1
f (x) = (2 ) 2 (det ) 2 exp( (x )0 1
(x ));
2
2
or if =
2 n 1 1
f (x) = (2 ) 2 (det ) 2 exp( 2
(x )0 1
(x ));
2
2
– When X s N (0; I) this simpli…es to
n
2 n=2 1 X
f (x) = (2 ) exp( 2
x2i )
2 i=1
2
= f (x1 )::f (xn ), where Xi i.i.d. N (0; )
Qn
= f (xi )
i=1
This shows that uncorrelatedness in the presence of joint normality yields independence.
1. If X has a multivariate (joint) normal distribution, then all marginals are normal:
2
X s N ( ; ) ! Xi s N ( i ; i ):
47
2. All linear transformations of X are normal
Exercise: If for a certain normal random variable X; P [X < 500] = :5 and P [X > 650] = :0227;
…nd the standard deviation of X:
– Solution: The normal distribution is symmetric about its mean, with P [X < ] = :5. Thus,
for this normal X; = 500: Then,
P [X > 650] = :0227
X 500 150
= P >
Since X 500 has a standard normal distribution, it follows from the table for the standard
normal distribution that 150 = 2:00 and = 75:
2
Below will we see that the , student-t and F distributions are derived from the normal distrib-
ution.
The 2 (n) density function has a single parameter, n, the degrees of freedom. It takes only positive
values and is skewed to the right.
2
If X (n); then E(X) = n and Var(X) = 2n:
Many test statistics we will consider in our econometrics courses have an (asymptotic) Chi-squared
distribution, with degrees of freedom typically given by the number of restrictions.
48
Useful results on quadratic forms (proofs given in Econometrics courses):
– Suitable quadratic forms of vectors of normal random variables can be shown to have a Chi-
squared distribution.
2. If the n dimensional vector W N (0; V ar(W )) (i.e., Wi not necessarily independent) then
W 0 V ar(W ) 1
W 2
(n)
(we can de…ne a symmetric matrix V ar(W )1=2 so that V ar(W )1=2 V ar(W )1=2 = V ar(W )):
2 W2 W 2 2
– If W N 0; is scalar, it equals 2 = (1)
Z 0 AZ 2
(rank(A))
Student-t Distribution
2
If Z is a N (0; 1) random variable and X is (n) and is independent of Z; then
Z
p t(n)
X=n
The t (n) density function has a single parameter p; the degrees of freedom. It has the same shape
as the normal distribution but has thicker tails and sharper peaks (leptokurtic)
49
F Distribution
2 2
If X is a (m) random variable and Y is a (n) random variable and X and Y are independent,
then
X=m
s F (m; n)
Y =n
The F (m; n) density function has two parameters m and n. It is positive and skewed to the right.
If F1 (n; k), then F2 = 1=F1 (k; n) :
Pr(F1 < a) = Pr(1=F1 > 1=a) = Pr(F2 > 1=a) = 1 Pr(F2 < 1=a)
Important examples
2
Example Let X1 ; X2 ; :::Xn be a random sample of size n from N ( ; ) then
p
n(X )
s t (n 1)
SX
2 2 1
Pn 2
where X N( ; =n) and SX = n 1 i=1 Xi X :
p
n(X ) 2 2 2
Proof requires us to show (i) N (0; 1), (ii) (n 1) SX (n 1) and (iii)
2
independence of X and SX :
p
p n(X )
n(X )
The following decomposition shows the required formulation =p 2 S 2 =(n
SX (n 1) X 1)
p
– The numerator is N (0; 1), the denominator is 2 (n 1)= (n 1); and the independence
yields the result.
2
Example Let SX and SY2 be the sample variances from mutually independent samples of sizes m and
n respectively drawn from normal distributions. Let X and Y have population variance 2X and 2Y
respectively:
2 2
Y SX
2 S 2 s F (m 1; n 1)
X Y
2 2 2 2
Proof uses the fact that (i) (m 1) X SX (m 1) ; (ii) (n 1) Y SY2 2
(n 1) ; and
their independence
2 2
Y SX
2 2
(m 1) X SX =(m 1)
The following decomposition shows the required formulation 2 S2 = (n 1) 2 2
Y SY =(n 1)
X Y
50
4.3 Bernouilli, binomial and poisson distributions
These are well known distributions for discrete random variables:
Bernouilli Distribution
X describes the outcome of a trial, where X takes the value 1 with probability p and 0 with
probability 1 p :
f (x) = px (1 p)1 x , x = 0; 1
n x
f (x) = p (1 p)n x
for x = 0; 1; :::; n
x
n n!
– x is a binomial coe¢ cient (combinators) and is equal to x!(n x)! : Pascal’s triangle.
– Moments not surprising if you see that X = X1 + X2 + ::: + Xn where Xi are independent
Bernouilli random variables.
51
Discrete Distributions
Uniform distribution: A die toss, a coin ‡ip... The random variable X may assume values
from 1 to N (N 1 is an integer) and all realizations are equally likely. The probability function
is
1
N for x = 1; 2; :::; N;
f (x) =
0 otherwise.
– The mean:
N +1
E[X] =
2
E(X) = 1: n1 + 2: n1 + ::n n1 = 1
n
n
i=1 i = 1
n
1
2 n(n + 1)
– The variance:
N2 1
V ar[X] =
12
E(X 2 ) = 12 : n1 + 22 : n1 + ::n2 n1 = 1
n
n
i=1 i
2
= 1
n
1
6 n(n + 1)(2n + 1)
V ar(X) = E(X 2 ) E(X)2 :
Geometric distribution with parameter p 2 [0; 1]: a single trial of an experiment results in either
success with probability p, or failure with probability 1 p:The experiment is performed with
successive independent trials until the …rst success occurs. If X represents the number of failures
until the …rst success, then X is a discrete random variable that can be 0; 1; 2; 3; :::: X is said to
have a geometric distribution with parameter p. (LM Chapter 4.4)
f (x) = (1 p)x p for x = 0; 1; 2; 3; :::
– Mean
1 p
E[X] =
p
– Variance:
1 p
V ar[X] =
p2
– The geometric distribution has the lack of memory property:
P [X = n + k j X n] = P [X = k]:
The likelihood of the occurrence of the event depends only on p — given p, history does not
matter.
Negative binomial distribution with parameters r and p (r > 0 and 0 < p 1). If r is an
integer, then the negative binomial random variable X can be interpreted as being the number
of failures until the rth success occurs when successive independent trials of an experiment are
performed for which the probability of success in a single particular trial is p (the distribution is
de…ned even if r is not an integer). (LM Chapter 4.5) We have:
r+x 1
f (x) = pr (1 p)x for x = 0; 1; 2; 3; :::
x
r(1 p) r(1 p)
E[X] = ; and V ar[X] =
p p2
52
Continuous Distributions
Uniform Distribution. In this case, X may take any value on the interval (a; b) and all points
in the interval are equally likely
1
b a for a < x < b;
f (x) =
0 otherwise.
– And we have:
a+b (b a)2
E[X] = ; and V ar[X] = ;
2 12
a+b
This is a symmetric distribution about the mean = median = 2
Exponential Distribution. Consider a Poisson event (e.g., the birth of babies in England).
What is the distribution of the time of the next Poisson event? When will the next English baby
be born? This is described by the Exponential distribution.
x
e for x > 0;
f (x) =
0 otherwise.
–
1 1
E[X] = ; and V ar[X] = 2;
x
F (x) = 1 e for x 0;
and
x
P [X > x] = e ;
Z 1
k!
E[X t ] = xk e x
dx = k
0
53
– Suppose that independent random variables Y1 ; Y2 ; :::; Yn have exponential distributions with
means 11 ; 12 ; :::; 1n (parameters 1 ; 2 ; :::; n ) respectively. Let Y = minfY1 ; Y2 ; :::; Yn g:
Then Y has an exponential distribution with mean 1 + 21:::+ n :
Proof:
P [Y > y] = P [Yi > y for all i = 1; 2; :::; n]
= P [(Y1 > y) \ (Y2 > y) \ \ (Yn > y)]
= P [Y1 > y] P [Y2 > y] P [Yn > y]
1y 2y ny ( 1 + 2 +:::+ n )y
using independence of the Yi ’s = (e )(e ) ::: (e )=e . The c.d.f.
of Y is then
FY (y) = P [Y y]
= 1 P [Y > y]
( 1 + 2 +:::+ n )y
= 1 e
and the p.d.f. of Y is
0
( 1+ 2+ + n )y
fY (y) = FY (y) = ( 1 + 2 + + n )e
x0
E[X] = ; and
1
x20
V ar[X] = :
( 2)( 1)2
54
Gamma distribution with parameters > 0 and > 0:
( 1 x
x e
( ) for x > 0;
f (x) =
0 otherwise.
R1
where ( ) is the gamma function de…ned for > 0 to be ( ) = 0
ya 1
e y
dy.
(LM Chapter 4.6)
– The exponential distribution with parameter is a special case of the gamma distribution
with = 1 and = :
a ab
E[X] = ; and V ar[X] = :
a+b (a + b)2 (a + b + 1)
4.5 Exercises
Exercise: An English player has 75% probability of scoring a goal from a penalty shot. 5 English
players will try to score in a penalty shoot-out. What is the distribution of probability of goals?
– Solution: We assume that each of the 5 attempts are independent events. Then:
n k
fX (k) = P (X = k) = p (1 p)n k
k
5
fX (0) = 0 (0:75)0 (0:25)5 = 1:(0:0010) = 0:1%
5
fX (1) = 1 (0:75)1 (0:25)4 = 5:(0:0029) = 1:5%
5
fX (2) = 2 (0:75)2 (0:25)3 = 10: (0:0088) = 8:8%
5
fX (3) = 3 (0:75)3 (0:25)2 = 10:(0:0264) = 26:4%
5
fX (4) = 4 (0:75)4 (0:25)1 = 5:(0:0791) = 39:6%
5
fX (5) = 5 (0:75)5 (0:25)0 = 1:(0:2373) = 23:7%:
Exercise: If X is the number of 6’s that turn up when 72 ordinary dice are independently thrown,
…nd the expected value of X 2 :
– Solution: X has a binomial distribution with n = 72 and p = 16 :Then E[X] = np = 12; and
V ar[X] = np(1 p) = 10: Since V ar[X] = E[X 2] (E[X])2 ; E[X 2 ] = 10 + 122 = 154:
55
Exercise: The number of hits, X, per baseball game, has a Poisson distribution. If the probability
1
of a no-hit game is 10;000 ; …nd the probability of having 4 or more hits in a particular game.
0
e 1
– Solution: P [X = 0] = 0! =e = 10;000 ! = ln 10; 000:
P [X 4] =
1 (P [X = 0] + P [X = 1] + P [X = 2] + P [X = 3])
0 1
e e
= 1 ( +
0! 1!
2 3
e e
+ ) +
2!3!
1 ln 10; 000
= 1 ( +
10; 000 10; 000
(ln 10; 000)2 (ln 10; 000)3
+ + )
2(10; 000) 6(10; 000)
= :9817:
Exercise: Suppose that X has a uniform distribution on the interval (0; a); where a > 0: The
pdf of X is given by f (x) = a1 for 0 < x < a: Find P [X > X 2 ]:
Example: The random variable T has an exponential distribution with cumulative density
function P [T t] = 1 e t where 1= is the mean of T: Find the value of for which
P [T 2] = 2 P [T > 4] and provide the V ar[T ]:
P [T 2] = 2P [T > 4]
2 4
1 e = 2e
4 2
2e +e 1=0
1 4
V ar[T ] = 2 =
(ln 2)2
56
5 Distributions of functions of random variables
5.1 The distribution of a function of random variable
2
The ; student-t, and F are examples of distributions of functions of random variables.
Another example we used is that linear combinations of normal random variables are normal
random variables.
While we discussed how to obtain the expected value of a function of a random variables, e.g.,
E X 2 and E etX ; we know discuss how to obtain the distribution of this function of random
variables.
Ad 3. If Y = u(X) and X and Y are continuous and u (X) is a continuous monotonic function of
X (one-to-one)
g(y) = f (v(y)) jv 0 (y)j
where v is the inverse x = v(y) and v 0 (y) = dv(y)=dy
FY (y) = P [Y y] = P [u (X) y]
57
Important applications:
2 X
– If X N( ; ) then Y = N (0; 1):
Note: 1 < y < 1
Use g(y) = f (v(y)) jv 0 (y)j.
Here f (x) = p21 2 exp 1
2 2 (x )2 ; v(y) = y + and v 0 (y) = :
Substituting gives:
g (y) = f( y+ )
1 1 1 1 2
= p exp 2
( y+ )2 = p exp y
2 2 2 2 2
where the latter is the pdf of N (0; 1) random variable as required.
2
– If X N ( ; ) then Y = eX has a lognormal distribution with parameters and 2
(log(Y )
N ( ; 2 )).
Use g(y) = f (v(y)) jv 0 (y)j and recognize that y > 0:
Here v(y) = log (y) and v 0 (y) = 1=y; so
(
p1 exp 1
)2
y 2 2 2 2 (log y if y > 0
g(y) =
0 otherwise
M GFX at t=0 2
E (Y ) = E e1 X = exp + 2 : In accordance with Jensen’s inequality:
since g 00 > 0,
2
E (Y ) E (g(X)) = exp + 2 exp ( ) = g (E(X))
1
– If Y is U [0; 1] then X = F (Y ) has Pr(X x) = F (x):
Useful for Simulations: If we want to draw random numbers from a distribution with
CDF F (x); this result ensures that we can simply draw uniform U [0; 1] random numbers,
y; and evaluate F 1 (y).
By de…nition
Pr(X x) = Pr(F 1 (Y ) x)
With the invertibility of the CDF (and F 0 (x) = f (x) 0);
Pr(X x) = Pr(Y F (x)) = F (x)
where the latter equality uses the fact that the CDF of U [0; 1] random variable is given
by Pr(Y y) = y:
58
Exercise: Let Y = 2X where X is uniformly distributed between 0 and 1: f (X) = 1 and
Pr(0:5 < X < 0:6) = 0:10: Find the pdf of Y:
Exercise: The random variable X has an exponential distribution with a mean of 1. The random
variable Y is de…ned to be Y = 2 ln X. Find fY (y), the p.d.f. of Y:
– Solution:
FY (y) = P [Y y] = P [2 ln X y]
h i y=2
= P X ey=2 = 1 e e
d ey=2
fY (y) = FY0 (y) = (1 e )
dy
1 y=2 ey=2
= e e
2
Alternatively, since Y = 2 ln X (y = u(x) = 2 ln x, and ln is a strictly increasing function
with inverse x = v(y) = ey=2 ), and X = eY =2 , it follows that
d y=2
fY (y) = fX (ey=2 ) e =
dy
ey=2 1
= e : ey=2
2
The above result must be modi…ed for a joint distribution. Suppose here that X1 and X2 have a
joint distribution fX (x1 ; x2 ) and that Y1 and Y2 are two monotonic functions of X1 and X2
Y1 = u1 (X1 ; X2 ) X1 = v1 (Y1 ; Y2 )
,
Y2 = u2 (X1 ; X2 ) X2 = v2 (Y1 ; Y2 )
then
gY (y1 ; y2 ) = fX (v1 (y1 ; y2 ) ; v2 (y1 ; y2 )) abs(J )
where
@v1 =@y1 @v1 =@y2
Jacobean = J = det
@v2 =@y1 @v2 =@y2
59
– The Jacobean must be nonzero for the transformation to exist.
Example. Let
Y1 = X1 X1 = Y1
,
Y2 = X1 + X2 X2 = Y2 Y1
1 0
– Here J = det = 1:
1 1
– gY (y1 ; y2 ) = fX (y1 ; y2 y1 ) 1
– If we integrate out Y1 for this joint density, then we would get the marginal distribution of
Y2 = X1 + X2 : Z
gY2 (y2 ) = fX (y1 ; y2 y1 ) 1dy1
With Y = X1 + X2 ; the above result allows say that the density function of Y = X1 + X2 is:
Z
fY (y) = fX (x1 ; y x1 ) dx1
If X1 and X2 are continuous random variables with joint density function f (x1 ; x2 ) then the
density function of Y = X1 + X2 is
Z 1
fY (y) = f (x1 ; y x1 )dx1 :
1
– Result may be used to show that if X1 and X2 are jointly normal rv’s, that X1 + X2 is also
normally distributed with mean 1 + 2 and variance 21 + 22 + 2 12 ! Proof is really tedious.
– Given independence, it is easier to proof this using the Moment Generating Function (as
shown before).
– In this form fY (y) is the convolution of fX1 and fX2 , also denoted as (fX1 fX2 ) (y)
60
Important multivariate applications
2
– X1 ; :::; Xn is a random sample from N ( ; ); then
n
X n
2 1X 2
Xi N (n ; n ))X= Xi N( ; =n)
i=1
n i=1
2
– X1 ; :::; Xn is a random sample from (1); then
n
X
2
Xi (n)
i=1
5.4 Exercises
Exercise: Suppose that X and Y are independent discrete integer-valued random variables with
X uniformly distributed on the integers 1 to 5, and Y having the following probability function -
Let Z = X + Y: Find P [Z = 5] :
– Solution: Using the fact that fX (x) = :2 for x = 1; 2; 3; 4; 5; and the convolution method for
independent discrete random variables, we have
5
X
fZ (5) = fX (i) fY (5 i)
i=0
= (0)(0) + (:2)(0) + (:2) (:2) + (:2)(0)
+(:2)(:5) + (:2)(:2)
= :20
Exercise: X is uniformly distributed on the even integers x = 0; 2; 4; :::; 22. The probability
1
function of X is f (x) = 12 for each even integer x from 0 to 22. Find E[X]:
Exercise: X1 and X2 are independent exponential random variables each with a mean of 1. Find
P [X1 + X2 < 1] :
61
Note hat X1 and X2 take on only non-negative numbers [0; 1): Since we evaluate the random
variable X2 at y X1 ; we realize that to de…ne fY (y) we need to ensure y X1 0; or
X1 y! Therefore
Exercise: The birth weight of males is normally distributed with mean 6 pounds, 10 ounces,
standard deviation 1 pound. For females, the mean weight is 7 pounds, 2 ounces with standard
deviation 12 ounces. Given two independent male/female births, …nd the probability that the
baby boy outweighs the baby girl.
– Solution: Let random variables X and Y denote the boy’s weight and girl’s weight, respec-
tively. Then W = X Y has a normal distribution with mean 10
16
2
7 16 = 12 lb and variance
2 2 9 25
X + Y = 1 + 16 = 16 Then,
P [X > Y ] = P [X Y > 0]
" #
W ( 21 ) ( 21 )
= P 1=2
> 1=2
[25=16] [25=16]
= P [Z > :4] ;
where Z has a standard normal distribution (W was standardized). Referring to the standard
normal table, this probability is 0:34.
If the exact distribution is unknown, we may want to rely on a result that holds when the sample
size is su¢ ciently large: asymptotic distribution. Central Limit Theorem.
2
– If we draw X1 ; :::; Xn from an unknown distribution with mean and variance ; the
distribution of X can be approximated well by a normal distribution with mean and
a
variance 2 =n; also denoted X N ( ; 2 =n)
62
Theorem Lindeberg-Levy CLT: If X1 ; :::; Xn are a random sample from a probability distri-
bution with …nite mean and …nite variance 2 then
p d 2
n(X ) ! N (0; )
p 2
\ n(X ) has a N (0; ) limiting distribution
p d 2
– n(X ) ! N (0; ) is a convergence in distribution result (more about this later).
This
p result ensures that,2 with n su¢ ciently large, we can approximate the distribution
of n(X ) by N (0; ) for given n; or
p a
n(X ) N (0; 2 )
– Let us demonstrate this by considering the sum of independent Bernouilli random variables
(toss of coin):
A fair coin is tossed n times: What is the probability distribution of the number (sum) of
heads?
63
Pn n
In the graphs, we considered the limiting distribution of Y = i=1 Xi ; where fXi gi=1 are n
independent Bernouilli random variables (e.g., toss of coin), Xi = f0; 1g :
– Let p is the probability of success (heads) then E(Xi ) = p and V ar(Xi ) = p(1 p):
– As n increases, the graph suggests that the normal distribution is indeed a good approxima-
tion of the distribution of the sum (average) of independent Bernouilli random variables.
Pn Pn
Given that we can derive E( i=1 Xi ) and V ar( i=1 Xi ); needed to fully characterize
the normal distribution, we can state
n
X a
Xi N (np; np(1 p))
i=1
– Exact distribution is true for any given sample size, approximation is only reasonable for
large sample sizes.
– The goal is to learn from a sample of observations something about the population from
which the data was drawn.
We assume there is an unknown process that generates the data (sample) that can be described
by a distribution function or probability density function (Data Generating Process).
64
We will say that a sample of n observations on one or more variables denoted fX1 ; X2 ; :::; Xn g
n
or fXi gi=1 is a random sample if the n observations are drawn independently from the same
population, or probability distribution fX (xi ; )
n
– We also denote this as fXi gi=1 is i.i.d. (independent, identically distributed)
– The vector contains one or more unknown parameters.
Statistic: Any function which can be computed from the data in a sample X = fX1 ; :::; Xn g
Estimator: A statistic that is intended to serve as a basis for learning about an unknown quantity
(parameter ) is called an estimator. Typically denoted by b:
Sampling distribution:
– Estimators are random variables, so that if another sample were drawn under identical con-
ditions, di¤erent values would be obtained.
– The probability function of the estimator is called the sampling distribution: it speci…es how
the realizations of our estimator will vary under repeated sampling.
– Under random sampling, we would expect that descriptive statistics (such as the sample
mean, sample variance and sample covariances) will mimic that of their population coun-
terparts, although not perfectly. The precise manner in which these quantities re‡ect the
population values de…nes the sampling distribution of our estimator
2 1
P
N
Variance: SX = N 1 (Xi X)2
i=1
2 2
X = E((Xi X) ) = SampleV ar(Xi )
1
P
N
Covariance SXY = N 1 (Xi X)(Yi Y)
i=1
XY = E((Xi X )(Yi Y )) = SampleCov(Xi ; Yi )
65
6.2.1 Sampling distribution of sample mean and variance
2
Given a random sample X1 ; :::; Xn i.i.d. N ( ; )
2 2
– The sampling distribution of (a suitable rescaled) estimator SX for is
2
(n 1) SY2 2
n 1
P
n P
n 2
(n 1) 2
Sx2 = ( Xi X 2
) ( Xi )2 with Xi 2
(1) : Rather than obtain-
i=1 i=1
2
ing a (n) (independence) we loose one degree of freedom because we used in place
of X
Formal proof would use a suitable quadratic form in normal random variables (Chapter
4).
Unbiasedness
^ is unbiased if E ^ =
Bias(b) = E(b)
– If samples of size n are drawn independently then the average value of our estimates will
tend to equal
E¢ ciency
– E¢ cient Unbiasedness
We need to acknowledge that there are many unbiased estimators that make poor use
of the data
– Mean Squared Error E¢ ciency
66
Allows for a trade-o¤ between bias and variance.
Important e¢ ciency results in econometrics we will come across in our econometrics courses:
– For n large enough, we’ll be interested to see whether such estimators are consistent and
have an asymptotic distribution .
These su¢ cient conditions ensure convergence in mean square to ; which is stronger
than the convergence in probability requirement.
X1 ; X2 ; :::; Xn are n independent random variables with the same distribution, mean and vari-
ance 2 : (i.i.d.)
Pn
– Show X = n1 i=1 Xi is a consistent estimator of
67
– Proof using su¢ cient conditions:
E X = and V ar X = 2 =n.
As n ! 1: lim E X = and lim V ar X = 0:
n!1 n!1
These are su¢ cient conditions that guarantee that the sample mean, X; converges to
the population mean E(Xi ), so:
1 Xn
plim Xi =
n i=1
In fact, we imposed stronger conditions than are needed for its consistency! (su¢ ciency)
An alternative method for proving consistency is based on Laws of large numbers (LLN).
– LLNs give conditions under which sample averages converge to their population counterparts.
Theorem Khinchine’s Weak Law of Large Numbers (WLLN): If X1 ; :::; Xn are a random
sample from a probability distribution with …nite mean then
1X
plim(X) = plim Xi =
n
plim X = E(Xi )
so consistency established!
2
– Requiring V ar(Xi ) = < 1 indeed is stronger than needed in above example!
– Proof does not require us to derive V ar(X); we just need to look at plims of averages! (plim
is a nice operator - nicer than the expectation operator)
1
(Xi X )(Yi Y )
E.g., for the Classical Simple Linear Regression Model ^ = n
2 ; plim ^ =
n (Xi X )
1
plim ( n
1
(Xi X )(Yi Y )) Cov(Xi ;Yi )
2 = V ar(Xi ) = :
n (Xi X )
1
plim
68
plim operator and Slutsky’s Theorem
– The probability limits operator exhibits some nice intuitive properties: If Xn and Yn are
random variables with plim Xn = a and plim Yn = b then
plim(Xn + Yn ) = a + b
plim(Xn Yn ) = ab
plim(Xn =Yn ) = a=b provided b 6= 0
– We will then need to assume that there is a distribution we can use that approximates its
distribution arbitrarily well for su¢ ciently large sample.
– This is associated with the asymptotic property of convergence in distribution (more details
in last Chapter)
We say that the estimator b of has an asymptotic normal distribution, if the distribution
of ^ can be approximated by a normal distribution (assumes the sample size is large)
Central Limit Theorems provide the necessary regularity conditions
Yi = Ci ( ) + i with E( i ) = 0
with Ci ( ) = E(Yi ):
Choose so as to minimize the distance, de…ned as
n
X
D(Y; C) = d(Yi ; Ci )
i=1
69
Details:
Pn
– Least squares estimator: bOLS = arg min i=1 (Yi )2 . Then
n
X
dD
= 2(Yi )( 1) = 0
d i=1
Xn n
X n
X
=) Yi = =) Yi = n
i=1 i=1 i=1
Pn
Yi
=) = i=1
=) bOLS = Y :
n
Pn
– Least absolute deviation estimator: bLAD = arg min i=1 jYi jand
bLAD = YM ED
where YM ED is the sample median (the median minimizes the average distance to all points).
median
This gets more interesting if the Ci ’s are not the same for all i.
Exercise: : Find the MLE of ; given a random sample Y1 ; :::Yn , each drawn from the p.d.f.
yi
(1 )1 yi
yi = 0; 1
f (yi ) =
0
70
Write down the Log-likelihood
n
X n
X
ln L( ) = yi ln + (n yi ) ln(1 )
i=1 i=1
2
Exercise: Let X be a normally distributed random variable, X N( ; ). Using a random
sample of n observations obtain the MLE estimators for and 2 :
2
– Solution: In a random sample of n observations, the density of each observation is f (xi ; ; ).
Since the n observations are independent, their joint density is:
n
Y
2 2 2
f (x1 ; x2 ; :::xn ; ; )= f (xi ; ; ) = L( ; jx1 ; x2 ; :::xn )
i=1
2 @ ln L n 1
Pn
Taking the derivative with respect to , we get @ 2 = 2 2 + 2( 2 )2 i=1 (xi )2 :
This gives us the second FOC:
n 1 Pn
+ i=1 (xi ^ M LE )2 = 0
2^ 2M LE 2
2 ^ M LE
2
1 Pn
, 2 i=1 (xi )2 = n
2
^ M LE P
n
(xi x)2
) ^ 2M LE = i=1
n
71
Di¢ cult exercise (may want to skip this one): Find the MLE of the scalars and 2 ; given that
data has the joint distribution: y N x ; 2 ; where y = (y1 ; ::; yn ) and x = (x1 ; ::; xn ) : We
assume is known. [Allows for dependence between observations!]
– Solution: The joint density is now NOT the product of the marginals but instead (see
Chapter 4):
2 n 1 1
f (y) = (2 ) 2 (det ) 2 exp( 2
(y x )0 1
(y x ))
2
n n 2 1 1
ln L = 2 log (2 ) 2 log 2 ln (det ) 2 2 (y x )0 1
(y x )
0 0
= n2 log (2 ) n2 log 2 1
2 ln (det )
1
2 2 x 1
x + y0 1
y 2 0 x0 1
y)
after expanding the quadratic form (uses linear algebra)
@ ln L
The …rst FOC (will discuss a related derivation in Chapter 8 in more detail): @ :=
0 1
1
2^ 2
2x0 1 x ^ 2x0 1 y = 0 ) ^ = xx0 1 yx
(An estimator we call the GLS estimator of in our econometrics courses)
(y x ^ )0 1
(y x ^ )
The second FOC: @ ln L
@ 2 := n
2^ 2
+ 2^1 4 (y x ^ )0 1
(y x ^ ) = 0 ) ^ 2 = n
Exercise (tricky): Suppose that the random variables Y1 ; Y2 ; :::; YN are drawn from a uniform
distribution so that Yi U (0; ). We have that:
1 N
L (Y1 ; Y2 ; :::; YN ; ) = for 0 YM AX
0 if any Yi 2
= [0; ]
What is the MLE estimator of ?
– Solution: Since L is not twice continuously di¤erentiable in ; we need to approach its
maximization without di¤erentiation. The solution is readily obtained with the aid of the
following picture:
LN(θ)
LN(θ)
0 YM AX
WMAX
Therefore the MLE estimator is b = YM AX : Successful maximization could not have been
obtained by straightforward di¤erentiation in this case.
Maximum Likelihood will provide di¤erent answers depending on the particular distribution we
believe the data was generated from.
See also LM Chapter 5.2
72
6.5.3 Method of moments estimator
The Method of Moment Estimator assumes that there are moments
E(m(Y; )) = 0;
– The MME estimator consist out of choosing ^M M in such a way that the sample analogue
of these moments are satis…ed
– ^M M solve
n
1X
m(Yi ; ^M M ) = 0
n i=1
If we want to estimate p parameters, we will need at least p moment conditions involving
these parameters.
If there are more moments than parameter of interest, we will consider the Generalized
Method of Moments Estimator, which optimally weights the sample moments.
Exercise: Let Y1 ; :::; Yn be a random sample drawn from a distribution with mean and variance
2
: Obtain the MM estimator of = ; 2 :
2
– Solution: We need 2 moment conditions, involving and :
2 2
Recall V ar(Y ) = E(Y 2 ) [E(Y )] :
The method of moments estimator is given by (sample analogues):
1
Pn
n Pi=1 (Yi ^M M ) = 0
n
1
n i=1 Yi
2
^ 2M M ^ 2M M = 0
Pn Pn
Together
Pn they yield ^ M M E = 1
n i=1 Yi = Y and ^ 2M M E = 1
n i=1 Yi2 Y2 =
1
n (Y
i=1 i Y )2
Exercise: Again, suppose that the random variables W1 ; W2 ; :::; WN are drawn from a uniform
distribution so that Wi U (0; ). What is the MM estimator of ?
0
– Solution: We know that: E(Wi ) = 2 = 2 : Thus, the M M E estimator is de…ned by the
bM M E
sample moment equation: W = 2
– Implementation: Say n = 7; and fW1 ; W2 ; :::W7 g = f25000; 30000; 20000; 25000; 45000; 5000; 25000g:Then,
bM M E = 2 W = 2 25000 = 50000: The bM LE would be 45000.
– In fact, the MM estimator is not valid because is a boundary value.
73
6.6 Interval estimation
Regardless of the properties of an estimator, the estimate obtained will vary from sample to sample.
A point estimate is a single number, and usually we want to know more. How con…dent can we
be about our estimate (precision)?
– The expected temperature next Saturday at 4pm is 22C. Does it mean “something between
21C and 23C with very high probability” or “something between 15C and 29C with high
probability”?
An interval estimate refers to a range of values such that we can expect this interval to contain
the true parameter in some speci…ed proportion of the samples, or with some desired level of
con…dence level
σ
pdf (X)
α/2 n
α/2
µ
Con…dence intervals
– To help us de…ne the con…dence interval, we …rst note
X
Z= p N (0; 1)
= n
Pr Z > z =2 =1 z =2 = =2
Pr X z =2 p X +z =2 p =1
n n
74
That is, with 1 con…dence,
2 X z =2 p ;X + z =2 p
n n
How can we obtain the con…dence interval if the distribution of X is not normal?
– Remember the Central Limit Theorem: as long as n is big enough, X will be approximately
normally distributed, so the con…dence interval is asymptotically correct!
– If the sample is small, then more attention needs to be paid to the distribution of X:
2
What if we don’t know ?
2
– To allow us to obtain the con…dence interval for ; we need to replace with S 2 in Z; and
use
X
Tn 1 = p
S= n
instead. This random variable has a Student t distribution with n 1 degrees of freedom.
As n becomes larger, the t distribution approaches the standard normal distribution and
the con…dence interval will not change. But for smaller values of t, there is signi…cant
di¤erence. The t-distribution is symmetric around the mean but has thicker tails than
the standard normal distribution, so that more of its area falls within the tails. While
there is only one standard normal distribution, there is a t distribution for each size of
sample size.
– Like we did with the standard normal distribution, let’s de…ne t =2;n 1 such that:
Pr Tn 1 >t =2;n 1 = =2
S S
2 X t =2;n 1 p ;X + t =2;n 1 p
n n
75
– Comparing with the expression above: if we use S instead of , we replace the standard
normal distribution with the Student t distribution with n 1 degrees of freedom. This
interval typically is larger!
Exercise: A random sample of n = 10 ‡ashlight batteries with a mean operating life X = 5h and
a sample standard deviation s = 1h is picked from a production line known to produce batteries
with a normally distributed operating lives. Find the 95% C.I. for the unknown mean of the
working life of the entire population of batteries.
– Solution: We …rst …nd the value of t0:025 so that the 2.5% of the area is within each tail for
n 1 = 9 degrees of freedom: This is obtained from the tables by moving down the column
headed 0.025 to 9df: The value we get is 2.262. Thus:
s 1
= X 2:262 p = 5 2:262 p
n 10
' 5 2:262(0:316) ' 5 0:71
7 Hypothesis testing
7.1 Classical testing procedure
Hypothesis test: a rule that determines whether a particular value 0 2 is consistent with the
evidence of the sample.
– two sided: HA : 6= 0
For example, I want to test whether mean age of a student in this course is 25.
H0 : = 25
H1 : 6= 25
The test procedure is a rule, stated in terms of the data, that dictates whether the null hypothesis
should be rejected or not.
The classical, or Neyman-Pearson, methodology involves partitioning the sample space into two
regions.
76
– If the observed data (i.e., the test statistic) fall in the rejection region then we reject H0 ;
– The rejection region is de…ned by our willingness to commit a type I error (signi…cance level)!
1. De…ne H0 and H1 .
2. Formulate a test statistic
3. Partition the sample space into the rejection region and the acceptance region. How?
4. Reject H0 if the test statistic falls in the rejection region; Do not reject H0 is the test statistic
does not fall in the rejection region.
Since the sample is random, the test statistic is also random. The same test can lead to di¤erent
conclusions in di¤erent samples. As such, there are two ways a procedure can be in error:
– Type I error: The procedure can lead to the rejection of the null when it is true.
– Type II error: The procedure can fail to reject the null when it is false.
77
7.1.2 Signi…cance level and power of a test
Signi…cance level ( ) = Probability of Type I error
– The level of signi…cance is under the control of the analyst (typically set equal to 5%)
– We also call this the size of the test (under control by the analyst).
– Given HA this allows us to determine when we want to start rejecting H0 (recognizing that
only with probability the null is true).
What is our willingness to erroneously reject a null?
– For a given signi…cance level , we would like the power of our test to be a large as possible,
in other words, to be as small as possible
Power of a test is the ability of the test to reject when the null is false!
To ensure that our tests are powerful, we want to make use of e¢ cient estimators!
– The power and probability of a Type II error ( ); are de…ned in terms of the alternative
hypothesis and depends on the value of the parameter.
The power of the test H0 : = 50 (mean age EC400 students) when = 25 (truth)
should be close to 1, i.e., we should really reject = 50 for any sample! The power of
test H0 : = 26 is much smaller.
A petroleum company is searching for additives that may increase gas mileage (example from LM
- section 6.2)
– They send 30 cars with the new additive on a road trip from Boston to Los Angeles.
Without the additive: X N (25; (2:4)2 ) mpg.
With the new additive: X N ( ; (2:4)2 ) mpg.
– We want to know whether > 25.
Hypotheses: H0 : = 25 , H1 : > 25.
– Use X as an estimator for : Reject when our estimate x is too big (relative to 25).
– Sampling distribution of X under H0 will determine what it means for x to be too big.
2
H0 : X N (25; =n)
– More precisely, our willingness to commit a Type I error with probability (signi…cance
level), will tell us when to reject.
78
We should reject when x > x ; where x is de…ned as
Pr(X > x ) =
σ
pdf (X)
n
µ x*
How to …nd x ?
x p25
z = = n
X p25
– Under H0 : = n
=Z N (0; 1) ) use the N (0; 1) table to obtain z
Pr(Z > z ) =
– It is a statistic, as it does not contain any unknown parameters and it has a distribution for
which we easily can …nd the critical values!
Z N (0; 1) under H0
79
7.2.2 The p-value
Given n; and x, we could have asked the question: Which levels of signi…cance ( ) would lead
us to reject H0 ?
De…nition p-value provides the lowest level of signi…cance ( ) at which we would reject the
null.
x p
25
– After calculating the Z-statistic for our sample z = = n
; the p-value is de…ned by:
P (Z > z) = 1 (z)
x p25
In the example, z = 2:4= 30
= 2:97: From the standard normal table (2:97) = 0:9985;
so the p-value is
1 (2:97) = 0:15%
As we had decided to reject H0 at the 5% level, we observe that the p-value is smaller
than 5%. We would also have rejected the hypothesis when = 1%!
Now we report the analogous results when we consider the two-sided alternative
H0 : = 25, H1 : 6= 25;
Again, use X as our estimator of
Here, we would like to reject H0 if x is too high or too low! Pr(X > xH ) = =2 and Pr(X <
xL ) = =2:
Probability distribution of X
X 25
p N (0; 1) under H0
= n
xH 25
Pr(X > xH ) = Pr( X=p25
n
> p )=
= n
1 (zH ) = =2
xL 25
Pr(X < xL ) = Pr( X=p25
n
< p )=
= n
(zL ) = =2
80
– zH solves (zH ) = 1 =2 and zL solves (zL ) = =2.
Note that zH = zL = z =2
normal distribution centered around zero, de…ne z =2 >0
p p
– xL = 25 z =2 : = n; xH = 25 + z =2 : = n
As before, the Z-statistic for testing the mean of a normal population with known variance
X p25
Z= = n
N (0; 1) under H0
H0 : = 25 = 0
H1 : = 1 (say 1 > 0, i.e. one-sided)
X p25
Our test: Reject if Z = = n
> z (say = 5%)
81
– To compute this, we need to realize that if the true mean is 1; Z no longer has a standard
normal distribution!
Pr( X=p25
n
>z j = 1) = Pr( X=p26
n
>z + 25 p26
= n
)
X p26
Since under H1 : = n
N (0; 1); we can obtain the power by
25 p26 25 p26
1 (z + 2:4= 30
) =1 (1:645 + 2:4= 30
) = 74%
– We need to realize that the Z-statistic can no longer be computed as it contains the unknown
quantity : Not a valid test statistic.
– We need to replace with an estimator of the standard deviation:
v
u n
u X 2
SX = t n 1 1 Xi X
i=1
SE(X) is the standard error of the estimator X (squared root of the estimated variance
of X)
It has a student t distribution with n 1 degrees of freedom.
If n is big, the Student t distribution is very close to the standard normal distribution.
But for smaller values of t, there is signi…cant di¤erence. The Student t distribution has
thicker tails.
82
2
Lemma Let X1 ; X2 ; :::Xn be a random sample of size n from N ( ; ) then
p
n(X )
s t (n 1)
SX
Pn 2
where X N ( ; 2 =n) and SX 2
= n 1 1 i=1 Xi X :
Proof: Rewrite p
p n(X )
n(X ) N (0;1)
= v
u
SX u(n
u 1) 2 2
SX =(n 1)
t| {z }
2 (n 1)
2
Using the independence of X and SX (not shown here), we have the required formulation of a t (n 1)
random variable
Depending on whether or not 2 is known or not, we obtain critical values from N (0; 1) or tn 1 .
Given the alternative hypothesis (one-sided or two-sided) we de…ne a rejection region where we will
be rejecting H0 :
One sided test has the bene…t of improved power
See also LM Chapter 7.2-7.4.
2
– The estimator on which we base our test is SX :
– The test statistic, is given by
2
2 (n 1)SX 2
= 2 n 1 under H0
0
83
7.5 Hypothesis testing and con…dence intervals
Rather than reporting point estimates, ^; we may want to report a con…dence interval for the
parameter(s) of interest.
h i
De…nition A 100(1 )% con…dence interval for is an interval lower ^ ; upper ^ such
that
Pr(lower(b) upper(b)) = 1
– Often the con…dence interval is given by
[b C =2 SE(b); b + C =2 SE(b)]
with C =2 the critical value taken from the appropriate distribution, and SE(b) is an (esti-
mate) of the standard deviation of ^:
If a hypothesized value of the parameter does not fall in this range of plausible values, then the
data are not consistent with the hypothesis, and it should be rejected.
84
8 The classical linear regression model (2018: mostly self-study)
8.1 Multiple linear regression model
Study the causal relationship between a dependent variable, y; and one or more independent
variables, x1 ; :::; xk :
y = x1 1 + ::: + xk k + "
– Underlying economic theory will specify the dependent and independent variables in the
model and " is a random disturbance
n
– Given a sample f(yi ; xi1 ; ::; xik )gi=1 ; the objective is to estimate the unknown parameters,
study theoretical propositions, and use the model to predict the variable y:
How to proceed depends on the assumptions we are happy to make concerning the
stochastic process that generated our data.
– j provides the causal marginal e¤ect the explanatory variable xj has on the conditional
expectation of y; ceteris paribus.
– For statistical inferences, it is usual to add the assumption that the disturbances come from
a normal distribution (exact hypothesis testing, t and F test)
GM + normality renders OLS the MVUE estimator!
85
Our model has n observations: so lets stack them
0 1 0 0 1
y1 x1 + "1
B y2 C B x02 + "2 C
B C B C
B .. C = B .. C
@ . A @ . A
yn x0n + "n
A2 No Perfect Collinearity
This assumption guarantees that we can interpret the regression of y on X as the conditional
mean of y : E(yi jX) = xi1 1 + ::: + xik k .
86
Omission of relevant factors (in ") that are correlated with any of x1 ; :::; xk forms a violation
of this assumption.
Any correlation between the errors and regressors violates A3! (example measurement error
of explanatory variables)
– When we have random sampling (cross-sectional) we only need to worry about correlation
between "i and characteristics of individual i:
– In time series data (where random sampling is quite unreasonable) we need to worry
about correlation between "t and future values of the explanatory variables xt+1 ; xt+2 ; :::
as well.
A4 Homoskedasticity and nonautocorrelation
V ar("i jx1 ; :::; xk ) = 2 A3
V ar("jX) = E (""0 jX) = 2
I
Cov("i ; "j jx1 ; :::; xk ) = 0 all i 6= j
The presence of heteroskedasticity ( 2i 6= 2 ) and dependence among disturbances (time,
spatial dependence) are commonplace in economics.
Concerning our GM assumptions, we may want to distinguish two data generating process for the
regressors
– Fixed (non-stochastic) regressors
Under repeated sampling the regressors remain …xed, as would be the case in an exper-
iment
If X is …xed E("i jX) = E("i ), and A3 reduces to E("i ) = 0: Similarly A4 reduces to
V ar("i ) = 2 and Cov("i ; "j ) = 0 i 6= j:
– Random (stochastic) regressors
n
In obtaining a new sample f(yi ; xi1 ; ::; xik )gi=1 it is di¢ cult in general to control (keep
…xed) the x’s!
If " and X are (mean) independent, we also have E("i jX) = E ("i )!
The assumption E("i jX) = 0 is important for establishing the unbiasedness of our OLS
estimator! OLS may still be consistent if this assumption fails (as long as E(xi "i ) = 0).
8.3 Estimation
8.3.1 Minimum distance: ordinary least squares
yi = 0 + xi 1 + "i , E("i jX) = 0 and E(""0 jX) = 2
I
OLS estimates 0 and 1 by minimizing the sum of squares of the vertical distances from the data
points to the …tted regression line.
y
^"i = yi ^ + ^ xi
0 1
87
n
– Given f(xi ; yi )gi=1 , it determines what the straight line is that minimizes:
n
X 2
S( 0; 1) = [yi ( 0 + 1 xi )]
i=1
^"i = yi ybi = yi ^ + ^ xi
0 1
^ = y ^1x
0
Pn
(x x)(yi y) SampleCov(xi ; yi )
^
1 = Pn i
i=1
2
=
i=1 (x i x) SampleVar(xi )
2
The (unbiased) estimator of the error variance ; here (2 parameters)
Xn
^"2i RSS
2 i=1
s = =
n 2 n 2
Important property of OLS residuals: the residuals are orthogonal to the regressors!
with ^"i = yi ^ ^ xi
0 1
– This property is in line with the classical linear regression assumption: errors and regressors
are uncorrelated!
Xn
1
n ^"i = 0 is the sample analogue of E("i ) = 0
Xi=1
n
1
n ^"i xi = 0 is the sample analogue of E("i xi ) = 0
i=1
88
– The MM estimator estimates 0 ; 1 ; and 2 by enforcing the sample analogues of these
population moments
8 Xn
> 1
> ^"i = 0
< n Xi=1
n
1
n ^"i xi = 0 where ^"i = yi ^ 0;M M ^ 1;M M xi
>
> X i=1
n
: 1 ^"2 = ^ 2 n i MM
i=1
The (conditional) joint density of the observations y1 ; y2; y3; :::; yn; (de…nes the likelihood function)
is:
Q
n 1 1 2
L= p e 2 ([yi 0 1 :xi ]= )
i=1 2 2
2
– The MLE estimator estimates 0; 1; and by maximizing the (log-)likelihood:
n
2 n n 2 1 X 2
ln L( 0; 1; )= ln(2 ) ln 2
(yi ( 0 + 1 xi ))
2 2 2 i=1
@
@
: 1
^2
n
i=1 (yi ( ^ 0 + ^ 1 xi )) = 0 1 n
i=1 ^
"i = 0
0 ^2
@
@
: 1
^2
(yi ( ^ 0 + ^ 1 xi ))xi = 0 or 1
^2
n
i=1 ^
"i xi = 0
1
@
: n
+ 1
(y i ( ^ + ^ x )) 2
= 0
n
2^ 2
+ 2^1 4 ni=1 ^"2i =0
@ 2 2^ 2 2^ 4 0 1 i
with ^"i = yi ^ ^
0;M LE 1;M LE xi
– The …rst two FOC are the same as those from OLS ) ^ M LE = ^ OLS
n
X
2 1
– The last FOC yields ^ M LE = n ^"2i RSS 2
n = ^M M
i=1
– These MLE estimates critically depend on the joint normality assumption. Other MLE
estimates would result if a di¤erent distributional assumption was made.
89
8.4 Properties of OLS estimators in CLM
8.4.1 Unbiased, e¢ cient (BLUE), consistent
Pn
(x x)(yi y)
Consider ^ 1 = Pn i
i=1
2 in simple linear regression model
i=1 (xi x)
2 2
Let X be …xed: A1-A4 ensure that V ar( ^ 1 ) = Pn
x)2 = (n 1)s2X
i=1 (xi
– By de…nition
Xn Xn X
V ar( ^ 1 ) = V ar( 1 + di "i ) = V ar(di "i ) + Cov (di "i ; dj "j )
i=1 i=1 i6=j
Xn X
= d2i V ar("i ) + di dj Cov ("i ; "j )
i=1 i6=j
Pn
A4 yields V ar( ^ 1 ) = 2
i=1 d2i and plugging the de…nition of di in gives the desired
answer.
2
– If X is stochastic: A1-A4 ensure that V ar( ^ 1 ) = EX Pn
x)2
i=1 (xi
Uses the relation between conditional and unconditional variance. Note V ar(EX ( ^ 1 jX)) =
V ar( 1 ) = 0
– The precision of our estimate of 1 is enhanced by
larger sample size (n) .
more variability of the x regressors (s2X )
and smaller error variance ( 2 )!
90
Pn 2
=n
X …xed: If lim n1 i=1 (xi x)2 = 2
x > 0; V ar( ^ 1 ) = 1
Pn
x)2
! 0 while
n!1 n i=1 (xi
E^ =1 1
(n 2) 2 s2 2
n 2
^ and s are independent
2
1
@S( )
– We need to solve the system of k …rst order conditions: @ 1 = 0; ; @S(
@
)
= 0 for ^ 1 ; :::; ^ k
k
0 @S( )
1
@ 1
@S ( ) B .. C
=B
@ .
C
A =0
@ ^
@S( )
@ k ^
– A natural extension of the FOC of the simple linear regression model yield:
0 Xn 1
xi1 ^"i
B i=1 C
B . C = 0 with ^"i = yi x0i ^
@ X .. A
n
xik ^"i
i=1
91
In matrix notation, this equals X 0 ^" = 0 with ^" = y X^
X0 y X^ = 0 ) X 0X ^ = X 0y
A2
+
^ 1
= (X 0 X) X 0y
@S( )
Let us consider the vector of derivation of @ directly:
0
S( ) = (y X ) (y X )
0 0 0
= (y 0 X ) (y X ) = y 0 y X 0 y y0 X + 0 X 0 X
= y y 2 0X 0y + 0X 0X
0
since 0 X 0 y = y 0 X
@ (y 0 y )
– @ =0
@ ( 2 0 X 0 y)
– @ = 2X 0 y
1 0
Pk
z1
0
@ ( z) @( j zj ) B C
Let us simplify: @ = j=1
@ = @ ... A = z
zk
0
@( X0X )
– @ = 2X 0 X
0 Pk Pk 1
Pk Pk j=1 j z1j + i=1 i z1i
@( 0
Z ) @( j zij ) B .. C
Let us simplify: @ = i=1 j=1
@
i
=B
@ .
C
A
Pk Pk
j=1 j zkj + i=1 i zki
0 Pk 1
j=1 j (z1j + zj1 )
B .. C
=B
@ .
C
A
Pk
j=1 j (z kj + zjk )
0 Pk 1
2 j=1 z1j j
@( 0
Z ) B .. C
If Z is symmetric @ =B
@ .
C = 2Z
A
Pk
2 j=1 zkj j
@S ( )
= 2X 0 y + 2X 0 X ^ = 0
@ ^
H0 : 1 = 5 against HA : 1 6= 5
92
2
– If we assume is known, the test statistic we use is
r
5^ Xn
z= 1
N (0; 1) under H0 ^
Stdev( 1 ) = = (xi x)2
^
Stdev( 1 ) i=1
– We should reject the hypothesis = 5; if 5 does not lie in this con…dence interval.
1
h i
– This con…dence interval typically is wider than ^ 1 z =2 Stdev( ^ 1 ); ^ 1 + z =2 Stdev( ^ 1 ) ;
2
and recognizes the imprecision associated with the estimation of :
– A well known test for such joint linear restrictions is the F-test, which compares the …t of
the unrestricted and restricted model. Under H0 :
(RRSS U RSS) =/ #restrictions
F = F#restrictions,df of unrestr m o del
U RSS = df of unrestricted model
When we minimize the residual sum of squares subject to restrictions, we typically incur
a loss (RRSS U RSS).
This test determines whether this loss is signi…cant, in which case we would reject the
validity of these restrictions!
If our model is: yi = 1 + 2 xi2 + 3 xi3 +"i ; then to obtain RRSS; we impose the restric-
tions and regress: yi 5xi2 xi3 = 1 + "i . Its RSS is called RRSS; #restrictions=2;
df of unrestricted model=n 3on).
Statistical Inference
The use of the t and F test as speci…ed above critically depends on the GM assumptions and
normality of the errors.
– The t-test e.g, for H0 : 1 = 5; makes use of the SE ^ 1 which is obtained using the formula
derived under our GM assumptions.
93
– Violation of GM, invalidates our usual test statistics.
If we do not want to assume normality of the errors, we will want to rely on a suitable CLT:
Xn
^ a N ( ; 2= (xi x)2 )
1 1
i=1
^ 5 a
1
z= N (0; 1) under H0 (asymptotic t-test)
SE( ^ 1 )
– For joint linear restrictions (discussed in our econometrics courses), we will use an asymptotic
2
test, with degrees of freedom given by the number of restrictions
Autocorrelation –commonplace in time series data In the presence of autocorrelation, A.4 fails
(Suggested background reading: Wooldridge, Chapter 12: 12.1-12.3 and 12.5)
– Assuming that all other GM assumptions are still satis…ed: OLS still has good properties
(consistent).
In time series data, we typically prefer to use E("t jxt ) = 0 (weak exogeneity) which will
not allow us to obtain unbiasedness!
– Nevertheless, the usual standard error of OLS is incorrect in the presence of autocorrelation
and/or heteroskedasticity –t test and con…dence intervals invalid.
– We will need to use robust SE’s (HAC) to make them valid!
Important: NO need to be explicit about form of heteroskedasticity and/or autocorre-
lation.
– The OLS estimator is no longer e¢ cient. There might be a better estimator!
94
Important: To regain e¢ ciency we DO need to specify the form of autocorrelated (e.g.,
stationary AR(1)).
– Autocorrelation in the presence of lagged dependent variables will cause a violation of A.3!!
– Common tests for Autocorrelation are the Breush-Godfrey (LM) test and the Durbin Watson
test.
– Speci…c (weakly dependent) autocorrelation patterns:
Autoregressive process of order p: AR(p)
(For AR processes to be weakly dependent we cannot have unit roots (persistence, strong
dependence) or explosive roots)
Moving error process of order q: MA(q), or
ARMA(p,q), which is a combination of a AR(p) and MA(q)
– Endogeneity can arise for a number of reasons: omitted relevant variables, measurement
errors in the regressors, lagged dependent variables in the presence of autocorrelation in the
error term, and simultaneity
– This is a serious violation of as it renders the OLS estimator biased and inconsistent
Pn
– Intuitively: OLS imposes sample conditions on the residuals, e.g., i=1 ^"i xi = 0; that are
unreasonable if E ("i xi ) = 0:
– Solution: Look for an instrumental variable, zi ; that is valid, E ("i zi ) = 0; and relevant,
Cov (xi ; zi ) 6= 0:
The IV estimator, which is can also be seen to be a method of moment estimator, is an
estimator that has desirable large sample properties (e.g., consistent and asymptotically
normal)
The IV estimator can be computed using 2SLS.
– Identi…cation
If we have exactly as many instruments as we need (exact identi…cation), IV and 2SLS
are identical.
If we have more instruments than we need (over identi…cation), then 2SLS provides a
way to use the optimal instrument.
We cannot estimate the parameters if we have less instruments than we need (under
identi…cation); our parameters are not identi…ed.
95
– To establish the consistency of our estimators we relied on the convergence in probability
concept
p
plim ^n = or ^n !
– To enable us to conduct hypotheses testing, convergence in distribution is relevant if we
cannot establish the exact sampling distributions for our estimators.
p d
zn = n ^n ! f (z)
Let fxn ; n = 1; 2; :::g be a sqce of r.v.’s and x another r.v. de…ned on a common probability space
(x can be a constant).
p
De…nition (Convergence in Probability) We say that xn ! x if 8 > 0;
De…nition (Weak consistency) Suppose that we have an unknown parameter , and based on
p
sample of n observations we estimate it by ^n . Then, ^n is weakly consistent if ^n ! 0.
a:s:
De…nition (Almost Sure Convergence) We say that xn ! x if
n o
Pr lim jxn xj = 0 = 1.
n!1
a:s:
De…nition (Strong consistency) ^n is strongly consistent if ^n ! .
96
Intuitively, the distribution of xn collapses to a spike at plim xn
2
– With z = xn x; result follows as E jxn xj !0
Let us denote by FX (x) = Pr(X x) the probability distribution function (DF) of X, and
it0 X
X (t) = E[e ] its characteristic function (CF).
d
De…nition We say that Xn ! X in distribution (Xn ! X) if lim FXn (x) = FX (x) at every
n!1
continuity point of FX ( ) :
The following theorem indicates why the characteristic functions are useful in proofs of central
limit theorems
97
Theorem
d
Xn ! X , Xn (t) ! X (t) 8t
Two fundamental theoretical results are the foundations for applying these convergence idea to
sampling distributions of estimators
Instead of reviewing these here, I will provide important results we often make use of in our
Econometrics courses.
– Slutsky Theorem
– Continuous Mapping Theorem
– Cramer Convergence Theorem
– Delta Method
– Stochastic Order of Magnitude
plim(Wn 1 ) = 1
(matrix inverse rule)
98
9.3 Continuous Mapping Theorem
Theorem Continuous Mapping Theorem. Let Xn and X be k 1 vectors. Let g be a
continuous function in the domain of X. Then,
d d
Xn ! X ) g (Xn ) ! g (X) .
If we know how to obtain the distribution of the random variable g(X) (Chapter 4), then we can
use this result to describe the limiting distribution of Xn :
– Examples
d d
Xn ! X s N (0; 1) ) Xn2 ! 2
(1)
d d
Xn ! X s N (0; Ik ) =) Xn0 Xn ! Xk2
p 2
p d 2 n(X ) d 2
n X ! N (0; )) ! (1)
p d p 0 1
p d 2
n X ! N (0; ) ) n X n X ! (k)
A convenient device used to proving joint convergence result is given by a related theorem.
If Xn has a limiting distribution and plim (Xn Yn ) = 0 then Yn has the same limiting distribution
as Xn :
99
9.5 Delta Method
The Delta Method provides a convenient application of the Continuous Mapping Theorem
p
– In a multivariate setting, where n (xn x) ) N (0; ); we obtain
p d @f (x) @f (x)
n (f (xn ) f (x)) ! N (0; ):
@x0 @x
@f (x)
To prevent degeneracy we require @x0 has full rank.
– The result is particularly useful when requiring SE’s of functions of parameters.
Example:
x p x p x x d
– If ! and n ! N (0; ) then
y y y y
x p x
– !
y y
x plim x x
Clearly by Slutsky: plim =
y plim y y
1
!
p x x d 1 x
– n ! N (0; 2
y
x
)
y y
y y
2
y
0 1 0 1
xA xA
@@ @@
y 1 y
Uses: @ = and @ = x
2 where y 6= 0:
x y y y
100
9.6 Order of a sequence and Stochastic Order of Magnitude
We will like to de…ne the rate at which a sequence converges or diverges in terms of the order of
the sequence:
– Order n
A sequence cn is of order n ; denoted O n ; if and only if plim n cn is a …nite
nonzero constant.
– Order less than n
A sequence cn is of order less than n ; denoted o n ; if and only if plim n cn equals
zero.
2 2 2
Example: Considering the variance of the mean, =n := n: It converges to zero as long as is
a …nite constant
2 1
– Here n = O(n )
The above notation deals with the convergence of sequences of ordinary numbers or sequences of
random variables. In the latter setting we typically use Op and op notation instead:
p
Yn ! 0 () Yn = op (1)
d
If Xn ! X () Xn X = op (1)
d
If Xn ! X =) Xn = Op (1)
101