EC400Stats Lecturenotes2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

LECTURE NOTES

EC400: Probability and Statistical Inference


Marcia Schafgans
September 2021

Contents
1 Probability theory 4
1.1 Basic algebra of sets and Venn Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 The Probability Function - Review of well known properties . . . . . . . . . . . . . . . . 9
1.3 Conditional probability and Bayes’formula . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Random variables 15
2.1 Probability distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Survival and hazard function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Expectations of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Mean and variance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Higher order moments and existence of moments . . . . . . . . . . . . . . . . . . 24
2.4.3 Moment generating function and characteristic function . . . . . . . . . . . . . . 27
2.5 Percentiles and mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Joint, Marginal and conditional distributions 30


3.1 Joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Marginal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Conditional distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Independence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Expectations in a joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.1 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.2 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.3 Mean and variance of sums of random variables . . . . . . . . . . . . . . . . . . . 39
3.5.4 Conditional mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.5 Conditional vs unconditional moments (Law of iterated expectations) . . . . . . 41
3.5.6 Independence vs Conditional mean independence vs Uncorrelatedness . . . . . . 42
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1
4 Some special distributions 47
4.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Chi-squared, t and F distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Bernouilli, binomial and poisson distributions . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Some other distributions (not discussed in 2018) . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Distributions of functions of random variables 57


5.1 The distribution of a function of random variable . . . . . . . . . . . . . . . . . . . . . . 57
5.2 The distribution of a function of bivariate random variables . . . . . . . . . . . . . . . . 59
5.3 The distribution of a sum of random variables . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 The limiting distribution of a sum of independent random variables (CLT) . . . . . . . . 62

6 Estimation and Inference 64


6.1 Samples and Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Statistics as Estimators –Sampling distributions . . . . . . . . . . . . . . . . . . . . . . 65
6.2.1 Sampling distribution of sample mean and variance . . . . . . . . . . . . . . . . . 66
6.3 Finite sample criteria of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1 Unbiasedness, E¢ ciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Asymptotic properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4.1 Consistency (LLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4.2 Asymptotic normality (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 Methods of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.1 Minimum distance estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5.2 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5.3 Method of moments estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.6 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Hypothesis testing 76
7.1 Classical testing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1.1 Type of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1.2 Signi…cance level and power of a test . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Test of the mean (variance known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2.1 The z-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.2 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.3 Power of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Test of the mean (variance unknown) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3.1 The t-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Test of the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.5 Hypothesis testing and con…dence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 84

8 The classical linear regression model (2018: mostly self-study) 85


8.1 Multiple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.1.1 Multiple Linear Regression Model –Matrix Notation (Non-examinable) . . . . . 85
8.2 Gauss-Markov assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.3.1 Minimum distance: ordinary least squares . . . . . . . . . . . . . . . . . . . . . . 87
8.3.2 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.3.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

2
8.4 Properties of OLS estimators in CLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.4.1 Unbiased, e¢ cient (BLUE), consistent . . . . . . . . . . . . . . . . . . . . . . . . 90
8.5 Derivation OLS estimator using Matrix notation (Non-examinable) . . . . . . . . . . . . 91
8.6 Statistical Inference in CLM under normality . . . . . . . . . . . . . . . . . . . . . . . . 92
8.6.1 The t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.6.2 The F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.7 Gauss-Markov violations - brief summary . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9 Large-Sample Distribution Theory (Non-examinable) 95


9.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 Slutsky’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.3 Continuous Mapping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.4 Cramer’s Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.5 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.6 Order of a sequence and Stochastic Order of Magnitude . . . . . . . . . . . . . . . . . . 101

3
1 Probability theory
Read LM Chapter 2.
The starting point for studying probability is the de…nition of four key terms: experiment, sample
outcome, sample space, and event.

– The latter three are carry-overs from classical set theory, and give us a familiar mathematical
framework within which to work.
– The former is what provides the conceptual mechanism for casting real-world phenomena
into probabilistic terms.

Experiment: is any procedure that (1) can be repeated, theoretically, an in…nite number of times
and (2) has a well-de…ned set of possible outcomes.

– Example: rolling a pair of dice


– Obtaining a sample of n observations from a population

Sample space –an arbitrary non-empty set (all possible outcomes).

Sample outcome ! –is a particular draw from the sample space (relate to random variable)
Event –any designated collection of sample outcomes, including individual outcomes, the entire
sample space, and the null set.

Let us consider a particular example (for other examples see LM, Ch 2):

– Experiment: a procedure that has a well-de…ned set of outcomes.


Example: Winner of Wimbledon 2018.
– Sample outcome: outcome of a random experiment.
Example: Andy Murray wins
– Sample space: , the collection of all sample outcomes.
= fNadal, Djokovic, Federer, Murray, .....}
– Event: a collection of sample outcomes, or a subset of the sample space.
Event E: A Spanish player wins (E ={Nadal, Verdasco, Ferrer, Almagro, Ferrero,
Lopez,....})
Event F : A seeded Spanish player wins (F ={Nadal, Verdasco, Ferrer})
Event G: Federer wins (G ={Federer})
Events E and G are mutually exclusive.
F is a subset of E, F E

Associated with events de…ned on a sample space are several operations collectively referred to
as the algebra of sets. These are rules that govern the ways in which one event can be combined
with another.

4
Algebra of Sets:

– Let A and B be any two events de…ned over the sample sample space
The intersection of A and B, written A \ B, is the event whose outcomes belong to both
A and B: If A \ B = ; then A and B are mutually exclusive.
The union of A and B, written A [ B, is the event whose outcomes belong to either A
or B or both.
The complement of A; written AC ; is the event consisting of all outcomes in other
than those contained in A:
– The notions of unions and intersections can easily be extended to more than two events.

Next we will want to assign a probability to an experiment’s outcome - or, more generally, to an
event.

– Let A be any event de…ned over the sample sample space


– P (A) will denote the probability of A; and P is referred to as the probability function.
P (A) 2 [0; 1] :

Kolmogorov showed that the following axioms are necessary and su¢ cient for characterizing the
probability function P :

1. Let A be any event de…ned over : Then P (A) 0:


2. P ( ) = 1
3. Let A and B be any two mutually exclusive events de…ned over ; then P (A [ B) = P (A) +
P (B)
4. When has an in…nite number of members: Let A1 ; A2 ; ::: be events de…ned over : If
1 P
1
Ai \ Aj = ; for each i 6= j; then P [ Ai = P (Ai ) :
i=1 i=1

Objective: We want to describe our experiment by means of a probability space, a triple ( ; F; P )


consisting of

– The sample space


– The algebra F (also called …eld) –a set of subsets of , called events (each event is
a set containing zero or more outcomes), such that
F contains the empty set: ;
F is closed under to complements: If A 2 F; then also AC = nA
F is closed under countable unions: If Ai 2 F for i = 1; 2; ::n then also ([ni=1 Ai ) 2 F
A algebra is a type of algebra
– The probability measure P : F ! [0; 1] – a function on F that assigns probabilities to the
events

Given the outcome of the experiment, !; all events in F that contain the selected outcome are
said to have occurred.
If this experiment were to be repeated an in…nite number of times, the relative frequencies of
occurrence of each of the events would coincide with the probabilities prescribed by the function
P:

5
1.1 Basic algebra of sets and Venn Diagrams
The intersection of A and B, written A \ B, is the event whose outcomes belong to both A and
B: If A \ B = ; then A and B are mutually exclusive.
The union of A and B, written A [ B, is the event whose outcomes belong to either A or B or
both.
The complement of A; written AC ; is the event consisting of all outcomes in other than those
contained in A:

Exhaustive events B1 ; B2 ; :::; Bn : If B1 [ B2 [ :::[ Bn = , the entire sample space.


A is a subset of B; denoted A B, if the occurrence of event A implies that event B has occurred.
Partition of event A: Events C1 ; C2 ; :::; Cn form a partition of event A if A = [ni=1 Ci and the
Ci ’s are mutually exclusive.

Mutually exclusive

B⊂A
B

Venn diagrams

6
A
Mutually exclusive
and exaustive
C
B

Partition of A

Venn diagrams

Example: Tossing a six-faced “fair” die. The sample space is f1; 2; 3; 4; 5; 6g ; each number being
an outcome,

Outcomes f1g and f2g are mutually exclusive


Outcomes 1 to 6 are exhaustive
The event of tossing an even number is f2; 4; 6g,
If A = f1; 2; 3g and B = f2; 4; 6g then A [ B = f1; 2; 3; 4; 6g and A \ B = f2g;
If A = f1; 2; 3g, then AC = f4; 5; 6g ;
if D = f2g and B = f2; 4; 6g ; then D B:
the events E = f2; 4g and F = f6g form a partition of the event f2; 4; 6g :

7
A few results:

The following results can be easily seen with the aid of Venn Diagrams. Figure 1 illustrates the …rst
one.
C
1. (A [ B) = AC \ B C
C
2. (A \ B) = AC [ B C
3. for any events A; B1 ; B2 ; :::; Bn :

A \ (B1 [ B2 [ :::Bn ) = (A \ B1 ) [ ::: (A \ Bn )


A [ (B1 \ B2 \ ::: Bn ) = (A [ B1 ) \ ::: (A [ Bn )

4. If B1 ; B2 ; :::; Bn are exhaustive events ([ni=1 Bi = ), then for any event A :

A \ (B1 [ B2 [ ::: [ Bn ) = A

5. For any event A, A [ AC = and A \ AC = ;.


6. If A B then A [ B = B and A \ B = A.

Ω AC
A B

A∪B BC

(A∪B)C AC ∩BC

C
Figure 1: (A [ B) = AC \ B C

8
Examples of a algebra:
F = f;; g forms a algebra
–F contains the empty set: ;
–F is closed under to complements: ;c = 2 F and c = ; 2 F
–F is closed under countable unions: ; [ = 2 F
–F is closed under countable intersections: ; \ = ; 2 F
F = f;; fa; bg ; fc; dg ; fa; b; c; dgg forms a -algebra
–F contains the empty set: ;
c
–F is closed under to complements: e.g., fa; bg = fc; dg 2 F
–F is closed under countable unions: e.g., fa; bg [ fc; dg = 2 F
–F is closed under countable intersections: e.g., fa; bg \ fc; dg = ; 2 F

1.2 The Probability Function - Review of well known properties


If A is any event de…ned on a sample space ; then P (A) will denote the probability of A, and P
is referred to as the probability function.
It is a mapping from a set (i.e., an event) to a number.

Example: Tossing a “fair” i.e.


1
Each of the six faces has the same chance of 6
1
Probability function P (j) = 6 for j = 1; 2; 3; 4; 5; 6
The event “an even number is tossed” is A = f2; 4; 6g, and has probability
1 1 1 1
P (A) = + + =
6 6 6 2

A few well known properties:


1. P (;) = 0
2. if A B then P (A) P (B)
3. for any events A and B;
P (A [ B) = P (A) + P (B) P (A \ B)

(since P (A) + P (B) counts P (A \ B) twice, see …gure 4).


A B

Figure 4: Venn diagram

9
4. for any event A; P AC = 1 P (A)
5. for any events A and B:
P (A) = P (A \ B) + P A \ B C
6. for exhaustive, mutually exclusive events B1 ; B2 ; :::; Bn , and for any event A:
P (A) = P (A \ B1 ) + P (A \ B2 ) + ::: + P (A \ Bn )
Xn
= P (A \ Bi )
i=1

Exercise: A household survey of electric appliances found that 75% of houses have radios (R),
65% have irons (I), 55% have electric toasters (T ), 50% have (IR), 40% have (RT ), 30% have
(IT ), and 20% have all three. Find the probability that a household has at least one of these
appliances.

– Solution:
P (R [ I [ T ) = P (R) + P (I) + P (T ) P (R \ I)
P (R \ T ) P (I \ T )
+P (R \ I \ T )
= :75 + :65 + :55 :5 :4 :3 + :2 = :95

Exercise: Let P (A \ B) = 0:2, P (A) = 0:6 ;and P (B) = 0:5. Find P AC [ B C .

– Solution:
C
P AC [ B C = P (A \ B)
= 1 P (A \ B)
= :8

1.3 Conditional probability and Bayes’formula


Sometimes we may already know that a certain event A has taken place and those occurrences may
have a bearing on the probability we are trying to …nd.
Conditional probability of event B given event A: If P (A) > 0,
P (B \ A)
P (BjA) =
P (A)

Figure 5: Venn diagram


, and by rewriting we obtain
P (B \ A) = P (A) :P (BjA) (= P (B) :P (AjB))

10
Exercise: Suppose you and I are investors and part of our capital is invested in Thailand. I know
that you’ve just received some information about Thailand, I do not know what it is. It may be
B, “things are good”, or B C , “things are not good”. Let the event A be “you take the money out
of Thailand”, and AC “you keep the money Thailand”. Say that I know the probability of the
event A given B and given B C as well as the probability of B. I see that you chose A. What is
the probability of B given A?

– Solution: Using the conditional probability formula, we can observe:


P (B \ A)
P (BjA) =
P (A)
P (B) :P (AjB)
=
P (A)
P (B) :P (AjB)
=
P (B) :P (AjB) + P (B C ) :P (AjB C )

Theorem Bayes’ Theorem (simple form): For any events A and B with P (A) > 0,
P (B) :P (AjB)
P (BjA) =
P (A)
Theorem Bayes’ Theorem (general form): If B1; B2; :::; Bn form a partition of the sample space S,
then
P (AjBj ) P (Bj )
P (Bj jA) = Pn for each j = 1; 2; :::; n
i=1 P (AjBi ) P (Bi )

If we know P (AjBj ) for all j; the theorem allows us to compute P (Bj jA) :
Bayesian analysis: P [Bj ] are referred to as prior probabilities, and P (Bj jA) as posterior proba-
bilities.

1.4 Independence
The independence of (non-empty) events A and B is equivalent to

P (AjB) = P (A) or P (BjA) = P (B)

– The probability of a given event A remains the same regardless of the outcome of a second
event B:

A and B are independent events if P [A \ B] = P [A] P [B].

– This result follows directly from Bayes’Theorem:


P (B) :P (AjB) P (B) :P (A)
P (BjA) = = = P (B)
P (A) P (A)

Mutually independent events A1 ; A2 ; :::; An satisfy:

P (A1 \ A2 ::: \ An ) = P (A1 ) P (A2 ) P (An )


Yn
= P (Ai )
i=1

11
Example: Die-tossing experiment with sample space is S = f1; 2; 3; 4; 5; 6g. Consider the following
events:
A = f1; 2; 3g “the number tossed is 3"
B = f2; 4; 6g “the number tossed is even”
C = f1; 2g “the number tossed is a 1 or a 2”
D = f1; 6g “the number tossed doesn’t start with the letters ‘f’or ‘t’”.

The conditional probability of A given B is

P (f1; 2; 3g \ f2; 4; 6g)


P (AjB) =
P (f2; 4; 6g)
P (f2g)
=
P (f2; 4; 6g)
1=6 1
= =
1=2 3

Events A and B are not independent, since:


1 1 1 1
= P (A \ B) 6= P (A) P (B) = =
6 2 2 4
(or alternatively, events A and B are not independent since P (AjB) 6= P (A))
1
P (AjC) = 1 6= 2 = P [A], so that A and C are not independent
P (BjC) = 12 = P (B), B and C are independent.
(alternatively, P (B \ C) = P (f2g) = 16 = 21 31 = P (B) P (C))
A and B are both independent of D:

A few results:

1. If P (A1 \ A2 ::: \ An 1) > 0, then

P (A1 \ A2 ::: \ An ) = P (A1 ) P (A2 jA1 )


P (A3 jA1 \ A2 ) :::
P (An jA1 \ A2 \ ::: \ An 1)

2. P AC jB = 1 P (AjB)
P (A\B) P (A)
3. if A B then P (AjB) = P (B) = P (B) , and P (BjA) = 1

4. If A and B are independent events then AC and B are independent events, A and B C are
independent events, and AC and B C are independent events.

1.5 Combinatorics
Counting ordered sequences: the multiplication rule. If operation A can be performed in m dif-
ferent ways and operation B in n di¤erent ways, the sequence (operation A; operation B) can be
performed in m n di¤erent ways.

12
– Rolling a dice twice, yields 6x6 possible outcomes.
– If an operation Ai , i = 1; ::; k; can be performed in ni ways i = 1; 2; ::; k respectively, then
the ordered sequence (operation A1 ; operation A2 ; :::; operation Ak ) can be performed in
n1 n2 :: nk ways.

Counting permutations (when the objects are all distinct): The number of permutations of length
k that can be formed from a set of n distinct elements, repetitions not allowed, is denoted by n Pk
:
n!
n Pk = n (n 1) ::: (n k + 1) =
(n k)!

– The number of ways to permute an entire set of n distinct objects is n Pn = n!

Counting permutations (when the objects are not all distinct): The number of ways to arrange n
objects, n1 being of one kind, n2 of the second kind, ..., and nr of an rth kind is
n! Pr
, where ni = n
n1 !n2 !::nr ! i=1

Counting combinations: The number of ways to form combinations of size k from a set of n
distinct objects, repetitions not allowed, is denoted by the symbols nk or n Ck , where

n n!
k =n C k =
k! (n k)!
n
– k ; k = 0; :::; n; are commonly referred to as binomial coe¢ cients
– Pascal’s triangle allows us to easily obtain the binomial coe¢ cients (see p.110 in Larsen and
Marx)

LM Chapter 2.7: provides many combinatorial probability problems.

1.6 Exercises
1 5 7
Exercise: If P (A) = 6 and P (B) = 12 , and P (AjB) + P (BjA) = 10 :Find P (A \ B) :

– Solution:
P (A \ B)
P (BjA) = = 6P (A \ B)
P (A)
P (A \ B) 12
P (AjB) = = P (A \ B)
P (B) 5
12 7
! 6+ P (A \ B) =
5 10
1
! P (A \ B) =
12

Exercise: Three dice have the probabilities of throwing a “6”: p; q; r; respectively. One of the
dice is chosen at random and thrown (each is equally likely to be chosen). A “6”appeared. What
is the probability that the die chosen was the …rst one?

13
– Solution: The event “a 6 is thrown” is denoted by “6”
P ((die 1) \ (“6”))
P (die 1 j “6”) =
P (“6")
P (“6”j die 1) P (die 1)
=
P (“6”)
1
p 3
=
P [“6”]

P (“6”) = P (“6”\ (die 1)) + P (“6”\ (die 2))


+P [“6”\ (die 3)]
= P [“6”jdie 1] P [die 1] +
+P [“6”jdie 2] P [die 2] +
+P [“6”jdie 3] P [die 3]
1 1 1
= p +q +r
3 3 3
p+q+r
=
3

p 31
! P [die 1 j \6"] =
P [\6"]
p 31 p
= 1 =
(p + q + r) 3
p+q+r

Exercise: Identical twins come from the same egg and hence are of the same sex. Fraternal twins
have a 50-50 chance of being the same sex. Among twins, the probability of a fraternal set is p and
an identical set is q = 1 p. If the next set of twins are of the same sex, what is the probability
that they are identical?

– Solution: Let A be the event ”the next set of twins are of the same sex”, and let B be the
event ”the next sets of twins are identical”. We are given:
P (AjB) = 1; P AjB C = :5
P (B) = q; P B C = p = 1 q:
P (A \ B)
Then P (BjA) =
P (A)
But P (A \ B) = P (AjB) P (B) = q;
C
and P A \ B = P AjB C P B C = :5p
Thus, P (A) = P (A \ B) + P A \ B C
= q + :5p = q + :5(1 q)
= :5(1 + q);
q
and P (B j A) =
:5(1 + q)

14
Exercise: Let events A and B be independent. Find the probability, in terms of P (A) and P (B),
that exactly one of the events A and B occurs.

– Solution: P [exactly one of A and B] = P A \ B C [ B \ AC . Since A\B C and B\AC


are mutually exclusive, it follows that

P [exactly one of A and B] = P A \ B C + P B \ AC

Since A and B are independent, it follows that A and B C are also independent, as are B
and AC :
Then P A \ B C [ B \ AC

= P (A) P B C + P (B) P AC
= P (A) (1 P (B)) + P (B) (1 P (A))
= P (A) + P (B) 2P (A) P (B) :

This result is easily seen with the aid of a Venn diagram.

For more examples and exercises see LM Chapter 2.

2 Random variables
See also Greene Appendix C.

In the previous chapter we introduced the sample space ; which may be quite tedious to describe
in the elements of are not numbers.
With the help of random variables we formulate a rule, or a set of rules, by which the elements
! of may be represented by numbers x or ordered pairs of numbers (x1 ; x2 ) or, more generally
n tuplets of numbers (x1 ; x2 ; ::; xn ) :

– Example: The random experiment may be the toss of a coin and = fH; T g : We may de…ne
X such that X(!) = 0 if ! = T and X (!) = 1 if ! = H:
– A random variable X is a function that carries the probability from a sample space to a
space of real numbers:

Pr(X 2 A) = P (C) , where C = f!; ! 2 and X (!) 2 Ag

A random variable is a real-valued function whose domain is the sample space

– We can compute it for any given sample.


– The value depends on the particular outcome we happen to observe

15
Random variables can be scalar (univariate) or vectors (multivariate)

– Let capital letters (X) denote the random variable and small letters (x) a particular realiza-
tion.
– We should distinguish discrete and continuous random variables

A discrete random variable can take on values from a …nite or countable in…nite sequence only.

Example: Suppose I’ll toss a coin until the …rst head occurs. Related to this experiment we may
think of the following two examples of random variables:

– Random variable X
X = 1 if the …rst head occurs on an even-numbered toss
X = 0 if the …rst head occurs on an odd-numbered toss;
– Random variable Y
Y = n, where n is the number of the toss on which the …rst head occurs.
– Both X and Y are discrete random variables, where X can take on only the values 0 or 1,
and Y can take on any positive integer value.
– X and Y are based on the same sample space –the sample points are sequences of tail coin
‡ips ending with a head coin ‡ip:
= fH; T H; T T H; T T T H; T T T T H; :::g.
X(H) = 0 (a head on ‡ip one, an odd-numbered ‡ip),
X(T H) = 1;
X(T T H) = 0; :::and so on.
Y (H) = 1 (…rst head on ‡ip 1),
Y (T H) = 2;
Y (T T H) = 3;
Y (T T T H) = 4; :::and so on.

A continuous random variable can assume numerical values from an interval of real numbers, (e.g.,
the set of real numbers <).

Simple examples are the weight and height of a person or household income.

2.1 Probability distribution function


A listing of the values x taken by a random variable X and their associated probabilities is a
probability function, f (x):
For a discrete random variable, the probability function equals f (x) = P (X = x) :

– The probability function must satisfy


(i) 0P f (x) 1 for all x, and
(ii) x2 f (x) = 1.
P
– Given a set A of real numbers, P (X 2 A) = x2A f (x).

16
For a continuous random variable, the probability of a particular outcome is zero.

– The probability density function (pdf ) describes positive probabilities to intervals in


the range of x:

– The pdf is de…ned so that f (x) 0 and


Rb
1. P (X 2 [a; b]) = P (a X b) is de…ned to be equal to a
f (x)dx.

pdf (x)

a b x
R1
2. 1
f (x)dx = 1

– For a continuous random variable, P (X = a) = 0; (non-zero probabilities only exist over an


interval, not at a single point). Thus, for a continuous random variable X, P (a < X < b) =
P (a X < b) = P (a < X b) = P (a X b), (it is irrelevant whether or not the end-
points are included).

Example: Suppose that X has density function

2x for 0 < x < 1


f (x) =
0 otherwise

Then f satis…es the requirements for a density function, since


Z 1 Z 1
f (x)dx = 2xdx = 1:
1 0

and, for example


Z :5
P [:2 < X < :5] = 2xdx
:2
= x2 j:5
:2
= :21

A mixed (discrete and continuous) random variable has some points with non-zero probability
mass, and with a continuous p.d.f elsewhere.

– The sum of the probabilities at the discrete points of probability plus the integral of the
density function on the continuous region for X must be 1.

17
Example: X has probability of .5 at X = 0, and X is a continuous random variable on the interval
(0,1) with density function f (x) = x for 0 < x < 1, and X has no density or probability elsewhere.
Note that:
Z 1 Z 1
P (X = 0) + f (x)dx = :5 + xdx
0 0
= :5 + :5 = 1:

2.2 Cumulative distribution function


For any random variable X; the probability that X is less than or equal to a is denoted F (a)
F (x) = P (X x) denotes the cumulative distribution function of a random variable X.

– From the de…nition of the cdf P (a < X b) = F (b) F (a)

The cdf must satisfy the following properties


(i) 0 F (x) 1 for all x
(ii) limx! 1 F (x) = 0
(iii) limx!1 F (x) = 1
(iv) If x > y then F (x) F (y)

P
A discrete random variable with probability function f (x) has a c.d.f equalling F (x) = w x f (w) where
F (x) is a “step function”(it has a jump at each point with non-zero probability, while remaining
constant until next jump).
A
R xcontinuous random variable X with density function f (x); has a distribution function F (x) =
@
1
f (t)dt. F (x) is a continuous, di¤erentiable, non-decreasing function such that @x (x) =
F
F 0 (x) = f (x):
If X has a mixed distribution, then F (x) is continuous except at the points of non-zero probability
mass, where F (x) will have a jump.

Results and formulas for random variables

1. For a continuous random variable X, P (a < X < b) = P (a X < b) = P (a < X b) = P (a X b),


(it is irrelevant whether or not the endpoints are included).

For a continuous random variable, P (X = a) = 0; (non-zero probabilities only exist over an


interval, not at a single point).

2. If X has a mixed distribution, the P (X = t) will be non-zero for some value(s) of t, and P (a < X < b)
will not always be equal to P (a X b) (they will not be equal if X has a non-zero probability
mass at either a or b):
3. f (x) may be de…ned piecewise, meaning that f (x) is de…ned by a di¤erent algebraic formula on
di¤erent intervals.
4. A continuous random variable may have two or more di¤erent, but equivalent p.d.f’s, but the
di¤erence in the p.d.f’s would only occur at a …nite (or countably in…nite) number of points. The
c.d.f of a random variable of any type is always unique to that random variable.

18
2.2.1 Examples
X = number turning up when tossing one fair die
1
so X has probability function fX (X) = P [X = x] = 6 for x = 1; 2; 3; 4; 5; 6: X is a discrete
random variable. 8
>
> 0 if x < 1
>
> 1
>
> 6 if 1 x < 2
>
> 2
< 6 if 2 x < 3
3
FX (x) = P [X x] = 6 if 3 x < 4
>
> 4
>
> 6 if 4 x < 5
>
> 5
>
> if 5 x < 6
: 6
1 if x 6

Y is a continuous random variable on the interval (0,1) with density function

3y 2 for 0 < y < 1


fY (y) =
0 elsewhere
Then 8
< 0 if y < 0
FY (y) = y3 if 0 y < 1
:
1 if y 1

Z has a mixed distribution on the interval [0; 1). Z has probability of .5 at Z = 0, and Z has
density function fZ (z) = z for 0 < z < 1, and Z has no density or probability elsewhere.
Then 8
>
> 0 if z < 0
<
:5 if z = 0
FZ (z) = 1 2
>
> :5 + 2 z if 0 <z<1
:
1 if z 1

Exercise: A die is loaded in such a way that the probability of the face with j dots turning up is
proportional to j for j = 1; 2; 3; 4; 5; 6: What is the probability, in one roll of the die, that an even
number of dots will turn up?

– Solution: Let X denote the random variable representing the number of dots that appears
when the die is rolled once. Then, P [X = k] = R k for k = 1; 2; 3; 4; 5; 6; where R is the
proportional constant. Since the sum of all the probabilities of points in that can occur must
be 1, it follows that

R [1 + 2 + 3 + 4 + 5 + 6] = 1;
1
so that R = :
21
Then,

P [even number of dots turns up] =


= P [2] + P [4] + P [6]
2+4+6 4
= = :
21 7
Exercise: In ordinary single die is tossed repeatedly until the …rst even number turns up. The
random variable X is de…ned to be the number of the toss on which the …rst even number turns
up. Find the probability that X is an even number.

19
– Solution: X is a discrete random variable that can take on an integer value of 1 or more.
The probability function for X is the probability of x 1 successive odd tosses followed by
an even toss: x
1
f (x) = P [X = x] =
2
Then

P [X is even] = P [2] + P [4] + P [6] + :::


2 4 6
1 1 1
= + + + :::
2 2 2
1 2
2 1
= = :
1 2 3
1 2

Exercise: The continuous random variable X has density function f (x) = 3 48x2 for :25
x :25 (and f (x) = 0 elsewhere). Find P 18 X 5
16 :

– Solution:
P [:125 X :3125] = P [:125 X :25]
since there is no density for X at points greater than .25. The probability is
Z :25
5
(3 48x2 )dx = :
:125 32

Exercise: Suppose that the continuous random variable X has the cumulative distribution func-
tion F (x) = 1+e1 x for 1 < x < 1: Find X’s density function.

– Solution: The density function for a continuous random variable is the …rst derivatives of
the cumulative distribution function. The density function of X is

f (x) = F 0 (x)
e x
=
(1 + e x )2
x
Exercise: X is a random variable for which P [X x] = 1 e for x 1, and P [X x] = 0
for x < 1. Which of the following statements is true?
A) P [x = 2] = 1 e 2 and P [X = 1] = 1 e 1
B) P [x = 2] = 1 e 2 and P [X 1] = 1 e 1
C) P [x = 2] = 1 e 2 and P [X < 1] = 1 e 1
D) P [x < 2] = 1 e 2 and P [X < 1] = 1 e 1
E) P [x < 2] = 1 e 2 and P [X = 1] = 1 e 1

– Solution: Since P [X x] = 1 e x for x 1; it follows that P [X 1] = 1 e 1 . But


P [X x] = 0 if x < 1; and thus P [X < 1] = 0; so that P [X = 1] = 1 e 1 (since
P [X 1] = P [X < 1]+P [X = 1]). This eliminates answers C and D. Since the distribution
function for X is continuous (and di¤erentiable for x > 1; it follows that P [X = x] = 0 for
x > 1: This eliminates answers A, B and C. This is an example of a random variable with a
mixed distribution (a point of probability at 1; and a continuous distribution for X > 1):

20
Exercise: A continuous random variable X has a density function
8
< 2x 0 < x < 12
4 2x 1
f (x) = x<2
: 3 2
0 elsewhere;
Find P [:25 < X 1:25] :
– Solution:
Z 1:25
P [:25 < X 1:25] = f (x)dx
:25
Z :5 Z 1:25
4 2x
= 2xdx + dx
:25 :5 3
3
=
4
Note that since X is a continuous random variable, the probability P [:25 X < 1:25] would
be the same as P [:25 < X 1:25] :This is an example of a density function de…ned piecewise.
More examples can be found in LM Chapter 3.3 and 3.4.

2.3 Survival and hazard function


While the pdf f (x) and cdf F (x) are common ways to formulate the distribution of a random
variable, for continuous random variables related functions may be of interest:
– Survival function:
S(x) = Pr(X x) = 1 F (x)
– Hazard function:
f (x)
h(x) =
1 F (x)
– Relation between them: h(x) = d ln S (x) =dx

S(x) and h(x) are useful when studying duration of a spell

The hazard function is a conditional probability


P (x X x + tjX x)
h(x) = lim
t#0 t
– Using Bayes’Theorem
P (x X x + t)
P (x X x + tjX x) =
P (X x)
– By de…nition of Cumulative Density Function (CDF)
P (x X x + t) F (x + t) F (x)
=
P (X x) 1 F (x)
– Therefore:
F (x + t) F (x) 1 f (x)
h(x) = lim =
t#0 t 1 F (x) 1 F (x)
as the term in brackets de…nes the derivative of the CDF which equals the PDF.

21
2.4 Expectations of a random variable
See also LM Chapter 3.5 and 3.6.

2.4.1 Mean and variance:


The expected value of a random variable X is a "suitably weighted average" over the range
of values that X can be, or the ”center” of the distribution.

– It is denoted by E [X], or X or .
– It is also called the expectation of X, or the mean of X:

De…nition The mean or expected value of a random variable is


P
R x f (x) if X is discrete
E [X] = 1
1
x f (x)dx if X is continuous

(in practice, the interval of integration is the interval of non-zero density for X).

The expected value of g(X); where g is some function is


P
E [X] = R 1g(x) f (x) if X is discrete
1
g(x) f (x)dx if X is continuous

– Example (discrete): X is the result of one toss of a fair die, then


1 1 1 7
E [X] = 1 +2 + ::: + 6 = :
6 6 6 2

– Example (continuous): If the pdf of Y is given by f (y) = 1 for y 2 [1; 2] and f (y) = 0
otherwise,
Z 2
E [X] = (1 x)dx
1
= 1:5

For any constants a and b:


E(a + b:X) = a + b:E(X)
2
Importantly though, in general, E(g(X)) 6= g(E(X)) example: E(X 2 ) 6= E (X) !

Theorem Jensen’s inequality: If g is a function and X is a random variable such that g 00 (x) 0 at
all points x with non-zero density or probability for X; then:

@ 2 g(x)
E [g (X)] g (E [X]) , when g 00 (x) 0
@2x
with strict inequality if g 00 (x) > 0.

The inequality reverses if g 00 0.

– Example E (log(X)) log (E(X))

22
Graphically, the function below is convex (g 00 > 0). Therefore, the expected value of Y = g(X) is
bigger than g(E(X)).

– Later we will discuss how to obtain the distribution of Y = g(X) given that we know the
distribution of X:

The expected value provides us with important information about a random variable. But higher
order moments are also very relevant.

Example: Suppose you have two investment opportunities: A yields on average 4% a year, B has an
expected return of 5% (both in British Pounds). If B better than A?
Answer : Maybe. An important issue is: how risky are those investments? If A is a bond issued by the
British Treasury and B was issued by a state bank in Argentina, B will yield more on average, but the
possibility that you will lose money is also substantially higher...

2 2
The variance of X measures the dispersion of X. It is denoted by V ar [X] ; X or :

De…nition The variance of a random variable is

V ar [X] = E (X E (X))2
( P
2
(x ) f (x) if X is discrete
= R1 2
1
(x ) f (x)dx if X is continuous

– The variance, is always 0 and is called the standard deviation.


– We can obtain the following simplifying result
2
V ar (X) = E X 2 E (X)
h i
2 2
Proof: V ar [X] = E (X E(X)) = E X 2 2(X:E[X]) + (E [X])
2 2
= E X2 E[X]:2:E [X] + (E [X]) = E X 2 (E [X])

If a and b are constants, then var [aX + b] = a2 var [X] (prove!))

23
How to calculate the variance: (with = E(X))

– Discrete case: 2
X
E [X ] = (x )2 f (x)
X
E X2 2
= x2 f (x) 2

– Continuous case: Z 1
2
E [X ] = (x )2 f (x)dx
1
Z 1
2 2
E X = x2 f (x)dx 2
1

Exercise: Suppose E[X] = 2: Compute the variance in the following three cases:

(i) Pr(X = 2) = 1:
(ii) Pr(X = 1) = Pr(X = 3) = 1=2:
(iii) Pr(X = 0) = Pr(X = 4) = 1=2:

– (i) V arh [X] = 0, X =i 0;


2
(ii) E (X E(X)) = 1=2:(1 2)2 + 1=2:(3 2)2 = 1
E X 2 = 12 21 + 32 12 = 5 and V ar(X) = E X 2 E(X)2 = 5 22 = 1
So hX = 1 i
2
(iii) E (X E(X)) = 1=2:(0 2)2 + 1=2:(4 2)2 = 4
1 1
E X 2 = 02 2 + 42 2 = 8 and V ar(X) = E X 2 E(X)2 = 8 22 = 4
So X = 2
– The spread is highest in case (iii).

2.4.2 Higher order moments and existence of moments


Two other measures often used to describe a probability distribution and skewness and kurtosis.
2
The coe¢ cient of skewness: If the mean of random variable X is and the variance is
then the coe¢ cient of skewness is de…ned to be:
h i
3
E (X )
3

It informs us about asymmetry of a distribution.


E [(X )3 ]
– Under symmetry 3 = 0:

– The skewness is positive if the "long tail" is in the positive direction.


– Mean > Median indicates a positively skewed distribution; Mean < Median indicates a
negatively skewed distribution.

24
The coe¢ cient of excess kurtosis is de…ned to be:
h i
4
E (X )
4
3

It informs us about the thickness of the tails of the distribution.


h i
4
– Under normality E (X ) = 4 3 = 0;

pdf pdf

µ µ
High kurtosis Low kurtosis

Exercise: Suppose E[X] = 2: Compute the coe¢ cient of excess kurtose in the following three
cases:
(i) Pr(X = 2) = 1:
(ii) Pr(X = 1) = 1=4; Pr(X = 2) = 1=2; Pr(X = 3) = 1=4:
(iii) Pr(X = 0) = 1=16; Pr(X = 2) = 7=8; Pr(X = 4) = 1=16:
h i
4
– (i) E (X ) = 0: Since Var(X) = 0; excess kurtose coe¢ cient not determined.
h i
4
(ii) E (X E(X)) = 1=4:(1 2)4 + 1=2:(2 2)4 + 1=4:(3 2)4 = 1=2
h i
2 2
E (X ) = 1=4 ( 1) + 1=4 (1)2 = 1=2
So excess
h kurtose icoe¢ cient: 21 =( 12 )2 3 = 1
4
(iii) E (X E(X)) = 1=16:(0 2)4 + 7=8:(2 2)4 + 1=16:(4 2)4 = 2
h i
2 2
E (X ) = 1=16 ( 2) + 1=16 (2)2 = 1=2
So excess kurtose coe¢ cient: 2=(1=2)2 3 = 1
– Relative to the standard normal distribution, (iii) has fatter tails (positive coe¢ cient), while
(ii) has thinner tails (negative coe¢ cients)

The r-th moment of X is E [X r ] with r 1 an integer.


r
The r-th central moment of X about the mean is E [(X ) ]:
So, the variance is the 2nd central moment of X about its mean, etc.

Not all moments of a random variable need to exist!

– An existence result: If the k th moment of a random variable exist, all moments of order
less than k exist.

25
Example: The mean might not exist (it might be +1 or -1). Consider the continuous random variable
X with p.d.f:
1
x2 for x 1
f (x) =
0, otherwise;
It is a pdf as: Z 1
1 1 1
dx = =1
1 x2 x 1
It is expected value is: Z Z
1 1
1 1 1
x: dx = dx = log(x) = +1
1 x2 1 x 1

Proof existence result (LM, page 201): Let f (y) be the pdf of a continuous random variable
Y . As E(Y k ) exist: Z 1
k
jyj :f (y)dy < 1
1

Let 1 j < k. Then:


Z 1 Z Z
j j j
jyj :f (y)dy = jyj :f (y)dy + jyj :f (y)dy
1 jyj 1 jyj>1
Z Z
j
f (y)dy + jyj :f (y)dy
jyj 1 jyj>1
Z
j
1+ jyj :f (y)dy
jyj>1
Z
k
1+ jyj :f (y)dy < 1
jyj>1

Example: Let X have a student-t distribution with v degrees of freedom.

– Distribution is symmetric, bell shaped (like normal distribution). The degrees of freedom
(v) determine what moments exist
All moments of order v or higher do not exist
Mean only exist when v > 1 (otherwise unde…ned); Skewness (Kurtosis) only de…ned
when v > 3 (v > 4) :
– The tails when v = 5 are thicker and more peaked compared to v = 1 (N (0; 1)) t(5)
leptokurtic (peaked).

26
2.4.3 Moment generating function and characteristic function
For the random variable X, with probability density function, f (x); if the function
Z
tX
M (t) = E e (= etx f (x)dx)

exists, then it is the moment generating function.

– See also LM Chapter 3.12 (with exercises)


– The moment generating function is unique and completely determines the distribution of the
random variable; thus is two random varaibles have the same moment generating function,
they have the same distribution.
– Evaluating derivatives of the moment generating function at 0 yields the moments of the
random variable:
@r
E [X r ] = r M (t)
@t t=0
2 1 2 2
– Example: The moment generating function of X N( ; ) is given by M (t) = exp t+ 2 t
The derivation of the MGF is given by:
Z
1 1 2
E etX = etx p e 2 2 (x ) dx
Z 2 2
1 2 2
e 2 2 (x 2 x+ )+tx dx
1
= p
2 2
Z
1 2 2 2
e 2 2 (x 2( + t)x+ ) dx
1
= p
2 2

By rewriting the function to be integrated as the product of a constant and the pdf of a
normal with mean + 2 t and variance 2 ; the answer follows.
Z
1 2 2
e 2 2 [x ( +t )] :e t+ 2 t dx
1 1 2 2
= p
Z 2 2
1 2 2
e 2 2 [x ( +t )] :dx e t+ 2 t = e t+ 2 t
1 1 2 2 1 2 2
= p
2 2
| {z }
=1
R
This recognizes the property of a pdf which guarantees f (x)dx = 1:
1 2 2
Using the M (t) = exp t+ 2 t ; we can show that M 0 (0) E(X) = and M 00 (0)
E(X 2 ) = 2 + 2

I M 0 (t) = + 2
t exp t+ 1
2
2 2
t and M 0 (0) = = E(X)
2
I M (t) =
00 2
exp t+ 1
2
2 2
t + + 2
t exp t+ 1
2
2 2
t and M 00 (0) = 2
+ 2
!

A useful feature of MGFs is that if X and Y are independent, then the MGF of X + Y is
Mx (t)My (t):
E et(X+Y ) = E etX etY =indep E etX E etY

27
– Useful result when considering the distribution of sums of random variables. More later.

While there are many distributions whose MGF does not exist, every distribution has a unique
characteristic function.

'(t) = E eitX where i is the imaginary number, s.t., i2 = 1.

– Characteristic functions is a fundamental tool in proofs of central limit theorems. More later.

2.5 Percentiles and mode


Percentiles of a distribution: If 0 < p < 1, then the 100 p-th percentile of the distribution of
X is the number cp which satis…es both of the following inequalities:

P [X cp ] p and P [X cp ] 1 p:

– Interquantile range: c0:75 c0:25 (measure of inequality)


– For a continuous random variable, it is su¢ cient to …nd the cp for which P [X cp ] = p:
– Used when determining critical values when performing hypotheses tests.

Median: the 50-th percentile of a distribution is referred to as the median of the distri-
bution — it is the point M for which P [X M ] = :5: Half of the distribution probability is to
the left of M and half is to the right.

The mode of a distribution: The mode is any point m at which the probability or density
function f (x) is maximized.

The distribution of the random variable X is said to be symmetric about the point c if
f (c + t) = f (c t) for any value of t. In this case

– The expected value of X is c


– The median of X is c and h i
k
– Any odd order-central moments about the mean are 0, i.e. E (X ) = 0 if k 1 is an
odd integer.

2.6 Exercises
Exercise: The time between consecutive eruptions of the volcano Mauna Loa follows the following
pdf :
f (t) = 0:027:e 0:027:t
where t is given in months. What is the average time between consecutive eruptions of the volcano?

– Solution: The average is given by:


Z 1
0:027:t
E(t) = t:0:027:e :dt
0

28
Integrating by parts, we have:
Z 1
0:027:t 1 0:027:t
E(t) = t:e j0 e :dt
0
1 0:027y 1
= 0+ e j0
0:027
1
= 0
0:027
= 37 months

Exercise: If the pdf of Y is given by f (y) = 1 for y 2 [1; 2] and f (y) = 0 otherwise, what is
V ar(Y )?

– Solution:
Z 2
2
E(Y ) = y 2 dy
1
3
y 2 8 1 7
= 1 = =
3 3 3
As E(X) = 3=2:
2
V ar(Y ) = E(X 2 ) [E(X)]
2
7 3
=
3 2
1
=
12
Exercise: Let X equal the number of tosses of a fair die until the …rst “1”appears. Find E [X] :

– Solution: X is a discrete random variable that can take on an integer value 1:The prob-
ability that the …rst 1 appears on the x-th toss is f (x) = ( 56 )x 1 ( 16 ) for x 1 (x 1 tosses
that are not 1 followed by a 1). This is the probability function of X.
Then
1
X 1
X 5 1
E [X] = k f (k) = k ( )k 1
( )
6 6
k=1 k=1
1 5 5
= ( ) 1 + 2( ) + 3( )2 + :::
6 6 6
1
We use the general increasing geometric series relation 1 + 2r + 3r2 + ::: = (1 r)2 , so that

1 1
E [X] = ( ) = 6:
6 (1 56 )2

Exercise: A continuous random variable X has density function

1 jxj if jxj < 1


f (X) =
0, elsewhere

Find V ar [X] :

29
– Solution: The density of X is symmetric about 0 (since f (x) = f ( x)), so that E [X] = 0
This can be veri…ed directly:
Z 1
E [X] = x(1 jxj)dx
1
Z 0 Z 1
= x(1 + x)dx + x(1 x)dx
1 0
1 1
= + =0
6 6
Then
2
V ar [X] = E X2 (E [X]) = E X 2
Z 1
= x2 (1 jxj)dx
1
Z 0 Z 1
2 1
= x (1 + x)dx + x2 (1 x)dx =
1 0 6
1 jxj
Exercise: The continuous random variable X has p.d.f f (x) = 2 e for 1 < x < 1. Find
the 87.5-th percentile of the distribution.
Rb
– Solution: The 87.5-th percentile is the number b for which 0:875 = P [X b] = 1 f (x)dx =
Rb 1
1 2
e jxj dx: This distribution is symmetric about 0, since f ( x) = f (x); so the mean
and median are both 0. Thus b > 0, and so
Z b Z 0 Z b
1 jxj 1 jxj 1
e dx = e dx + e jxj dx
1 2 1 2 0 2
Z b
1 1
= :5 + e x dx = :5 + (1 e b )
0 2 2
= :875
! b= ln(:25) = ln 4

3 Joint, Marginal and conditional distributions


See also LM Chapter 3.7, 3.8 and 3.11 (with exercises)

3.1 Joint distribution


The joint density function for two random variables X and Y; denoted f (x; y); is de…ned so that
P (a X b; c Y d)
8 P P
< f (x; y) if X and Y are discrete
= a x bc y d
R R
: b d f (x; y)dydx if X and Y are continuous
a c

where f (x; y) 0 and


( PP
f (x; y) = 1 if X and Y are discrete
Rx Ry
x y
f (x; y)dydx = 1 if X and Y are continuous

30
– Everything can be generalized to general multivariate random variables
By using vectors, e.g., f (x) = f (x1 ; :::; xn )

The cumulative distribution function described the probability of a joint event

F (x; y) = P [(X x) \ (Y y)]


Px Py
R xs= R 1 t= 1 f (s; t) if X and Y are discrete
= y
1 1
f (s; t)dtds if X and Y are continuous

– For continuous random variables


@2
F (x; y) = f (x; y)
@x@y

– Extension to multivariate continuous random variables, using vectors


Z xn Z xn 1 Z x1
F (x) = ::: f (t)dt1 dt2 ::dtn
1 1 1

Useful results

– lim F (x; y) = lim F (x; y) = 0


x! 1 y! 1

P [(x1 < X x2 ) \ (y1 < Y y2 ]


= F (x2 ; y2 ) F (x2 ; y1 ) F (x1 ; y2 ) + F (x1 ; y1 )
Z x2 Z y2
= f (x; y)dydx in the continuous setting
x1 y1

To visualize this result, note that F (x2 ; y2 ) considers the probability of all points in ( 1; x2 )\
( 1; y2 ). Subtracting F (x2 ; y1 ) and F (x1 ; y2 ), we end up with the probability in the interval
we are interested (x1 ; x2 ) \ (y1 ; y2 ) minus the probability in the interval ( 1; x1 ) \ ( 1; y1 )
because this was subtracted twice (look at …gure: X Y ).

31
y

y2

y1

x1 x2 x

Figure: X Y

3.2 Marginal distribution


From the joint distribution of X and Y , we can obtain the probability distribution of X (without
reference to Y ). That is the marginal distribution of X.

To obtain the marginal distribution from the joint density, we need to sum or integrate out the
other variable(s)
The marginal probability function or marginal density function of X
P
fX (x) = R 1y f (x; y) in the discrete case
1
f (x; y)dy in the continuous case
R1 R1 R1
– Extension to the multivariate setting, fX1 (x1 ) = 1 1
::: 1
f (t)dt2 ::dtn :

The marginal cumulative distribution of X can be found from the joint distribution F (x; y)
as:
FX (x) = lim F (x; y):
y!1

– Observe that F (x; y) = Pr ((X x) \ (Y y)) with y ! 1 is nothing else as Pr (X x)


F (x) since Y is allowed to take all possible values.

3.3 Conditional distribution


Conditioning and the use of conditional distributions play an important role in econometric mod-
elling.
Suppose X and Y have joint density/probability function f (x; y), and the density/probability
function of the marginal distribution of X is fX (x). The conditional density function of Y given
X = x is
f (x; y)
fY jX (yjX = x) = ; if fX (x) 6= 0:
fX (x)

32
– Recall the de…nition of conditional probability:

P [B \ A]
P [BjA] =
P [A]

– "Slice of the joint distribution, suitably rescaled"

The density/probability function of jointly distributed variables X and Y can be written in the
form

f (x; y) = fY jX (yjX = x) fX (x)


= fXjY (xjY = y) fY (y)

Figure 2 shows the joint distribution of X and Y .

– The points in red are the ones with higher probability density, the points in black have
probability density close to 0. We see for higher values of X, higher values of Y are more
likely. When X is low, Y is more likely to be low.

Figure 3 shows the marginal distribution of X and the distribution of X conditional on Y = 1.

– The marginal distribution of X is symmetric around 2, as it considers the probability density


of X for all possible values of Y . The distribution of X conditional on Y = 1 has its mode
at X = 1 and can be seen as a slice of the joint, when Y = 1 and X varies between 0 and 4:

3.4 Independence of random variables


X and Y with cumulative distribution functions FX (x) and FY (y) are independent if F (x; y) can
be factored in the form
F (x; y) = FX (x) FY (y) for all (x; y)

If X and Y are independent

– Joint pdf is the product of the marginals

f (x; y) = fX (x) fY (y)

– Conditional pdf same as unconditional pdf

fY jX (yjX = x) = fY (y); fXjY (xjY = y) = fX (x)

– E(g(X)h(Y )) = E (g(X)) E (h(Y ))

3.5 Expectations in a joint distribution


For any function g(x; y);
P P
y g(x; y)f (x; y)
R 1x R 1 in the discrete case
E (g(X; Y )) =
1 1
g(x; y)f (x; y)dydx in the continuous case

See also LM Chapter 3.9 (with exercises).


The covariance between X and Y is a special case:

33
Figure 2: Joint distribution of X and Y

34
Figure 3: Marginal and conditional distributions of X

35
3.5.1 Covariance and correlation
Covariance between X and Y :

Cov [X; Y ] = E [(X E [X]) (Y E [Y ])]


= E (XY XE[Y ] Y E[X] + E[X]E[Y ])
= E[XY ] E[X]E[Y ]
2
– Analogy: V ar[X] = E[X 2 ] (E[X]) . (Indeed, Cov [X; X] = V ar [X] ).
– Clearly, Cov [X; Y ] = Cov [Y; X] :
– If X and Y are independent ) Cov[X; Y ] = 0
Cov(X; Y ) = E(XY ) E(X)E(Y )
As independence ensures E(XY ) = E(X)E(Y ); result is shown!

The covariance will indicate the direction of covariation of X and Y: Its magnitude depends on
the scales of measurement, unlike the correlation coe¢ cient.

Coe¢ cient of correlation between X and Y :


Cov [X; Y ]
(X; Y ) = X;Y = 1 XY 1
X Y

where X and Y are the standard deviations of X and Y respectively.

36
Zero covariance versus independence

While independence implies uncorrelatedness, the reverse is not necessarily true

X and Y independent ) Cov(X; Y ) = 0


X and Y independent 6( Cov(X; Y ) = 0

– Exception to this is when X and Y are normally distributed, then zero covariance implies
independence!
Bivariate normal distribution is given by:
1
p
fXY (x; y) = 2
2 x y 1 xy
h i
y y
exp( 2(1
1
(x x
)2 + ( y
)2 2 xy (
x x
)( y
)
2
xy ) x y x y

If xy = 0 note fXY (x; y) = fX (x)fY (y); i.e. independent!

Exercise: Suppose Pr(X = 1) = Pr(X = 3) = Pr(Y = 1) = Pr(Y = 3) = 1=2. Then, E[X] = 2;


E[Y ] = 2, X = Y = 1. The covariance and correlation depend on whether Y is more or less
likely to be equal to 3 given that X = 3. Consider the covariance/correlation in the following
three cases:
(i) If X and Y are independent,
(ii) Suppose that either X = Y = 3 or X = Y = 1:
(iii) Suppose that either X = 3 if Y = 1 and X = 1 if Y = 3:

– Solution
(i) Pr(XY = 1) = 1=4, Pr(XY = 3) = 1=2, Pr(XY = 9) = 1=4.
Thus E[XY ] = 41 1 + 12 3 + 41 9 = 4, E[X]E[Y ] = 4; so Cov [X; Y ] = X;Y = 0.
(ii) Pr(XY = 1) = 1=2, and Pr(XY = 9) = 1=2.
Thus E[XY ] = 21 1 + 12 9 = 5, Cov [X; Y ] = 5 4 = 1, and XY =1
(iii) Pr(XY = 3) = 1.
Thus E[XY ] = 3, Cov [X; Y ] = 3 4= 1, and XY = 1

– In case (iii) the correlation is negative (high X corresponds with low Y ) whereas in (ii) it is
positive (high X corresponds with high Y )

Exercise: X and Y are discrete random variables which are jointly distributed with the following
probability function f (x; y) :
X
1 0 1
1 1 1
1 18 9 6
1
Y 0 9 0 61
1 1 1
1 6 9 9

Find E [X Y ] :

37
P P
– Solution: Recall: E [XY ] = x y xy f (x; y)

1 1 1
E [XY ] = ( 1)(1)(
) + ( 1)(0)( ) + ( 1)( 1)( )
18 9 6
1 1
+(0)(1)( ) + (0)(0)(0) + (0)( 1)( )
9 9
1 1 1
+(1)(1)( ) + (1)(0)( ) + (1)( 1)( )
6 6 9
1
=
6

3.5.2 Covariance matrix


In the multivariate setting, we de…ne a covariance matrix as
" #
0
V ar(X) = E (X EX)(X EX) with EX =
(n 1) (1 n) (n 1)
2 2
30
1 12 1n
6 2 7
6 21 2 7
= 6 .. 7
4 . 5
2
n1 n
0 1 0 1 0 1 0 1
X1 1 X1 E (X1 ) X1 1
B C B C B C B C
– With X = @ ... A ; = @ ... A ; we have X E (X) = @ ..
. A=@
..
. A
Xn n Xn E (Xn ) Xn n

Using matrix manipulation we0get 1


2
(X1 1) (X1 1 ) (Xn n)
0 B .. .. C
(X E (X)) (X E (X)) = @ . . A:
2
(Xn n ) (X1 1) (Xn n)
Taking expectations gives the result above.

2
– The
h diagonali contains the variances associated with each element in the vector X : i =
2
E (Xi i)

– The o¤-diagonal contains covariances: ij = E (Xi i) Xj j

– By dividing ij by i j we obtain the correlation matrix.

2
If X1 ; :::; Xn are independent with E (Xi ) = 0 and V ar(Xi ) = :
2
E (X) = 0 and V ar(X) = In (scalar covariance matrix)

– Note that a scalar covariance matrix does not guarantee independence (joint normality re-
quired!)

38
3.5.3 Mean and variance of sums of random variables
The expected value of a sum of two random variables is:

E [X + Y ] = E [X] + E [Y ]

and, generalizing,
P
n P
n
E Xi = E [Xi ]
i=1 i=1

The variance of a sum of two random variables

V ar [X + Y ] = V ar [X] + V ar [Y ] + 2 Cov [X; Y ]

and, generalizing,
P
n P
n P
n P
n
V ar Xi = V ar [Xi ] + Cov(Xi ; Xj )
i=1 i=1 i=1 j=1
i6=j

– Proof (in bivariate setting): Using the fact that V ar(Z) = E(Z 2 ) E(Z)2 ; we get
h i
2 2
V ar [X + Y ] = E (X + Y ) (E [X + Y ])
2
= E X 2 + 2XY + Y 2 (E [X] + E [Y ])
= E X 2 + E [2XY ] + E Y 2
2
(E [X]) 2E [X] E [Y ] (E [Y ])2
= V ar [X] + V ar [Y ] + 2 Cov [X; Y ]

– If X and Y are independent, then:

V ar [X + Y ] = V ar [X] + V ar [Y ]

and, generalizing, if X1 ; X2 ; :::; Xn are mutually independent random variables:


n
X
P
n
V ar Xi = V ar [Xi ]
i=1 i=1

We can extend these results to linear combinations of random variables:

– For any X; Y and constants a; b; and c :

E [aX + bY + c] = aE [X] + bE[Y ] + c

– for any X; Y and constants a; b; and c :

V ar [aX + bY + c] = a2 V ar [X] + b2 V ar [Y ]
+2ab Cov [X; Y ]

39
Pn
Important case: Consider the sample mean X = ( i=1 Xi ) =n. Assume that the observations
are independent (Cov [Xi ; Xj ] = 0 for i 6= j) and V ar [Xi ] = 2 . Then
n
! n
1
2 X 1 X 1 2
V ar X = V ar Xi =indep 2 V ar [Xi ] = 2 n: 2 =
n i=1
n i=1 n n

The standard deviation of X is:


Stdev(X) = p
n
The variance and standard deviation of the sample mean are smaller when the sample size is
larger.

It is useful to add some matrix algebra to the above discussion. Let X be an n 1 vector of
random variables.

– Consider the scalar random variable: Y = a0 X + b0 = a1 X1 + ::: + an Xn + b0 :

For any constant vector a and scalar b0

E [Y ] = a0 E [X] + b0 and
Xn X
n
0
V ar [Y ] = a V ar [X] a = ai aj ij
i=1 j=1

– Consider the vector of random variables: Y = AX + b

For any constant matrix A and vector b :

E [Y ] = AE (X) + b and
V ar [Y ] = AV ar [X] A0

– To show the latter result note:

E (Y ) = E (AX + b) = AE(X) + b uses the linearity of the expectation operator and


the fact that A and b are not random.

0
V ar(Y ) is a covariance matrix which is de…ned as E ((Y EY ) (Y EY ) ).

Once we replace Y = AX + b; we get


0
V ar(Y ) = E (AX + b AEX b) (AX + b AEX b) )
= E ((AX AEX) (AX AEX)0 ) =
0 0
= E (A(X EX) (A (X EX)) ) = E (A(X EX) (X EX) A0 )
= AV ar(X)A0

The earlier result is simply a special case here with A = a0 :

40
3.5.4 Conditional mean and variance
The conditional expectation of Y given X = x is

E [Y jX = x]
P
= R 1 y y fY jX (yjX = x) in the discrete case
1
y fY jX (yjX = x)dy in the continuous case

– The conditional mean function E(Y jX) is called the regression of Y on X


A random variable can always be written as

Y = E(Y jX) + (Y E(Y jX)


= E(Y jX) + "

The conditional variance of Y given X = x is

V ar [Y jX = x]
( P
2
(y E [Y jX = x]) fY jX (yjX = x) in the discrete case
= R1 y 2
1
(y E [Y jX = x]) fY jX (yjX = x)dy in the continuous case
2
= E Y 2 jX = x E [Y jX = x]

– The conditional variance is called the scedastic function, and like the regression, typically a
function of x:
– The case where V ar(Y jX = x) does not vary with x is called homoskedasticity.

3.5.5 Conditional vs unconditional moments (Law of iterated expectations)


When considering expectations of functions of random variables, the law of iterated expectations is very
useful.

Theorem Law of iterated expectations. Let h(X; Y ) be a function of two random variables

E(h(X; Y )) = EX [E [h(X; Y )jX = x]]

assuming these expectations exist.

This result allows us to consider the stochastic nature of X and Y sequentially.

– Above, we initially ignore the stochastic nature of X by conditioning on it.


Note that, typically, E [h(X; Y )jX = x] is a function of x:
– We deal with the stochastic nature of X afterwards.
The notation EX [ ] indicates the expectation over the values of X:

Conditional and Unconditional mean

E (Y ) = EX [E (Y jX = x)]

Conditional and Unconditional variance:

V ar(Y ) = EX [V ar(Y jX = x)] + V arX [E(Y jX = x)]

41
– V ar(Y ) = E Y 2 E(Y )2
2
= EX E Y 2 jX = x (EX [E(Y jX = x)]) using theorem
h i
2
– We add and subtract EX E (Y jX = x) and rearrange to yield:
h i h i
2 2 2
= EX E Y 2 jX = x EX E (Y jX = x) +EX E (Y jX = x) (EX [E(Y jX = x)])
| {z } | {z }
EX (V ar(Y jX=x)) V arX (E(Y jX=x))

– If E(Y jX = x) does not depend on x we get the simpler result: V ar(Y ) = EX [V ar(Y jX = x)]

3.5.6 Independence vs Conditional mean independence vs Uncorrelatedness


We say that Y is mean independent of X if E [Y jX = x] does not depend on x; so that

E [Y jX = x] = E [Y ]

– This is a much weaker assumption than assuming independence between Y and X:

Independence implies mean independence:


Z 1 Z 1
E [Y jX = x] y fY jX (yjX = x)dy =indep y fY (y)dy E [Y ]
1 1

Mean independence does not imply independence.

– While we may have mean independence, this does not guarantee that, e.g., V ar [Y jX = x]
does not depend on x: Independence: fY jX=x = fY involves all moments!

Mean independence implies uncorrelatedness: Need to show E (Y X) = E (Y ) E (X).

– Say E(Y jX = x) = E (Y ) : Using the law of iterated expectation

E (Y X) = EX (E(Y XjX = x))

– Since X is …xed when considering E(Y XjX = x); we can take X outside the expectation
and consider

E (Y X) = EX (XE(Y jX = x))
= EX (XE(Y ))

– Realize that E(Y ) is a …xed number, to obtain desired result:

E (Y X) = E(Y )EX (X) E(Y )E (X)

Uncorrelatedness does not imply mean independence or independence (exception nor-


mality)

42
3.6 Exercises
Exercise: If f (X; Y ) = K(X 2 + Y 2 ) is the density function for the joint distribution of the
continuous random variables X and Y de…ned over the unit square bounded by the points (0,0),
(1,0), (1,1) and (0,1), …nd K:
– Solution: The (double) integral of the density function over the region of density must be 1,
so that
Z 1Z 1
1 = K(x2 + y 2 )dydx
0 0
2
= K
3
3
! K= :
2
Exercise: The cumulative distribution function for the joint distribution of the continuous random
variables X and Y is F (x; y) = (:2)(3x3 y + 2x2 y 2 ), for 0 x 1 and 0 y 1. Find f ( 21 ; 12 ).
– Solution:
@2
f (x; y) = F (x; y)
@x@y
= (:2)(9x2 + 8xy)
1 1 17
! f( ; ) = :
2 2 20
Exercise: Continuous random variables X and Y have a joint distribution with density function
f (x; y) = 3(2 2x
2
y)
in the region bounded by y = 0; x = 0 and y = 2 2x: Find the density
function for the marginal distribution of X for 0 < x < 1:
– Solution: X must be in the interval (0; 1) and Y must be in the interval (0; 2). It is good to
draw the region on which the density is nonzero:

1 x

Figure: x y
R1
Since fX (x) = 1 f (x; y)dy, we note that given a value of x in (0; 1), the possible values of
y (with non-zero density for f (x; y)) must satisfy 0 < y < 2 2x; so that
Z 2 2x
fX (x) = f (x; y)dy
0
Z 2 2x
3(2 2x y)
= dy
0 2
= 3(1 x)2

43
Exercise: Suppose that X and Y are independent continuous random variables with the following
density functions - fX (x) = 1 for 0 < x < 1 and fY (y) = 2y for 0 < y < 1. Find P [Y < X] :

– Solution: Since X and Y are independent, the density function of the joint distribution of
X and Y is
f (x; y) = fX (x) fY (y) = 2y
and is de…ned on the unit square.
Z 1 Z x
1
P [Y < X] = 2ydydx = :
0 0 3

Exercise: Continuous random variables X and Y have a joint distribution with a density function
f (x; y) = x2 + xy 1 1
3 for 0 < x < 1 and 0 < y < 2:Find P X > 2 jY > 2 .

– Solution: Using the de…nition of conditional probability

1 1 P (X > 21 ) \ (Y > 12 )
P X> jY > =
2 2 P Y > 12
Z Z h
1 1 1 2
xy i 43
P (X > ) \ (Y > ) = x2 + dydx =
2 2 1
2
1
2
3 64

Z 2
1
P Y > = fY (y)dy
2 1
2
Z 2 Z 1
= f (x; y)dx dy
1
2 0
Z Z h
2 1
xy i 13
= x2 + dxdy =
1
2 0 3 6
1 1 43=64 43
! P X> jY > = = :
2 2 13=16 52

Exercise: Continuous random variables X and Y have a joint distribution density function
f (x; y) = 2 (sin 2 y)e x for 0 < x < 1 and 0 < y < 1: Find P X > 1jY = 12 :

– Solution: Using the de…nition of conditional probability

1 P (X > 1) \ (Y = 21 )
P X > 1jY = =
2 fY ( 21 )

Z 1
1 1
P (X > 1) \ (Y = ) = f (x; )dx
2 2
Z1 1
x
= (sin )e dx
1 2 4
1=2
[2] 1
= e
4

44
Z 1
1 1
fY ( ) = f (x; )dx
2 2
Z0 1
x
= (sin )e dx
0 2 4
1=2
[2]
=
4
1 1
! P X > 1jY = =e
2

Exercise: X is a continuous random variable with density function fX (x) = x + 12 for 0 < x < 1.
X is also jointly distributed with the continuous random variable Y , and the conditional density
function of Y given X = x is
x+y
fY jX (yjX = x) =
x + 21
for 0 < x < 1 and 0 < y < 1: Find fY (y) for 0 < y < 1:

– Solution:

f (x; y) = f (yjx) fX (x)


x+y 1
= 1 (x + 2 )
x+ 2
= x+y

Then
Z 1
fY (y) = f (x; y)dx
0
1
= y+
2

Exercise: The coe¢ cient of correlation between random variables X and Y is 13 , and 2X = a;
2
Y = 4a: The random variable Z is de…ned to be Z = 3X 4Y; and it is found that 2Z = 114:
Find a:

– Solution:
2
Z = V ar [Z]
= 9V ar [X] + 16V ar [Y ] 2 (3)(4)Cov [X; Y ]

Since

Cov [X; Y ] = [X; Y ] X Y


1 1=2 1=2 2a
= [a] [4a] = ;
3 3
it follows that

2 2a
114 = Z = 9a + 16(4a) 24
3
= 57a
! a=2

45
Exercise: Suppose that X has a continuous distribution with p.d.f fX (x) = 2x on the interval
(0; 1) and fX (x) = 0 elsewhere. Suppose that Y is a continuous random variable such that the
conditional distribution of Y given X = x is uniform on the interval (0; x). Find the mean and
variance of Y:
1
– Solution: We are given fX (x) = 2x for 0 < x < 1 and fY jX (y j X = x) = x for 0 < y < x:
Then,
f (x; y) = f (yjx) fX (x)
1
= 2x
x
= 2 for 0 < x < 1 and 0 < y < x:
The unconditional (marginal) distribution of Y has p.d.f.
Z 1
fY (y) = f (x; y)dx
1
Z 1
= 2dx
y
= 2(1 y) for 0 < y < 1
(and fY (y) is 0 elsewhere). Then
Z 1
1
E [Y ] = y 2(1 y)dy = ;
0 3
Z 1
1
E Y2 = y 2 2(1 y)dy = ;
0 6
and
2
V ar [Y ] = E Y2 (E [Y ])
2
1 1 1
= = :
6 3 18
Exercise: Given n independent random variables X1 ; X2 ; :::; Xn each having the same variance
of 2 ; and de…ning U = 2X1 + X2 + ::: + Xn 1 and V = X2 + X3 + ::: + 2Xn , …nd the coe¢ cient
of correlation between U and V:
– Solution:
Cov [U; V ]
UV = ;
U V
2 2 2
U = (4 + 1 + 1 + ::: + 1) = (n + 2) =
2
= V :
0
Since the X s are independent, if i 6= j then Cov [Xi ; Xj ] = 0: Then, noting that Cov [W; W ] =
V ar [W ] ; we have Cov [U; V ] =
= Cov [2X1 ; X2 ] + Cov [2X1 ; X3 ] + :::
+Cov [Xn 1 ; 2Xn ]
= V ar [X2 ] + V ar [X3 ] + ::: + V ar [Xn 1]
2
= (n 2)
2
(n 2) n 2
Then UV = (n+2) 2 = n+2 .

46
4 Some special distributions
4.1 Normal distribution
See also LM Chapter 4.3 (with exercises).
2
Univariate. The pdf of a normal random variable X with mean and variance is
1 1
(x )2
f (x) = p e 2 2 for 1 < x < 1:
2 2

2
– This result is typically denoted: X N( ; )

2 X
If X N( ; ); then the random variable Z = N (0; 1)

– The standard normal distribution is often denoted by (z) and its cdf (z)
– Tables of values of (z) may be found in most statistics textbooks
– Using this notation

1 x x
f (x) = and F (x) =

Multivariate. The pdf of jointly normal random variable X1 ; :::; Xn with mean (vector) and
covariance (matrix) is
n 1 1
f (x) = (2 ) 2 (det ) 2 exp( (x )0 1
(x ));
2
2
or if =
2 n 1 1
f (x) = (2 ) 2 (det ) 2 exp( 2
(x )0 1
(x ));
2
2
– When X s N (0; I) this simpli…es to
n
2 n=2 1 X
f (x) = (2 ) exp( 2
x2i )
2 i=1
2
= f (x1 )::f (xn ), where Xi i.i.d. N (0; )
Qn
= f (xi )
i=1

This shows that uncorrelatedness in the presence of joint normality yields independence.

Important Properties of Normal Distribution

1. If X has a multivariate (joint) normal distribution, then all marginals are normal:
2
X s N ( ; ) ! Xi s N ( i ; i ):

47
2. All linear transformations of X are normal

Y = c + BX; with c a constant vector and B a constant matrix


Y s N (c + B ; B B 0 )
– Recall, E (Y ) = c + BE (X) and V ar(Y ) = BV ar(X)B 0
3. If X1 and X2 are jointly normal, then uncorrelatedness implies independence of X1 and
X2 :
– Proof: Show that bivariate normal distribution with 12 = 0 is identical to the product
of the marginals of X1 and X2 !
4. If X1 and X2 are bivariate normal, then the conditional distributions are normal as well
1 2 2
X1 jX2 s N ( 1 + 12 (x2 2) ; 1 1 12 )
2

Exercise: If for a certain normal random variable X; P [X < 500] = :5 and P [X > 650] = :0227;
…nd the standard deviation of X:
– Solution: The normal distribution is symmetric about its mean, with P [X < ] = :5. Thus,
for this normal X; = 500: Then,
P [X > 650] = :0227
X 500 150
= P >

Since X 500 has a standard normal distribution, it follows from the table for the standard
normal distribution that 150 = 2:00 and = 75:

2
Below will we see that the , student-t and F distributions are derived from the normal distrib-
ution.

4.2 Chi-squared, t and F distributions


Chi-squared Distribution
If Z1 , Z2 , ..., Zn are n independent N (0; 1) random variables, then
n
X
Zi2 2
(n)
i=1
Pn 2
– If Wi are independent N (0; 2 ) variables then i=1 (Wi = ) 2
(n)
Standardizing yields that Wi = is standard normal from which the result automatically
follows

The 2 (n) density function has a single parameter, n, the degrees of freedom. It takes only positive
values and is skewed to the right.
2
If X (n); then E(X) = n and Var(X) = 2n:

Many test statistics we will consider in our econometrics courses have an (asymptotic) Chi-squared
distribution, with degrees of freedom typically given by the number of restrictions.

48
Useful results on quadratic forms (proofs given in Econometrics courses):

– Suitable quadratic forms of vectors of normal random variables can be shown to have a Chi-
squared distribution.

1. If the n dimensional vector Z N (0; I) then Z 0 Z 2


(n)

2. If the n dimensional vector W N (0; V ar(W )) (i.e., Wi not necessarily independent) then

W 0 V ar(W ) 1
W 2
(n)

– To see this result, we initially consider a standardization of the vector W :


1=2
(V ar(W )) W =Z N (0; I)

(we can de…ne a symmetric matrix V ar(W )1=2 so that V ar(W )1=2 V ar(W )1=2 = V ar(W )):

– Given this we note


1=2 1=2
W 0 V ar(W ) 1=2
V ar(W ) 1=2
W = (V ar(W ) W )0 (V ar(W ) W ) = Z 0Z

whence the result follows

2 W2 W 2 2
– If W N 0; is scalar, it equals 2 = (1)

3. If the n dimensional vector Z N (0; I) and A is a symmetric and idempotent matrix

Z 0 AZ 2
(rank(A))

where rank(A) = tr(A) = number of unit eigenvalues of A:

Student-t Distribution
2
If Z is a N (0; 1) random variable and X is (n) and is independent of Z; then

Z
p t(n)
X=n

The t (n) density function has a single parameter p; the degrees of freedom. It has the same shape
as the normal distribution but has thicker tails and sharper peaks (leptokurtic)

If X t (n), then E(X) = 0 (n > 1) and V ar(X) = n=(n 2) (n > 3)


If n is large: t N (0; 1) when n ! 1:

49
F Distribution
2 2
If X is a (m) random variable and Y is a (n) random variable and X and Y are independent,
then
X=m
s F (m; n)
Y =n

The F (m; n) density function has two parameters m and n. It is positive and skewed to the right.
If F1 (n; k), then F2 = 1=F1 (k; n) :

Pr(F1 < a) = Pr(1=F1 > 1=a) = Pr(F2 > 1=a) = 1 Pr(F2 < 1=a)

Useful observation: If T t(m) then T 2 F (1; m):

Important examples
2
Example Let X1 ; X2 ; :::Xn be a random sample of size n from N ( ; ) then
p
n(X )
s t (n 1)
SX
2 2 1
Pn 2
where X N( ; =n) and SX = n 1 i=1 Xi X :
p
n(X ) 2 2 2
Proof requires us to show (i) N (0; 1), (ii) (n 1) SX (n 1) and (iii)
2
independence of X and SX :
p
p n(X )
n(X )
The following decomposition shows the required formulation =p 2 S 2 =(n
SX (n 1) X 1)

p
– The numerator is N (0; 1), the denominator is 2 (n 1)= (n 1); and the independence
yields the result.

2
Example Let SX and SY2 be the sample variances from mutually independent samples of sizes m and
n respectively drawn from normal distributions. Let X and Y have population variance 2X and 2Y
respectively:
2 2
Y SX
2 S 2 s F (m 1; n 1)
X Y

2 2 2 2
Proof uses the fact that (i) (m 1) X SX (m 1) ; (ii) (n 1) Y SY2 2
(n 1) ; and
their independence
2 2
Y SX
2 2
(m 1) X SX =(m 1)
The following decomposition shows the required formulation 2 S2 = (n 1) 2 2
Y SY =(n 1)
X Y

– The numerator is 2 (m 1)= (m 1), the denominator is 2


(n 1)= (n 1), and the inde-
pendence yields the result.

50
4.3 Bernouilli, binomial and poisson distributions
These are well known distributions for discrete random variables:

Bernouilli Distribution

X describes the outcome of a trial, where X takes the value 1 with probability p and 0 with
probability 1 p :
f (x) = px (1 p)1 x , x = 0; 1

E(X) = Pr(X = 1) = p and V ar(X) = p(1 p)


2
– E.g., V ar(X) = E X 2 E (X) ; where E X 2 = 02 Pr (X = 0) + 12 Pr (X = 1) = p

Distribution plays an important role in binary choice models (Logit/Probit).

Binomial Distribution (LM Chapter 3.2)

X describes the number of successes after n independent trials

n x
f (x) = p (1 p)n x
for x = 0; 1; :::; n
x
n n!
– x is a binomial coe¢ cient (combinators) and is equal to x!(n x)! : Pascal’s triangle.

E(X) = np and V ar(X) = np(1 p)

– Moments not surprising if you see that X = X1 + X2 + ::: + Xn where Xi are independent
Bernouilli random variables.

Poisson distribution (LM Chapter 4.2)

X describes the number of events in a given period


x
e
f (x) = for x = 0; 1; :::; n with = E(x)
x!
n
– With = np; f (x) = lim :px :(1 p)n x
n!1 x
– Distribution has seen a wide use in econometrics in, e.g., modelling patents, crime and
demand for health services.

4.4 Some other distributions (not discussed in 2018)


You are not expected to remember the form of these distributions for the exam. In fact: any distribution
other than the normal distribution will be provided in the exam.

51
Discrete Distributions

Uniform distribution: A die toss, a coin ‡ip... The random variable X may assume values
from 1 to N (N 1 is an integer) and all realizations are equally likely. The probability function
is
1
N for x = 1; 2; :::; N;
f (x) =
0 otherwise.

– The mean:
N +1
E[X] =
2
E(X) = 1: n1 + 2: n1 + ::n n1 = 1
n
n
i=1 i = 1
n
1
2 n(n + 1)
– The variance:
N2 1
V ar[X] =
12
E(X 2 ) = 12 : n1 + 22 : n1 + ::n2 n1 = 1
n
n
i=1 i
2
= 1
n
1
6 n(n + 1)(2n + 1)
V ar(X) = E(X 2 ) E(X)2 :

Geometric distribution with parameter p 2 [0; 1]: a single trial of an experiment results in either
success with probability p, or failure with probability 1 p:The experiment is performed with
successive independent trials until the …rst success occurs. If X represents the number of failures
until the …rst success, then X is a discrete random variable that can be 0; 1; 2; 3; :::: X is said to
have a geometric distribution with parameter p. (LM Chapter 4.4)
f (x) = (1 p)x p for x = 0; 1; 2; 3; :::
– Mean
1 p
E[X] =
p
– Variance:
1 p
V ar[X] =
p2
– The geometric distribution has the lack of memory property:
P [X = n + k j X n] = P [X = k]:
The likelihood of the occurrence of the event depends only on p — given p, history does not
matter.

Negative binomial distribution with parameters r and p (r > 0 and 0 < p 1). If r is an
integer, then the negative binomial random variable X can be interpreted as being the number
of failures until the rth success occurs when successive independent trials of an experiment are
performed for which the probability of success in a single particular trial is p (the distribution is
de…ned even if r is not an integer). (LM Chapter 4.5) We have:
r+x 1
f (x) = pr (1 p)x for x = 0; 1; 2; 3; :::
x
r(1 p) r(1 p)
E[X] = ; and V ar[X] =
p p2

52
Continuous Distributions

Uniform Distribution. In this case, X may take any value on the interval (a; b) and all points
in the interval are equally likely
1
b a for a < x < b;
f (x) =
0 otherwise.

– And we have:
a+b (b a)2
E[X] = ; and V ar[X] = ;
2 12
a+b
This is a symmetric distribution about the mean = median = 2

Exponential Distribution. Consider a Poisson event (e.g., the birth of babies in England).
What is the distribution of the time of the next Poisson event? When will the next English baby
be born? This is described by the Exponential distribution.
x
e for x > 0;
f (x) =
0 otherwise.

1 1
E[X] = ; and V ar[X] = 2;
x
F (x) = 1 e for x 0;

and
x
P [X > x] = e ;
Z 1
k!
E[X t ] = xk e x
dx = k
0

– The exponential distribution has the lack of memory property

for x; y > 0; P [X > x + y j X > x] = P [X > y]

Proof: Suppose that X has an exponential distribution with parameter : Then

P [X > x + y \ (X > x)]


P [X > x + y j X > x] =
P [X > x]
P [X > x + y] e (x+y) y
= = =e
P [X > x] e x
y
and P [X > y] = e .

In our example: if we expect one baby to be born in England at every 32 seconds, we


expect that the next baby will be born in 32 seconds, regardless of whether the youngest
English fellow was born 1, 2, 10 or 50 seconds ago.

53
– Suppose that independent random variables Y1 ; Y2 ; :::; Yn have exponential distributions with
means 11 ; 12 ; :::; 1n (parameters 1 ; 2 ; :::; n ) respectively. Let Y = minfY1 ; Y2 ; :::; Yn g:
Then Y has an exponential distribution with mean 1 + 21:::+ n :

Proof:
P [Y > y] = P [Yi > y for all i = 1; 2; :::; n]
= P [(Y1 > y) \ (Y2 > y) \ \ (Yn > y)]
= P [Y1 > y] P [Y2 > y] P [Yn > y]
1y 2y ny ( 1 + 2 +:::+ n )y
using independence of the Yi ’s = (e )(e ) ::: (e )=e . The c.d.f.
of Y is then
FY (y) = P [Y y]
= 1 P [Y > y]
( 1 + 2 +:::+ n )y
= 1 e
and the p.d.f. of Y is
0
( 1+ 2+ + n )y
fY (y) = FY (y) = ( 1 + 2 + + n )e

which is the p.d.f. of an exponential distribution with parameter 1 + 2+ + n:

Intuition: The event Y is “either X1 or X2 happens”. If X1 and X2 are Poisson events


with parameters 1 and 2 , then Y is also a Poisson event (no memory, constant rate) and
the rate that it occurs is 1+ 2 .

Logistic distribution: The cdf for a logistic random variable is denoted by


1
F (x) = (x) = x
, 1<x<1
1+e
and its density is
f (x) = (x) (1 (x)) :
Like the normal distribution it is symmetric but has thicker tails.
E (X) = 0
2
V ar (X) =
3

Pareto distribution with parameters ; x0 > 0:


x0
+1 for x > x0 ;
f (x) = x
0 otherwise.

x0
E[X] = ; and
1
x20
V ar[X] = :
( 2)( 1)2

54
Gamma distribution with parameters > 0 and > 0:
( 1 x
x e
( ) for x > 0;
f (x) =
0 otherwise.
R1
where ( ) is the gamma function de…ned for > 0 to be ( ) = 0
ya 1
e y
dy.
(LM Chapter 4.6)

– When n is a positive integer, (n) = (n 1)!

E[X] = ; and V ar[X] = 2

– The exponential distribution with parameter is a special case of the gamma distribution
with = 1 and = :

Beta distribution with parameters ; b > 0:


1
B(a;b) xa 1
(1 x)b 1
for 0 < x < 1;
f (x) =
0 otherwise.

where the beta function is


Z 1
B(a; b) = xa 1
(1 x)b 1
dx:
0

a ab
E[X] = ; and V ar[X] = :
a+b (a + b)2 (a + b + 1)

4.5 Exercises
Exercise: An English player has 75% probability of scoring a goal from a penalty shot. 5 English
players will try to score in a penalty shoot-out. What is the distribution of probability of goals?

– Solution: We assume that each of the 5 attempts are independent events. Then:
n k
fX (k) = P (X = k) = p (1 p)n k
k
5
fX (0) = 0 (0:75)0 (0:25)5 = 1:(0:0010) = 0:1%
5
fX (1) = 1 (0:75)1 (0:25)4 = 5:(0:0029) = 1:5%
5
fX (2) = 2 (0:75)2 (0:25)3 = 10: (0:0088) = 8:8%
5
fX (3) = 3 (0:75)3 (0:25)2 = 10:(0:0264) = 26:4%
5
fX (4) = 4 (0:75)4 (0:25)1 = 5:(0:0791) = 39:6%
5
fX (5) = 5 (0:75)5 (0:25)0 = 1:(0:2373) = 23:7%:

Exercise: If X is the number of 6’s that turn up when 72 ordinary dice are independently thrown,
…nd the expected value of X 2 :

– Solution: X has a binomial distribution with n = 72 and p = 16 :Then E[X] = np = 12; and
V ar[X] = np(1 p) = 10: Since V ar[X] = E[X 2] (E[X])2 ; E[X 2 ] = 10 + 122 = 154:

55
Exercise: The number of hits, X, per baseball game, has a Poisson distribution. If the probability
1
of a no-hit game is 10;000 ; …nd the probability of having 4 or more hits in a particular game.
0
e 1
– Solution: P [X = 0] = 0! =e = 10;000 ! = ln 10; 000:
P [X 4] =

1 (P [X = 0] + P [X = 1] + P [X = 2] + P [X = 3])
0 1
e e
= 1 ( +
0! 1!
2 3
e e
+ ) +
2!3!
1 ln 10; 000
= 1 ( +
10; 000 10; 000
(ln 10; 000)2 (ln 10; 000)3
+ + )
2(10; 000) 6(10; 000)
= :9817:

Exercise: Suppose that X has a uniform distribution on the interval (0; a); where a > 0: The
pdf of X is given by f (x) = a1 for 0 < x < a: Find P [X > X 2 ]:

– Solution: If a 1; then X > X 2 is always true, so that P [X > X 2 ] = 1. If a > 1; then


2
X > X only if X < 1; which has probability
Z 1
P [X < 1] = f (x)dx
0
Z 1
1 1
= dx = :
0 a a

Thus, P [X > X 2 ] = min[1; a1 ] .

Example: The random variable T has an exponential distribution with cumulative density
function P [T t] = 1 e t where 1= is the mean of T: Find the value of for which
P [T 2] = 2 P [T > 4] and provide the V ar[T ]:

– Solution: Using the CDF, we know

P [T 2] = 2P [T > 4]
2 4
1 e = 2e
4 2
2e +e 1=0

Calling x = e 2 , we get 2x2 + x 1 = 0. Solving the quadratic equation results in x 2


1
2 ; 1 . We ignore the negative root, so that e
2
= 12 ; and = 12 ln 2: Then,

1 4
V ar[T ] = 2 =
(ln 2)2

56
5 Distributions of functions of random variables
5.1 The distribution of a function of random variable
2
The ; student-t, and F are examples of distributions of functions of random variables.
Another example we used is that linear combinations of normal random variables are normal
random variables.

While we discussed how to obtain the expected value of a function of a random variables, e.g.,
E X 2 and E etX ; we know discuss how to obtain the distribution of this function of random
variables.

We distinguish three types of transformation:

1. one discrete random variable is transformed into another


2. a continuous random variable is transformed into a discrete one
3. one continuous random variable is transformed into another one.

Ad 1. If Y = u(X), then X and Y are discrete random variables then


X
g(y) = f (x)
y=u(x)

Example. If Y = 1 if X = 1; 3; 5 and Y = 2 if X = 2; 4; 6 then

g(1) = f (1) + f (3) + f (5); g(2) = f (2) + f (4) + f (6)

Ad 2. If Y = u(X) and X is continuous and Y discrete, then


X
g(y) = f (x)dx
y=u(x)

Example. If Y = 0 if X < a; Y = 1 if a X < b; and Y = 2 if X b then


Z a Z b Z 1
g(0) = f (x)dx; g(1) = f (x)dx; g(2) = f (x)dx
1 a b

Ad 3. If Y = u(X) and X and Y are continuous and u (X) is a continuous monotonic function of
X (one-to-one)
g(y) = f (v(y)) jv 0 (y)j
where v is the inverse x = v(y) and v 0 (y) = dv(y)=dy

Called the change of variable method and jv 0 (y)j is the Jacobean.


We can obtain this by …rst considering the cdf of Y :

FY (y) = P [Y y] = P [u (X) y]

If u is strictly increasing, this implies


FY (y) = P [X v(y)] = FX (v(y))

Since g(y) FY (y)0 = FX


0
(v(y)) v 0 (y) = f (v(y)) v 0 (y) we have established the result.

57
Important applications:
2 X
– If X N( ; ) then Y = N (0; 1):
Note: 1 < y < 1
Use g(y) = f (v(y)) jv 0 (y)j.
Here f (x) = p21 2 exp 1
2 2 (x )2 ; v(y) = y + and v 0 (y) = :
Substituting gives:
g (y) = f( y+ )
1 1 1 1 2
= p exp 2
( y+ )2 = p exp y
2 2 2 2 2
where the latter is the pdf of N (0; 1) random variable as required.

2
– If X N ( ; ) then Y = eX has a lognormal distribution with parameters and 2
(log(Y )
N ( ; 2 )).
Use g(y) = f (v(y)) jv 0 (y)j and recognize that y > 0:
Here v(y) = log (y) and v 0 (y) = 1=y; so
(
p1 exp 1
)2
y 2 2 2 2 (log y if y > 0
g(y) =
0 otherwise
M GFX at t=0 2
E (Y ) = E e1 X = exp + 2 : In accordance with Jensen’s inequality:
since g 00 > 0,
2
E (Y ) E (g(X)) = exp + 2 exp ( ) = g (E(X))

Graphically, with X N (1; 1) :

1
– If Y is U [0; 1] then X = F (Y ) has Pr(X x) = F (x):
Useful for Simulations: If we want to draw random numbers from a distribution with
CDF F (x); this result ensures that we can simply draw uniform U [0; 1] random numbers,
y; and evaluate F 1 (y).
By de…nition
Pr(X x) = Pr(F 1 (Y ) x)
With the invertibility of the CDF (and F 0 (x) = f (x) 0);
Pr(X x) = Pr(Y F (x)) = F (x)
where the latter equality uses the fact that the CDF of U [0; 1] random variable is given
by Pr(Y y) = y:

58
Exercise: Let Y = 2X where X is uniformly distributed between 0 and 1: f (X) = 1 and
Pr(0:5 < X < 0:6) = 0:10: Find the pdf of Y:

– Solution: Pr(1 < Y < 1:2) = 0:10:


Y is uniformly distributed between 0 and 2, i.e., f (Y ) = 1=2.
Since u(x) = 2x; v(y) = y=2 and v 0 (y) = 1=2

fY (y) = fX (y=2) j1=2j = 1 1=2 = 1=2 for y 2 [0; 2]

Exercise: The random variable X has an exponential distribution with a mean of 1. The random
variable Y is de…ned to be Y = 2 ln X. Find fY (y), the p.d.f. of Y:

– Solution:

FY (y) = P [Y y] = P [2 ln X y]
h i y=2
= P X ey=2 = 1 e e

d ey=2
fY (y) = FY0 (y) = (1 e )
dy
1 y=2 ey=2
= e e
2
Alternatively, since Y = 2 ln X (y = u(x) = 2 ln x, and ln is a strictly increasing function
with inverse x = v(y) = ey=2 ), and X = eY =2 , it follows that

d y=2
fY (y) = fX (ey=2 ) e =
dy
ey=2 1
= e : ey=2
2

See also LM Chapter 3.8 (with exercises)

5.2 The distribution of a function of bivariate random variables


Recall: If Y = u(X) and X and Y are continuous and u (X) is a continuous monotonic function
of X (one-to-one)
g(y) = f (v(y)) jv 0 (y)j
where v is the inverse x = v(y) and v 0 (y) = dv(y)=dy

The above result must be modi…ed for a joint distribution. Suppose here that X1 and X2 have a
joint distribution fX (x1 ; x2 ) and that Y1 and Y2 are two monotonic functions of X1 and X2

Y1 = u1 (X1 ; X2 ) X1 = v1 (Y1 ; Y2 )
,
Y2 = u2 (X1 ; X2 ) X2 = v2 (Y1 ; Y2 )

then
gY (y1 ; y2 ) = fX (v1 (y1 ; y2 ) ; v2 (y1 ; y2 )) abs(J )
where
@v1 =@y1 @v1 =@y2
Jacobean = J = det
@v2 =@y1 @v2 =@y2

59
– The Jacobean must be nonzero for the transformation to exist.

Example. Let
Y1 = X1 X1 = Y1
,
Y2 = X1 + X2 X2 = Y2 Y1

1 0
– Here J = det = 1:
1 1
– gY (y1 ; y2 ) = fX (y1 ; y2 y1 ) 1
– If we integrate out Y1 for this joint density, then we would get the marginal distribution of
Y2 = X1 + X2 : Z
gY2 (y2 ) = fX (y1 ; y2 y1 ) 1dy1

With Y = X1 + X2 ; the above result allows say that the density function of Y = X1 + X2 is:
Z
fY (y) = fX (x1 ; y x1 ) dx1

5.3 The distribution of a sum of random variables


If X1 and X2 are discrete integer valued random variables with joint probability function f (x1; x2 ),
then for an integer y
1
X
P [X1 + X2 = y] = f (x1 ; y x1 )
x1 = 1

– When X1 = x1 then X2 = y x1 in order for X1 + X2 = y

If X1 and X2 are continuous random variables with joint density function f (x1 ; x2 ) then the
density function of Y = X1 + X2 is
Z 1
fY (y) = f (x1 ; y x1 )dx1 :
1

– Result may be used to show that if X1 and X2 are jointly normal rv’s, that X1 + X2 is also
normally distributed with mean 1 + 2 and variance 21 + 22 + 2 12 ! Proof is really tedious.
– Given independence, it is easier to proof this using the Moment Generating Function (as
shown before).

With independence, the pdf of Y = X1 + X2 simpli…es to


8 1
> X
>
> fX1 (x1 )fX2 (y x1 ) discrete case
>
>
< x1 = 1
fY (y) = Z1
>
>
>
> fX1 (x1 )fX2 (y x1 )dx1 continuous case.
>
:
1

– In this form fY (y) is the convolution of fX1 and fX2 , also denoted as (fX1 fX2 ) (y)

60
Important multivariate applications
2
– X1 ; :::; Xn is a random sample from N ( ; ); then
n
X n
2 1X 2
Xi N (n ; n ))X= Xi N( ; =n)
i=1
n i=1

2
– X1 ; :::; Xn is a random sample from (1); then
n
X
2
Xi (n)
i=1

5.4 Exercises
Exercise: Suppose that X and Y are independent discrete integer-valued random variables with
X uniformly distributed on the integers 1 to 5, and Y having the following probability function -

fY (0) = :3; fY (1) = :5; fY (3) = :2:

Let Z = X + Y: Find P [Z = 5] :

– Solution: Using the fact that fX (x) = :2 for x = 1; 2; 3; 4; 5; and the convolution method for
independent discrete random variables, we have
5
X
fZ (5) = fX (i) fY (5 i)
i=0
= (0)(0) + (:2)(0) + (:2) (:2) + (:2)(0)
+(:2)(:5) + (:2)(:2)
= :20

Exercise: X is uniformly distributed on the even integers x = 0; 2; 4; :::; 22. The probability
1
function of X is f (x) = 12 for each even integer x from 0 to 22. Find E[X]:

– Solution: Consider the transformation Y = X+2 2 . The random variable Y is distributed on


1
the points Y = 1; 2; :::; 12; with probability function fY (y) = 12 for each integer y from 1 to
12. Thus, Y has the discrete uniform distribution, and E[Y ] = 12+1 2 = 132 : Since Y =
X+2
2 ;
E[X]+2
we can use rules for expectation to get E[Y ] = 2 so that E[X] = 2 E[Y ] 2 = 11.

Exercise: X1 and X2 are independent exponential random variables each with a mean of 1. Find
P [X1 + X2 < 1] :

– Solution: Using the convolution method, the density function of Y = X1 + X2 is


Z y
fY (y) = fX1 (t) fX2 (y t)dt
Z0 y
= e t e (y t) dt
0
y
= ye :

61
Note hat X1 and X2 take on only non-negative numbers [0; 1): Since we evaluate the random
variable X2 at y X1 ; we realize that to de…ne fY (y) we need to ensure y X1 0; or
X1 y! Therefore

P [X1 + X2 < 1] = P [Y < 1]


Z 1
= ye y dy
0
= ye y
e y
jy=1
y=0
1
= 1 2e

(the last integral required integration by parts)

Exercise: The birth weight of males is normally distributed with mean 6 pounds, 10 ounces,
standard deviation 1 pound. For females, the mean weight is 7 pounds, 2 ounces with standard
deviation 12 ounces. Given two independent male/female births, …nd the probability that the
baby boy outweighs the baby girl.

– Solution: Let random variables X and Y denote the boy’s weight and girl’s weight, respec-
tively. Then W = X Y has a normal distribution with mean 10
16
2
7 16 = 12 lb and variance
2 2 9 25
X + Y = 1 + 16 = 16 Then,

P [X > Y ] = P [X Y > 0]
" #
W ( 21 ) ( 21 )
= P 1=2
> 1=2
[25=16] [25=16]
= P [Z > :4] ;

where Z has a standard normal distribution (W was standardized). Referring to the standard
normal table, this probability is 0:34.

5.5 The limiting distribution of a sum of independent random variables


(CLT)
Above we showed what the exact distribution of a sum of random variables is when the distribution
from which the random variables are drawn is known.
If we do not know the distribution from which the sample is drawn, the exact distribution of the
sum (average) will be unknown.
2
– If we draw X1 ; :::; Xn from an unknown distribution with mean and variance ; all we
know is E X = and V ar X = 2 =n

If the exact distribution is unknown, we may want to rely on a result that holds when the sample
size is su¢ ciently large: asymptotic distribution. Central Limit Theorem.
2
– If we draw X1 ; :::; Xn from an unknown distribution with mean and variance ; the
distribution of X can be approximated well by a normal distribution with mean and
a
variance 2 =n; also denoted X N ( ; 2 =n)

62
Theorem Lindeberg-Levy CLT: If X1 ; :::; Xn are a random sample from a probability distri-
bution with …nite mean and …nite variance 2 then
p d 2
n(X ) ! N (0; )
p 2
\ n(X ) has a N (0; ) limiting distribution
p d 2
– n(X ) ! N (0; ) is a convergence in distribution result (more about this later).
This
p result ensures that,2 with n su¢ ciently large, we can approximate the distribution
of n(X ) by N (0; ) for given n; or
p a
n(X ) N (0; 2 )

By linear transformations, this yields


2 2
a a
(X ) N (0; ) or X N( ; )
n n
In practical terms: the theorem states that sums of random variables, regardless of their form,
will tend to be normally distributed.

– Let us demonstrate this by considering the sum of independent Bernouilli random variables
(toss of coin):
A fair coin is tossed n times: What is the probability distribution of the number (sum) of
heads?

63
Pn n
In the graphs, we considered the limiting distribution of Y = i=1 Xi ; where fXi gi=1 are n
independent Bernouilli random variables (e.g., toss of coin), Xi = f0; 1g :

– Let p is the probability of success (heads) then E(Xi ) = p and V ar(Xi ) = p(1 p):
– As n increases, the graph suggests that the normal distribution is indeed a good approxima-
tion of the distribution of the sum (average) of independent Bernouilli random variables.
Pn Pn
Given that we can derive E( i=1 Xi ) and V ar( i=1 Xi ); needed to fully characterize
the normal distribution, we can state
n
X a
Xi N (np; np(1 p))
i=1

Using a linear transformation, the normal distribution is also a good approximation of


the sample average
a p(1 p)
X N (p; )
n
In fact, because we know the distribution of Xi , we even know what the exact distribution is
n
X
Y = Xi binomial (n; p)
i=1

– Exact distribution is true for any given sample size, approximation is only reasonable for
large sample sizes.

See also LM Chapter 4.3 –Central Limit Theorem (with exercises)

6 Estimation and Inference


See also Greene Appendix C.

6.1 Samples and Random sampling


The classical theory of statistical inference centers on using the sampled data e¤ectively and is
based on the properties of samples and sampling distribution.

– The goal is to learn from a sample of observations something about the population from
which the data was drawn.

We assume there is an unknown process that generates the data (sample) that can be described
by a distribution function or probability density function (Data Generating Process).

– Frequentist interpretation: A sample is one particular outcome of a statistical experiment


where we draw repeatedly from this distribution.
– We will consider three types of data in our econometric courses
cross section: observations are all drawn at the same point in time
time series: observations are drawn at a number of points in time
panel : cross-sectional units observed over time.

64
We will say that a sample of n observations on one or more variables denoted fX1 ; X2 ; :::; Xn g
n
or fXi gi=1 is a random sample if the n observations are drawn independently from the same
population, or probability distribution fX (xi ; )
n
– We also denote this as fXi gi=1 is i.i.d. (independent, identically distributed)
– The vector contains one or more unknown parameters.

6.2 Statistics as Estimators –Sampling distributions


We need to describe a rule or strategy for using the data to allow us to infer the value of this
parameter or parameters.

Statistic: Any function which can be computed from the data in a sample X = fX1 ; :::; Xn g

– It does not involve any unknown quantities.


– Depending on the sample drawn it will have a di¤erent outcome (it is a random variable)

Estimator: A statistic that is intended to serve as a basis for learning about an unknown quantity
(parameter ) is called an estimator. Typically denoted by b:

– Example: the average of a random sample on food expenditures in the UK is an estimator


of the mean of food expenditures in the UK
– An estimator is a random variable - depending on the sample it will take a di¤erent outcome
Any realization from the estimator is called an estimate.

Sampling distribution:

– Estimators are random variables, so that if another sample were drawn under identical con-
ditions, di¤erent values would be obtained.
– The probability function of the estimator is called the sampling distribution: it speci…es how
the realizations of our estimator will vary under repeated sampling.
– Under random sampling, we would expect that descriptive statistics (such as the sample
mean, sample variance and sample covariances) will mimic that of their population coun-
terparts, although not perfectly. The precise manner in which these quantities re‡ect the
population values de…nes the sampling distribution of our estimator

Population moments: Possible Estimator:


(sample moments)
P
N
Mean: X = N1 Xi
i=1
X = E(Xi ) = Sample average

2 1
P
N
Variance: SX = N 1 (Xi X)2
i=1
2 2
X = E((Xi X) ) = SampleV ar(Xi )

1
P
N
Covariance SXY = N 1 (Xi X)(Yi Y)
i=1
XY = E((Xi X )(Yi Y )) = SampleCov(Xi ; Yi )

65
6.2.1 Sampling distribution of sample mean and variance
2
Given a random sample X1 ; :::; Xn i.i.d. N ( ; )

– The sampling distribution of our estimator X for is


2
X N( ; =n)

Any linear combination of jointly normal random variables is normally distributed.


Showed E(X) = and V ar X = 2 =n

2 2
– The sampling distribution of (a suitable rescaled) estimator SX for is
2
(n 1) SY2 2
n 1

P
n P
n 2
(n 1) 2
Sx2 = ( Xi X 2
) ( Xi )2 with Xi 2
(1) : Rather than obtain-
i=1 i=1
2
ing a (n) (independence) we loose one degree of freedom because we used in place
of X
Formal proof would use a suitable quadratic form in normal random variables (Chapter
4).

6.3 Finite sample criteria of estimators


6.3.1 Unbiasedness, E¢ ciency
Let ^ be an estimator of that uses a sample of size n:

Unbiasedness
^ is unbiased if E ^ =

Bias(b) = E(b)

– If samples of size n are drawn independently then the average value of our estimates will
tend to equal

E¢ ciency

– E¢ cient Unbiasedness

If ^1 and ^2 are two unbiased estimator of ; then ^1 is more e¢ cient if


V ar(^1 ) < V ar(^2 )

We need to acknowledge that there are many unbiased estimators that make poor use
of the data
– Mean Squared Error E¢ ciency

If ^1 and ^2 are two estimator of ; then ^1 is more e¢ cient if


M SE(^1 ) < M SE(^2 )

66
Allows for a trade-o¤ between bias and variance.

M SE(b) E[b ]2 = V ar(b) + Bias2 (b)

There may be biased estimators that are more precise!

Important e¢ ciency results in econometrics we will come across in our econometrics courses:

– Best Linear Unbiased Estimator (BLUE)


In the Classical Linear Regression Model, the OLS estimator is BLUE.
Among all linear, unbiased estimators the OLS estimator has the lowest variance
– Best Unbiased Estimator (BUE or MVUE)
In the Classical Linear Regression Model with normal disturbances, the OLS(=MLE)
estimator is BUE.
Among all unbiased estimators it has the lowest variance – given by the Cramer-Rao
lower-bound

See also LM Chapter 5.4 and 5.5.

6.4 Asymptotic properties of estimators


In many cases, the …rst two moments of estimators may not be known or may not even exist
(unbiased, variance?) and/or the sampling distributions for the estimator may be unknown.

– For n large enough, we’ll be interested to see whether such estimators are consistent and
have an asymptotic distribution .

6.4.1 Consistency (LLN)


Consistency: An estimator b of is consistent if, when the sample size increases, b gets “closer”
to
p
plim ^ = or ^ !

– Consistency is a convergence in probability result (more details in last Chapter)


Su¢ cient condition for consistency:

lim E(b) = and lim V ar(b) = 0


n!1 n!1
Asym ptotically unbiased

These su¢ cient conditions ensure convergence in mean square to ; which is stronger
than the convergence in probability requirement.

X1 ; X2 ; :::; Xn are n independent random variables with the same distribution, mean and vari-
ance 2 : (i.i.d.)
Pn
– Show X = n1 i=1 Xi is a consistent estimator of

67
– Proof using su¢ cient conditions:

E X = and V ar X = 2 =n.
As n ! 1: lim E X = and lim V ar X = 0:
n!1 n!1

These are su¢ cient conditions that guarantee that the sample mean, X; converges to
the population mean E(Xi ), so:

1 Xn
plim Xi =
n i=1

In fact, we imposed stronger conditions than are needed for its consistency! (su¢ ciency)

An alternative method for proving consistency is based on Laws of large numbers (LLN).

– LLNs give conditions under which sample averages converge to their population counterparts.

Theorem Khinchine’s Weak Law of Large Numbers (WLLN): If X1 ; :::; Xn are a random
sample from a probability distribution with …nite mean then
1X
plim(X) = plim Xi =
n

Similar regularity conditions exist such that e.g.,


1X 2 1X
plim Xi = E(Xi2 ) or plim Xi "i = E(Xi "i )
n n
– Alternative LLN’s exist for cases where Xi ("i ) are not i.i.d.. Relaxing the dependence or
heterogeneity restrictions will often involve strengthening moment restrictions. See Greene,
Appendix D.

Return to question whether X is a consistent estimator of !

– By WLLN, with X1 ; ::; Xn i.i.d. random variables, as long as EXi is …nite

plim X = E(Xi )

so consistency established!
2
– Requiring V ar(Xi ) = < 1 indeed is stronger than needed in above example!

– Proof does not require us to derive V ar(X); we just need to look at plims of averages! (plim
is a nice operator - nicer than the expectation operator)
1
(Xi X )(Yi Y )
E.g., for the Classical Simple Linear Regression Model ^ = n
2 ; plim ^ =
n (Xi X )
1

plim ( n
1
(Xi X )(Yi Y )) Cov(Xi ;Yi )
2 = V ar(Xi ) = :
n (Xi X )
1
plim

68
plim operator and Slutsky’s Theorem

– The probability limits operator exhibits some nice intuitive properties: If Xn and Yn are
random variables with plim Xn = a and plim Yn = b then

plim(Xn + Yn ) = a + b
plim(Xn Yn ) = ab
plim(Xn =Yn ) = a=b provided b 6= 0

– Slutsky Theorem. For a continuous function g(Xn ) that is not a function of n

plim g(Xn ) = g(plim Xn ):


2 2
Example: as plim X = ; we can directly show that X consistently estimates :
2
plimX = (plimX)2 = 2

6.4.2 Asymptotic normality (CLT)


If we do not know the …nite sample distribution of our estimator we cannot perform hypothesis
testing!

– We will then need to assume that there is a distribution we can use that approximates its
distribution arbitrarily well for su¢ ciently large sample.
– This is associated with the asymptotic property of convergence in distribution (more details
in last Chapter)
We say that the estimator b of has an asymptotic normal distribution, if the distribution
of ^ can be approximated by a normal distribution (assumes the sample size is large)
Central Limit Theorems provide the necessary regularity conditions

6.5 Methods of estimation


6.5.1 Minimum distance estimator
Suppose that the available data fY1 ; :::; Yn g is generated by the following process:

Yi = Ci ( ) + i with E( i ) = 0

with Ci ( ) = E(Yi ):
Choose so as to minimize the distance, de…ned as
n
X
D(Y; C) = d(Yi ; Ci )
i=1

– Distance could be de…ned by squared di¤erences (Least Squares) or absolute di¤erences


– If Ci ( ) = we de…ne
n
X
bOLS = arg min (Yi )2 ) F OC b = Y
i=1
n
X
bLAD = arg min jYi j ( robust to outliers
i=1

69
Details:
Pn
– Least squares estimator: bOLS = arg min i=1 (Yi )2 . Then
n
X
dD
= 2(Yi )( 1) = 0
d i=1
Xn n
X n
X
=) Yi = =) Yi = n
i=1 i=1 i=1
Pn
Yi
=) = i=1
=) bOLS = Y :
n
Pn
– Least absolute deviation estimator: bLAD = arg min i=1 jYi jand
bLAD = YM ED

where YM ED is the sample median (the median minimizes the average distance to all points).

median

This gets more interesting if the Ci ’s are not the same for all i.

6.5.2 Maximum likelihood estimator


The MLE assumes that the available data fY1 ; :::; Yn g is generated from a speci…c joint density
f (Y1 ; ::; Yn ; ); with the parameter(s) of interest.

– Likelihood function: joint p.d.f. of sample expressed as a function of parameters


n
Y
i:i:d
L( ) = f (y1 ; :::; yn ; ) = f (yi ; )
i=1

Choose so as to maximize the likelihood of observing our sample


^M LE = arg max L( ) or equivalently ^M LE = arg max ln L( )

– Obtain bM LE ; from FOC:


@ ln L
=0
@ bM LE

Exercise: : Find the MLE of ; given a random sample Y1 ; :::Yn , each drawn from the p.d.f.
yi
(1 )1 yi
yi = 0; 1
f (yi ) =
0

– Solution: Construct the joint p.d.f. (independence)


y1 y2 yn
f (y1 ; :::; yn ; ) = (1 )1 y1
(1 )1 y2
(1 )1 yn
Pn Pn
yi
= i=1 (1 )n i=1 yi
L( )

70
Write down the Log-likelihood
n
X n
X
ln L( ) = yi ln + (n yi ) ln(1 )
i=1 i=1

Obtain bM LE ; from FOC:


P
n P
n
yi n yi
@ ln L
= i=1
bM LE
i=1
= 0 ) bM LE = y
@ bM LE 1 bM LE

2
Exercise: Let X be a normally distributed random variable, X N( ; ). Using a random
sample of n observations obtain the MLE estimators for and 2 :
2
– Solution: In a random sample of n observations, the density of each observation is f (xi ; ; ).
Since the n observations are independent, their joint density is:
n
Y
2 2 2
f (x1 ; x2 ; :::xn ; ; )= f (xi ; ; ) = L( ; jx1 ; x2 ; :::xn )
i=1

The probability density function of the normal distribution is:


2 1 (xi )2 =2 2
f (xi ; ; )= e
[2 2 ]1=2

So the likelihood function is:


n
Y
2 1 (xi )2 =2 2
L( ; jx1 ; x2 ; :::xn ) = e
(2 2 )1=2
i=1

Taking log, we get:


n
2 n 2 n 1 X
ln L( ; jx1 ; x2 ; :::xn ) = ln( ) ln(2 ) 2
(xi )2
2 2 2 i=1
2
The ML estimators are the values of and that maximize the log-likelihood function:
Pn
Taking the derivative with respect to , we get @ @ln L = 2 2 2 i=1 (xi ): This gives us
the …rst FOC:
2 Pn
(xi ^ M LE ) = 0
2^ 2M LE i=1
Pn Pn
xi
, i=1 (xi ^ M LE ) = 0 or ^ M LE = i=1
n =x

2 @ ln L n 1
Pn
Taking the derivative with respect to , we get @ 2 = 2 2 + 2( 2 )2 i=1 (xi )2 :
This gives us the second FOC:
n 1 Pn
+ i=1 (xi ^ M LE )2 = 0
2^ 2M LE 2
2 ^ M LE
2

1 Pn
, 2 i=1 (xi )2 = n
2
^ M LE P
n
(xi x)2
) ^ 2M LE = i=1
n

71
Di¢ cult exercise (may want to skip this one): Find the MLE of the scalars and 2 ; given that
data has the joint distribution: y N x ; 2 ; where y = (y1 ; ::; yn ) and x = (x1 ; ::; xn ) : We
assume is known. [Allows for dependence between observations!]
– Solution: The joint density is now NOT the product of the marginals but instead (see
Chapter 4):
2 n 1 1
f (y) = (2 ) 2 (det ) 2 exp( 2
(y x )0 1
(y x ))
2
n n 2 1 1
ln L = 2 log (2 ) 2 log 2 ln (det ) 2 2 (y x )0 1
(y x )

0 0
= n2 log (2 ) n2 log 2 1
2 ln (det )
1
2 2 x 1
x + y0 1
y 2 0 x0 1
y)
after expanding the quadratic form (uses linear algebra)

@ ln L
The …rst FOC (will discuss a related derivation in Chapter 8 in more detail): @ :=
0 1
1
2^ 2
2x0 1 x ^ 2x0 1 y = 0 ) ^ = xx0 1 yx
(An estimator we call the GLS estimator of in our econometrics courses)

(y x ^ )0 1
(y x ^ )
The second FOC: @ ln L
@ 2 := n
2^ 2
+ 2^1 4 (y x ^ )0 1
(y x ^ ) = 0 ) ^ 2 = n

Exercise (tricky): Suppose that the random variables Y1 ; Y2 ; :::; YN are drawn from a uniform
distribution so that Yi U (0; ). We have that:
1 N
L (Y1 ; Y2 ; :::; YN ; ) = for 0 YM AX
0 if any Yi 2
= [0; ]
What is the MLE estimator of ?
– Solution: Since L is not twice continuously di¤erentiable in ; we need to approach its
maximization without di¤erentiation. The solution is readily obtained with the aid of the
following picture:

LN(θ)
LN(θ)

0 YM AX
WMAX

Therefore the MLE estimator is b = YM AX : Successful maximization could not have been
obtained by straightforward di¤erentiation in this case.

Maximum Likelihood will provide di¤erent answers depending on the particular distribution we
believe the data was generated from.
See also LM Chapter 5.2

72
6.5.3 Method of moments estimator
The Method of Moment Estimator assumes that there are moments

E(m(Y; )) = 0;

associated with the random sample Y1 ; ::; Yn ..

– The MME estimator consist out of choosing ^M M in such a way that the sample analogue
of these moments are satis…ed

– ^M M solve
n
1X
m(Yi ; ^M M ) = 0
n i=1
If we want to estimate p parameters, we will need at least p moment conditions involving
these parameters.
If there are more moments than parameter of interest, we will consider the Generalized
Method of Moments Estimator, which optimally weights the sample moments.

Exercise: Let Y1 ; :::; Yn be a random sample drawn from a distribution with mean and variance
2
: Obtain the MM estimator of = ; 2 :
2
– Solution: We need 2 moment conditions, involving and :

E(Y ) = E(Y )=0


or
E(Y 2 ) = 2
+ 2
E(Y 2 2 2
)=0

2 2
Recall V ar(Y ) = E(Y 2 ) [E(Y )] :
The method of moments estimator is given by (sample analogues):
1
Pn
n Pi=1 (Yi ^M M ) = 0
n
1
n i=1 Yi
2
^ 2M M ^ 2M M = 0
Pn Pn
Together
Pn they yield ^ M M E = 1
n i=1 Yi = Y and ^ 2M M E = 1
n i=1 Yi2 Y2 =
1
n (Y
i=1 i Y )2

Exercise: Again, suppose that the random variables W1 ; W2 ; :::; WN are drawn from a uniform
distribution so that Wi U (0; ). What is the MM estimator of ?
0
– Solution: We know that: E(Wi ) = 2 = 2 : Thus, the M M E estimator is de…ned by the
bM M E
sample moment equation: W = 2
– Implementation: Say n = 7; and fW1 ; W2 ; :::W7 g = f25000; 30000; 20000; 25000; 45000; 5000; 25000g:Then,
bM M E = 2 W = 2 25000 = 50000: The bM LE would be 45000.
– In fact, the MM estimator is not valid because is a boundary value.

See also LM Chapter 5.2

73
6.6 Interval estimation
Regardless of the properties of an estimator, the estimate obtained will vary from sample to sample.

A point estimate is a single number, and usually we want to know more. How con…dent can we
be about our estimate (precision)?

– The expected temperature next Saturday at 4pm is 22C. Does it mean “something between
21C and 23C with very high probability” or “something between 15C and 29C with high
probability”?

An interval estimate refers to a range of values such that we can expect this interval to contain
the true parameter in some speci…ed proportion of the samples, or with some desired level of
con…dence level

Let us assume that X is a normally distributed variable, that is, X N ( ; ).


2
Say we want to construct a con…dence interval for ; assuming that is known.

– Con…dence interval is based on our estimator X for ; where we know that


2
X N ; =n

σ
pdf (X)
α/2 n
α/2

µ
Con…dence intervals
– To help us de…ne the con…dence interval, we …rst note

X
Z= p N (0; 1)
= n

Let us de…ne z =2 such that:

Pr Z > z =2 =1 z =2 = =2

with Z N (0; 1): Note that, by symmetry z =2 = =2


Then
X
Pr z =2 p z =2 =1
= n
– Rewriting the above equation by isolating in the centre of the inequalities gives:

Pr X z =2 p X +z =2 p =1
n n

74
That is, with 1 con…dence,

2 X z =2 p ;X + z =2 p
n n

How can we obtain the con…dence interval if the distribution of X is not normal?

– Remember the Central Limit Theorem: as long as n is big enough, X will be approximately
normally distributed, so the con…dence interval is asymptotically correct!
– If the sample is small, then more attention needs to be paid to the distribution of X:

2
What if we don’t know ?

– We need to use an estimator of the variance


n
!
1 X 2
S2 = Xi X
n 1 i=1

2
– To allow us to obtain the con…dence interval for ; we need to replace with S 2 in Z; and
use
X
Tn 1 = p
S= n
instead. This random variable has a Student t distribution with n 1 degrees of freedom.
As n becomes larger, the t distribution approaches the standard normal distribution and
the con…dence interval will not change. But for smaller values of t, there is signi…cant
di¤erence. The t-distribution is symmetric around the mean but has thicker tails than
the standard normal distribution, so that more of its area falls within the tails. While
there is only one standard normal distribution, there is a t distribution for each size of
sample size.
– Like we did with the standard normal distribution, let’s de…ne t =2;n 1 such that:

Pr Tn 1 >t =2;n 1 = =2

Note that, by symmetry, again Pr Tn 1 t =2;n 1 = =2:


– We can write:
X
Pr t =2;n 1 p t =2;n 1 =1
S= n
S S
Pr X t =2;n 1: p X +t =2;n 1 : p =1
n n

That is, with 1 con…dence,

S S
2 X t =2;n 1 p ;X + t =2;n 1 p
n n

75
– Comparing with the expression above: if we use S instead of , we replace the standard
normal distribution with the Student t distribution with n 1 degrees of freedom. This
interval typically is larger!

Exercise: A random sample of n = 10 ‡ashlight batteries with a mean operating life X = 5h and
a sample standard deviation s = 1h is picked from a production line known to produce batteries
with a normally distributed operating lives. Find the 95% C.I. for the unknown mean of the
working life of the entire population of batteries.

– Solution: We …rst …nd the value of t0:025 so that the 2.5% of the area is within each tail for
n 1 = 9 degrees of freedom: This is obtained from the tables by moving down the column
headed 0.025 to 9df: The value we get is 2.262. Thus:
s 1
= X 2:262 p = 5 2:262 p
n 10
' 5 2:262(0:316) ' 5 0:71

and is between 4.29 and 5.71 with 95% con…dence.

See also LM Chapter 5.3

7 Hypothesis testing
7.1 Classical testing procedure
Hypothesis test: a rule that determines whether a particular value 0 2 is consistent with the
evidence of the sample.

– is the parameter space –the set of reasonable parameter values.

The Null Hypothesis H0 : = 0

The Alternative Hypothesis can be

– one sided: HA : > 0 (or HA : < 0)

– two sided: HA : 6= 0

For example, I want to test whether mean age of a student in this course is 25.

H0 : = 25
H1 : 6= 25

The test procedure is a rule, stated in terms of the data, that dictates whether the null hypothesis
should be rejected or not.

The classical, or Neyman-Pearson, methodology involves partitioning the sample space into two
regions.

76
– If the observed data (i.e., the test statistic) fall in the rejection region then we reject H0 ;

– If the test statistic falls in the acceptance region,


then do not reject H0 .

– The rejection region is de…ned by our willingness to commit a type I error (signi…cance level)!

The classical testing procedure therefore consists of the following steps:

1. De…ne H0 and H1 .
2. Formulate a test statistic

A test statistic is a random variable which


– is computable from the data and does not comprise any unknown quantities.
– has a well de…ned distribution needed to de…ne the rejection region.

3. Partition the sample space into the rejection region and the acceptance region. How?
4. Reject H0 if the test statistic falls in the rejection region; Do not reject H0 is the test statistic
does not fall in the rejection region.

Do not use the terminology: "accept H0 "!

See also LM Chapter 6.2

7.1.1 Type of errors


The rejection region is de…ned by our willingness to commit a type I error!

Since the sample is random, the test statistic is also random. The same test can lead to di¤erent
conclusions in di¤erent samples. As such, there are two ways a procedure can be in error:

– Type I error: The procedure can lead to the rejection of the null when it is true.
– Type II error: The procedure can fail to reject the null when it is false.

Do not Reject H0 Reject H0


H0 true X Type I error
H0 false Type II error X

See also LM Chapter 6.4 (with exercises)

77
7.1.2 Signi…cance level and power of a test
Signi…cance level ( ) = Probability of Type I error

– The level of signi…cance is under the control of the analyst (typically set equal to 5%)
– We also call this the size of the test (under control by the analyst).
– Given HA this allows us to determine when we want to start rejecting H0 (recognizing that
only with probability the null is true).
What is our willingness to erroneously reject a null?

Power of a test = 1 Probability of Type II error

– For a given signi…cance level , we would like the power of our test to be a large as possible,
in other words, to be as small as possible
Power of a test is the ability of the test to reject when the null is false!
To ensure that our tests are powerful, we want to make use of e¢ cient estimators!
– The power and probability of a Type II error ( ); are de…ned in terms of the alternative
hypothesis and depends on the value of the parameter.
The power of the test H0 : = 50 (mean age EC400 students) when = 25 (truth)
should be close to 1, i.e., we should really reject = 50 for any sample! The power of
test H0 : = 26 is much smaller.

7.2 Test of the mean (variance known)


Let us …rst consider a one-sided test of the mean

A petroleum company is searching for additives that may increase gas mileage (example from LM
- section 6.2)

– They send 30 cars with the new additive on a road trip from Boston to Los Angeles.
Without the additive: X N (25; (2:4)2 ) mpg.
With the new additive: X N ( ; (2:4)2 ) mpg.
– We want to know whether > 25.
Hypotheses: H0 : = 25 , H1 : > 25.
– Use X as an estimator for : Reject when our estimate x is too big (relative to 25).

– Sampling distribution of X under H0 will determine what it means for x to be too big.
2
H0 : X N (25; =n)

– More precisely, our willingness to commit a Type I error with probability (signi…cance
level), will tell us when to reject.

78
We should reject when x > x ; where x is de…ned as

Pr(X > x ) =

σ
pdf (X)
n

µ x*

How to …nd x ?

– Pr(X > x ) equals Pr( X=p25


n
> x p25
= n
) Pr( X=p25
n
> z ) with

x p25
z = = n

X p25
– Under H0 : = n
=Z N (0; 1) ) use the N (0; 1) table to obtain z

Pr(Z > z ) =

With = 5%; the statistical table of N (0; 1) yields z = 1:645


p
– Given z ; we can compute x = 25 + z : = n:
With = 5%, x = 25 + 1:645: p2:4
30
= 25:72:

7.2.1 The z-statistic


This above reveals that when testing the mean of a normal population with known variance we
should use the Z statistic
X 25
Z= p
= n

– It is a statistic, as it does not contain any unknown parameters and it has a distribution for
which we easily can …nd the critical values!

Z N (0; 1) under H0

– Reject H0 if z > z where (z ) = 1 .


– In the example
26:3 25
z= p = 2:97
2:4= 30
which is greater than 1.645 so we reject H0 at the 5% level of signi…cance.

79
7.2.2 The p-value
Given n; and x, we could have asked the question: Which levels of signi…cance ( ) would lead
us to reject H0 ?

De…nition p-value provides the lowest level of signi…cance ( ) at which we would reject the
null.
x p
25
– After calculating the Z-statistic for our sample z = = n
; the p-value is de…ned by:

P (Z > z) = 1 (z)
x p25
In the example, z = 2:4= 30
= 2:97: From the standard normal table (2:97) = 0:9985;
so the p-value is
1 (2:97) = 0:15%
As we had decided to reject H0 at the 5% level, we observe that the p-value is smaller
than 5%. We would also have rejected the hypothesis when = 1%!

Now we report the analogous results when we consider the two-sided alternative

H0 : = 25, H1 : 6= 25;
Again, use X as our estimator of

Here, we would like to reject H0 if x is too high or too low! Pr(X > xH ) = =2 and Pr(X <
xL ) = =2:

– To …nd xH and xL ; we proceed as before.

Probability distribution of X

– We standardize the distribution of X under the null, to facilitate obtaining xH and xL

X 25
p N (0; 1) under H0
= n
xH 25
Pr(X > xH ) = Pr( X=p25
n
> p )=
= n
1 (zH ) = =2
xL 25
Pr(X < xL ) = Pr( X=p25
n
< p )=
= n
(zL ) = =2

80
– zH solves (zH ) = 1 =2 and zL solves (zL ) = =2.
Note that zH = zL = z =2
normal distribution centered around zero, de…ne z =2 >0
p p
– xL = 25 z =2 : = n; xH = 25 + z =2 : = n

As before, the Z-statistic for testing the mean of a normal population with known variance
X p25
Z= = n
N (0; 1) under H0

– Now reject H0 if z > z =2 or if z < z =2 where (z =2 ) =1 =2.


– In the example
26:3 25
z= p = 2:97
2:4= 30
which is greater than 1.96 ( = 5%) so we reject H0 :

The p-value, is de…ned by:

2P (Z > jzj); Example: 2 (1 (2:97)) = 0:30%

– If H0 is true, Pr( X > 26:3) = 0:30%.

Summing up (see also LM Chapter 6.2):


Let z be so that Pr (Z > z ) = and by symmetry Pr (Z < z ) = . The null hypothesis is
H0 : = 0 . The variance of the distribution is known.
If the alternative hypothesis is:

H1 : > 0, we reject H0 at the level of signi…cance if z > z .


H1 : < 0, we reject H0 at the level of signi…cance if z < z .
H1 : 6= 0, we reject H0 at the level of signi…cance if z > z =2 or z < z =2 .

7.2.3 Power of the test


We recall the power of the test is de…ned as Pr(Reject H0 jH0 is false)
2
Consider the example (assume known):

H0 : = 25 = 0
H1 : = 1 (say 1 > 0, i.e. one-sided)
X p25
Our test: Reject if Z = = n
> z (say = 5%)

The power of the test depends on the true value of ; 1:

Power( 1) = Pr( X=p25


n
>z j = 1)

81
– To compute this, we need to realize that if the true mean is 1; Z no longer has a standard
normal distribution!

Pr( X=p25
n
>z j = 1) = Pr( X=p26
n
>z + 25 p26
= n
)

X p26
Since under H1 : = n
N (0; 1); we can obtain the power by

25 p26 25 p26
1 (z + 2:4= 30
) =1 (1:645 + 2:4= 30
) = 74%

The power of the test is a¤ected by


2
– The precision or our estimator, i.e., the variance of the distribution of X ( =n)
A more precise estimator (lower variance) increases the test statistic, i.e., makes it easier
the reject a null (so also when the null is false)
– The particular alternative we consider
The further the truth lies from the hypothesis we are testing, the more likely we are to
reject the null!

7.3 Test of the mean (variance unknown)


7.3.1 The t-statistic
We have assumed that is known. In practice, that’s very rarely the case. What if we do not
know ?

– We need to realize that the Z-statistic can no longer be computed as it contains the unknown
quantity : Not a valid test statistic.
– We need to replace with an estimator of the standard deviation:
v
u n
u X 2
SX = t n 1 1 Xi X
i=1

– Our new test statistic becomes


X 25 X 25
T = p t (n 1) under H0
SX = n SE(X)

SE(X) is the standard error of the estimator X (squared root of the estimated variance
of X)
It has a student t distribution with n 1 degrees of freedom.
If n is big, the Student t distribution is very close to the standard normal distribution.
But for smaller values of t, there is signi…cant di¤erence. The Student t distribution has
thicker tails.

82
2
Lemma Let X1 ; X2 ; :::Xn be a random sample of size n from N ( ; ) then
p
n(X )
s t (n 1)
SX
Pn 2
where X N ( ; 2 =n) and SX 2
= n 1 1 i=1 Xi X :

Proof: Rewrite p
p n(X )
n(X ) N (0;1)
= v
u
SX u(n
u 1) 2 2
SX =(n 1)
t| {z }
2 (n 1)

2
Using the independence of X and SX (not shown here), we have the required formulation of a t (n 1)
random variable

Test of the mean –Summary

Let X1 ; :::; Xn drawn from N ( ; 2 ):


Consider the null hypothesis H0 : = 0, and use X as our estimator for
2
If is known, the test statistic is
X
Z= p 0 N (0; 1) under H0
= n
2
If is unknown, the test statistic is
X X
T = p 0 0
tn 1 under H0
S= n SE(X)

Depending on whether or not 2 is known or not, we obtain critical values from N (0; 1) or tn 1 .
Given the alternative hypothesis (one-sided or two-sided) we de…ne a rejection region where we will
be rejecting H0 :
One sided test has the bene…t of improved power
See also LM Chapter 7.2-7.4.

7.4 Test of the variance


2 2 2
Let X1 ; :::; Xn drawn from N ( ; ): Consider the null hypothesis H0 : = 0.

2
– The estimator on which we base our test is SX :
– The test statistic, is given by
2
2 (n 1)SX 2
= 2 n 1 under H0
0

it has Chi-squared distribution with n 1 degrees of freedom.


2 2 2 2 2 2
If H1 : > 0 : reject H0 at signi…cance level if > , where Pr > =
2 2 2 2
If H1 : < 0 : reject H0 at signi…cance level if < 1 .
2 2 2 2 2 2
If H1 : 6= 0 : reject H0 at signi…cance level if < 1 =2 or > =2 .

See also LM Chapter 7.5.

83
7.5 Hypothesis testing and con…dence intervals
Rather than reporting point estimates, ^; we may want to report a con…dence interval for the
parameter(s) of interest.
h i
De…nition A 100(1 )% con…dence interval for is an interval lower ^ ; upper ^ such
that
Pr(lower(b) upper(b)) = 1
– Often the con…dence interval is given by
[b C =2 SE(b); b + C =2 SE(b)]
with C =2 the critical value taken from the appropriate distribution, and SE(b) is an (esti-
mate) of the standard deviation of ^:
If a hypothesized value of the parameter does not fall in this range of plausible values, then the
data are not consistent with the hypothesis, and it should be rejected.

Example: con…dence interval of the mean of a normal population


2 2
– X1 ; :::Xn randomly drawn from N ( ; ); not known. Specify the (1 ) 100% con…dence
for
2
– The estimator on which the C.I. is based: X N( ; =n):
X p
– Use SX = n
tn 1, and student-t tables to specify t =2 such that
X
Pr( t =2 < p <t =2 ) =1
SX = n
– The con…dence interval of is obtained by rewriting:
SX SX
Pr X t =2 pn X +t =2 pn =1
h i
SX SX
That is, with 1 con…dence, 2 X t =2 pn ; X +t =2 pn
For a particular sample we can obtain this con…dence interval!

Let us look at our empirical example:


H0 : = 25; H1 : 6= 25
The sample size n = 30; and x = 26:3
2
– When is known (= 2:4)
[x z =2 pn ; x +z =2 pn ] = [26:3 1:96 p2:4
30
; 26:3 + 1:96 p2:4
30
]
= [25:4; 27:2]
The con…dence interval, shows, e.g., H0 : = 25 needs to be rejected with signi…cance
level 5% since 25 does not lie in the 95% con…dence interval.
– When 2 is unknown
[x t s
=2;n 1 pn ; x +t s
=2;n 1 pn ] = [26:3 2:045 psX30 ; 26:3 + 2:045 psX30 ]
2
This interval typically is larger, as it needs to account for the fact that needs to be
estimated as well!

84
8 The classical linear regression model (2018: mostly self-study)
8.1 Multiple linear regression model
Study the causal relationship between a dependent variable, y; and one or more independent
variables, x1 ; :::; xk :
y = x1 1 + ::: + xk k + "

– Underlying economic theory will specify the dependent and independent variables in the
model and " is a random disturbance

n
– Given a sample f(yi ; xi1 ; ::; xik )gi=1 ; the objective is to estimate the unknown parameters,
study theoretical propositions, and use the model to predict the variable y:
How to proceed depends on the assumptions we are happy to make concerning the
stochastic process that generated our data.

Gauss-Markov Assumptions: Classical Linear Regression model

Under our GM assumptions:

– j provides the causal marginal e¤ect the explanatory variable xj has on the conditional
expectation of y; ceteris paribus.

Economic theory should guide us whether this is indeed a causal relationship.


– The OLS provides the BLUE estimator for 1 ; :::; k

– For statistical inferences, it is usual to add the assumption that the disturbances come from
a normal distribution (exact hypothesis testing, t and F test)
GM + normality renders OLS the MVUE estimator!

8.1.1 Multiple Linear Regression Model –Matrix Notation (Non-examinable)


n
Our multiple linear regression model, assumes that our sample f(yi ; xi1 ; ::; xik )gi=1 satis…es:

yi = 1 xi1 + 2 xi2 + ::: + k xik + "i

The ith observation of the model can be rewritten as:


2 3 2 3
1 xi1
6 7 6 xi2 7
6 2 7 6 7
yi = x0i + "i ; where =6 . 7 and xi = 6 .. 7
4 .. 5 4 . 5
k xik

– is a vector of (…xed) parameters


– xi is a vector of characteristics of individual i:
0
NB. x0i = xi (scalar)

85
Our model has n observations: so lets stack them
0 1 0 0 1
y1 x1 + "1
B y2 C B x02 + "2 C
B C B C
B .. C = B .. C
@ . A @ . A
yn x0n + "n

– As appears for all observations, this can be simpli…ed as


0 1 0 0 1 0 1
y1 x1 "1
B .. C B .. C B .. C
@ . A @ . A + @ . A
yn x0n "n
(nx1) (nxk) (kx1) (nx1)

This may be simpli…ed as


2 3
x11 x12 x1k
y = X + " with X = 4 5
xn1 xn2 xnk
1 0 0 1
y1 "1
B C B .. C
– y and " are the n 1 dimensional vectors, y = @ ... A, " = @ . A
yn "n
– X is like an excel spreadsheet. The rows represent the individual observations while the
columns represent the various explanatory variable in our model.

8.2 Gauss-Markov assumptions


A1 Linear in the parameters
yi = xi1 1 + ::: + xik k + "i

Using matrix notation, this may we simpli…ed as y = X + "; where is a k dimensional


vector and X is a n k dimensional matrix whose columns denote the k explanatory variables
In the simple linear regression model: yi = 1 + xi 2 + "i ; the …rst column of X contains
ones, and the second column of X contains the regressor xi :

A2 No Perfect Collinearity

There are no exact linear relationships among the independent variables.


Equivalent to assumption X has full column rank
– This ensures that X 0 X is invertible.
n
– In the simple linear regression model: i=1 (xi x)2 > 0

A3 Exogeneity (strict) of the independent variables

E ("i jx1 ; x2 ; :::; xk ) = 0 E("i jX) = 0

This assumption guarantees that we can interpret the regression of y on X as the conditional
mean of y : E(yi jX) = xi1 1 + ::: + xik k .

86
Omission of relevant factors (in ") that are correlated with any of x1 ; :::; xk forms a violation
of this assumption.
Any correlation between the errors and regressors violates A3! (example measurement error
of explanatory variables)
– When we have random sampling (cross-sectional) we only need to worry about correlation
between "i and characteristics of individual i:
– In time series data (where random sampling is quite unreasonable) we need to worry
about correlation between "t and future values of the explanatory variables xt+1 ; xt+2 ; :::
as well.
A4 Homoskedasticity and nonautocorrelation
V ar("i jx1 ; :::; xk ) = 2 A3
V ar("jX) = E (""0 jX) = 2
I
Cov("i ; "j jx1 ; :::; xk ) = 0 all i 6= j
The presence of heteroskedasticity ( 2i 6= 2 ) and dependence among disturbances (time,
spatial dependence) are commonplace in economics.

Concerning our GM assumptions, we may want to distinguish two data generating process for the
regressors
– Fixed (non-stochastic) regressors
Under repeated sampling the regressors remain …xed, as would be the case in an exper-
iment
If X is …xed E("i jX) = E("i ), and A3 reduces to E("i ) = 0: Similarly A4 reduces to
V ar("i ) = 2 and Cov("i ; "j ) = 0 i 6= j:
– Random (stochastic) regressors
n
In obtaining a new sample f(yi ; xi1 ; ::; xik )gi=1 it is di¢ cult in general to control (keep
…xed) the x’s!
If " and X are (mean) independent, we also have E("i jX) = E ("i )!

The assumption E("i jX) = 0 is important for establishing the unbiasedness of our OLS
estimator! OLS may still be consistent if this assumption fails (as long as E(xi "i ) = 0).

8.3 Estimation
8.3.1 Minimum distance: ordinary least squares
yi = 0 + xi 1 + "i , E("i jX) = 0 and E(""0 jX) = 2
I
OLS estimates 0 and 1 by minimizing the sum of squares of the vertical distances from the data
points to the …tted regression line.

y
^"i = yi ^ + ^ xi
0 1

87
n
– Given f(xi ; yi )gi=1 , it determines what the straight line is that minimizes:
n
X 2
S( 0; 1) = [yi ( 0 + 1 xi )]
i=1

The …tted values are


ybi = ^ 0 + ^ 1 xi
and the residuals are de…ned as:

^"i = yi ybi = yi ^ + ^ xi
0 1

The …rst order conditions with respect to 0 and 1 are given by


Xn
@S
@ 0 : 2 (yi ( ^ 0 + ^ 1 xi )) = 0
Xi=1
n
@
@S
: 2 (yi ( ^ 0 + ^ 1 xi ))xi = 0
1 i=1

– Solving with respect to ^ 0 and ^ 1 yields (show!)

^ = y ^1x
0
Pn
(x x)(yi y) SampleCov(xi ; yi )
^
1 = Pn i
i=1
2
=
i=1 (x i x) SampleVar(xi )

2
The (unbiased) estimator of the error variance ; here (2 parameters)
Xn
^"2i RSS
2 i=1
s = =
n 2 n 2

Important property of OLS residuals: the residuals are orthogonal to the regressors!

– We observe this from rewriting the FOC as :


Xn
@S
@ 0 : 2 ^"i = 0
Xi=1
n
@S
@ : 2 ^"i xi = 0
1 i=1

with ^"i = yi ^ ^ xi
0 1
– This property is in line with the classical linear regression assumption: errors and regressors
are uncorrelated!
Xn
1
n ^"i = 0 is the sample analogue of E("i ) = 0
Xi=1
n
1
n ^"i xi = 0 is the sample analogue of E("i xi ) = 0
i=1

8.3.2 Method of moments


This estimator looks directly at the GM assumptions.

E("i ) = 0; E(xi "i ) = 0 and E("2i ) = 2


where "i = yi 0 1 xi

88
– The MM estimator estimates 0 ; 1 ; and 2 by enforcing the sample analogues of these
population moments
8 Xn
> 1
> ^"i = 0
< n Xi=1
n
1
n ^"i xi = 0 where ^"i = yi ^ 0;M M ^ 1;M M xi
>
> X i=1
n
: 1 ^"2 = ^ 2 n i MM
i=1

The …rst two conditions are same as FOC of OLS ) ^ M M = ^ OLS


The last condition determines the MM estimator for 2 :
We use exactly as many moment conditions as we have unknown parameters!

8.3.3 Maximum likelihood


2
To use the MLE procedure to estimate 0, 1 and we need to be able to determine the full
(conditional) density of the data. Say
2
A5 "jX N (0; I) normally distributed errors

Given A5, conditional on X


2
yi is in dependent normal with mean 0 + 1 xi and variance

The (conditional) joint density of the observations y1 ; y2; y3; :::; yn; (de…nes the likelihood function)
is:
Q
n 1 1 2
L= p e 2 ([yi 0 1 :xi ]= )
i=1 2 2

2
– The MLE estimator estimates 0; 1; and by maximizing the (log-)likelihood:
n
2 n n 2 1 X 2
ln L( 0; 1; )= ln(2 ) ln 2
(yi ( 0 + 1 xi ))
2 2 2 i=1

The …rst order conditions are:

@
@
: 1
^2
n
i=1 (yi ( ^ 0 + ^ 1 xi )) = 0 1 n
i=1 ^
"i = 0
0 ^2
@
@
: 1
^2
(yi ( ^ 0 + ^ 1 xi ))xi = 0 or 1
^2
n
i=1 ^
"i xi = 0
1
@
: n
+ 1
(y i ( ^ + ^ x )) 2
= 0
n
2^ 2
+ 2^1 4 ni=1 ^"2i =0
@ 2 2^ 2 2^ 4 0 1 i

with ^"i = yi ^ ^
0;M LE 1;M LE xi

– The …rst two FOC are the same as those from OLS ) ^ M LE = ^ OLS
n
X
2 1
– The last FOC yields ^ M LE = n ^"2i RSS 2
n = ^M M
i=1

– These MLE estimates critically depend on the joint normality assumption. Other MLE
estimates would result if a di¤erent distributional assumption was made.

89
8.4 Properties of OLS estimators in CLM
8.4.1 Unbiased, e¢ cient (BLUE), consistent
Pn
(x x)(yi y)
Consider ^ 1 = Pn i
i=1
2 in simple linear regression model
i=1 (xi x)

A1-A3 ensure ^ 1 is Unbiased, nts E ^ 1 = 1

– A2 ensures ^ 1 is well de…ned (no perfect collinearity)


– Using A1, we can rewrite our estimator as
Pn Xn
^ = P i=1 (xi x)"i xi x
1 1 + n 2
= 1 + di "i with di = Pn
i=1 (xi x) i=1 j=1 (xj x)2

– Next we take expectations:


If X is …xed (non-stochastic), di are …xed and
Xn
E ^1 = 1 + di E ("i ) = 1 by A3
i=1

If X is stochastic, we need to use the law of iterated expectations. To take di out of


E ( ), we need to condition on all x’s
Xn
E ^ 1 jX = 1 + di E ("i jX) = 1 by A3
i=1

To conclude the unbiasedness: E ^ 1 = EX E ^ 1 jX = 1

2 2
Let X be …xed: A1-A4 ensure that V ar( ^ 1 ) = Pn
x)2 = (n 1)s2X
i=1 (xi

– By de…nition
Xn Xn X
V ar( ^ 1 ) = V ar( 1 + di "i ) = V ar(di "i ) + Cov (di "i ; dj "j )
i=1 i=1 i6=j
Xn X
= d2i V ar("i ) + di dj Cov ("i ; "j )
i=1 i6=j
Pn
A4 yields V ar( ^ 1 ) = 2
i=1 d2i and plugging the de…nition of di in gives the desired
answer.
2
– If X is stochastic: A1-A4 ensure that V ar( ^ 1 ) = EX Pn
x)2
i=1 (xi

Uses the relation between conditional and unconditional variance. Note V ar(EX ( ^ 1 jX)) =
V ar( 1 ) = 0
– The precision of our estimate of 1 is enhanced by
larger sample size (n) .
more variability of the x regressors (s2X )
and smaller error variance ( 2 )!

Consistent, nts plim ^ 1 = 1

– Using su¢ cient conditions:

90
Pn 2
=n
X …xed: If lim n1 i=1 (xi x)2 = 2
x > 0; V ar( ^ 1 ) = 1
Pn
x)2
! 0 while
n!1 n i=1 (xi

E^ =1 1

– Using law of large numbers:


X stochastic: Recall ^ 1 = Sam pleCov(xi ;yi ) ^ Cov(xi ;yi )
Sam pleVar(xi ) . By WLLN, then plim 1 = Var(xi ) (sam-
n
ple averages converge to the pop analogues); As Cov (xi ; yi ) = Cov (xi ; 0 + 1 xi + "i ) =
^ Cov(xi ;"i )
1 V ar(xi )+Cov(xi ; "i ) : plim 1 = 1 + Var(xi ) : Provided Var(xi ) > 0; then Cov(xi ; "i ) =
0 ensures consistency of OLS!

By Gauss-Markov Theorem: ^ 1 is BLUE of 1:

Sampling distribution. Adding A5, ensures


^ jX 2
Pn
1 N( 1; = i=1 (xi x)2 )

(n 2) 2 s2 2
n 2
^ and s are independent
2
1

– The normal distribution of ^ 1 follows directly as it is a linear combination of normal random


variables ^ 1 = 1 + di "i
– The latter two results are important in the derivation of the well known t test (discussed in
our econometrics courses)
– Even if A5 is not satis…ed, these results are still valid asymptotically!

8.5 Derivation OLS estimator using Matrix notation (Non-examinable)


yi = xi1 1 + ::: + xik k + "i = x0i + "
n
Given f(xi ; yi )gi=1 , we want to determines the vector that minimizes:
n
X 2 0
S( ) = [yi x0i ] = (y X ) (y X )
i=1 1 n n 1

@S( )
– We need to solve the system of k …rst order conditions: @ 1 = 0; ; @S(
@
)
= 0 for ^ 1 ; :::; ^ k
k

0 @S( )
1
@ 1
@S ( ) B .. C
=B
@ .
C
A =0
@ ^
@S( )
@ k ^

– A natural extension of the FOC of the simple linear regression model yield:
0 Xn 1
xi1 ^"i
B i=1 C
B . C = 0 with ^"i = yi x0i ^
@ X .. A
n
xik ^"i
i=1

91
In matrix notation, this equals X 0 ^" = 0 with ^" = y X^

X0 y X^ = 0 ) X 0X ^ = X 0y
A2
+
^ 1
= (X 0 X) X 0y
@S( )
Let us consider the vector of derivation of @ directly:
0
S( ) = (y X ) (y X )
0 0 0
= (y 0 X ) (y X ) = y 0 y X 0 y y0 X + 0 X 0 X
= y y 2 0X 0y + 0X 0X
0
since 0 X 0 y = y 0 X

@ (y 0 y )
– @ =0
@ ( 2 0 X 0 y)
– @ = 2X 0 y
1 0
Pk
z1
0
@ ( z) @( j zj ) B C
Let us simplify: @ = j=1
@ = @ ... A = z
zk
0
@( X0X )
– @ = 2X 0 X
0 Pk Pk 1
Pk Pk j=1 j z1j + i=1 i z1i
@( 0
Z ) @( j zij ) B .. C
Let us simplify: @ = i=1 j=1
@
i
=B
@ .
C
A
Pk Pk
j=1 j zkj + i=1 i zki
0 Pk 1
j=1 j (z1j + zj1 )
B .. C
=B
@ .
C
A
Pk
j=1 j (z kj + zjk )
0 Pk 1
2 j=1 z1j j
@( 0
Z ) B .. C
If Z is symmetric @ =B
@ .
C = 2Z
A
Pk
2 j=1 zkj j

– Combining these results yield the FOC:

@S ( )
= 2X 0 y + 2X 0 X ^ = 0
@ ^

8.6 Statistical Inference in CLM under normality


8.6.1 The t-test
For simplicity (and without loss of generality) let us treat the regressors as …xed and assume GM
+ normality
We want to test the hypothesis

H0 : 1 = 5 against HA : 1 6= 5

92
2
– If we assume is known, the test statistic we use is
r
5^ Xn
z= 1
N (0; 1) under H0 ^
Stdev( 1 ) = = (xi x)2
^
Stdev( 1 ) i=1

We should reject if jzj > z =2 where is our level of signi…cance


2
– If is unknown, z is not a proper test-statistic (cannot compute it with the data as it
depends on 2 ): The test statistic we then use is
r
^ 5 Xn
t= 1 tn 2 under H0 SE( ^ 1 ) = s= (xi x)2
SE( ^ 1 ) i=1

We should reject if jzj > t =2 where is our level of signi…cance

Alternatively, we can base our test on the (1 ) 100% con…dence interval


h i
^ t =2 SE( ^ 1 ); ^ 1 + t =2 SE( ^ 1 )
1

– We should reject the hypothesis = 5; if 5 does not lie in this con…dence interval.
1
h i
– This con…dence interval typically is wider than ^ 1 z =2 Stdev( ^ 1 ); ^ 1 + z =2 Stdev( ^ 1 ) ;
2
and recognizes the imprecision associated with the estimation of :

8.6.2 The F-test


In the multiple linear regression model, we may want to test various linear restrictions jointly,
e.g.,
H0 : 2 = 5 against HA : 2 = 6 5 and/or
3 = 1 3 =6 1

– A well known test for such joint linear restrictions is the F-test, which compares the …t of
the unrestricted and restricted model. Under H0 :
(RRSS U RSS) =/ #restrictions
F = F#restrictions,df of unrestr m o del
U RSS = df of unrestricted model

When we minimize the residual sum of squares subject to restrictions, we typically incur
a loss (RRSS U RSS).
This test determines whether this loss is signi…cant, in which case we would reject the
validity of these restrictions!
If our model is: yi = 1 + 2 xi2 + 3 xi3 +"i ; then to obtain RRSS; we impose the restric-
tions and regress: yi 5xi2 xi3 = 1 + "i . Its RSS is called RRSS; #restrictions=2;
df of unrestricted model=n 3on).

Statistical Inference

The use of the t and F test as speci…ed above critically depends on the GM assumptions and
normality of the errors.

– The t-test e.g, for H0 : 1 = 5; makes use of the SE ^ 1 which is obtained using the formula
derived under our GM assumptions.

93
– Violation of GM, invalidates our usual test statistics.

If we do not want to assume normality of the errors, we will want to rely on a suitable CLT:
Xn
^ a N ( ; 2= (xi x)2 )
1 1
i=1

– For the single linear restriction H0 : 1 = 5; we will use the Z test:

^ 5 a
1
z= N (0; 1) under H0 (asymptotic t-test)
SE( ^ 1 )

– For joint linear restrictions (discussed in our econometrics courses), we will use an asymptotic
2
test, with degrees of freedom given by the number of restrictions

8.7 Gauss-Markov violations - brief summary


Heteroskedasticity - commonplace in cross sectional data
(Suggested background reading: Wooldridge, Chapter 8: 8.1-8.4)

– In the presence of heteroskedasticity, A.4 fails


– Assuming that all other GM assumptions are still satis…ed: OLS still has good properties
(unbiased, consistent).
Nevertheless, the usual standard error of OLS is incorrect in the presence of heteroskedas-
ticity –t test and con…dence intervals invalid.
– We will need to use robust SE’s (White) to make them valid!
Important: NO need to be explicit about form of heteroskedasticity.
– The OLS estimator is no longer e¢ cient. There might be a better estimator (WLS)!
Important: To regain e¢ ciency we DO need to specify the form of the variance as a
function of explanatory variables
– Common tests for Heteroskedasticity are the Gold…eld and Quandt test, the Breush-Pagan
(LM) Test and the White Test.

Autocorrelation –commonplace in time series data In the presence of autocorrelation, A.4 fails
(Suggested background reading: Wooldridge, Chapter 12: 12.1-12.3 and 12.5)

– Assuming that all other GM assumptions are still satis…ed: OLS still has good properties
(consistent).
In time series data, we typically prefer to use E("t jxt ) = 0 (weak exogeneity) which will
not allow us to obtain unbiasedness!
– Nevertheless, the usual standard error of OLS is incorrect in the presence of autocorrelation
and/or heteroskedasticity –t test and con…dence intervals invalid.
– We will need to use robust SE’s (HAC) to make them valid!
Important: NO need to be explicit about form of heteroskedasticity and/or autocorre-
lation.
– The OLS estimator is no longer e¢ cient. There might be a better estimator!

94
Important: To regain e¢ ciency we DO need to specify the form of autocorrelated (e.g.,
stationary AR(1)).
– Autocorrelation in the presence of lagged dependent variables will cause a violation of A.3!!
– Common tests for Autocorrelation are the Breush-Godfrey (LM) test and the Durbin Watson
test.
– Speci…c (weakly dependent) autocorrelation patterns:
Autoregressive process of order p: AR(p)
(For AR processes to be weakly dependent we cannot have unit roots (persistence, strong
dependence) or explosive roots)
Moving error process of order q: MA(q), or
ARMA(p,q), which is a combination of a AR(p) and MA(q)

Endogeneity –Correlation between the errors and regressors


(Suggested background reading: Wooldridge, Chapter 15: 15.1-15.5 or Stock and Watson, Chapter
10)

– Endogeneity can arise for a number of reasons: omitted relevant variables, measurement
errors in the regressors, lagged dependent variables in the presence of autocorrelation in the
error term, and simultaneity
– This is a serious violation of as it renders the OLS estimator biased and inconsistent
Pn
– Intuitively: OLS imposes sample conditions on the residuals, e.g., i=1 ^"i xi = 0; that are
unreasonable if E ("i xi ) = 0:
– Solution: Look for an instrumental variable, zi ; that is valid, E ("i zi ) = 0; and relevant,
Cov (xi ; zi ) 6= 0:
The IV estimator, which is can also be seen to be a method of moment estimator, is an
estimator that has desirable large sample properties (e.g., consistent and asymptotically
normal)
The IV estimator can be computed using 2SLS.
– Identi…cation
If we have exactly as many instruments as we need (exact identi…cation), IV and 2SLS
are identical.
If we have more instruments than we need (over identi…cation), then 2SLS provides a
way to use the optimal instrument.
We cannot estimate the parameters if we have less instruments than we need (under
identi…cation); our parameters are not identi…ed.

9 Large-Sample Distribution Theory (Non-examinable)


9.1 Modes of Convergence
See also Green Appendix D.
Convergence in probability and convergence in distribution play an important role in econometrics
and statistical inference.

95
– To establish the consistency of our estimators we relied on the convergence in probability
concept
p
plim ^n = or ^n !
– To enable us to conduct hypotheses testing, convergence in distribution is relevant if we
cannot establish the exact sampling distributions for our estimators.
p d
zn = n ^n ! f (z)

with f (z) a well-de…ned distribution with a mean and a positive variance

Let fxn ; n = 1; 2; :::g be a sqce of r.v.’s and x another r.v. de…ned on a common probability space
(x can be a constant).

p
De…nition (Convergence in Probability) We say that xn ! x if 8 > 0;

lim Pr fjxn xj > g = 0


n!1

Then xn converges in probability to x. We also write xn = x + op (1).

De…nition (Weak consistency) Suppose that we have an unknown parameter , and based on
p
sample of n observations we estimate it by ^n . Then, ^n is weakly consistent if ^n ! 0.

A stronger type of convergence is almost sure convergence

a:s:
De…nition (Almost Sure Convergence) We say that xn ! x if
n o
Pr lim jxn xj = 0 = 1.
n!1

– Intuitively, ones the sqce xn becomes close to x; is stays close to x:

a:s:
De…nition (Strong consistency) ^n is strongly consistent if ^n ! .

Relationship between these two modes of convergence:


a:s: p
xn ! x ) xn ! x.

A special case of convergence in probability is given by

De…nition (Convergence in mean square) Assuming E[x2 ] < 1


n o
mse 2
xn ! x i¤ lim E jxn xj = 0.
n!1

– If x is a constant: the conditions require asymptotic unbiasedness, lim E(xn ) = x; and


n!1
vanishing variance lim V ar(xn ) = 0
n!1

96
Intuitively, the distribution of xn collapses to a spike at plim xn

By Markov’s inequality (r = 2, Chebychev’s inequality)


mse p
xn ! x ) xn ! x.

Theorem (Markov’s inequality) For r > 0,


r
E fjzj g
Pr fjzj > g r ; 8 > 0.

2
– With z = xn x; result follows as E jxn xj !0

When considering convergence to a constant, we also have


p d
^n ! , ^n ! , where the limiting distribution of ^n is a spike

– Nevertheless, such a degenerate distribution is not very informative.

We would like to standardize it, such that, say


p d
zn = n ^n ! f (z)

with f (z) a well-de…ned distn with a mean and a positive variance.

– The Central Limit Theorem provides an application of this proposition

Theorem Lindeberg-Levy CLT: If X1 ; :::; Xn is a random sample (i.i.d.) from a probability


distribution with …nite mean and …nite variance 2 then
p d 2
n(X n ) ! N (0; )

Let us denote by FX (x) = Pr(X x) the probability distribution function (DF) of X, and
it0 X
X (t) = E[e ] its characteristic function (CF).

d
De…nition We say that Xn ! X in distribution (Xn ! X) if lim FXn (x) = FX (x) at every
n!1
continuity point of FX ( ) :

We call FX (x) the limiting (or asymptotic) distribution of Xn :

The following theorem indicates why the characteristic functions are useful in proofs of central
limit theorems

97
Theorem
d
Xn ! X , Xn (t) ! X (t) 8t

Two fundamental theoretical results are the foundations for applying these convergence idea to
sampling distributions of estimators

– Law of Large Numbers


They state regularity conditions under which sample moments converge in probability
to their population counterparts, e.g., Khinchine WLLN
– Central Limit Theorems
They state regularity conditions under which an appropriately standardized (for mean
and variance) sample average converges in distribution, e.g., Lindeberg-Levy CLT

Instead of reviewing these here, I will provide important results we often make use of in our
Econometrics courses.

– Slutsky Theorem
– Continuous Mapping Theorem
– Cramer Convergence Theorem
– Delta Method
– Stochastic Order of Magnitude

9.2 Slutsky’s Theorem


Theorem Slutsky Theorem. For a continuous function g(Xn ) that is not a function of n

plim g(Xn ) = g(plim Xn ):

Some implications of Slutsky theorem are:

– If xn and yn are random variables with plim xn = c and plim yn = d then

plim(xn + yn ) = c + d (sum rule)


plim(xn yn ) = cd (product rule)
plim(xn =yn ) = c=d (ratio rule)

– If Wn is a matrix whose elements are random variables and if plim Wn = , then

plim(Wn 1 ) = 1
(matrix inverse rule)

– If Xn and Yn are random matrices with plim Xn = A and plim Yb = B then

plim(Xn Yn ) = AB (matrix product rule)

98
9.3 Continuous Mapping Theorem
Theorem Continuous Mapping Theorem. Let Xn and X be k 1 vectors. Let g be a
continuous function in the domain of X. Then,
d d
Xn ! X ) g (Xn ) ! g (X) .

If we know how to obtain the distribution of the random variable g(X) (Chapter 4), then we can
use this result to describe the limiting distribution of Xn :

– Examples
d d
Xn ! X s N (0; 1) ) Xn2 ! 2
(1)
d d
Xn ! X s N (0; Ik ) =) Xn0 Xn ! Xk2
p 2
p d 2 n(X ) d 2
n X ! N (0; )) ! (1)
p d p 0 1
p d 2
n X ! N (0; ) ) n X n X ! (k)

A convenient device used to proving joint convergence result is given by a related theorem.

Theorem Cramer-Wold Theorem. Let Xn and X be k 1 vectors.


d 0 d 0
Xn ! X , Xn ! X for all 2 Rk

9.4 Cramer’s Convergence Theorem


Results that combine limiting distributions and probability limits:
d p d
1. Xn ! X; yn ! c =) Xn + yn ! X + c:
d p d
2. Xn ! X; yn ! c =) yn Xn ! cX:
d p d
3. Xn ! X; yn ! c 6= 0 =) yn 1 Xn ! c 1
X:

Theorem Cramer’s Convergence Theorem. Let fXn g be a sequence of (k 1) vectors of


r.v’s, and assume that Xn = An Zn : Suppose
p
An ! A (p.s.d.) and
d 1=2 d
Zn ! N ( ; ) ; that is (Zn ) ! N (0; 1) ;
d
then, An Zn ! N (A ; A A0 ) :

If Xn has a limiting distribution and plim (Xn Yn ) = 0 then Yn has the same limiting distribution
as Xn :

99
9.5 Delta Method
The Delta Method provides a convenient application of the Continuous Mapping Theorem

Theorem Delta Method. Let xn be a sequence of r.v’s with


p
xn ! x
p d 2
n (xn x) ! N (0; ):

Let f be continuous di¤ erentiable, then


p
f (xn ) ! f (x) Slutsky Theorem
p d 2
n (f (xn ) f (x)) ! N (0; 2 (f 0 (x)) ):

The proof relies on a Taylor expansion


1 2
f (xn ) = f (x) + f 0 (x) (xn x) + f 00 (x ) (xn x)
2
p p 00
(x ) p 2
n (f (xn ) f (x)) = f 0 (x) n (xn x) + f 2p n (xn x)
| {z } | n {z }
d 2 (f 0 (x))2 )
!N (0; op (1)

p
– In a multivariate setting, where n (xn x) ) N (0; ); we obtain

p d @f (x) @f (x)
n (f (xn ) f (x)) ! N (0; ):
@x0 @x
@f (x)
To prevent degeneracy we require @x0 has full rank.
– The result is particularly useful when requiring SE’s of functions of parameters.

Example:

x p x p x x d
– If ! and n ! N (0; ) then
y y y y
x p x
– !
y y

x plim x x
Clearly by Slutsky: plim =
y plim y y
1
!
p x x d 1 x
– n ! N (0; 2
y
x
)
y y
y y
2
y
0 1 0 1
xA xA
@@ @@
y 1 y
Uses: @ = and @ = x
2 where y 6= 0:
x y y y

100
9.6 Order of a sequence and Stochastic Order of Magnitude
We will like to de…ne the rate at which a sequence converges or diverges in terms of the order of
the sequence:

– Order n
A sequence cn is of order n ; denoted O n ; if and only if plim n cn is a …nite
nonzero constant.
– Order less than n
A sequence cn is of order less than n ; denoted o n ; if and only if plim n cn equals
zero.
2 2 2
Example: Considering the variance of the mean, =n := n: It converges to zero as long as is
a …nite constant
2 1
– Here n = O(n )

The above notation deals with the convergence of sequences of ordinary numbers or sequences of
random variables. In the latter setting we typically use Op and op notation instead:

Let fn be a nonstochastic sequence (such as n ).

De…nition (Stochastic Order of Magnitude)

– xn = Op (fn ) if and only if 8" > 0 , 9c 0 and n0 > 0 such that

Pr fjxn =fn j > cg < " , 8n n0.

We also say xn =fn is bounded in probability (or tight), or xn =fn = Op (1)


p
– xn = op (fn ) if and only if xn =fn = op (1) or xn =fn ! 0:

p
Yn ! 0 () Yn = op (1)
d
If Xn ! X () Xn X = op (1)
d
If Xn ! X =) Xn = Op (1)

Some useful results:

– Suppose Xn = op (an ) and Yn = op (bn ) then


Xn Yn = op (an bn )
Xn + Yn = op (max (an ; bn ))
r
jXn j = op (arn ) for r > 0
– If Xn = op (an ) and Yn = Op (bn ) then Xn Yn = op (an bn )

101

You might also like