CH 1, 2, 3 Slides
CH 1, 2, 3 Slides
CH 1, 2, 3 Slides
1/2/3-1
Brief Overview of the Course
1/2/3-2
This course is about using data to measure causal effects.
Ideally, we would like an experiment
owhat would be an experiment to estimate the effect of
class size on standardized test scores?
But almost always we only have observational
(nonexperimental) data.
oreturns to education
ocigarette prices
omonetary policy
Most of the course deals with difficulties arising from using
observational to estimate causal effects
oconfounding effects (omitted factors)
osimultaneous causality
ocorrelation does not imply causation
1/2/3-3
In this course you will:
1/2/3-4
Review of Probability and Statistics
(SW Chapters 2, 3)
1/2/3-5
The California Test Score Data Set
Variables:
5PthP grade test scores (Stanford-9 achievement test,
combined math and reading), district average
Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
1/2/3-6
Initial look at the data:
(You should already know how to interpret this table)
1/2/3-7
Do districts with smaller classes have higher test scores?
Scatterplot of test score v. student-teacher ratio
1/2/3-10
1. Estimation
nsmall nlarge
1 1
Ysmall - Ylarge =
nsmall
Y i
nlarge
Y
i =1
i
i =1
= 657.4 650.0
= 7.4
1/2/3-11
2. Hypothesis testing
Ys - Yl Ys - Yl
t= = (remember this?)
ss2
+ sl2 SE (Ys - Yl )
ns nl
1/2/3-12
Compute the difference-of-means t-statistic:
Size Y sBYB n
small 657.4 19.4 238
large 650.0 17.9 182
|t| > 1.96, so reject (at the 5% significance level) the null
hypothesis that the two means are the same.
1/2/3-13
3. Confidence interval
(Ys Yl ) 1.96SE(Ys Yl )
= 7.4 1.961.83 = (3.8, 11.0)
Two equivalent statements:
1. The 95% confidence interval for doesnt include 0;
2. The hypothesis that = 0 is rejected at the 5% level.
1/2/3-14
What comes next
1/2/3-15
Review of Statistical Theory
Population
The group or collection of all possible entities of interest
(school districts)
We will think of populations as infinitely large ( is an
approximation to very big)
Random variable Y
Numerical summary of a random outcome (district
average test score, district STR)
1/2/3-17
Population distribution of Y
1/2/3-18
(b) Moments of a population distribution: mean, variance,
standard deviation, covariance, correlation
1/2/3-19
Moments, ctd.
E ( Y - Y )
3
skewness =
s Y3
= measure of asymmetry of a distribution
skewness = 0: distribution is symmetric
skewness > (<) 0: distribution has long right (left) tail
E ( Y - Y )
4
kurtosis =
s Y4
= measure of mass in tails
= measure of probability of large values
kurtosis = 3: normal distribution
skewness > 3: heavy tails (leptokurtotic)
1/2/3-20
1/2/3-21
2 random variables: joint distributions and covariance
so is the correlation
1/2/3-23
The correlation coefficient is defined in terms of the
covariance:
cov( X , Z ) s XZ
corr(X,Z) = = = rBXZB
var( X ) var( Z ) s X s Z
1 corr(X,Z) 1
corr(X,Z) = 1 mean perfect positive linear association
corr(X,Z) = 1 means perfect negative linear association
corr(X,Z) = 0 means no linear association
1/2/3-24
The correlation coefficient measures linear association
1/2/3-25
(c) Conditional distributions and conditional means
Conditional distributions
The distribution of Y, given value(s) of some other
random variable, X
Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
conditional mean = mean of conditional distribution
= E(Y|X = x) (important concept and notation)
conditional variance = variance of conditional distribution
Example: E(Test scores|STR < 20) = the mean of test
scores among districts with small class sizes
The difference in means is the difference between the
means of two conditional distributions:
1/2/3-26
Conditional mean, ctd.
1/2/3-27
(d) Distribution of a sample of data drawn randomly
from a population: YB1B,, YBnB
1/2/3-28
Distribution of YB1B,, YBnB under simple random
sampling
Because individuals #1 and #2 are selected at random, the
value of YB1B has no information content for YB2B. Thus:
oYB1B and YB2B are independently distributed
oYB1B and YB2B come from the same distribution, that
is, YB1B, YB2B are identically distributed
oThat is, under simple random sampling, YB1B and YB2B
are independently and identically distributed (i.i.d.).
oMore generally, under simple random sampling,
{YBiB}, i = 1,, n, are i.i.d.
1/2/3-29
This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Y is the natural estimator of the mean. But:
(a) What are the properties of Y ?
(b) Why should we use Y rather than some other estimator?
YB1B (the first observation)
maybe unequal weights not simple average
1/2/3-30
median(YB1B,, YBnB)
The starting point is the sampling distribution of Y
1/2/3-31
(a) The sampling distribution of Y
Y is a random variable, and its properties are determined by
the sampling distribution of Y
The individuals in the sample are drawn at random.
Thus the values of (YB1B,, YBnB) are random
Thus functions of (YB1B,, YBnB), such as Y , are random:
had a different sample been drawn, they would have taken
on a different value
The distribution of Y over different possible samples of
size n is called the sampling distribution of Y .
The mean and variance of Y are the mean and variance of
its sampling distribution, E(Y ) and var(Y ).
The concept of the sampling distribution underpins all of
econometrics.
1/2/3-32
The sampling distribution of Y , ctd.
Example: Suppose Y takes on 0 or 1 (a Bernoulli random
variable) with the probability distribution,
Pr[Y = 0] = .22, Pr(Y =1) = .78
Then
E(Y) = p1 + (1 p)0 = p = .78
s Y2 = E[Y E(Y)]2 = p(1 p) [remember this?]
= .78(1.78) = 0.1716
The sampling distribution of Y depends on n.
Consider n = 2. The sampling distribution of Y is,
Pr(Y = 0) = .222 = .0484
Pr(Y = ) = 2.22.78 = .3432
Pr(Y = 1) = .782 = .6084
1/2/3-33
The sampling distribution of Y when Y is Bernoulli (p = .78):
1/2/3-34
Things we want to know about the sampling distribution:
1/2/3-35
The mean and variance of the sampling distribution of Y
General case that is, for Yi i.i.d. from any distribution, not
just Bernoulli:
1 n 1 n 1 n
mean: E(Y ) = E( Yi ) = E (Yi ) = Y = Y
n i =1 n i =1 n i =1
n Yi - Y
= E
i =1
n 2
1
= E (Yi - Y )
n i =1
1/2/3-36
n 2
1
so var(Y ) = E (Yi - Y )
n i =1
1 n 1 n
= i
E (Y - Y j
) (Y - )
Y
n
i =1 n
j =1
1 n n
= 2 E (Yi - Y )(Y j - Y )
n i =1 j =1
1 n n
= 2 cov(Yi , Y j )
n i =1 j =1
n
1
= 2
n
Y
s 2
i =1
s Y2
=
n
1/2/3-37
Mean and variance of sampling distribution of Y , ctd.
E(Y ) = Y
s Y2
var(Y ) =
n
Implications:
1. Y is an unbiased estimator of Y (that is, E(Y ) = Y)
2. var(Y ) is inversely proportional to n
the spread of the sampling distribution is
proportional to 1/ n
Thus the sampling uncertainty associated with Y is
proportional to 1/ n (larger samples, less
uncertainty, but square-root law)
1/2/3-38
The sampling distribution of Y when n is large
1/2/3-39
The Law of Large Numbers:
An estimator is consistent if the probability that its falls
within an interval of the true population value tends to one
as the sample size increases.
If (Y1,,Yn) are i.i.d. and s Y2 < , then Y is a consistent
estimator of Y, that is,
Pr[|Y Y| < ] 1 as n
p
which can be written, Y Y
p
(Y Y means Y converges in probability to Y).
s Y2
(the math: as n , var(Y ) = 0, which implies that Pr[|Y
n
Y| < ] 1.)
1/2/3-40
The Central Limit Theorem (CLT):
If (Y1,,Yn) are i.i.d. and 0 < s Y2 < , then when n is large
the distribution of Y is well approximated by a normal
distribution.
s Y2
Y is approximately distributed N(Y, ) (normal
n
distribution with mean Y and variance s Y /n)
2
1/2/3-43
Summary: The Sampling Distribution of Y
For Y1,,Yn i.i.d. with 0 < s Y2 < ,
The exact (finite sample) sampling distribution of Y has
mean Y (Y is an unbiased estimator of Y) and variance
s Y2 /n
Other than its mean and variance, the exact distribution of
Y is complicated and depends on the distribution of Y (the
population distribution)
When n is large, the sampling distribution simplifies:
p
o Y Y (Law of large numbers)
Y - E (Y )
o is approximately N(0,1) (CLT)
var(Y )
1/2/3-44
(b) Why Use Y To Estimate Y?
Y is unbiased: E(Y ) = Y
p
Y is consistent: Y Y
Y is the least squares estimator of Y; Y solves,
n
min m (Yi - m ) 2
i =1
dm i =1
(Yi - m ) 2
=
i =1 dm
(Yi - m ) 2
= 2(Yi - m )
i =1
1/2/3-45
Why Use Y To Estimate Y, ctd.
1/2/3-46
1. The probability framework for statistical inference
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision, based on the evidence at hand, whether
a null hypothesis is true, or instead that some alternative
hypothesis is true. That is, test
H0: E(Y) = Y,0 vs. H1: E(Y) > Y,0 (1-sided, >)
H0: E(Y) = Y,0 vs. H1: E(Y) < Y,0 (1-sided, <)
H0: E(Y) = Y,0 vs. H1: E(Y) Y,0 (2-sided)
1/2/3-47
Some terminology for testing statistical hypotheses:
Pr
p-value = H 0 [| Y - Y ,0 |>| Y act
- Y ,0 |]
Y - Y ,0 Y act - Y ,0
= PrH 0 [| |>| |]
sY / n sY / n
Y - Y ,0 Y act - Y ,0
= PrH 0 [| |>| |]
sY sY
probability under left+right N(0,1) tails
where s Y = std. dev. of the distribution of Y = sY/ n .
1/2/3-49
Calculating the p-value with s Y known:
1/2/3-50
Estimator of the variance of Y:
n
1
sY2 =
n - 1 i =1
(Yi - Y ) 2
= sample variance of Y
Fact:
p
If (Y1,,Yn) are i.i.d. and E(Y4) < , then sY2 s Y2
1/2/3-51
Computing the p-value with s Y2 estimated:
Y - Y ,0 Y act - Y ,0
= PrH 0 [| |>| |]
sY / n sY / n
Y - Y ,0 Y act - Y ,0
PrH 0 [| |>| |] (large n)
sY / n sY / n
so
Pr
p-value = H 0 [| t |>| t act
|] s 2
( Y estimated)
probability under normal tails outside |tact|
Y - Y ,0
where t = (the usual t-statistic)
sY / n
1/2/3-52
What is the link between the p-value and the significance
level?
1/2/3-53
At this point, you might be wondering,...
What happened to the t-table and the degrees of freedom?
1/2/3-54
Comments on this recipe and the Student t-distribution
1/2/3-56
Comments on Student t distribution, ctd.
1/2/3-57
1/2/3-58
Comments on Student t distribution, ctd.
4. You might not know this. Consider the t-statistic testing
the hypothesis that two means (groups s, l) are equal:
Ys - Yl Ys - Yl
t= 2 2 =
ss
+ sl SE (Ys - Yl )
ns nl
s
The assumption that Y is distributed N(Y, Y ) is rarely
2
1/2/3-60
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence intervals
Confidence Intervals
A 95% confidence interval for Y is an interval that contains
the true value of Y in 95% of repeated samples.
Y - Y Y - Y
{ Y: 1.96} = {Y: 1.96 1.96}
sY / n sY / n
sY sY
= {Y: 1.96 Y Y 1.96 }
n n
sY sY
= {Y (Y 1.96 , Y + 1.96 )}
n n
This confidence interval relies on the large-n results that Y is
p
approximately normally distributed and s s Y2 .
2
Y
1/2/3-62
Summary:
From the two assumptions of:
(1) simple random sampling of a population, that is,
{Yi, i =1,,n} are i.i.d.
4
(2) 0 < E(Y ) <
1/2/3-63
Lets go back to the original policy question:
What is the effect on test scores of reducing STR by one
student/class?
Have we answered this question?
1/2/3-64