07 BiasAndRegression

CS109/Stat121/AC209/E-109
Data Science
Bias and Regression
Hanspeter Pfister, Joe Blitzstein, and Verena Kaynig
residual
ŷ
column space of X
This Week
• HW1 due tonight at11:59 pm (Eastern Time)
• HW2 posted soon

Census Data from the Current Population Survey (CPS)
“It is important to note that the CPS counts students

living in dormitories as living in their parents’ home.”
– Census Bureau, http://www.census.gov/prod/2013pubs/p20-570.pdf

Some Forms of Bias
• selection bias
• publication bias (file drawer problem)
• non-response bias
• length bias
1936 Presidential Election, Landon vs. FDR
Literary Digest predicted Landon

would win with 370 electoral votes,
based on sample size of 2.4 million.
source: https://en.wikipedia.org/wiki/United_States_presidential_election,_1936
Literary Digest got responses from 2.3

million out of 10 million people surveyed.
To collect their sample, they used 3 readily available lists:

• readers of their magazine
• car registration list
• phone directory
Wald and the Bullet Holes
What about the unobserved planes? Missing data!
???
What about the unobserved planes? Missing data!
Longevity Study from Lombard (1835)
Profession Average Longevity
chocolate maker 73.6
professors 66.6
clocksmiths 55.3
locksmiths 47.2
students 20.2
Sources: Lombard (1835), Wainer (1999), Stigler (2002)

Class Size Paradox
Why do so many schools boast small

average class size but then so many students
end up in huge classes?
Simple example: each student takes one course;

suppose there is one course with 100 students,
fifty courses with 2 students.
Dean calculates: (100+50*2)/51 = 3.92
Students calculate: (100*100+100*2)/200 = 51

“About 10 percent of the 1.6 million inmates in
America’s prisons are serving life sentences;
another 11 percent are serving over 20 years.”
source: http://www.nytimes.com/2012/02/26/health/dealing-with-dementia-among-aging-criminals.html?
pagewanted=all
Length-Biasing Paradox
How would you measure the average prison sentence?
Bias of an Estimator
The bias of an estimator is ˆ = E(✓)
ˆ
how far off it is on average: bias(✓) ✓
So why not just subtract off the bias?

Bias-Variance Tradeoff
one form: ˆ ˆ 2 ˆ
MSE(✓) = Var(✓) + bias (✓)
often a little bit of bias can make it
possible to have much lower MSE
http://scott.fortmann-roe.com/docs/BiasVariance.html
Unbiased Estimation: Poisson Example
X ⇠ Pois( )
2
Goal: estimate e
( 1)X is the best (and only) unbiased estimator of e 2
sensible?
Fisher Weighting
How should we combine independent, unbiased
estimators for a parameter into one estimator?
k
X
✓ˆ = wi ✓ˆi
i=1
The weights should sum to one, but how should they be chosen?
1
wi /
Var(✓ˆi )
(Inversely proportional to variance; why not SD?)

Nate Silver Weighting Method
• Exponential decay based on recency of poll
• Sample size of poll
• Pollster rating
http://fivethirtyeight.com/features/how-
the-fivethirtyeight-senate-forecast-
model-works/
Multiple Testing, Bonferroni
How should we handle p-values
when testing multiple hypotheses?
For example, what if we are looking

at diet (with 10 kinds of food) and
disease (with 10 diseases)?
A simple, conservative approach is
Bonferroni: divide significance level by
number of hypotheses being tested.
https://en.wikipedia.org/wiki/Bonferroni_correction
80
78
76
74
72
SON’S HEIGHT (INCHES)
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
FATHER’S HEIGHT (INCHES)
plot from Freedman, data from Pearson-Lee

Regression Toward the Mean (RTTM)
Examples are everywhere...
Test scores
Sports
Inherited characteristics, e.g., heights
Traffic accidents at various sites
Daniel Kahneman Quote on RTTM
I had the most satisfying Eureka experience of my career while
attempting to teach flight instructors that praise is more effective
than punishment for promoting skill-learning....
[A flight instructor objected:] “On many occasions I have praised

flight cadets for clean execution of some aerobatic maneuver, and
in general when they try it again, they do worse. On the other
hand, I have often screamed at cadets for bad execution, and in
general they do better the next time. So please don’t tell us that
reinforcement works and punishment does not...”
This was a joyous moment, in which I understood an important

truth about the world: because we tend to reward others when
they do well and punish them when they do badly, and because
there is regression to the mean, it is part of the human condition
that we are statistically punished for rewarding others and
rewarded for punishing them.
Regression Paradox
y: child’s height (standardized)
x: parent’s height (standardized)
Regression line: predict y = rx;

think of this as a weighted average of
the parent’s height and the mean
Now, what about predicting the parent’s height from

the child’s height? Use x = y/r?
Regression line is x = ry, the r stays the same!

Linear Model
often called “OLS” (ordinary least squares), but that puts
the focus on the procedure rather than the model.
y = |{z}
X + |{z}
✏
|{z} |{z}
n⇥1 n⇥k k⇥1 n⇥1
What’s linear about it?
y = |{z}
X + |{z}
✏
|{z} |{z}
n⇥1 n⇥k k⇥1 n⇥1
Linear refers to the fact that we’re taking

linear combinations of the predictors.
Still linear if, e.g., use both x and its
square and its cube as predictors.
Sample Quantities vs. Population Quantities
ˆ0 = ȳ ˆ1 x̄
sample version Pn
(think of x and y as (x i x̄)(yi ȳ)
data vectors) ˆ1 = i=1
Pn
(x x̄) 2
i=1 i
y= 0 + 1x +✏
population version
(think of x and y E(y) = 0 + 1 E(x)
as r.v.s)
cov(y, x) = 1 cov(x, x)
visualize regression as a projection
residual
ŷ
column space of X
or as a conditional expectation
Y-E(Y|X)
E(Y|X)
space of all functions of X
p p
nn n
=n e n e2⇡n 2⇡n
= n
where
wherethe
thepenultimate
penultimatelineline
usesuses
thatthat
exp exp
(x n) 2 2
(x isn)small
isifsmall
x is far
if from ⇤
n. from
x is far
Gauss-Markov Theorem
19
19 Gauss-Markov
Gauss-Markov Theorem
Theorem
Consider a linear model
Consider a linear model y =X +✏
y =X +✏
where y is n by 1, X is an n by k matrix of covariates, is a k by 1 vector of
where y isand
parameters, n by
the1, X is✏j an
errors are n by k matrix
uncorrelated withofequal
covariates,
variance, ✏j is a k 2 ].byThe
⇠ [0, 1 ve
errors do not need
parameters, and to
thebeerrors
assumed to be
✏j are with equal variance, ✏j ⇠ [0, 2
Normally distributed.
uncorrelated
errors do not need to be assumed to be Normally distributed.
Theorem 19.1. Under the above assumptions,
Then it follows that...
Theorem 19.1. Under the above assumptions,
ˆ ⌘ (X 0
X) 1 X 0 y
ˆ ⌘ (X 0 X) 1 X 0 y
is BLUE (the Best Linear Unbiased Estimator).
his 19.2.
BLUE What
(thedoBest
we mean by Unbiased
Linear best? Which loss function should we minimize? In
Estimator).
this case, the “best” estimator is the one that minimizes the sum of squares error.
h 19.2.
That’s whyWhat
we calldo we mean
it the byleast
ordinary best? Which
squares loss function should we minimi
estimator.
For“best”
this case, ˜the Normal errors,isthis
estimator the is
onealso
thatthe
˜ MLE. the sum of squares
minimizes
Proof. Let be any linear unbiased estimator, i.e. = Ay for some matrix A.
That’s why we call it the ordinary least squares estimator.
˜
Residuals
y = Xˆ+e
mirrors
y =X +✏
The residual vector e is orthogonal to all the columns of X.

7.5. RESIDUAL PLOTS
Residual Plots 81
you can make. If all is well, you should see constant variance in the vertical ( ε̂) direction and the scatter
Always plot the residuals! (Plot residuals vs. fitted
should be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) and
values, and residuals vs. each predictor variable)
nonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases are
illustrated.
No problem Heteroscedascity Nonlinear
0.3
1.0
0.2
1
0.5
0.1
0.0
0
Residual
Residual
Residual
0.0
−0.5
−0.1
−1
−0.2
−1.0
−0.3
−1.5
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2
Fitted Fitted Fitted
Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the second
Faraway, http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
shows non-constant variance and the third indicates some nonlinearity which should prompt some change
in the structural form of the model
“Explained” Variance
var(y) = var(X ˆ) + var(e)
ˆ P n
2 var(X ) i=1 (ŷi ȳ)2
R = = Pn
var(y) i=1 (yi ȳ)2
R2 measures goodness of fit, but

it does not validate the model.
Adding more predictors can only increase R2.

07 BiasAndRegression

Uploaded by

Copyright:

Available Formats

07 BiasAndRegression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 BiasAndRegression

Uploaded by

Copyright:

Available Formats

CS109/Stat121/AC209/E-109

• HW2 posted soon

“It is important to note that the CPS counts students

– Census Bureau, http://www.census.gov/prod/2013pubs/p20-570.pdf

Literary Digest predicted Landon

Literary Digest got responses from 2.3

To collect their sample, they used 3 readily available lists:

Profession Average Longevity

chocolate maker 73.6

Sources: Lombard (1835), Wainer (1999), Stigler (2002)

Why do so many schools boast small

Simple example: each student takes one course;

Dean calculates: (100+50*2)/51 = 3.92

Students calculate: (100*100+100*2)/200 = 51

So why not just subtract off the bias?

( 1)X is the best (and only) unbiased estimator of e 2

(Inversely proportional to variance; why not SD?)

For example, what if we are looking

plot from Freedman, data from Pearson-Lee

Examples are everywhere...

[A flight instructor objected:] “On many occasions I have praised

This was a joyous moment, in which I understood an important

Regression line: predict y = rx;

Now, what about predicting the parent’s height from

Regression line is x = ry, the r stays the same!

Linear refers to the fact that we’re taking

The residual vector e is orthogonal to all the columns of X.

Fitted Fitted Fitted

R2 measures goodness of fit, but

You might also like

Students calculate: (100100+1002)/200 = 51