07 BiasAndRegression
07 BiasAndRegression
07 BiasAndRegression
Data Science
Bias and Regression
Hanspeter Pfister, Joe Blitzstein, and Verena Kaynig
residual
ŷ
column space of X
This Week
• HW1 due tonight at11:59 pm (Eastern Time)
• selection bias
• publication bias (file drawer problem)
• non-response bias
• length bias
1936 Presidential Election, Landon vs. FDR
1936 Presidential Election, Landon vs. FDR
source: https://en.wikipedia.org/wiki/United_States_presidential_election,_1936
1936 Presidential Election, Landon vs. FDR
???
What about the unobserved planes? Missing data!
Longevity Study from Lombard (1835)
professors 66.6
clocksmiths 55.3
locksmiths 47.2
students 20.2
source: http://www.nytimes.com/2012/02/26/health/dealing-with-dementia-among-aging-criminals.html?
pagewanted=all
Length-Biasing Paradox
How would you measure the average prison sentence?
Bias of an Estimator
The bias of an estimator is ˆ = E(✓)
ˆ
how far off it is on average: bias(✓) ✓
http://scott.fortmann-roe.com/docs/BiasVariance.html
Unbiased Estimation: Poisson Example
X ⇠ Pois( )
2
Goal: estimate e
sensible?
Fisher Weighting
How should we combine independent, unbiased
estimators for a parameter into one estimator?
k
X
✓ˆ = wi ✓ˆi
i=1
The weights should sum to one, but how should they be chosen?
1
wi /
Var(✓ˆi )
http://fivethirtyeight.com/features/how-
the-fivethirtyeight-senate-forecast-
model-works/
Multiple Testing, Bonferroni
How should we handle p-values
when testing multiple hypotheses?
https://en.wikipedia.org/wiki/Bonferroni_correction
80
78
76
74
72
SON’S HEIGHT (INCHES)
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
FATHER’S HEIGHT (INCHES)
Test scores
Sports
Inherited characteristics, e.g., heights
Traffic accidents at various sites
Daniel Kahneman Quote on RTTM
I had the most satisfying Eureka experience of my career while
attempting to teach flight instructors that praise is more effective
than punishment for promoting skill-learning....
y = |{z}
X + |{z}
✏
|{z} |{z}
n⇥1 n⇥k k⇥1 n⇥1
What’s linear about it?
y = |{z}
X + |{z}
✏
|{z} |{z}
n⇥1 n⇥k k⇥1 n⇥1
y= 0 + 1x +✏
population version
(think of x and y E(y) = 0 + 1 E(x)
as r.v.s)
cov(y, x) = 1 cov(x, x)
visualize regression as a projection
residual
ŷ
column space of X
or as a conditional expectation
Y-E(Y|X)
E(Y|X)
space of all functions of X
p p
nn n
=n e n e2⇡n 2⇡n
= n
where
wherethe
thepenultimate
penultimatelineline
usesuses
thatthat
exp exp
(x n) 2 2
(x isn)small
isifsmall
x is far
if from ⇤
n. from
x is far
Gauss-Markov Theorem
19
19 Gauss-Markov
Gauss-Markov Theorem
Theorem
Consider a linear model
Consider a linear model y =X +✏
y =X +✏
where y is n by 1, X is an n by k matrix of covariates, is a k by 1 vector of
where y isand
parameters, n by
the1, X is✏j an
errors are n by k matrix
uncorrelated withofequal
covariates,
variance, ✏j is a k 2 ].byThe
⇠ [0, 1 ve
errors do not need
parameters, and to
thebeerrors
assumed to be
✏j are with equal variance, ✏j ⇠ [0, 2
Normally distributed.
uncorrelated
errors do not need to be assumed to be Normally distributed.
Theorem 19.1. Under the above assumptions,
Then it follows that...
Theorem 19.1. Under the above assumptions,
ˆ ⌘ (X 0
X) 1 X 0 y
ˆ ⌘ (X 0 X) 1 X 0 y
is BLUE (the Best Linear Unbiased Estimator).
his 19.2.
BLUE What
(thedoBest
we mean by Unbiased
Linear best? Which loss function should we minimize? In
Estimator).
this case, the “best” estimator is the one that minimizes the sum of squares error.
h 19.2.
That’s whyWhat
we calldo we mean
it the byleast
ordinary best? Which
squares loss function should we minimi
estimator.
For“best”
this case, ˜the Normal errors,isthis
estimator the is
onealso
thatthe
˜ MLE. the sum of squares
minimizes
Proof. Let be any linear unbiased estimator, i.e. = Ay for some matrix A.
That’s why we call it the ordinary least squares estimator.
˜
Residuals
y = Xˆ+e
mirrors
y =X +✏
you can make. If all is well, you should see constant variance in the vertical ( ε̂) direction and the scatter
Always plot the residuals! (Plot residuals vs. fitted
should be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) and
values, and residuals vs. each predictor variable)
nonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases are
illustrated.
No problem Heteroscedascity Nonlinear
0.3
1.0
0.2
1
0.5
0.1
0.0
0
Residual
Residual
Residual
0.0
−0.5
−0.1
−1
−0.2
−1.0
−0.3
−1.5
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2
Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the second
Faraway, http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
shows non-constant variance and the third indicates some nonlinearity which should prompt some change
in the structural form of the model
“Explained” Variance
var(y) = var(X ˆ) + var(e)
ˆ P n
2 var(X ) i=1 (ŷi ȳ)2
R = = Pn
var(y) i=1 (yi ȳ)2