07 BiasAndRegression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

CS109/Stat121/AC209/E-109

Data Science
Bias and Regression
Hanspeter Pfister, Joe Blitzstein, and Verena Kaynig

residual


column space of X
This Week
• HW1 due tonight at11:59 pm (Eastern Time)

• HW2 posted soon


Census Data from the Current Population Survey (CPS)

“It is important to note that the CPS counts students


living in dormitories as living in their parents’ home.”

– Census Bureau, http://www.census.gov/prod/2013pubs/p20-570.pdf


Some Forms of Bias

• selection bias
• publication bias (file drawer problem)
• non-response bias
• length bias
1936 Presidential Election, Landon vs. FDR
1936 Presidential Election, Landon vs. FDR

Literary Digest predicted Landon


would win with 370 electoral votes,
based on sample size of 2.4 million.

source: https://en.wikipedia.org/wiki/United_States_presidential_election,_1936
1936 Presidential Election, Landon vs. FDR

Literary Digest got responses from 2.3


million out of 10 million people surveyed.

To collect their sample, they used 3 readily available lists:


• readers of their magazine
• car registration list
• phone directory
Wald and the Bullet Holes
What about the unobserved planes? Missing data!

???
What about the unobserved planes? Missing data!
Longevity Study from Lombard (1835)

Profession Average Longevity

chocolate maker 73.6

professors 66.6

clocksmiths 55.3

locksmiths 47.2

students 20.2

Sources: Lombard (1835), Wainer (1999), Stigler (2002)


Class Size Paradox

Why do so many schools boast small


average class size but then so many students
end up in huge classes?

Simple example: each student takes one course;


suppose there is one course with 100 students,
fifty courses with 2 students.

Dean calculates: (100+50*2)/51 = 3.92

Students calculate: (100*100+100*2)/200 = 51


“About 10 percent of the 1.6 million inmates in
America’s prisons are serving life sentences;
another 11 percent are serving over 20 years.”

source: http://www.nytimes.com/2012/02/26/health/dealing-with-dementia-among-aging-criminals.html?
pagewanted=all
Length-Biasing Paradox
How would you measure the average prison sentence?
Bias of an Estimator
The bias of an estimator is ˆ = E(✓)
ˆ
how far off it is on average: bias(✓) ✓

So why not just subtract off the bias?


Bias-Variance Tradeoff
one form: ˆ ˆ 2 ˆ
MSE(✓) = Var(✓) + bias (✓)
often a little bit of bias can make it
possible to have much lower MSE

http://scott.fortmann-roe.com/docs/BiasVariance.html
Unbiased Estimation: Poisson Example
X ⇠ Pois( )

2
Goal: estimate e

( 1)X is the best (and only) unbiased estimator of e 2

sensible?
Fisher Weighting
How should we combine independent, unbiased
estimators for a parameter into one estimator?

k
X
✓ˆ = wi ✓ˆi
i=1

The weights should sum to one, but how should they be chosen?

1
wi /
Var(✓ˆi )

(Inversely proportional to variance; why not SD?)


Nate Silver Weighting Method
• Exponential decay based on recency of poll
• Sample size of poll
• Pollster rating

http://fivethirtyeight.com/features/how-
the-fivethirtyeight-senate-forecast-
model-works/
Multiple Testing, Bonferroni
How should we handle p-values
when testing multiple hypotheses?

For example, what if we are looking


at diet (with 10 kinds of food) and
disease (with 10 diseases)?
A simple, conservative approach is
Bonferroni: divide significance level by
number of hypotheses being tested.

https://en.wikipedia.org/wiki/Bonferroni_correction
80

78

76

74

72
SON’S HEIGHT (INCHES)

70

68

66

64

62

60

58
58 60 62 64 66 68 70 72 74 76 78 80
FATHER’S HEIGHT (INCHES)

plot from Freedman, data from Pearson-Lee


Regression Toward the Mean (RTTM)

Examples are everywhere...

Test scores
Sports
Inherited characteristics, e.g., heights
Traffic accidents at various sites
Daniel Kahneman Quote on RTTM
I had the most satisfying Eureka experience of my career while
attempting to teach flight instructors that praise is more effective
than punishment for promoting skill-learning....

[A flight instructor objected:] “On many occasions I have praised


flight cadets for clean execution of some aerobatic maneuver, and
in general when they try it again, they do worse. On the other
hand, I have often screamed at cadets for bad execution, and in
general they do better the next time. So please don’t tell us that
reinforcement works and punishment does not...”

This was a joyous moment, in which I understood an important


truth about the world: because we tend to reward others when
they do well and punish them when they do badly, and because
there is regression to the mean, it is part of the human condition
that we are statistically punished for rewarding others and
rewarded for punishing them.
Regression Paradox
y: child’s height (standardized)
x: parent’s height (standardized)

Regression line: predict y = rx;


think of this as a weighted average of
the parent’s height and the mean

Now, what about predicting the parent’s height from


the child’s height? Use x = y/r?

Regression line is x = ry, the r stays the same!


Linear Model
often called “OLS” (ordinary least squares), but that puts
the focus on the procedure rather than the model.

y = |{z}
X + |{z}

|{z} |{z}
n⇥1 n⇥k k⇥1 n⇥1
What’s linear about it?
y = |{z}
X + |{z}

|{z} |{z}
n⇥1 n⇥k k⇥1 n⇥1

Linear refers to the fact that we’re taking


linear combinations of the predictors.
Still linear if, e.g., use both x and its
square and its cube as predictors.
Sample Quantities vs. Population Quantities
ˆ0 = ȳ ˆ1 x̄
sample version Pn
(think of x and y as (x i x̄)(yi ȳ)
data vectors) ˆ1 = i=1
Pn
(x x̄) 2
i=1 i

y= 0 + 1x +✏
population version
(think of x and y E(y) = 0 + 1 E(x)
as r.v.s)
cov(y, x) = 1 cov(x, x)
visualize regression as a projection

residual


column space of X
or as a conditional expectation

Y-E(Y|X)

E(Y|X)
space of all functions of X
p p
nn n
=n e n e2⇡n 2⇡n
= n

where
wherethe
thepenultimate
penultimatelineline
usesuses
thatthat
exp exp
(x n) 2 2
(x isn)small
isifsmall
x is far
if from ⇤
n. from
x is far
Gauss-Markov Theorem
19
19 Gauss-Markov
Gauss-Markov Theorem
Theorem
Consider a linear model
Consider a linear model y =X +✏
y =X +✏
where y is n by 1, X is an n by k matrix of covariates, is a k by 1 vector of
where y isand
parameters, n by
the1, X is✏j an
errors are n by k matrix
uncorrelated withofequal
covariates,
variance, ✏j is a k 2 ].byThe
⇠ [0, 1 ve
errors do not need
parameters, and to
thebeerrors
assumed to be
✏j are with equal variance, ✏j ⇠ [0, 2
Normally distributed.
uncorrelated
errors do not need to be assumed to be Normally distributed.
Theorem 19.1. Under the above assumptions,
Then it follows that...
Theorem 19.1. Under the above assumptions,
ˆ ⌘ (X 0
X) 1 X 0 y
ˆ ⌘ (X 0 X) 1 X 0 y
is BLUE (the Best Linear Unbiased Estimator).

his 19.2.
BLUE What
(thedoBest
we mean by Unbiased
Linear best? Which loss function should we minimize? In
Estimator).
this case, the “best” estimator is the one that minimizes the sum of squares error.
h 19.2.
That’s whyWhat
we calldo we mean
it the byleast
ordinary best? Which
squares loss function should we minimi
estimator.
For“best”
this case, ˜the Normal errors,isthis
estimator the is
onealso
thatthe
˜ MLE. the sum of squares
minimizes
Proof. Let be any linear unbiased estimator, i.e. = Ay for some matrix A.
That’s why we call it the ordinary least squares estimator.
˜
Residuals
y = Xˆ+e
mirrors
y =X +✏

The residual vector e is orthogonal to all the columns of X.


7.5. RESIDUAL PLOTS
Residual Plots 81

you can make. If all is well, you should see constant variance in the vertical ( ε̂) direction and the scatter
Always plot the residuals! (Plot residuals vs. fitted
should be symmetric vertically about 0. Things to look for are heteroscedascity (non-constant variance) and
values, and residuals vs. each predictor variable)
nonlinearity (which indicates some change in the model is necessary). In Figure 7.5, these three cases are
illustrated.
No problem Heteroscedascity Nonlinear

0.3
1.0

0.2
1

0.5

0.1
0.0
0
Residual

Residual

Residual

0.0
−0.5

−0.1
−1

−0.2
−1.0

−0.3
−1.5
−2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2

Fitted Fitted Fitted

Figure 7.5: Residuals vs Fitted plots - the first suggests no change to the current model while the second
Faraway, http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
shows non-constant variance and the third indicates some nonlinearity which should prompt some change
in the structural form of the model
“Explained” Variance
var(y) = var(X ˆ) + var(e)

ˆ P n
2 var(X ) i=1 (ŷi ȳ)2
R = = Pn
var(y) i=1 (yi ȳ)2

R2 measures goodness of fit, but


it does not validate the model.
Adding more predictors can only increase R2.

You might also like