Model Selection-Handout PDF
Model Selection-Handout PDF
Model Selection-Handout PDF
Y = 0 + 1 X1 + + p Xp + .
In the lectures that follow, we consider some approaches for
1 / 57
2 / 57
variance.
3 / 57
4 / 57
Subset Selection
Best subset and stepwise model selection procedures
Best Subset Selection
1. Let M0 denote the null model, which contains no
predictors. This model simply predicts the sample mean
for each observation.
2. For k = 1, 2, . . . p:
(a) Fit all kp models that contain
exactly k predictors.
(b) Pick the best among these kp models, and call it Mk . Here
best is defined as having the smallest RSS, or equivalently
largest R2 .
5 / 57
0.8
R2
0.6
6e+07
0.4
4e+07
0.0
0.2
2e+07
8e+07
1.0
Number of Predictors
10
10
Number of Predictors
7 / 57
Stepwise Selection
For computational reasons, best subset selection cannot be
8 / 57
9 / 57
In Detail
Forward Stepwise Selection
1. Let M0 denote the null model, which contains no
predictors.
2. For k = 0, . . . , p 1:
10 / 57
clear.
11 / 57
# Variables
One
Two
Three
Four
Best subset
rating
rating, income
rating, income, student
cards, income
student, limit
Forward stepwise
rating
rating, income
rating, income, student
rating, income,
student, limit
The first four selected models for best subset selection and
forward stepwise selection on the Credit data set. The first
three models are identical but the fourth models differ.
12 / 57
13 / 57
2.1 Consider all k models that contain all but one of the
predictors in Mk , for a total of k 1 predictors.
2.2 Choose the best among these k models, and call it Mk1 .
Here best is defined as having smallest RSS or highest R2 .
14 / 57
15 / 57
16 / 57
17 / 57
18 / 57
10
Number of Predictors
0.90
0.86
0.88
Adjusted R2
0.92
0.94
0.96
30000
25000
BIC
20000
15000
10000
15000
10000
Cp
20000
25000
30000
10
Number of Predictors
10
Number of Predictors
19 / 57
1
RSS + 2d
2 ,
n
where d is the total # of parameters used and
2 is an
estimate of the variance of the error associated with each
response measurement.
The AIC criterion is defined for a large class of models fit
by maximum likelihood:
Cp =
AIC = 2 log L + 2 d
where L is the maximized value of the likelihood function
for the estimated model.
In the case of the linear model with Gaussian errors,
maximum likelihood and least squares are the same thing,
and Cp and AIC are equivalent. Prove this.
20 / 57
Details on BIC
BIC =
1
RSS + log(n)d
2 .
n
log(n)d
2 term, where n is the number of observations.
Since log n > 2 for any n > 7, the BIC statistic generally
21 / 57
Adjusted R2
For a least squares model with d variables, the adjusted R2
statistic is calculated as
Adjusted R2 = 1
RSS/(n d 1)
.
TSS/(n 1)
Number of Predictors
4
6
8
10
100
100
100
160
180
200
140
160
180
200
140
Number of Predictors
4
6
8
10
120
140
160
180
200
CrossValidation Error
120
120
220
220
220
Number of Predictors
4
6
8
10
24 / 57
Shrinkage Methods
Ridge regression and Lasso
The subset selection methods use least squares to fit a
should improve the fit, but it turns out that shrinking the
coefficient estimates can significantly reduce their variance.
26 / 57
Ridge regression
Recall that the least squares fitting procedure estimates
RSS =
n
X
i=1
yi 0
p
X
2
j xij .
j=1
2
p
p
p
n
X
X
X
X
yi 0
j xij +
j2 = RSS +
j2 ,
i=1
j=1
j=1
j=1
estimates that fit the data well, by making the RSS small.
P
However, the second term, j j2 , called a shrinkage
penalty, is small when 1 , . . . , p are close to zero, and so it
has the effect of shrinking the estimates of j towards zero.
The tuning parameter serves to control the relative
28 / 57
1e02
400
300
200
100
0
100
Standardized Coefficients
300
100
100
200
300
Income
Limit
Rating
Student
300
Standardized Coefficients
400
1e+00
1e+02
1e+04
0.0
0.2
0.4
0.6
0.8
1.0
2
kR k2 /kk
29 / 57
p
2
j=1 j .
30 / 57
xij
i=1 (xij
xj )2
31 / 57
10
50
40
30
20
10
0
60
1e01
1e+01
1e+03
0.0
0.2
0.4
0.6
0.8
1.0
2
kR k2 /kk
The Lasso
Ridge regression does have one obvious disadvantage:
yi 0
p
X
j=1
2
j xij +
p
X
j=1
|j | = RSS +
p
X
j=1
|j |.
variable selection.
34 / 57
400
300
200
100
0
100
Standardized Coefficients
300
200
100
0
Income
Limit
Rating
Student
300
200
Standardized Coefficients
400
20
50
100
200
500
2000
5000
0.0
0.2
0.4
0.6
0.8
1.0
1
kL k1 /kk
35 / 57
n
X
i=1
yi 0
p
X
2
j xij
subject to
j=1
p
X
j=1
|j | s
and
minimize
n
X
i=1
yi 0
p
X
j=1
2
j xij
subject to
p
X
j=1
j2 s,
respectively.
36 / 57
37 / 57
60
50
40
30
20
0
10
50
40
30
20
10
0
60
0.02
0.10
0.50
2.00
10.00
50.00
0.0
0.2
0.4
0.6
0.8
1.0
R2 on Training Data
100
80
60
40
0
20
80
60
40
20
0
100
0.02
0.10
0.50
2.00
10.00
50.00
0.4
0.5
0.6
0.7
0.8
0.9
1.0
R2 on Training Data
Conclusions
40 / 57
300
100
0
300
100
Standardized Coefficients
25.6
25.4
25.2
25.0
CrossValidation Error
5e03
5e02
5e01
5e+00
5e03
5e02
5e01
5e+00
15
10
5
0
5
Standardized Coefficients
1400
1000
600
200
0
CrossValidation Error
0.0
0.2
0.4
0.6
1
kLk1/kk
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
kLk1/kk
44 / 57
p
X
mj Xj
(1)
j=1
yi = 0 +
M
X
m zim + i ,
i = 1, . . . , n,
(2)
m=1
m zim =
m=1
M
X
m=1
p
X
mj xij =
j =
m mj xij =
j=1 m=1
j=1
where
p X
M
X
M
X
m mj .
p
X
j xij ,
j=1
(3)
m=1
46 / 57
And so on.
47 / 57
25
20
15
10
0
Ad Spending
30
35
Pictures of PCA
10
20
30
40
50
60
70
Population
20
30
40
Population
50
10
5
0
5
10
25
20
15
10
5
Ad Spending
30
20
10
10
20
25
20
15
Ad Spending
10
40
30
20
Population
50
30
60
Plots of the first principal component scores zi1 versus pop and
ad. The relationships are strong.
50 / 57
25
20
15
Ad Spending
10
40
30
20
Population
50
30
60
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
51 / 57
150
100
Squared Bias
Test MSE
Variance
50
60
50
40
30
20
10
0
70
10
20
30
Number of Components
40
10
20
30
40
Number of Components
PCR was applied to two simulated data sets. The black, green,
and purple lines correspond to squared bias, variance, and test
mean squared error, respectively. Left: Simulated data from
slide 32. Right: Simulated data from slide 39.
52 / 57
60000
40000
CrossValidation MSE
20000
300
200
100
0
100
300
Standardized Coefficients
400
Income
Limit
Rating
Student
80000
Number of Components
10
10
Number of Components
54 / 57
55 / 57
56 / 57
Summary
57 / 57