Introduction to Statistical Learning
(ISLR 2.1)
Yingbo Li
Clemson University
MATH 8050
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 1 / 16
Outline
1 Why Estimate f
2 How to Estimate f
3 Trade-Off: Prediction Accuracy and Model Interpretability
4 Supervised vs Unsupervised Learning
5 Regression vs Classification
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 2 / 16
Why Estimate f
The Advertising data
For n = 200 different markets
Sales: sales of the product in this market (Y )
TV: advertising budget for TV (X1 )
Radio: advertising budget for radio (X2 )
Newspaper: advertising budget for newspaper (X3 )
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 3 / 16
Why Estimate f
We believe that there is a relationship between Y and X
Y : output variable, response
X = (X1 , X2 , X3 ): input variables, predictors
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 4 / 16
Why Estimate f
Model the relationship between Y and X
The regression function
Y = f (X) +
f : unknown function
: random error with mean zero, i.e., E() = 0.
In the Advertising example:
f (X1 , X2 , X3 ) = E(Y | X1 , X2 , X3 )
Statistical learning, and this course, are all about how to estimate f .
Why?
Prediction.
Inference.
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 5 / 16
Why Estimate f
Prediction
If we can get a good estimate for f, we can make accurate predictions for
the response Y , based on a new value of X.
For a new market, given three media budgets, what’s the sales?
Just want to predict sales, not to know which media is more
important.
Suppose our estimate for f is fˆ, the output Y for input X is predicted as
Ŷ = fˆ(X)
Mean square error
E(Y − Ŷ )2 = E[f (X) − fˆ(X)]2 + V ()
| {z } |{z}
Reducible Irreducible
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 6 / 16
Why Estimate f
Inference
We are often interested in understanding the relationship between that Y
and each of X1 , . . . , Xp . For example
1 Which predictors actually affect the response?
2 Is the relationship positive or negative?
3 Is the relationship a simple linear one or is it more complicated?
How much impact does TV budges have on the sales.
Which media generate the biggest boost in sales?
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 7 / 16
How to Estimate f
How to estimate f
Use the training data and a statistical method to estimate f .
We have observed a set of training data
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},
where each xi = (xi,1 , xi,2 , . . . , xi,p )0 , and yi is a scalar.
Statistical learning methods:
I parametric
I non-parametric
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 8 / 16
How to Estimate f
Income vs Education, Seniority
Incom
e
y
rit
Ye
io
a
n
rs
Se
of
Ed
uc
ati
on
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 9 / 16
How to Estimate f
Parametric methods
Reduces the problem of estimating f down to one of estimating a (finite)
set of parameters. A two-step model based approach:
1 Come up with a model (some functional form assumption about f ).
The most common example is a linear model.
f (X) = β0 + β1 X1 + β2 X2 + · · · + βp Xp
I We only need to estimate p + 1 parameters β0 , β1 , . . . , βp .
I Although it is almost never correct, a linear model often serves as a
good and interpretable approximation to the unknown true f (X).
2 Use the training data to fit the model.
Estimate the unknown parameters such as β̂0 , β̂1 , . . . , β̂p .
I The most common approach is ordinary least squares (OLS).
I We will see later that there are other superior approaches.
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 10 / 16
How to Estimate f
A linear model fˆL (X) = β0 + β1 X gives a reasonable fit here.
A quadratic model fˆQ (X) = β0 + β1 X + β2 X 2 fits slightly better.
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 11 / 16
How to Estimate f
A linear regression fit to the Income data
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
Income = β0 + β1 × Education + β2 × Seniority
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 12 / 16
How to Estimate f
Non-parametric methods
They do not make explicit assumptions about the functional form of f .
Advantages: accurately fit a wider range of possible shapes of f .
Disadvantages: require large n to obtain an accurate estimate.
Incom
Incom
e
e
ity
ity
or
Ye
or
Ye
ni
a
ni
a rs
Se
rs
Se
of of
E Edu
du ca
ca tio
tio n
n
A smooth thin-plate spline fit: A rough thin-plate spline fit:
flexible overfitting
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 13 / 16
Trade-Off: Prediction Accuracy and Model Interpretability
Some trade-offs
Prediction accuracy vs model interpretability
Linear models are easy to interpret; thin-plate splines are not.
Good fit vs over-fit
A model that overfits the training data may not predict well.
High
Subset Selection
Lasso
Least Squares
Interpretability
Generalized Additive Models
Trees
Bagging, Boosting
Support Vector Machines
Low
Low High
Flexibility
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 14 / 16
Supervised vs Unsupervised Learning
Supervised vs unsupervised learning
Supervised learning: both X and Y are available
Unsupervised learning: only X is available; there is no Y .
I Example: market segmentation where we try to divide potential
customers into groups based on their characteristics.
I A common approach is clustering.
12
8
10
8
6
X2
X2
6
4
4
2
2
0 2 4 6 8 10 12 0 2 4 6
X1 X1
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 15 / 16
Regression vs Classification
Regression vs classification
Regression: Y is continuous (quantitative).
I Predicting the value of the Dow in 6 months.
I Predicting the value of a given house based on various inputs.
Classification: Y is categorical (qualitative).
I Will the Dow be up (U) or down (D) in 6 months?
I Is this email a SPAM or not?
Yingbo Li (Clemson) Intro to Statistical Learning MATH 8050 16 / 16