ISLR
ISLR
ISLR
Parametric Methods
Parametric methods involve a two-step model-based approach.
1. Choose/assume how a f (X) looks like
2. Fit/train the data
Non-parametric methods
Dont know how the f (X) looks like
Problems that have qualitative (categorical) response are assigned classification problems and
quantitative responses - regression problems.
n
X
i=1
We are not that interested in minimizing training MSE but mostly doing accurate prediction,
i.e. minimizing test MSE. Even though we try to minimize training MSE and also training
MSE is correlated to test MSE, our testing MSE would still be larger.
Look Page 31 for Graphs.
Flexibility = Level of Fitingness = Degrees of Freedom
1X
I(yi 6= yi )
n i=1
I(yi 6= yi ) - indicator variable, 1 when prediction is wrong and 0 when yi = yi Equation
computes the fraction of incorrect classifications.
Bayes Method
The Bayes Classifier assigns each response to a class that has the largest probability (< or >
50%)
Bayes decision boundary - 50% split curve
Bayes Error Rate:
1 E(maxj P (Y = j|X))
Bayes considered to be a gold standard in classification having the smallest error rate.
K-Nearest Neighbours
Takes every observation and compares first nearest K points. And then chooses the class
according to the largest fraction. Low K value - high flexibility (perfectly overfit), high K - low
flexibility. Need to choose the right K such that the variance is not high.
Plot error rate against 1/K (represents flexibility); test error should have a usual U-shape.
3. Linear Regression
Residual = ei = yi yi
Residual Sum of Squares:
RSS = e21 + e22 + e23 + ... + e2n
Least Squares method and gradient descent try to minimize RSS. Population regression line true, most accurate prediction from the features obtained, not known.
If we were to train many linear regression lines of different samples, then the average regression line would be the true population regression line. If we use a single regression line
from a single sample, that estimate would be unbiased. So how far off our regression line from
a population regression line? We could answer that by finding the standard error (variance):
2
V ar(
) = SE(
)2 = n
In general, we dont know the SE (variance) of the 1 and 2 but we can estimate it using
ResidualpStandard Error:
RSE = RSS/(n 2)
RSE measures the lack of fit in terms of Y.
We can use SE to compute 95% confidence intervals, which is 2*SE.
Null Hypothesis - there is no relationship. In order to understand whether a given value reinforces an alternate hypothesis or rejects the null hypothesis, a t-statistic can be calculated
which measures how many standard deviation of its from 0.
t=
SE()
SE is found by finding the sum of the differences of each point and a linear model.
We find t-statistic with respect to zero because if the coefficient is close to zero (with relatively large SE) then t-statistic would produce small value which would mean that this features
is not related to the response.
By calculating a p-value, if p-value is small (less than 5, 1% - 2 and 2.75 in t-statistics) we
would know that there is a relationship between features (predictors) and the response.
R2 Statistic
RSS
T SS RSS
=1
T SS
T SS
Similarly to RSE, R2 measures the lack of fit of our prediction by the predictors. However, in
comparison to RSE, it measures it in relative terms, from 0 to 1.
TSS = Total Sum of Squares = total variability of the data
RSS = Residual Sum of Squares = variability that is not explained by the trained model
TSS - RSS = Amount that is explained by the model
R2 = Proportion of variability in Y that is explained using X
R2 =
r = Corellation
R2 = r 2
of many) is related to the output, then this method would be flawed because there might
be one feature that randomly correlates with the output. You have to compute a p-value
with an F-statistic at the same time because p-values would give you which individial
predictors are significant.
2. Deciding on Important Variables
There are 2p models for every p features.
Variable Selection:
Forward Selection - we start with no variables and then add a feature that has the lowest
RSS one at a time. This method can include variables early that later become redundant.
Backward Selection - we start with all the variables and then delete a variable that has
the largest p-value one at a time. It continues until the stopping rule is reached. Cannot
be used if p > n.
Mixed Selection - we start with no variables like in Forward Selection and keep adding
them. When a p-value of a variable that is being added reaches a certain p-value threshold,
we delete that variable. Continue
3. Model Fit
Most common numerical methods are RSE and R2 .
Addition of variables that are only weakly associated with the response will still increase
R2 , however, by a small amount.
4. Predictions
Reducible Error - try to approximate the model to the true population regression by using
95% confidence intervals.
Model Bias - choose a different model (learning technique)
Irreducible Error
(xi x
)2
1
+ Pn
2
n
i=1 (xi x)
6. Collinearity
Collinearity - when two or more predictors are related to each other.
When two variables are collinear, a model can give a wide range of of weights on those
variables, i.e. the RSS can have the same value of local minimum for a wide range of
values of these weights. Therefore, the SE for the predictors increase and subsequently
the t-statistic decreases. If t-statistic is small we tend to fail to reject the null hypothesis
and therefore we might think that the given predictor doesnt influence the output. Multicollinearity occurs when data is collected without experimental design
In order to detect collinearity, look at the correlation matrix! It gives correlations between two variables. However, there can be correlations between 3 or more variables in
which case we need to compute Variance Inflation Factor (VIF). VIF is the ratio of the
variance of one predictor when full model is implemented divided by the variance of the
same predictor fitted on its own. When VIF = 1: completely no collinearity; VIF > 5 or
10: problematic amount of collinearity.
2 Solutions to the Problem of Collinearity:
First: delete one of the variables
Second: combine them into one variable (average of the standardized versions)
The value of K represents the bias-variance trade-off. With small K (e.g. K=1) we get high
flexibility, i.e. low bias and high variance. In comparison, high K would produce lower variance
and much smoother fit.
If we use KNN regression on the straight linear line, then KNN would just regress (approach) to
the line but wouldnt technically be as accurate as the actual linear regression model. Therefore,
non-parametric models have a higher variance (not necessarily with corresponding
reduction in bias) in comparison to parametric methods.
KNN might seem better in comparison to linear regression when the actual function is unknown
and might highly non-linear, however, that only works with low number of features. In high
dimensional data (p>4), linear regression outperforms KNN.
That happens because in high dimensional data each observation might not have another close
observation because of so many variables. There is essentially a reduction in the sample size
(for non-parametric) as dimensions increase. Its called curse of dimensionality. Generally,
parametric methods would outperform non-parametric methods when there is a
low number of observations per feature.
4. Classification
Why dont we use Linear Regression instead?
In general, creating a dummy variable (with values of 0, 1, 2, 3...) and applying it to linear
regression doesnt reflect the true qualitative response. This happens due to ordering of the
dummy variable which would imply that 2 lies in between 1 and 3, however, in reality they
might not be related at all!
Therefore, unless you have binary data (where you can just assume that if its > 0.5 then
its this qualitative outcome) or qualitative response thats already ordered (for example mild,
moderate, severe) you cant use linear regression.
The result of linear regression of binary output is exactly the same as in Linear Discriminant
Analysis (LDA) given later.
Another problem is that a linear fit would produce values that dont lie within [0, 1] which
would generate a problem of interpretability. Therefore, we can use a logistic function (Sshaped):
eB0 +B1 X
p(X) =
1 + eB0 +B1 X
This is called log-odds or logits:
log(
p(X)
) = B0 + B1 X
1 p(X)
yi =1
In classification, when you want to reject the null hypothesis (that this variable is not related
to the response) instead of t-statistic, you need use z-statistic.
Subset Selection
1. Make the null model, with no predictors which is just making our prediction to be the
mean of the data.
2. For each model out of 2p models fit a least squares regression. Find the best model for
each of p, which we mean, the one that gives the smallest RSS or equivalently largest R2
3. Select a single best model by method of cross-validation error, Cp , BIC or adjusted R2
In the second step we reduce the number of models in consideration from 2p to p-1 models.
And then in the 3 step we choose the one of the smallest test error. This method can be applied
to classification too, where we compute the deviance instead of RSS.
Even though, Subset Selection is very appealing, it very computationally expensive to implement when p is large. For p=20 there are 1 mln models to consider. Dont do it when p is
more than 35.
Backward Selection
Starts with the possibility when all p features are included and then starts to exclude the most
useless ones. Cannot be used for the case np.
In order to estimate a test error we can either:
1. Adjust a training error (4 common approaches: Cp , Aikake Information Criterion (AIC),
Bayesian Information Criterion (BIC) and adjusted R2
2. Approximate a test error with a validation or cross-validation set
Generally, validation is a better approach and can be used in a wider range of model selection
tasks. Validation was an issue in the past because it was too computationally expensive.
Ridge Regression
RSS +
p
X
j=1
j2 =
n
X
i=1
(y y)2 +
p
X
j2
j=1
penalty added to the loss function the weights would linearly adjust in the same way. Thats
why we need to make sure they are scaled (standardized), otherwise the weights that are large
in value would be penalized more than those smaller in value.
x
ij = p
xij
1/n
Pn
i=1 (xij
x
ij )2
As the parameter increases, the flexibility of the ridge regression decreases, by decreasing
the variance and increasing the bias.
Ridge regression is much faster than the best Subset Selection and can be applied for cases with
large number of features. In fact, it is shown that you can solve simalteniously for all values of
in a similar timescale as for normal regression. ?????????????
The Lasso
Ridge Regression does have one shortcoming in comparison to the best, forward and backward
subset selection. The unnecessary predictors would be minimized, but not to exactly zero,
which even though might not harm the accuracy (predictability) of the model, but would harm
the interpretability as it still would include all p features. The Lasso is a way to overcome this.
We simply change the shrinkage penalty from l2 to l1 .
l1 =
kk1 =
|j |
|j |
In comparison to the l2 , l1 actually forces the weights to go down to almost zero. Hence, it
pretty much performs a variable selection and the models are easier to interpret now. The
Lasso yields sparse models.
Depending on the value of , some of the variables can get lost, the larger the the higher
the chance your model is left with less variables.
Another Formulation for Ridge and Lasso
If s is too large then the above models would just yield a least squares solution
Comparison of Ridge to Lasso For the case where most of the features are related to
the response: Both generate a similar bias error, however, ridge regression gives slightly lower
variance than Lasso and therefore lower MSE.
For the case when some of the features should be zero: Lasso definitely outperforms ridge
regression; it gives lower bias, variance and MSE.
You need to use cross-validation in order to know which technique is better: lasso or ridge.