ESL: Chapter 1: 1.1 Introduction To Linear Regression
ESL: Chapter 1: 1.1 Introduction To Linear Regression
ESL: Chapter 1: 1.1 Introduction To Linear Regression
ESL: Chapter 1
Statistical learning is concerned with predicting an outcome given a set of data. The outcome can be
quantitative or categorical.
The types of learning that can be employed can further be broken into two categories:
1. Supervised learning where the outcome is present to guide the learning process. This could be thought
of as similar to classical conditioning.
2. Unsupervised learning where there is no output data, and the task is to data to draw conclusions from
it.
ESL: Chapter 2
Supervised learning concerns two components inputs and outputs. An alternative is to call these predictor
and response variables respectively.
The response variables can further be quantitative measurements, or qualitative information. An example of the latter is the Iris data set which is available in R. This dichotomy has led to two names for
the general prediction tasks. Regression refers to quantitative output prediction while classification refers
to qualitative outputs. A third type of class concerns qualitative data referring to categories like small,
medium, or large.
Notation:
X will denote the prediction variable.
Y will be the response variable.
Y will be the models estimate of the response variable.
represent the response variables, and their relevant estimates for classification problems .
G and G
Training data will be denoted as (xi , yi ) for regression problems or (xi , gi ) for classification problems
where 1 i N .
1.1
N
X
xi i
(1)
i=1
0 refers to the intercept of the model and represents the bias. (1) can be thought of as a matrix equation
with a vector of predictor variables X, and a vector of response variables Y . X is a p-length column vector.
is an p N matrix of regression coefficients, Y is an N length vector of data. Finally, the first elements
of X will be made up of ones. will have a corresponding first entry of 0 . These additions to X, take
into account the constant term in the regression model
Y = X T
(2)
Fitting the model involves minimizing the squared residual terms. We will seek to identify such that
the estimated line, plane, or hyperplane is as close as possible in terms of Euclidean distance from the actual
values of the response variables.
In terms of an optimization problem, this is:
min RSS() =
i R
min
N
X
(yi xTi i )2
(3)
i=1
RSS() = (Y X)T (Y X)
(4)
R(p+1)N
Where X is an N p matrix with an input vector in each row. Y is an N length vector made up of
response variables.
Since this is an optimization problem, we can identify the terms which minimize this problem by differentiation the RSS function above and then set the derivative equal to zero to find the extremum:
RSS() = (Y X)T (Y X)
=X T (Y X) + (Y X)T X
T
=2X (Y X)
(5)
(6)
(7)
By taking (7) and setting it equal to zero, we get what is called the normal equations:
2X T (Y X) = 0
(8)
(9)
1.2
KNN
ESL gives a quick example showing linear regression is not appropriate to classify data due to the number
of assumptions that we make about the decision boundary. Instead, a more simple method is performed
which makes far fewer assumptions on the data.
KNN is a method of classification which works on the principle that after taking a group of training data
T , we should classify the category of an observation xi from testing set B by the K nearest observations.
Typically the metric used is Euclidean distance though other measures of distance are often used. The
neighborhood for K observations is denoted as Nk (x) and is the area around observation xi containing
the K observations.
Calculating the quality of error for KNN is not the same as it would be for regression as K = 1
would give the lowest SSE.
Some variants of KNN include kernel methods which have a weight which decreases smoothly to
zero with the distance from the target point instead of the binary weighting which normal KNN uses.
Additionally, if we are using high dimensional problem, other distance metrics may be used.
Regression is also an option when applying KNN. In this case we take weighted values for the K nearest
neighbors, and use their average in order to estimate the associated response variable:
1
Y (x) =
k
1.3
yi
(10)
xi Nk (x)
Let X Rp be an input vector, Y R be an output variable, and let there be a joint probability distribution
function: P r(X, Y ). Our goal is to identify a function f (X) which predicts Y given X. To do this, a loss
function is defined which is tailored to penalizing prediction errors. In the case of linear regression, this
would be residual sums squared.
A note on probability theory before we moving on: Recall the conditional distribution for a jpdf is
P r(X, Y ) = P r(Y |X)P r(X)
(11)
(12)
The expected squared prediction error (EPE) is the expected value of the L(Y, f (X)):
EP E(f ) = E(Y f (X))2
Z
= [y f (x)]2 P r(dx, dy)
(13)
Using (11) produces:
(14)
(15)
(16)
f (X) = E(Y |X = x)
(17)
KNN attemps to implement this framework directly using training data. For each data point x, an
average is taken for all yi in its neighborhood of k points closes to x. The simplest way to go about this is
to simple take the average to find the estimate of f (x):
f(x) = Ave(yi |xi Nk (x))
(18)
Note (18) would be equivalent to (17) if we assume some regularity in the functions, and assume k < N ,
and k, N . However, this is never the case so the choice of k, and the use of KNN must be done
carefully.
1.4
Curses of Dimensionality
s(r, p) = r p
(19)
For instance, if we want to include 1% of the points we would get the following results as we increased
the dimensionality:
s(.01, 2) = .1
s(.01, 10) = .631
s(.01, 50) = .912
The second example of the curse of dimensionality is all sample points become closer to the
edge of the sample as we increase the dimensions. This can be illustrated by the following scenario:
Consider the case where we have N data point uniformly distributed in a p-dimensional unit ball which
is centered at the origin. In this case we can calculate the median distance as:
1
md(p, N ) = 1
1 N p1
2
(20)
As we can see, the larger the number of dimensions that we use, the close to the boundaries the data
points will be.
A third example of the curse of dimensionality concerns the sampling density. The sampling
density is the number of samples per unit distance. In the case of a unit hyperbube, each sample will take
a volume N1 of the domain. Therefore the sampling distance is proportional to N1 .
For a training set of T it is possible to compute the expected prediction error at a test point where
x0 = 0. This is in context to a 1 nearest neighbor rule. In this case we find the mean squared error (MSE)
to estimate f (x0 ):
M SE(x0 ) = ET [f (x0 ) y0 ]2
1.5
Regression can be considered to be a particular instance of an inverse problem. Given a set of observation
data occupying p dimensional the Euclidean space Y Y, we hope to identify the function f (x) satisfying:
yi = f (xi ) +
Where is an unspecified noise term.
(21)
With regreswsion, we use a set of linear basis functions in order to approximate the value of the inverse
solution f (x):
N
X
f (x) =
hi (x)i
(22)
i=1
In this case, hi is a set of functions which transform the input vector x. These might be polynomial
expressions (like regression), or trigonometric functions.
Continuting with this framework, the more general form of the RSS function that we attempt to minimize
is given by:
N
X
RSS() =
(yi f (xi ))2
(23)
i=1
In this case the increase in generality introduces a problem: By generalizing the optimization problem
that we try to solve, the guarantee of a unique solution is lost. This becomes an ill-posed problem.
1.6
To tailor a better solution for f (x), the optimization function can be modified to try to approximate particular
values which we desire.
Regularization is the first example. ESL describes this as a roughness penalty. This takes the original
RSS function, and adds a regularization term J(f ) where is the regularization parameter, and J(f ) is
some user-selected functional. The regularized RSS function is:
P RSS(f, ) = RSS(f ) + J(f )
(24)
One example of this may be the cubic smoothing spline functional. In this case, for a one dimensional model
this would be:
Z
P RSS(f, ) = RSS(f ) + [f 00 (x)]2
(25)
Often, the purpose of J(f )is to impose smoothness on the functional. The selection of is also important: If it is zero we end up with the standard RSS. If its too large, the regularization term dominates the
optimization problem and we end up with something completely different.
Kernel Methods are a way to weight the observations in the neighborhood of a particular point x0 .
This is done using a kernel function K (x0 , x)which scales the points x in the region of x0 according to
some criteria K (x0 , x). One example ESL gives is:
h ||x x ||2 i
1
0
K (x0 , x) = exp
(26)
1.7
Instead of using RSS which states that we should try to minimize the sum squared error for our model versus
the training data, the maximum likelihood estimation can be used. This alternative is the key of logistic
regression which works as follows:
A set of N observations governed by the probability density function P r (y) are taken. Taking the log
of sample gives:
N
X
L() =
log[P r (yi )]
(27)
i=1
By selecting such that the above expression is maximized, we may be able to approximate the correct
model.
Consider the case where we have the additive error model Y = f (X) + where N (0, 2 ). If we assume
that the condition likelihood is:
P r(Y |X, ) = N (f (X), 2 )
(28)
we get the least squares criterion discussed earlier. In this case, the log likelihood of the data is:
N
1
log(2) N log 2 (yi f (xi ))2
(29)
2
2
A more complicated method arises when we are seeking to categorize several different kinds of responses
G. The multinomial likelihood for the regression function P r(G|X) may be coverned by:
L() =
k = 1, ..., K
(30)
and hasw the conditional probability of each class given a particular X indexed by the parameter vector
. In this case the log-liklihood becomes:
L() =
N
X
log(pgi , (xi )
i=1
Solving this gives the value of which conform with the data in terms of likelihood.
(31)