ESL: Chapter 1: 1.1 Introduction To Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Abstract

ESL: Chapter 1
Statistical learning is concerned with predicting an outcome given a set of data. The outcome can be
quantitative or categorical.
The types of learning that can be employed can further be broken into two categories:
1. Supervised learning where the outcome is present to guide the learning process. This could be thought
of as similar to classical conditioning.
2. Unsupervised learning where there is no output data, and the task is to data to draw conclusions from
it.

ESL: Chapter 2

Supervised learning concerns two components inputs and outputs. An alternative is to call these predictor
and response variables respectively.
The response variables can further be quantitative measurements, or qualitative information. An example of the latter is the Iris data set which is available in R. This dichotomy has led to two names for
the general prediction tasks. Regression refers to quantitative output prediction while classification refers
to qualitative outputs. A third type of class concerns qualitative data referring to categories like small,
medium, or large.
Notation:
X will denote the prediction variable.
Y will be the response variable.
Y will be the models estimate of the response variable.
represent the response variables, and their relevant estimates for classification problems .
G and G
Training data will be denoted as (xi , yi ) for regression problems or (xi , gi ) for classification problems
where 1 i N .

1.1

Introduction to Linear Regression:

The simple case of linear regression is:


yi = 0 +

N
X

xi i

(1)

i=1

0 refers to the intercept of the model and represents the bias. (1) can be thought of as a matrix equation
with a vector of predictor variables X, and a vector of response variables Y . X is a p-length column vector.
is an p N matrix of regression coefficients, Y is an N length vector of data. Finally, the first elements
of X will be made up of ones. will have a corresponding first entry of 0 . These additions to X, take
into account the constant term in the regression model
Y = X T

(2)

Fitting the model involves minimizing the squared residual terms. We will seek to identify such that
the estimated line, plane, or hyperplane is as close as possible in terms of Euclidean distance from the actual
values of the response variables.
In terms of an optimization problem, this is:
min RSS() =

i R

min

N
X
(yi xTi i )2

(3)

i=1

RSS() = (Y X)T (Y X)

(4)

R(p+1)N

Where X is an N p matrix with an input vector in each row. Y is an N length vector made up of
response variables.
Since this is an optimization problem, we can identify the terms which minimize this problem by differentiation the RSS function above and then set the derivative equal to zero to find the extremum:

RSS() = (Y X)T (Y X)

=X T (Y X) + (Y X)T X
T

=2X (Y X)

(5)
(6)
(7)

By taking (7) and setting it equal to zero, we get what is called the normal equations:
2X T (Y X) = 0

(8)

Assuming that X X is nonsingular, the solution to (8) is :


= (X T X)1 X T Y

(9)

Where the fitted value is given by: yi = y = xTi .

1.2

KNN

ESL gives a quick example showing linear regression is not appropriate to classify data due to the number
of assumptions that we make about the decision boundary. Instead, a more simple method is performed
which makes far fewer assumptions on the data.

KNN is a method of classification which works on the principle that after taking a group of training data
T , we should classify the category of an observation xi from testing set B by the K nearest observations.
Typically the metric used is Euclidean distance though other measures of distance are often used. The
neighborhood for K observations is denoted as Nk (x) and is the area around observation xi containing
the K observations.

Calculating the quality of error for KNN is not the same as it would be for regression as K = 1
would give the lowest SSE.

Some variants of KNN include kernel methods which have a weight which decreases smoothly to
zero with the distance from the target point instead of the binary weighting which normal KNN uses.
Additionally, if we are using high dimensional problem, other distance metrics may be used.

Regression is also an option when applying KNN. In this case we take weighted values for the K nearest
neighbors, and use their average in order to estimate the associated response variable:
1
Y (x) =
k

1.3

yi

(10)

xi Nk (x)

Statistical Decision Theory

Let X Rp be an input vector, Y R be an output variable, and let there be a joint probability distribution
function: P r(X, Y ). Our goal is to identify a function f (X) which predicts Y given X. To do this, a loss
function is defined which is tailored to penalizing prediction errors. In the case of linear regression, this
would be residual sums squared.
A note on probability theory before we moving on: Recall the conditional distribution for a jpdf is
P r(X, Y ) = P r(Y |X)P r(X)

(11)

The loss function L(Y, f (X))can be defined as:


L(Y, f (X)) = (Y f (X))2

(12)

The expected squared prediction error (EPE) is the expected value of the L(Y, f (X)):
EP E(f ) = E(Y f (X))2
Z
= [y f (x)]2 P r(dx, dy)

(13)
Using (11) produces:

= EX EY |X ([Y f (X)]2 |X)

(14)
(15)

(15) can be minimized pointwise. We substitute c for f (X), and solve:


min EY |X ([Y c]2 |X = x)

(16)

f (X) = E(Y |X = x)

(17)

Which has the solution:

KNN attemps to implement this framework directly using training data. For each data point x, an
average is taken for all yi in its neighborhood of k points closes to x. The simplest way to go about this is
to simple take the average to find the estimate of f (x):
f(x) = Ave(yi |xi Nk (x))

(18)

Note (18) would be equivalent to (17) if we assume some regularity in the functions, and assume k < N ,
and k, N . However, this is never the case so the choice of k, and the use of KNN must be done
carefully.

1.4

Curses of Dimensionality

The first example of the curse of dimensionality

is that as the number of dimensions increase,


the portion of the total area of the data set that is taken up by a neighborhood taking up the same portion
of the data points increases.
To illustrate, consider the case where we have a unit hypercube in p dimensions. Next, assume that we
want the neighborhood about a particular point to correspond to a fraction r of the unit volume. In this
case, each side will become disproportionately larger, governed by the formula:
1

s(r, p) = r p

(19)

For instance, if we want to include 1% of the points we would get the following results as we increased
the dimensionality:
s(.01, 2) = .1
s(.01, 10) = .631
s(.01, 50) = .912

The second example of the curse of dimensionality is all sample points become closer to the
edge of the sample as we increase the dimensions. This can be illustrated by the following scenario:
Consider the case where we have N data point uniformly distributed in a p-dimensional unit ball which
is centered at the origin. In this case we can calculate the median distance as:
1

md(p, N ) = 1

1 N  p1
2

(20)

Consider the case where N = 500 for the following values of p:


md(500, 2) = .0372
md(500, 10) = .518
md(500, 50) = .877

As we can see, the larger the number of dimensions that we use, the close to the boundaries the data
points will be.

A third example of the curse of dimensionality concerns the sampling density. The sampling
density is the number of samples per unit distance. In the case of a unit hyperbube, each sample will take
a volume N1 of the domain. Therefore the sampling distance is proportional to N1 .
For a training set of T it is possible to compute the expected prediction error at a test point where
x0 = 0. This is in context to a 1 nearest neighbor rule. In this case we find the mean squared error (MSE)
to estimate f (x0 ):
M SE(x0 ) = ET [f (x0 ) y0 ]2

Add and subtractET (y0 ).

= ET [f (x0 ) ET (y0 ) + ET (y0 ) y0 ]2

1.5

Digging Deeper Into Regression

Regression can be considered to be a particular instance of an inverse problem. Given a set of observation
data occupying p dimensional the Euclidean space Y Y, we hope to identify the function f (x) satisfying:
yi = f (xi ) + 
Where  is an unspecified noise term.

(21)

With regreswsion, we use a set of linear basis functions in order to approximate the value of the inverse
solution f (x):
N
X
f (x) =
hi (x)i
(22)
i=1

In this case, hi is a set of functions which transform the input vector x. These might be polynomial
expressions (like regression), or trigonometric functions.
Continuting with this framework, the more general form of the RSS function that we attempt to minimize
is given by:
N
X
RSS() =
(yi f (xi ))2
(23)
i=1

In this case the increase in generality introduces a problem: By generalizing the optimization problem
that we try to solve, the guarantee of a unique solution is lost. This becomes an ill-posed problem.

1.6

Tweaks to Regression Models

To tailor a better solution for f (x), the optimization function can be modified to try to approximate particular
values which we desire.

Regularization is the first example. ESL describes this as a roughness penalty. This takes the original
RSS function, and adds a regularization term J(f ) where is the regularization parameter, and J(f ) is
some user-selected functional. The regularized RSS function is:
P RSS(f, ) = RSS(f ) + J(f )

(24)

One example of this may be the cubic smoothing spline functional. In this case, for a one dimensional model
this would be:
Z
P RSS(f, ) = RSS(f ) + [f 00 (x)]2
(25)
Often, the purpose of J(f )is to impose smoothness on the functional. The selection of is also important: If it is zero we end up with the standard RSS. If its too large, the regularization term dominates the
optimization problem and we end up with something completely different.

Kernel Methods are a way to weight the observations in the neighborhood of a particular point x0 .
This is done using a kernel function K (x0 , x)which scales the points x in the region of x0 according to
some criteria K (x0 , x). One example ESL gives is:
h ||x x ||2 i
1
0
K (x0 , x) = exp
(26)

1.7

Maximum Likelihood Estimation - An Alternative to RSS

Instead of using RSS which states that we should try to minimize the sum squared error for our model versus
the training data, the maximum likelihood estimation can be used. This alternative is the key of logistic
regression which works as follows:
A set of N observations governed by the probability density function P r (y) are taken. Taking the log
of sample gives:
N
X
L() =
log[P r (yi )]
(27)
i=1

By selecting such that the above expression is maximized, we may be able to approximate the correct
model.
Consider the case where we have the additive error model Y = f (X) +  where  N (0, 2 ). If we assume
that the condition likelihood is:
P r(Y |X, ) = N (f (X), 2 )
(28)
we get the least squares criterion discussed earlier. In this case, the log likelihood of the data is:
N
1
log(2) N log 2 (yi f (xi ))2
(29)
2
2
A more complicated method arises when we are seeking to categorize several different kinds of responses
G. The multinomial likelihood for the regression function P r(G|X) may be coverned by:
L() =

P r(G = Gk |X = x) = pk, (x),

k = 1, ..., K

(30)

and hasw the conditional probability of each class given a particular X indexed by the parameter vector
. In this case the log-liklihood becomes:
L() =

N
X

log(pgi , (xi )

i=1

Solving this gives the value of which conform with the data in terms of likelihood.

(31)

You might also like