4.4-InstanceBasedLearning Part 2

NPTEL
Video Course on Machine Learning
Professor Carl Gustaf Jansson, KTH
Week 4 Inductive Learning based on

Symbolic Representations
and Weak Theories
Video 4.4 Instance Based Learning Part 2

Distance-Weighted Nearest Neighbor Algorithm
The k-nearest neighbor classifier can be viewed as assigning the k nearest neighbors a
weight of 1 and all others a weight of 0.
This can be generalized to weighted nearest neighbor classifiers. That is, where the ith
nearest neighbor is assigned a weight wi where i=1..k.
A typical weight is the inverse of the square of the distance = 1/ d(xq, xi)^2.
The weighted k-NN algorithm can be used both for classification or regression:
• In weighted k-NN classification, the output is a class membership. An query instance
(xq) is assigned the class label most common among its k nearest neighbors where the
´vote´ of each neighbor i is weighted with wi.
• In k-NN regression, the output is the property value for the query instance (xq). This
value is the weighted sum of the property values of its k nearest neighbors divided by
the sum of the weights.
In the weighted approach one can extend the k-nearest neighbor method from k to all
dataitems. The alternative to keep to k elements is called a local weighted method while
the extension to all dataitems is called a global weighted method.
Locally Weighted Regression
The nearest neighbor approaches approximate a target function for a single query instance (xq).
Locally weighted regression (LWR) is an extension of this approach. It constructs an explicit
approximation of the target function over a local region surrounding xq. The approximation may
be a linear function, a quadratic function etc.
• The term local in term locally weighted regression is motivated by the fact that approximation
is based only on data near to xq.
• The term weighted is motivated by the fact that the contribition of training instances are
weighted based on the distance from xq. The weights are defined by a so called kernel
function.
One can say that the kernel function moderates the original distance measure.
• The term regression is motivated by the fact that we aim at approximating real-valued
functions.
Kernel functions in
Non parametric Statistics
A Kernel is a window function. When the
Argument is a distance measure on can say that
the kernel function is a moderation of the
original distance measure.
A Kernel is a non-negative real

valued integrable function K.
For most applications, it is common to define the

function to satisfy two additional requirements:
- The infinite integral over K(x)dx = 1
- K(x) = K(-x) for all x.
Several types of kernel functions are

commonly used: uniform, triangle,
Epanechnikov, quartic (bi-weight),
tri-weight, Gaussian, and cosine.
Kernel methods as a kind of instance-based learners
Kernel functions are also useful in a kind of weighted instance-based learners called
kernel methods. The kernel function here serves as a similarity function.
This kind of method typically computes a classification as a weighted sum of

similarities:
Cq =Sign (i=1..n Sum wi*ci*k(xi, xq))
where
• xi and ci are the feature vector and class label (+1, -1) for the training instance i
• cq is the (+1 or -1 )for the unlabeled input xq
• the wi are the weights for the training examples
• k is the function that measures similarity between any pair of instances = kernel
• the sum ranges over the n labeled examples in the classifier's training set
• the Sign function determines whether the predicted classification comes out positive
or negative (+1, -1).
Structure for the rest of the lecture
1a. Binary Linear Classifier
Instance Based Learning
2. Kernel Methods
3. Support Vector Machine
4. The Kernel Trick
1 b. Binary Non Linear Classifier

Binary Linear ( and Non Linear) Classifiers
Binary Linear Classifiers are classifiers which are both

• binary (they distinguish between two categories) and
• linear (the classification is based on a linear function of the inputs).
Each data item is characterized by a vector x of features i=1..n. We

will refer to features of an instance x as ai(x).
Associated with each instance is a binary-valued class label: c(x).
The goal of a Binary Linear Classifier is to find a a hyperplane (line,

plane...) that separates the instances of the two categories.
The property of the instance space required for such a hyper plane to
be found is called linear separability. Obviously the linear classification
techniques can only handle linearly separable cases.
Techniques that can handle the non-linear situations we call non linear
classifier techniques.
Hyperplane
A hyperplane in an n-dimensional Euclidean space is a
n-1 dimensional subset of that space that divides the space
into two disconnected parts. Examples show hyperplanes
in 2 and 3 dimensions.
Support Vector Machine
A Support Vector Machine is a kind of machine learning algorithm which:

• Operates in a supervised mode with pre-classified examples
• Can also operate in an incremental model
• Is an instance of a non-probabilistic binary linear classifier system, where binary
means that it classifies instances into two classes and linear means that the instance
space have to be linearly separable
• Can be used to handle non-linear problems if the original instance space is
transformed into a linearly separable one
• Manages a flexible representation of the class boundaries
• Contains mechanisms to handle overfitting
• Has a single global minimum which can be found in polynomial time.
It is popular because:
• it is easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning.
Support Vector Machine
A Support Vector Machine (SVM) performs binary linear
classification by finding the optimal hyperplane that
separates the two classes of instances.
A hyperplane in an n-dimensional Euclidean space is a

flat, n-1 dimensional subset of that space that divides the
space into two disconnected parts. In the two dimensional
space a hyperplane is a line dividing the plane in two
parts.
New instances are mapped into the instance space and

predicted to belong to one of the classes based on which
side of the hyperplane they are positioned.
Support Vector Machines can in the default case only be

applied to linearly separable instance spaces.
Informal outline of the Support
Vector Machine algorithm
A small subset of instances in the border line region between the

main instance space regions for the two classes are chosen as
corner stones for the analysis. These are called ´Support Vectors´.
The SVM algorithm aims at maximizing the margins around the

separating hyperplane, i.e. maximizing distances between the
target hyperplane and the chosen closest support vectors.
The optimal (maximum margin) hyperplane is fully specified by

the chosen support vectors.
The optimization problem can be expressed as a Quadratic

programming problem that can be solved by standard methods.
Applying a SVM to non-linear and non- feature
vector data cases.
As the SVM only handles linearly separable instance spaces with
instances expressed in fixed length real value feature vector form, all
problems that we want to approach with SVM have to be transformed
into such kind of linearly separable spaces typically of higher
dimensionality than the original.
We have two important cases:
1. Non linearly separable spaces of feature vector instances
2. Instance spaces with the raw instances expressed in other forms
The further discussion will illustrate the first case.

The Kernel Trick
To most natural way to transform a non-linear feature space
to an equivalent linear space is to make the target space more
high dimensional.
However, computation of instance coordinates in the high

dimensional target space is often more complex and costly.
Luckily enough, the SVM algorithm only needs a distance/similarity

measure between instances positioned in a linearly separable space.
If we choose the distance/similarity function in the target space (the

kernel function K) in a clever way, we can potentially express it in
terms of the coordinates from the original instance space.
The clever choice is to define the kernel K in terms of the Inner

Product of the mappings of the coordinates in the original space.
Example
Consider a two-dimensional input space and a three-dimensional
target space with the following feature mapping onto the
linearly separable three dimensional space:
X: = F : ( x1, x2 ) -> (x1^2, x2^2, sqrt 2*x1*x2)

Z: = F : ( z1, z2 ) -> (z1^2, z2^2, sqrt 2*z1*z2)
K (X,Z)= the kernal or similarity function in the three dimensional

space
= the inner product of X and Z in that space : |X,Z|
= |(x1^2, x2^2, sqrt 2*x1*x2), z1^2, z2^2, sqrt 2*z1*x2)|
= x1^2*z1^2 +x2^2*z2^2+2*x1*x2*z1*z2=(x1*z1+x2*z2)^2
= the square of the inner product between the corresponding

vectors in the two dimensional space
Extended use of Support Vector
Machines (SVM)
As described SVM handles linear binary classification
problems and by transformation of the instance space
also the non-linear cases.
SVM can be modified and combined to handle multiple

class problems. The strategy here is to reduce the
single multiclass problem into multiple binary
classification problems.
SVM can also be modified to handle regression still

keeping most of the key properties of the algorithm
NPTEL
Video Course on Machine Learning
Professor Carl Gustaf Jansson, KTH
Thanks for your attention!
The next lecture 4.5 will be on the topic:
Cluster Analysis

4.4-InstanceBasedLearning Part 2

Uploaded by

Copyright:

Available Formats

4.4-InstanceBasedLearning Part 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.4-InstanceBasedLearning Part 2

Uploaded by

Copyright:

Available Formats

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 4 Inductive Learning based on

Video 4.4 Instance Based Learning Part 2

A Kernel is a non-negative real

For most applications, it is common to define the

Several types of kernel functions are

This kind of method typically computes a classification as a weighted sum of

Cq =Sign (i=1..n Sum wi*ci*k(xi, xq))

1a. Binary Linear Classifier

Instance Based Learning

4. The Kernel Trick

1 b. Binary Non Linear Classifier

Binary Linear Classifiers are classifiers which are both

Each data item is characterized by a vector x of features i=1..n. We

Associated with each instance is a binary-valued class label: c(x).

The goal of a Binary Linear Classifier is to find a a hyperplane (line,

A Support Vector Machine is a kind of machine learning algorithm which:

A hyperplane in an n-dimensional Euclidean space is a

New instances are mapped into the instance space and

Support Vector Machines can in the default case only be

A small subset of instances in the border line region between the

The SVM algorithm aims at maximizing the margins around the

The optimal (maximum margin) hyperplane is fully specified by

The optimization problem can be expressed as a Quadratic

We have two important cases:

1. Non linearly separable spaces of feature vector instances

2. Instance spaces with the raw instances expressed in other forms

The further discussion will illustrate the first case.

However, computation of instance coordinates in the high

Luckily enough, the SVM algorithm only needs a distance/similarity

If we choose the distance/similarity function in the target space (the

The clever choice is to define the kernel K in terms of the Inner

X: = F : ( x1, x2 ) -> (x1^2, x2^2, sqrt 2*x1*x2)

K (X,Z)= the kernal or similarity function in the three dimensional

= the inner product of X and Z in that space : |X,Z|

= |(x1^2, x2^2, sqrt 2*x1*x2), z1^2, z2^2, sqrt 2*z1*x2)|

= the square of the inner product between the corresponding

SVM can be modified and combined to handle multiple

SVM can also be modified to handle regression still

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 4.5 will be on the topic:

You might also like

Cq =Sign (i=1..n Sum wicik(xi, xq))

X: = F : ( x1, x2 ) -> (x1^2, x2^2, sqrt 2x1x2)

= |(x1^2, x2^2, sqrt 2x1x2), z1^2, z2^2, sqrt 2z1x2)|