4.4-InstanceBasedLearning Part 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 4 Inductive Learning based on


Symbolic Representations
and Weak Theories

Video 4.4 Instance Based Learning Part 2


Distance-Weighted Nearest Neighbor Algorithm
The k-nearest neighbor classifier can be viewed as assigning the k nearest neighbors a
weight of 1 and all others a weight of 0.

This can be generalized to weighted nearest neighbor classifiers. That is, where the ith
nearest neighbor is assigned a weight wi where i=1..k.

A typical weight is the inverse of the square of the distance = 1/ d(xq, xi)^2.

The weighted k-NN algorithm can be used both for classification or regression:
• In weighted k-NN classification, the output is a class membership. An query instance
(xq) is assigned the class label most common among its k nearest neighbors where the
´vote´ of each neighbor i is weighted with wi.
• In k-NN regression, the output is the property value for the query instance (xq). This
value is the weighted sum of the property values of its k nearest neighbors divided by
the sum of the weights.

In the weighted approach one can extend the k-nearest neighbor method from k to all
dataitems. The alternative to keep to k elements is called a local weighted method while
the extension to all dataitems is called a global weighted method.
Locally Weighted Regression
The nearest neighbor approaches approximate a target function for a single query instance (xq).
Locally weighted regression (LWR) is an extension of this approach. It constructs an explicit
approximation of the target function over a local region surrounding xq. The approximation may
be a linear function, a quadratic function etc.

• The term local in term locally weighted regression is motivated by the fact that approximation
is based only on data near to xq.
• The term weighted is motivated by the fact that the contribition of training instances are
weighted based on the distance from xq. The weights are defined by a so called kernel
function.
One can say that the kernel function moderates the original distance measure.
• The term regression is motivated by the fact that we aim at approximating real-valued
functions.
Kernel functions in
Non parametric Statistics
A Kernel is a window function. When the
Argument is a distance measure on can say that
the kernel function is a moderation of the
original distance measure.

A Kernel is a non-negative real


valued integrable function K.

For most applications, it is common to define the


function to satisfy two additional requirements:
- The infinite integral over K(x)dx = 1
- K(x) = K(-x) for all x.

Several types of kernel functions are


commonly used: uniform, triangle,
Epanechnikov, quartic (bi-weight),
tri-weight, Gaussian, and cosine.
Kernel methods as a kind of instance-based learners
Kernel functions are also useful in a kind of weighted instance-based learners called
kernel methods. The kernel function here serves as a similarity function.

This kind of method typically computes a classification as a weighted sum of


similarities:

Cq =Sign (i=1..n Sum wi*ci*k(xi, xq))

where
• xi and ci are the feature vector and class label (+1, -1) for the training instance i
• cq is the (+1 or -1 )for the unlabeled input xq
• the wi are the weights for the training examples
• k is the function that measures similarity between any pair of instances = kernel
• the sum ranges over the n labeled examples in the classifier's training set
• the Sign function determines whether the predicted classification comes out positive
or negative (+1, -1).
Structure for the rest of the lecture

1a. Binary Linear Classifier

Instance Based Learning

2. Kernel Methods
3. Support Vector Machine

4. The Kernel Trick

1 b. Binary Non Linear Classifier


Binary Linear ( and Non Linear) Classifiers

Binary Linear Classifiers are classifiers which are both


• binary (they distinguish between two categories) and
• linear (the classification is based on a linear function of the inputs).

Each data item is characterized by a vector x of features i=1..n. We


will refer to features of an instance x as ai(x).

Associated with each instance is a binary-valued class label: c(x).

The goal of a Binary Linear Classifier is to find a a hyperplane (line,


plane...) that separates the instances of the two categories.

The property of the instance space required for such a hyper plane to
be found is called linear separability. Obviously the linear classification
techniques can only handle linearly separable cases.

Techniques that can handle the non-linear situations we call non linear
classifier techniques.
Hyperplane
A hyperplane in an n-dimensional Euclidean space is a
n-1 dimensional subset of that space that divides the space
into two disconnected parts. Examples show hyperplanes
in 2 and 3 dimensions.
Support Vector Machine

A Support Vector Machine is a kind of machine learning algorithm which:


• Operates in a supervised mode with pre-classified examples
• Can also operate in an incremental model
• Is an instance of a non-probabilistic binary linear classifier system, where binary
means that it classifies instances into two classes and linear means that the instance
space have to be linearly separable
• Can be used to handle non-linear problems if the original instance space is
transformed into a linearly separable one
• Manages a flexible representation of the class boundaries
• Contains mechanisms to handle overfitting
• Has a single global minimum which can be found in polynomial time.
It is popular because:
• it is easy to use
• it often has good generalization performance
• the same algorithm solves a variety of problems with little tuning.
Support Vector Machine
A Support Vector Machine (SVM) performs binary linear
classification by finding the optimal hyperplane that
separates the two classes of instances.

A hyperplane in an n-dimensional Euclidean space is a


flat, n-1 dimensional subset of that space that divides the
space into two disconnected parts. In the two dimensional
space a hyperplane is a line dividing the plane in two
parts.

New instances are mapped into the instance space and


predicted to belong to one of the classes based on which
side of the hyperplane they are positioned.

Support Vector Machines can in the default case only be


applied to linearly separable instance spaces.
Informal outline of the Support
Vector Machine algorithm

A small subset of instances in the border line region between the


main instance space regions for the two classes are chosen as
corner stones for the analysis. These are called ´Support Vectors´.

The SVM algorithm aims at maximizing the margins around the


separating hyperplane, i.e. maximizing distances between the
target hyperplane and the chosen closest support vectors.

The optimal (maximum margin) hyperplane is fully specified by


the chosen support vectors.

The optimization problem can be expressed as a Quadratic


programming problem that can be solved by standard methods.
Applying a SVM to non-linear and non- feature
vector data cases.
As the SVM only handles linearly separable instance spaces with
instances expressed in fixed length real value feature vector form, all
problems that we want to approach with SVM have to be transformed
into such kind of linearly separable spaces typically of higher
dimensionality than the original.

We have two important cases:

1. Non linearly separable spaces of feature vector instances

2. Instance spaces with the raw instances expressed in other forms

The further discussion will illustrate the first case.


The Kernel Trick
To most natural way to transform a non-linear feature space
to an equivalent linear space is to make the target space more
high dimensional.

However, computation of instance coordinates in the high


dimensional target space is often more complex and costly.

Luckily enough, the SVM algorithm only needs a distance/similarity


measure between instances positioned in a linearly separable space.

If we choose the distance/similarity function in the target space (the


kernel function K) in a clever way, we can potentially express it in
terms of the coordinates from the original instance space.

The clever choice is to define the kernel K in terms of the Inner


Product of the mappings of the coordinates in the original space.
Example
Consider a two-dimensional input space and a three-dimensional
target space with the following feature mapping onto the
linearly separable three dimensional space:

X: = F : ( x1, x2 ) -> (x1^2, x2^2, sqrt 2*x1*x2)


Z: = F : ( z1, z2 ) -> (z1^2, z2^2, sqrt 2*z1*z2)

K (X,Z)= the kernal or similarity function in the three dimensional


space

= the inner product of X and Z in that space : |X,Z|

= |(x1^2, x2^2, sqrt 2*x1*x2), z1^2, z2^2, sqrt 2*z1*x2)|

= x1^2*z1^2 +x2^2*z2^2+2*x1*x2*z1*z2=(x1*z1+x2*z2)^2

= the square of the inner product between the corresponding


vectors in the two dimensional space
Extended use of Support Vector
Machines (SVM)
As described SVM handles linear binary classification
problems and by transformation of the instance space
also the non-linear cases.

SVM can be modified and combined to handle multiple


class problems. The strategy here is to reduce the
single multiclass problem into multiple binary
classification problems.

SVM can also be modified to handle regression still


keeping most of the key properties of the algorithm
NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 4.5 will be on the topic:

Cluster Analysis

You might also like