Lecture09 SVM Intro, Kernel Trick (Updated)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

Support Vector Machines

Dr. M. Aksam Iftikhar,


Assistant Professor, CS
SVM: History & Motivation
2

o Support Vector Machine (SVM) is a supervised learning algorithm


developed by Vladimir Vapnik and it was first heard in 1992,
introduced by Vapnik, Boser and Guyon in COLT-92.
o It is said that Vladimir Vapnik has mentioned its idea in 1979 in
one of his paper but its major development was in the 90’s.
o SVM got into mainstream because of their exceptional performance in
Handwritten Digit Recognition.
SVM: Applications
3

o Text and image classification


o Hand-writing recognition,
o Data mining
o Bioinformatics
o Medicine and biosequence analysis
o and even stock market
SVM: Algorithm
4
SVM: Problem Definition - Linear separable case
5

We are given a set of n points (vectors): x1, x2,…,xn, such


that xi is a vector of length m, and each belong to one of
two classes we label them by “+1” and “-1”.
So our training set is:
( x1 , y1 ), ( x2 , y2 ),....( xn , yn ) i xi  R m , yi  {1,  1}
We want to find a separating hyperplane w.x + b = 0
that separates these points into the two classes.
“The positives” (class “+1”) and So the decision
function will be
“The negatives” (class “-1”). f(x) = sign(w.x + b)
(Assuming that they are linearly
separable)
Linear Separators
6

Binary classification can be viewed as the task of


separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)

But There are many possibilities


for such hyperplanes !!
SVM: Separating Hyperplane
7
yi 1
yi  1

Which one
should we
choose!

Yes, There are many possible separating hyperplanes


It could be this one or this or this or may be….!
SVM: Choosing a Separating Hyperplane
8

Suppose we choose the hypreplane (seen below) that is


close to some sample xi.
Now suppose we have a new point x’ that should be in
class “-1” and is close to xi. Using our classification
function f(x) this point is misclassified!
f ( x) sign( w x  b)
Poor generalization! x'
xi
(Poor performance on unseen
data)
SVM: Choosing a Separating Hyperplane
9

Hyperplane should be as far as possible from any sample


point.
This way a new data that is close to the old samples will
be classified correctly.

x'
Good generalization! xi
SVM: Choosing a Separating Hyperplane
10

The SVM idea is to maximize the distance between The


hyperplane and the closest sample points.
In the optimal hyper-
plane:
The distance to the closest
negative point =
The distance to the closest
positive point.
SVM: Choosing a Separating Hyperplane
11
SVM’s goal is to maximize the Margin which is twice the
perpendicular distance “d” between the separating
hyperplane and the closest sample.
Why it is the best?

in
arg
 Robust to outliners

M
as we saw and thus

d
strong generalization xi
ability.

d
 It proved itself to
have better
performance on test
data in both practice
and in theory.
SVM: Support Vectors
12

Support vectors are the samples closest to the


separating hyperplane. Oh! So this is where the
name came from!

in
arg
These are

M
Support

d
Vectors xi

d
Maximum Margin Classification 13

Maximizing the margin is also good according to intuition.


Implies that only support vectors matter; other training examples
are ignorable.
Finding the SVM hyperplane
14

• Determining the hyperplane is equivalent to finding the optimal


parameters w and b.
• The SVM hyperplane is determined by solving a dual optimization
problem based on Lagrange multipliers, and not discussed here in detail.
• However, one thing is important to note that this optimization problem
requires computing the inner products xiTxj between all the training
examples pairs.
• Remember, inner product/dot product between 2 products is simply sum of
products of all its components.
• We will return to this point later in the topic of Kernel trick.
SVM: Limitations
15

Limitations of Linear SVM:


 Doesn’t work well on non-

in
arg
M
linearly separable data.
 Noise (outlier) problem.

d
xi

d
But, it can deal with non-linear
classes with a nice tweak.
SVM: Non Linear case
16
Key idea: map our points with a mapping function (x) to a space of
sufficiently high dimension so that they become separable by a
hyperplane in the new higher-dimensional feature space.
o Input space: the space where the points xi are located.
o Feature space: the space of f(xi) after the transformation, where f(.) is the
transformation function, For example: a non-linearly separable case in one
dimension: x
0
Mapping data to two-dimensional space with (x) = (x, x2)
x2

0 x
Interlude: Illustration of a hyperplane
17

• A hyperplane refers to a D-dimensional plane in a D+1 dimensional


space.
• E.g. a SVM boundary line is a hyperplane in a 2D input space
(see figure below). Similarly, in a 3D input space, the SVM
boundary will be a 2-dimensional hyperplane and so on.
• This terminology is especially used for 4D and higher-dimensional
input spaces, for which, visualizing the input space is not possible.
x2

0 x
Input space vs Feature space
18

• Earlier, we referred a space induced by features (feature vectors) as a


feature space.
• In this lecture on SVM, the space before mapping to higher
dimension is referred as input space, and after the mapping, it’s
called the feature space.
SVM: Non Linear
19

An illustration of the algorithm in 2D input space:


Non-linear SVMs: Feature spaces
20

General idea: the original feature space can always be mapped to


some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
Limitations of non-linear mapping
21

1. The new feature space (after mapping) may be very


high dimensional, and working in the high
dimensional space is computationally expensive.
• Remember, in order to compute w
(and b), we need to compute the
inner products of feature vectors of
training examples together.
• i.e. xiTxj for all i, j
• Computing the inner products in the
higher dimensional space increases
the computational complexity to great
extent.
Limitations of non-linear mapping
22

2. How do we know the mapping function, i.e. what


will be the new set of features in the higher
dimensional space? E.g. for a single feature x, the
additional features can be x2, sqrt(x), sin(x) etc.
• Finding the correct mapping function is not
easy.

Both of the above limitations are addressed in


SVM classification by applying a simple tweak
– the Kernel Trick.
23

The Kernel Trick


The Kernel Trick – intuitition
24

• To compute the inner products of vectors in higher


dimensional space, we can apply the kernel trick.
• Using this trick, we can compute the inner products (in
higher dimension) without actually going into the
higher dimension.
• Remember, it’s the inner products of vectors, which
we need in the higher dimension, and not the vectors
themselves.
• This trick is known as the Kernel trick, as it is applied with
the help of a kernel function.
SVM: Kernel Function
25

• A kernel function is defined as a function of input space that


can be written as a dot product of two feature vectors in the
expanded feature space:
K (xi , x j )  (xi )T  (x j )
• We noted earlier that determining the hyperplane in linear
SVM involves dot product of input vectors, i.e. xiTxj .
• Now, we only need to compute K(xi, xj) (which uses the
input space) and don’t need to perform the computations
in the higher dimensional feature space explicitly. This is
what is called the Kernel Trick.
The “Kernel Trick” – example
26

• Let us understand the kernel trick with the help of an example.


• Assume, a non-linear separability case, where the original input vectors
are 2-dimensional, i.e. x1 and x2.
• In order to apply SVM, the input space is transformed into a
(higher) 6-dimensional feature space according to the following
mapping function.
• φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2], i.e. every datapoint x = [x1, x2] is
mapped into high-dimensional space via the transformation Φ: x → φ(x)
The “Kernel Trick” – example (contd.)
27

• Now, determining the hyperplane in this higher dimension requires


computing the dot product of the feature vectors, i.e. φ(xi) Tφ(xj) for
all i, j. Therefore,
• K(xi,xj)= φ(xi) Tφ(xj). where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
• Expanding the product:
K(xi,xj) = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
=(1 + xiTxj)2
• Where the final expression is a function of the original input space, i.e. xi
= [xi1, xi2] and not the transformed higher dimension feature space.
The “Kernel Trick” – example (contd.)
28

• In other words, if we compute (1 + xiTxj)2 in the input space, this is


equivalent to finding the inner products of vectors in the higher
dimensional feature space, i.e.
• K(xi,xj) =(1 + xiTxj)2 = φ(xi) Tφ(xj)
• Thus, a kernel function implicitly maps data to a high-dimensional
feature space (without the need to compute each φ(x) explicitly).
• In this case, the kernel function K(xi,xj)=(1 + xiTxj)2
• Note that choosing the kernel function also relieves us from the
problem of choosing the feature set in higher dimension, i.e. we
don’t need to engineer features in higher dimension.
The “Kernel Trick”
29

• The kernel function K(xi,xj) =(1 + xiTxj)2 is a special case of a more


generalized class of kernel functions, called polynomial functions
(more about this on the next slide).
• This is not the only choice for kernel function.
• In fact, we can choose from different kernel functions including the
polynomial function, sigmoid kernel, gaussian kernel etc.
Examples of Kernel Functions 30

Linear kernel: K (x i , x j ) x i x j

p
Polynomial kernel of powerKp:( x i , x j ) (1  x i x j )

 ||x i  x j ||2 / 2 2
Gaussian kernel K (x i , x j ) e
(Also called RBF Kernel)
Can lift to infinite dim. space

K (x i , x j ) tanh(x i x j   )
Two-layer perceptron:
SVM: Kernel Issues
31

How to know which Kernel to use?


This is a good question and actually still an open question, many
researches have been working to deal with this issue but still we
don’t have a firm answer. It is one of the weakness of SVM.
Generally, we have to test each kernel for a particular problem.
How to verify that rising to higher dimension using a specific
kernel will map the data to a space in which they are linearly
separable?
Even though rising to higher dimension increases the likelihood that
they will be separable we can’t guarantee that.
SVM: Kernel Issues
32

We saw that the Gaussian Radial Basis Kernel lifts the data to
infinite dimension so our data is always separable in this
space so why don’t we always use this kernel?
First of all we should decide which  to use in this kernel:
1 2
exp(  xi  x j )
2 2
Secondly, A strong kernel, which lifts the data to infinite
dimension, sometimes may lead us the severe problem of
Overfitting.
SVM: Kernel Issues
33
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:

All these problems


leads us to the
compromising
solution:
Soft Margin:
A solution which can work even if our
data is not perfectly linearly separable.
SVM: Kernel Issues
34
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:

All these problems


leads us to the
compromising
solution:
Soft Margin:
A solution which can work even if our
data is not perfectly linearly separable.
SVM: Kernel Issues
35
o In addition to the above problems, another problem is that
sometimes the points are linearly separable but the margin is Low:

All these problems


leads us to the
compromising
solution:
Soft Margin:
A soft margin in a Support Vector Machine (SVM) is a technique that allows SVMs to classify data that is
not linearly separable by making some mistakes.
The goal is to keep the margin wide enough so that other points can still be classified correctly. .
36

You might also like