An Idiot's Guide To Support Vector Machines

An Idiots guide to Support vector
machines (SVMs)
R. Berwick, Village Idiot
SVMs: A New
Generation of Learning Algorithms
Pre 1980:
Almost all learning methods learned linear decision surfaces.
Linear learning methods have nice theoretical properties
1980s
Decision trees and NNs allowed efficient learning of nonlinear decision surfaces
Little theoretical basis and all suffer from local minima
1990s
Efficient learning algorithms for non-linear functions based
on computational learning theory developed
Nice theoretical properties.
Key Ideas
Two independent developments within last decade
New efficient separability of non-linear regions that use
kernel functions : generalization of similarity to
new kinds of similarity measures based on dot products
Use of quadratic optimization problem to avoid local
minimum issues with neural nets
The resulting learning algorithm is an optimization
algorithm rather than a greedy search
Organization
Basic idea of support vector machines: just like 1layer or multi-layer neural nets
Optimal hyperplane for linearly separable
patterns
Extend to patterns that are not linearly
separable by transformations of original data to
map into new space the Kernel function
SVM algorithm for pattern recognition
Support Vectors
Support vectors are the data points that lie closest
to the decision surface (or hyperplane)
They are the data points most difficult to classify
They have direct bearing on the optimum location
of the decision surface
We can show that the optimal hyperplane stems
from the function class with the lowest
capacity= # of independent features/parameters
we can twiddle [note this is extra material not
covered in the lectures you dont have to know
this]
Recall from 1-layer nets : Which Separating

Hyperplane?
In general, lots of possible

solutions for a,b,c (an
infinite number!)
Support Vector Machine
(SVM) finds an optimal
solution
Support Vector Machine (SVM)

SVMs maximize the margin
(Winston terminology: the street)
around the separating hyperplane.
The decision function is fully
specified by a (usually very small)
subset of training samples, the
support vectors.
This becomes a Quadratic
programming problem that is easy
to solve by standard methods
Support vectors
Maximize
margin
Separation by Hyperplanes
Assume linear separability for now (we will relax this
later)
in 2 dimensions, can separate by a line
in higher dimensions, need hyperplanes
General input/output for SVMs just like for

neural nets, but for one important addition
Input: set of (input, output) training pair samples; call the
input sample features x1, x2xn, and the output result y.
Typically, there can be lots of input features xi.
Output: set of weights w (or wi), one for each feature,
whose linear combination predicts the value of y. (So far,
just like neural nets)
Important difference: we use the optimization of maximizing
the margin (street width) to reduce the number of weights
that are nonzero to just a few that correspond to the
important features that matter in deciding the separating
line(hyperplane)these nonzero weights correspond to the
support vectors (because they support the separating
hyperplane)
2-D Case
Find a,b,c, such that

ax + by c for red points
ax + by (or < ) c for green
points.
Which Hyperplane to pick?

Lots of possible solutions for a,b,c.
Some methods find a separating
hyperplane, but not the optimal one (e.g.,
neural net)
But: Which points should influence
optimality?
All points?
Linear regression
Neural nets
Or only difficult points close to
decision boundary
Support vector machines
Support Vectors again for linearly separable case

Support vectors are the elements of the training set that
would change the position of the dividing hyperplane if
removed.
Support vectors are the critical elements of the training set
The problem of finding the optimal hyper plane is an
optimization problem and can be solved by optimization
techniques (we use Lagrange multipliers to get this
problem into a form that can be solved analytically).
Support Vectors: Input vectors that just touch the boundary of the
margin (street) circled below, there are 3 of them (or, rather, the
tips of the vectors
w0 T x + b 0 = 1
or
w0Tx + b0 = 1
d
X
X
X
Here, we have shown the actual support vectors, v1, v2, v3, instead of
just the 3 circled points at the tail ends of the support vectors. d
denotes 1/2 of the street width
v1
v2
X
v3
X
X
Definitions
Define the hyperplanes H such that:
wxi+b +1 when yi =+1
wxi+b -1 when yi = 1
H1 and H2 are the planes:
H1: wxi+b = +1
H2: wxi+b = 1
H1
H0
H2
d+
d-
The points on the planes H1 and

H2 are the tips of the Support
Vectors
The plane H0 is the median in
between, where wxi+b =0
d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point
The margin (gutter) of a separating hyperplane is d+ + d.
Moving a support vector

moves the decision
boundary
Moving the other vectors

has no effect
The optimization algorithm to generate the weights proceeds in such a

way that only the support vectors determine the weights and thus the
boundary
Defining the separating Hyperplane

Form of equation defining the decision surface separating
the classes is a hyperplane of the form:
wTx + b = 0
w is a weight vector
x is input vector
b is bias
Allows us to write
wTx + b 0 for di = +1
wTx + b < 0 for di = 1
Some final definitions

Margin of Separation (d): the separation between the
hyperplane and the closest data point for a given weight
vector w and bias b.
Optimal Hyperplane (maximal margin): the particular
hyperplane for which the margin of separation d is
maximized.
Maximizing the margin (aka street width)

We want a classifier (linear separator)
with as big a margin as possible.
H0
Recall the distance from a point(x0,y0) to a line:
Ax+By+c = 0 is: |Ax0 +By0 +c|/sqrt(A2+B2), so,
The distance between H0 and H1 is then:
|wx+b|/||w||=1/||w||, so
H1
H2
d+
d-
The total distance between H1 and H2 is thus: 2/||w||

In order to maximize the margin, we thus need to minimize ||w||. With the
condition that there are no datapoints between H1 and H2:
xiw+b +1 when yi =+1
xiw+b 1 when yi =1
Can be combined into: yi(xiw) 1
We now must solve a quadratic

programming problem
Problem is: minimize ||w||, s.t. discrimination boundary is
obeyed, i.e., min f(x) s.t. g(x)=0, which we can rewrite as:
min f: ||w||2 (Note this is a quadratic function)
s.t. g: yi(wxi)b = 1 or [yi(wxi)b] 1 =0
This is a constrained optimization problem

It can be solved by the Lagrangian multipler method
Because it is quadratic, the surface is a paraboloid, with just a
single global minimum (thus avoiding a problem we had
with neural nets!)
10
flatten
Example: paraboloid 2+x2+2y2 s.t. x+y=1
Intuition: find intersection of two functions f, g at
a tangent point (intersection = both constraints
satisfied; tangent = derivative is 0); this will be a
min (or max) for f s.t. the contraint g is satisfied
Flattened paraboloid f: 2x2+2y2=0 with superimposed

constraint g: x +y = 1
Minimize when the constraint line g (shown in green)

is tangent to the inner ellipse contour linez of f (shown in red)
note direction of gradient arrows.
11
flattened paraboloid f: 2+x2+2y2=0 with superimposed constraint

g: x +y = 1; at tangent solution p, gradient vectors of f,g are
parallel (no possible move to increment f that also keeps you in
region g)
Minimize when the constraint line g is tangent to the inner ellipse

contour line of f
Two constraints
1. Parallel normal constraint (= gradient constraint
on f, g s.t. solution is a max, or a min)
2. g(x)=0 (solution is on the constraint line as well)
We now recast these by combining f, g as the new
Lagrangian function by introducing new slack
variables denoted a or (more usually, denoted
in the literature)
12
Redescribing these conditions

Want to look for solution point p where
"f ( p ) = "! g ( p )
g ( x) = 0
Or, combining these two as the Langrangian L &
requiring derivative of L be zero:
L(x, a) = f (x) ! ag(x)
"(x, a) = 0
At a solution p
The the constraint line g and the contour lines of f must
be tangent
If they are tangent, their gradient vectors
(perpendiculars) are parallel
Gradient of g must be 0 i.e., steepest ascent & so
perpendicular to f
Gradient of f must also be in the same direction as g
13
How Langrangian solves constrained

optimization
L(x, a) = f (x) ! ag(x) where
"(x, a) = 0
Partial derivatives wrt x recover the parallel normal
constraint
Partial derivatives wrt recover the g(x,y)=0
In general,
L(x, a) = f (x) + ! i ai g i (x)
In general
Gradient min of f
constraint condition g
L(x, a) = f (x) + ! i ai g i (x) a function of n + m variables

n for the x 's, m for the a. Differentiating gives n + m equations, each
set to 0. The n eqns differentiated wrt each xi give the gradient conditions;
the m eqns differentiated wrt each ai recover the constraints g i
In our case, f(x): || w||2 ; g(x): yi(wxi +b)1=0 so Lagrangian is:

min L= || w||2 ai[yi(wxi +b)1] wrt w, b
We expand the last to get the following L form:
min L= || w||2 aiyi(wxi +b) +ai wrt w, b
14
Lagrangian Formulation
So in the SVM problem the Lagrangian is
min LP =
1
2
w ! " ai yi x i # w + b + " ai
2
i=1
i=1
s.t. $i, ai % 0 where l is the # of training points

From the property that the derivatives at min = 0
l
!LP
we get:
= w" ayx = 0
!w
i =1
l
!LP
= " ai yi = 0 so
!b
i =1
l
w = ! ai yi x i ,
i=1
i i
!a y
i
=0
i=1
Whats with this Lp business?

This indicates that this is the primal form of the
optimization problem
We will actually solve the optimization problem
by now solving for the dual of this original
problem
What is this dual formulation?
15
The Lagrangian Dual Problem: instead of minimizing over w, b,

subject to constraints involving as, we can maximize over a (the
dual variable) subject to the relations obtained previously for w
and b
Our solution must satisfy these two relations:
l
w = ! ai yi x i ,
!a y
i
i=1
=0
i=1
By substituting for w and b back in the original eqn we can get rid of the
dependence on w and b.
Note first that we already now have our answer for what the weights w
must be: they are a linear combination of the training inputs and the
training outputs, xi and yi and the values of a. We will now solve for the
as by differentiating the dual problem wrt a, and setting it to zero. Most
of the as will turn out to have the value zero. The non-zero as will
correspond to the support vectors
Primal problem:
min LP =
w ! " ai yi x i # w + b + " ai
2
1
2
i=1
i=1
s.t. $i a i % 0
l
w = ! ai yi x i ,
i=1
Dual problem:
l
max LD (ai ) = ! ai "

i=1
s.t.
!a y
i
!a y
i
=0
i=1
1 l
!a a y y x #x
2 i=1 i j i j i j
= 0 & ai $ 0
i=1
(note that we have removed the dependence on w and b)
16
The Dual problem

Kuhn-Tucker theorem: the solution we find here will
be the same as the solution to the original problem
Q: But why are we doing this???? (why not just
solve the original problem????)
Ans: Because this will let us solve the problem by
computing the just the inner products of xi, xj (which
will be very important later on when we want to
solve non-linearly separable classification problems)
The Dual Problem

Dual problem:
l
max LD (ai ) = ! ai "

i=1
s.t.
!a y
i
1 l
!a a y y x #x
2 i=1 i j i j i j
= 0 & ai $ 0
i=1
Notice that all we have are the dot products of xi,xj

If we take the derivative wrt a and set it equal to zero,
we get the following solution, so we can solve for ai:
l
!a y
i =1
i i
=0
0 " ai " C
17
Now knowing the ai we can find the

weights w for the maximal margin
separating hyperplane:
l
w = ! ai yi x i
i=1
And now, after training and finding the w by this method,

given an unknown point u measured on features xi we
can classify it by looking at the sign of:
l
f (x) = wiu + b = (! ai yi x i iu) + b

i =1
Remember: most of the weights wi, i.e., the as, will be zero
Only the support vectors (on the gutters or margin) will have nonzero
weights or as this reduces the dimensionality of the solution
Inner products, similarity, and SVMs

Why should inner product kernels be involved in pattern
recognition using SVMs, or at all?
Intuition is that inner products provide some measure of
similarity
Inner product in 2D between 2 vectors of unit length returns the
cosine of the angle between them = how far apart they are
e.g. x = [1, 0]T , y = [0, 1]T
i.e. if they are parallel their inner product is 1 (completely similar)
xT y = xy = 1
If they are perpendicular (completely unlike) their inner product is
0 (so should not contribute to the correct classifier)
xT y = xy = 0
18
Insight into inner products
Consider that we are trying to maximize the form:

l
LD (ai ) = ! ai "
i=1
s.t.
!a y
i
1 l
!a a y y x #x
2 i=1 i j i j i j
= 0 & ai $ 0
i=1
The claim is that this function will be maximized if we give nonzero values to as that
correspond to the support vectors, ie, those that matter in fixing the maximum width
margin (street). Well, consider what this looks like. Note first from the constraint
condition that all the as are positive. Now lets think about a few cases.
Case 1. If two features xi , xj are completely dissimilar, their dot product is 0, and they dont
contribute to L.
Case 2. If two features xi,xj are completely alike, their dot product is 0. There are 2 subcases.
Subcase 1: both xi,and xj predict the same output value yi (either +1 or 1). Then yi
x yj is always 1, and the value of aiajyiyjxixj will be positive. But this would decrease the
value of L (since it would subtract from the first term sum). So, the algorithm downgrades
similar feature vectors that make the same prediction.
Subcase 2: xi,and xj make opposite predictions about the output value yi (ie, one is
+1, the other 1), but are otherwise very closely similar: then the product aiajyiyjxix is
negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisely
the examples we are looking for: the critical ones that tell the two classses apart.
Insight into inner products, graphically: 2 very

very similar xi, xj vectors that predict difft
classes tend to maximize the margin width
xi
xj
19
2 vectors that are similar but predict the

same class are redundant
xi
xj
2 dissimilar (orthogonal) vectors dont

count at all
xj
xi
20
Butare we done???
Not Linearly Separable!
Find a line that penalizes

points on the wrong side
21
Transformation to separate
o x
x
o
x
o
x
o
x
o
x
(x)
(o)
(x)
(x)
(o)
(o)
(x)
(x)
(o) (x)
(o)
(x)
(o)
(o)
NonLinear SVMs
The idea is to gain linearly separation by mapping the data to
a higher dimensional space
The following set cant be separated by a linear function,
but can be separated by a quadratic one
(x ! a )(x ! b ) = x 2 ! (a + b ) x + ab
a
{ }
So if we map x ! x 2 , x
we gain linear separation
22
Problems with linear SVM
=-1
=+1
What if the decision function is not linear? What transform would separate these?
Ans: polar coordinates!

Non-linear SVM
The Kernel trick
Imagine a function that maps the data into another space:
Radial
=Radial
=-1
=+1
=-1
=+1
Remember the function we want to optimize: Ld = ai ai ajyiyj (xixj) where (xixj) is the
dot product of the two feature vectors. If we now transform to , instead of computing this
dot product (xixj) we will have to compute ( (xi) (xj)). But how can we do this? This is
expensive and time consuming (suppose is a quartic polynomial or worse, we dont know the
function explicitly. Well, here is the neat thing:
If there is a kernel function K such that K(xi,xj) = (xi) (xj), then we do not need to know
or compute at all!! That is, the kernel function defines inner products in the transformed space.
Or, it defines similarity in the transformed space.
23
Non-linear SVMs
So, the function we end up optimizing is:
Ld = ai aiaj yiyjK(xixj),
Kernel example: The polynomial kernel
K(xi,xj) = (xixj + 1)p, where p is a tunable parameter
Note: Evaluating K only requires one addition and one exponentiation
more than the original dot product
Examples for Non Linear SVMs

K (x, y ) = (x ! y + 1)
K (x, y ) = exp "
x"y
2! 2
K (x, y ) = tanh (! x # y $ " )

1st is polynomial (includes xx as special case)
2nd is radial basis function (gaussians)
3rd is sigmoid (neural net activation function)
24
Weve already seen such nonlinear

transforms
What is it???
tanh(0xTxi + 1)
Its the sigmoid
transform (for neural
nets)
So, SVMs subsume
neural nets! (but w/o
their problems)
Inner Product Kernels

Type of Support Vector
Machine
Inner Product Kernel

K(x,xi ), I = 1, 2, , N
Usual inner product
Polynomial learning
machine
(xT xi + 1)p
Power p is specified a
priori by the user
Radial-basis function
(RBF)
exp(1/(22)||x-xi||2)
The width 2 is
specified a priori
Two layer neural net
tanh(0 xTxi + 1)
Actually works only for

some values of 0 and
1
25
Kernels generalize the notion of inner

product similarity
Note that one can define kernels over more than just
vectors: strings, trees, structures, in fact, just about
anything
A very powerful idea: used in comparing DNA, protein
structure, sentence structures, etc.
Examples for Non Linear SVMs 2

Gaussian Kernel
Linear
Gaussian
26
Nonlinear rbf kernel
Admirals delight w/ difft kernel

functions
27
Overfitting by SVM
Every point is a support vector too much freedom to bend to fit the
training data no generalization.
In fact, SVMs have an automatic way to avoid such issues, but we
wont cover it here see the book by Vapnik, 1995. (We add a
penalty function for mistakes made after training by over-fitting: recall
that if one over-fits, then one will tend to make errors on new data.
This penalty fn can be put into the quadratic programming problem
directly. You dont need to know this for this course.)
28

An Idiot's Guide To Support Vector Machines

Uploaded by

Copyright:

Available Formats

An Idiot's Guide To Support Vector Machines

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Idiot's Guide To Support Vector Machines

Uploaded by

Copyright:

Available Formats

An Idiots guide to Support vector

Recall from 1-layer nets : Which Separating

In general, lots of possible

Support Vector Machine (SVM)

General input/output for SVMs just like for

Find a,b,c, such that

Which Hyperplane to pick?

Support Vectors again for linearly separable case

The points on the planes H1 and

d+ = the shortest distance to the closest positive point

Moving a support vector

Moving the other vectors

The optimization algorithm to generate the weights proceeds in such a

Defining the separating Hyperplane

Some final definitions

Maximizing the margin (aka street width)

The total distance between H1 and H2 is thus: 2/||w||

Can be combined into: yi(xiw) 1

We now must solve a quadratic

This is a constrained optimization problem

Flattened paraboloid f: 2x2+2y2=0 with superimposed

Minimize when the constraint line g (shown in green)

flattened paraboloid f: 2+x2+2y2=0 with superimposed constraint

Minimize when the constraint line g is tangent to the inner ellipse

Redescribing these conditions

How Langrangian solves constrained

L(x, a) = f (x) + ! i ai g i (x)

L(x, a) = f (x) + ! i ai g i (x) a function of n + m variables

In our case, f(x): || w||2 ; g(x): yi(wxi +b)1=0 so Lagrangian is:

s.t. $i, ai % 0 where l is the # of training points

Whats with this Lp business?

The Lagrangian Dual Problem: instead of minimizing over w, b,

max LD (ai ) = ! ai "

(note that we have removed the dependence on w and b)

The Dual problem

The Dual Problem

max LD (ai ) = ! ai "

Notice that all we have are the dot products of xi,xj

Now knowing the ai we can find the

And now, after training and finding the w by this method,

f (x) = wiu + b = (! ai yi x i iu) + b

Inner products, similarity, and SVMs

Insight into inner products

Consider that we are trying to maximize the form:

Insight into inner products, graphically: 2 very

2 vectors that are similar but predict the

2 dissimilar (orthogonal) vectors dont

Not Linearly Separable!

Find a line that penalizes

Problems with linear SVM

Ans: polar coordinates!

Imagine a function that maps the data into another space:

Examples for Non Linear SVMs

K (x, y ) = exp "

K (x, y ) = tanh (! x # y $ " )

Weve already seen such nonlinear

Inner Product Kernels

Inner Product Kernel