An Idiot's Guide To Support Vector Machines
An Idiot's Guide To Support Vector Machines
An Idiot's Guide To Support Vector Machines
machines (SVMs)
R. Berwick, Village Idiot
SVMs: A New
Generation of Learning Algorithms
Pre 1980:
Almost all learning methods learned linear decision surfaces.
Linear learning methods have nice theoretical properties
1980s
Decision trees and NNs allowed efficient learning of nonlinear decision surfaces
Little theoretical basis and all suffer from local minima
1990s
Efficient learning algorithms for non-linear functions based
on computational learning theory developed
Nice theoretical properties.
Key Ideas
Two independent developments within last decade
New efficient separability of non-linear regions that use
kernel functions : generalization of similarity to
new kinds of similarity measures based on dot products
Use of quadratic optimization problem to avoid local
minimum issues with neural nets
The resulting learning algorithm is an optimization
algorithm rather than a greedy search
Organization
Basic idea of support vector machines: just like 1layer or multi-layer neural nets
Optimal hyperplane for linearly separable
patterns
Extend to patterns that are not linearly
separable by transformations of original data to
map into new space the Kernel function
SVM algorithm for pattern recognition
Support Vectors
Support vectors are the data points that lie closest
to the decision surface (or hyperplane)
They are the data points most difficult to classify
They have direct bearing on the optimum location
of the decision surface
We can show that the optimal hyperplane stems
from the function class with the lowest
capacity= # of independent features/parameters
we can twiddle [note this is extra material not
covered in the lectures you dont have to know
this]
Support vectors
Maximize
margin
Separation by Hyperplanes
Assume linear separability for now (we will relax this
later)
in 2 dimensions, can separate by a line
in higher dimensions, need hyperplanes
2-D Case
Support Vectors: Input vectors that just touch the boundary of the
margin (street) circled below, there are 3 of them (or, rather, the
tips of the vectors
w0 T x + b 0 = 1
or
w0Tx + b0 = 1
d
X
X
X
Here, we have shown the actual support vectors, v1, v2, v3, instead of
just the 3 circled points at the tail ends of the support vectors. d
denotes 1/2 of the street width
v1
v2
X
v3
X
X
Definitions
Define the hyperplanes H such that:
wxi+b +1 when yi =+1
wxi+b -1 when yi = 1
H1 and H2 are the planes:
H1: wxi+b = +1
H2: wxi+b = 1
H1
H0
H2
d+
d-
H1
H2
d+
d-
10
flatten
Example: paraboloid 2+x2+2y2 s.t. x+y=1
Intuition: find intersection of two functions f, g at
a tangent point (intersection = both constraints
satisfied; tangent = derivative is 0); this will be a
min (or max) for f s.t. the contraint g is satisfied
11
Two constraints
1. Parallel normal constraint (= gradient constraint
on f, g s.t. solution is a max, or a min)
2. g(x)=0 (solution is on the constraint line as well)
We now recast these by combining f, g as the new
Lagrangian function by introducing new slack
variables denoted a or (more usually, denoted
in the literature)
12
"f ( p ) = "! g ( p )
g ( x) = 0
Or, combining these two as the Langrangian L &
requiring derivative of L be zero:
L(x, a) = f (x) ! ag(x)
"(x, a) = 0
At a solution p
The the constraint line g and the contour lines of f must
be tangent
If they are tangent, their gradient vectors
(perpendiculars) are parallel
Gradient of g must be 0 i.e., steepest ascent & so
perpendicular to f
Gradient of f must also be in the same direction as g
13
In general
Gradient min of f
constraint condition g
14
Lagrangian Formulation
So in the SVM problem the Lagrangian is
min LP =
1
2
w ! " ai yi x i # w + b + " ai
2
i=1
i=1
!w
i =1
l
!LP
= " ai yi = 0 so
!b
i =1
l
w = ! ai yi x i ,
i=1
i i
!a y
i
=0
i=1
15
w = ! ai yi x i ,
!a y
i
i=1
=0
i=1
By substituting for w and b back in the original eqn we can get rid of the
dependence on w and b.
Note first that we already now have our answer for what the weights w
must be: they are a linear combination of the training inputs and the
training outputs, xi and yi and the values of a. We will now solve for the
as by differentiating the dual problem wrt a, and setting it to zero. Most
of the as will turn out to have the value zero. The non-zero as will
correspond to the support vectors
Primal problem:
min LP =
w ! " ai yi x i # w + b + " ai
2
1
2
i=1
i=1
s.t. $i a i % 0
l
w = ! ai yi x i ,
i=1
Dual problem:
l
s.t.
!a y
i
!a y
i
=0
i=1
1 l
!a a y y x #x
2 i=1 i j i j i j
= 0 & ai $ 0
i=1
16
s.t.
!a y
i
1 l
!a a y y x #x
2 i=1 i j i j i j
= 0 & ai $ 0
i=1
!a y
i =1
i i
=0
0 " ai " C
17
w = ! ai yi x i
i=1
Remember: most of the weights wi, i.e., the as, will be zero
Only the support vectors (on the gutters or margin) will have nonzero
weights or as this reduces the dimensionality of the solution
18
LD (ai ) = ! ai "
i=1
s.t.
!a y
i
1 l
!a a y y x #x
2 i=1 i j i j i j
= 0 & ai $ 0
i=1
The claim is that this function will be maximized if we give nonzero values to as that
correspond to the support vectors, ie, those that matter in fixing the maximum width
margin (street). Well, consider what this looks like. Note first from the constraint
condition that all the as are positive. Now lets think about a few cases.
Case 1. If two features xi , xj are completely dissimilar, their dot product is 0, and they dont
contribute to L.
Case 2. If two features xi,xj are completely alike, their dot product is 0. There are 2 subcases.
Subcase 1: both xi,and xj predict the same output value yi (either +1 or 1). Then yi
x yj is always 1, and the value of aiajyiyjxixj will be positive. But this would decrease the
value of L (since it would subtract from the first term sum). So, the algorithm downgrades
similar feature vectors that make the same prediction.
Subcase 2: xi,and xj make opposite predictions about the output value yi (ie, one is
+1, the other 1), but are otherwise very closely similar: then the product aiajyiyjxix is
negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisely
the examples we are looking for: the critical ones that tell the two classses apart.
xi
xj
19
xi
xj
xj
xi
20
Butare we done???
21
Transformation to separate
o x
x
o
x
o
x
o
x
o
x
(x)
(o)
(x)
(x)
(o)
(o)
(x)
(x)
(o) (x)
(o)
(x)
(o)
(o)
NonLinear SVMs
The idea is to gain linearly separation by mapping the data to
a higher dimensional space
The following set cant be separated by a linear function,
but can be separated by a quadratic one
(x ! a )(x ! b ) = x 2 ! (a + b ) x + ab
a
{ }
So if we map x ! x 2 , x
we gain linear separation
22
=-1
=+1
What if the decision function is not linear? What transform would separate these?
Radial
=Radial
=-1
=+1
=-1
=+1
Remember the function we want to optimize: Ld = ai ai ajyiyj (xixj) where (xixj) is the
dot product of the two feature vectors. If we now transform to , instead of computing this
dot product (xixj) we will have to compute ( (xi) (xj)). But how can we do this? This is
expensive and time consuming (suppose is a quartic polynomial or worse, we dont know the
function explicitly. Well, here is the neat thing:
If there is a kernel function K such that K(xi,xj) = (xi) (xj), then we do not need to know
or compute at all!! That is, the kernel function defines inner products in the transformed space.
Or, it defines similarity in the transformed space.
23
Non-linear SVMs
So, the function we end up optimizing is:
Ld = ai aiaj yiyjK(xixj),
Kernel example: The polynomial kernel
K(xi,xj) = (xixj + 1)p, where p is a tunable parameter
Note: Evaluating K only requires one addition and one exponentiation
more than the original dot product
x"y
2! 2
24
tanh(0xTxi + 1)
Its the sigmoid
transform (for neural
nets)
So, SVMs subsume
neural nets! (but w/o
their problems)
Polynomial learning
machine
(xT xi + 1)p
Power p is specified a
priori by the user
Radial-basis function
(RBF)
exp(1/(22)||x-xi||2)
The width 2 is
specified a priori
tanh(0 xTxi + 1)
25
Gaussian
26
27
Overfitting by SVM
Every point is a support vector too much freedom to bend to fit the
training data no generalization.
In fact, SVMs have an automatic way to avoid such issues, but we
wont cover it here see the book by Vapnik, 1995. (We add a
penalty function for mistakes made after training by over-fitting: recall
that if one over-fits, then one will tend to make errors on new data.
This penalty fn can be put into the quadratic programming problem
directly. You dont need to know this for this course.)
28