Introduction To: Support Vector Machines
Introduction To: Support Vector Machines
Introduction To: Support Vector Machines
Edited from
Andrew Moore (CMU) And Martin Law (MSU)
History of SVM
◼ SVM is related to statistical learning theory [1]
◼ SVM was first introduced in 1992 [2]
[1] V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.
[2] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on
Computational Learning Theory 5 144-152, Pittsburgh, 1992.
2023/3/27 2
Linear Classifiers
x f yest
Estimation:
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 w: weight vector
x: data vector
2023/3/27 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
2023/3/27 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
2023/3/27 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
2023/3/27 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
2023/3/27 7
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
2023/3/27 8
Maximum Margin a
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
2023/3/27 9
Maximum Margin a
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
2023/3/27 10
Why Maximum Margin?
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
2023/3/27 11
How to calculate the distance from a point to a line?
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
◼ http://mathworld.wolfram.com/Point-LineDistance2-
Dimensional.html
◼ In our case, w1*x1+w2*x2+b=0,
2023/3/27 12
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
2023/3/27 13
Large-margin Decision Boundary
◼ The decision boundary should be as far away from the
data of both classes as possible
◼ We should maximize the margin, m
Class 2
Class 1
m
2023/3/27 14
Finding the Decision Boundary
◼ Let {x1, ..., xn} be our data set and let yi {1,-1} be
the class label of xi
◼ The decision boundary should classify all points correctly
◼ The decision boundary can be found by solving the
3/27/2023 15
Recap of Constrained Optimization
◼ Suppose we want to: minimize f(x) subject to g(x) = 0
◼ A necessary condition for x0 to be a solution:
3/27/2023 16
Recap of Constrained Optimization
◼ The case for inequality constraint gi(x)0 is similar,
except that the Lagrange multiplier ai should be positive
◼ If x0 is a solution to the constrained optimization
problem
3/27/2023 17
Back to the Original Problem
◼ The Lagrangian is
3/27/2023 18
The Dual Problem
◼ If we substitute to , we have
◼ Note that
3/27/2023 19
The Dual Problem
◼ The new objective function is in terms of ai only
◼ It is known as the dual problem: if we know w, we
know all ai; if we know all ai, we know w
◼ The original problem is known as the primal problem
2023/3/27 21
The Dual Problem
◼ w can be recovered by
2023/3/27 22
Characteristics of the Solution
◼ Many of the ai are zero (see next page for example)
◼w is a linear combination of a small number of data points
◼ This “sparse” representation can be viewed as data
can write
◼ For testing with a new data z
◼ Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
◼ Note: w need not be formed explicitly
2023/3/27 23
A Geometrical Interpretation
Class 2
a8=0.6 a10=0
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
2023/3/27 24
Allowing errors in our solutions
◼We allow “error” xi in classification; it is based on the
output of the discriminant function wTx+b
◼ xi approximates the error of misclassified samples
Class 2
Class 1
2023/3/27 25
Soft Margin Hyperplane
◼ If we minimize ixi, xi can be computed by
◼ We want to minimize
◼ C : tradeoff parameter between error and margin (allowed
misclassified samples errors)
◼ The optimization problem becomes
26
Extension to Non-linear Decision Boundary
◼So far, we have only considered large-
margin classifier with a linear decision
boundary
◼How to generalize it to become nonlinear?
transformation
2023/3/27 27
Transforming the Data (c.f. DHS Ch. 5)
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
2023/3/27 28
Problems with linear SVM
=-1
=+1
Non-linear SVM 1
Rd
=-1
=+1
f
=-1
=+1
The Kernel Trick
◼ Recall the SVM optimization problem
2023/3/27 31
An Example for f(.) and K(.,.)
◼ Suppose f(.) is given as follows
2023/3/27 32
More on Kernel Functions
◼ Not all similarity measures can be used as kernel
function, however
◼ The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
◼ This implies that
◼ the n by n kernel matrix,
◼ in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
◼ This also means that optimization problem can be solved
in polynomial time!
2023/3/27 33
Examples of Kernel Functions
2023/3/27 34
Non-linear SVMs: Feature spaces
Φ: x → φ(x)
2023/3/27 35
Example
◼ Suppose we have 5 one-dimensional data points
◼ x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
◼ We use the polynomial kernel of degree 2
◼ K(x,y) = (xy+1)2
◼ C is set to 100
2023/3/27 36
Example
◼ By using a QP solver, we get
◼ a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
◼ Note that the constraints are indeed satisfied
2023/3/27 37
Example
1 2 4 5 6
2023/3/27 38
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Copyright © 2001,
2003, Andrew W.
Moore
Suppose we’re in 1-dimension
x=0
Positive “plane” Negative “plane”
Copyright © 2001,
2003, Andrew W.
Moore
Harder 1-dimensional dataset
x=0
Copyright © 2001,
2003, Andrew W.
Moore
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
z k = ( xk , x )
2
k
x=0 z k = ( xk , x )
2
k
1 2 4 5 6
2023/3/27 44
Example
1 2 4 5 6
2023/3/27 45
Degree of Polynomial Features
X^2
X^1
2023/3/27 46
Choosing the Kernel Function
◼ Probably the most tricky part of using SVM.
2023/3/27 47
Software
◼ A list of SVM implementation can be found at
http://www.kernel-machines.org/software.html
◼ Some implementation (such as LIBSVM) can handle
multi-class classification
◼ SVMLight is among one of the earliest implementation of
SVM
◼ Several Matlab toolboxes for SVM are also available
2023/3/27 48
Summary: Steps for Classification
◼ Prepare the pattern matrix
◼ Select the kernel function to use
value of C
◼ You can use the values suggested by the SVM software, or
you can set apart a validation set to determine the values
of the parameter
◼ Execute the training algorithm and obtain the ai
◼ Unseen data can be classified using the ai and the
support vectors
2023/3/27 49
Conclusion
◼ SVM is a useful alternative to neural networks
◼ Two key concepts of SVM: maximize the margin and the
kernel trick
◼ Many SVM implementations are available on the web for
2023/3/27 50
Resources
◼ http://www.kernel-machines.org/
◼ http://www.support-vector.net/
◼ http://www.support-vector.net/icml-tutorial.pdf
◼ http://www.kernel-machines.org/papers/tutorial-
nips.ps.gz
◼ http://www.clopinet.com/isabelle/Projects/SVM/applist.h
tml
2023/3/27 51
Appendix: Distance from a point to a line
2023/3/27 52
Distance and margin
◼ x = x1 + u (x2 - x1)
y = y1 + u (y2 - y1)
◼ d= |(P3-P)|=
2023/3/27 53