Support Vector Machines: Some Slides Adapted From

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Support Vector

Machines
Some slides adapted from

•Aliferis & Tsamardinos, Vanderbilt University


http://discover1.mc.vanderbilt.edu/discover/public/ml_tutorial_ol
d/index.html

•Rong Jin, Language Technology Institute


www.contrib.andrew.cmu.edu/~jin/ir_proj/svm.ppt
1
ABDBM © Ron Shamir
Support Vector Machines
• Decision surface: a hyperplane in feature
space
• One of the most important tools in the
machine learning toolbox
• In a nutshell:
– map the data to a predetermined very high-
dimensional space via a kernel function
– Find the hyperplane that maximizes the margin
between the two classes
– If data are not separable - find the hyperplane
that maximizes the margin and minimizes the
(weighted average of the) misclassifications

ABDBM © Ron Shamir 2


Support Vector Machines
• Three main ideas:
1. Define what an optimal hyperplane is (taking into account
that it needs to be computed efficiently): maximize
margin
2. Generalize to non-linearly separable problems: have a
penalty term for misclassifications
3. Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate
problem so that data are mapped implicitly to this space

ABDBM © Ron Shamir 3


Support Vector Machines
• Three main ideas:
1. Define what an optimal hyperplane is (taking into account
that it needs to be computed efficiently): maximize
margin
2. Generalize to non-linearly separable problems: have a
penalty term for misclassifications
3. Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate
problem so that data are mapped implicitly to this space

ABDBM © Ron Shamir 4


Which Separating Hyperplane
to Use?
Var1

Var2
ABDBM © Ron Shamir 5
Maximizing the Margin
Var1
IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!

Margin
Width

Margin
Width
Var2
ABDBM © Ron Shamir 6
Support Vectors
Var1

Support Vectors

Margin
Width
Var2
ABDBM © Ron Shamir 7
Setting Up the Optimization
Problem
Var1

The width of the


margin is:
2k
w
 
 w x  b  k So, the problem is:
w
2k
max
  w
w  x  b  k k Var2
k s.t. (w  x  b)  k , x of class 1
  (w  x  b)  k , x of class 2
w x  b  0

ABDBM © Ron Shamir 8


Setting Up the Optimization
Problem
Var1

Scaling w, b so that
k=1, the problem
becomes:

2
max
 w x  b  1 w
w s.t. (w  x  b)  1, x of class 1
(w  x  b)  1, x of class 2
w  x  b  1 1 Var2
1
 
w x  b  0

ABDBM © Ron Shamir 9


Setting Up the Optimization
Problem
• If class 1 corresponds to 1 and class 2
corresponds to -1, we can rewrite
(w  xi  b)  1, xi with yi  1
(w  xi  b)  1, xi with yi  1

• as
yi (w  xi  b)  1, xi

• So the problem becomes:


2 1 2
max min w
w or 2
s.t. yi (w  xi  b)  1, xi s.t. yi (w  xi  b)  1, xi

ABDBM © Ron Shamir 10


Linear, Hard-Margin SVM Formulation
• Find w,b that solve min w
1 2
2
s.t. yi (w  xi  b)  1, xi
• Quadratic program: quadratic objective, linear
(in)equality constraints
• Problem is convex  there is a unique global
minimum value (when feasible)
• There is also a unique minimizer, i.e. w and b
values that provide the minimum
• No solution if the data are not linearly separable
• Objective is PD  polynomial-time soln
• Very efficient soln with modern optimization
software (handles 1000s of constraints and
training instances).
ABDBM © Ron Shamir 11
Lagrange multipliers

1
Minimize (w )  || w || 2   i yi (w  x i  b)    i
w,b , 2
s.t. 1  0,...,  l  0

• Convex quadratic programming problem


• Duality theory applies!

ABDBM © Ron Shamir 12


Dual Space

• Dual Problem
1
Maximize F (Λ)  Λ  1  Λ  DΛ
2
Λy  0
subject to
Λ0
where Λ  ( 1 , 2 ,...,l ), y  ( y1 , y 2 ,..., y n ), Dij  yi y j x i  x j
• Representation for w
l
w   i yi x i
• Decision function i 1

l
f ( x )  sign (  yi i ( x  x i )  b)
i 1

ABDBM © Ron Shamir 13


Comments
l
• Representation of vector w w   i yi x i
i 1
– Linear combination of examples xi
• # parameters = # examples
– i: the importance of each examples
• Only the points closest to the bound have i0
• Core of the algorithm: xx’
– Both matrix D and decision function require the knowledge
of xx’ Dij  yi y j xi  x j
(More on this soon)

ABDBM © Ron Shamir 14


Support Vector Machines
• Three main ideas:
1. Define what an optimal hyperplane is (taking into account
that it needs to be computed efficiently): maximize
margin
2. Generalize to non-linearly separable problems: have a
penalty term for misclassifications
3. Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate
problem so that data are mapped implicitly to this space

ABDBM © Ron Shamir 15


Non-Linearly Separable Data

Var1 Introduce slack


i variables i

Allow some
j instances to fall
within the margin,
but penalize them
 w x  b  1
w

w  x  b  1 1 Var2
1
 
w x  b  0

ABDBM © Ron Shamir 16


Formulating the Optimization
Problem
Constraint becomes :

Var1 yi (w  xi  b)  1  i , xi
i i  0

Objective function
j penalizes for
misclassified instances
and those within the
 w x  b  1 margin
w
1
min w  C  i
2
w  x  b  1 1 Var2
2
1 i
 
w x  b  0
C trades-off margin width
& misclassifications
ABDBM © Ron Shamir 17
Linear, Soft-Margin SVMs
1
min w  C  i yi (w  xi  b)  1  i , xi
2

2 i i  0
• Algorithm tries to keep i at zero while maximizing
margin
• Alg does not minimize the no. of misclassifications
(NP-complete problem) but the sum of distances
from the margin hyperplanes
• Other formulations use i2 instead
• C: penalty for misclassification
• As C, we get closer to the hard-margin solution

ABDBM © Ron Shamir 18


Dual Space
• Dual Problem
1
Maximize F (Λ)  Λ 1  Λ  DΛ
2
Λy  0
subject to Λ  0
Λ  C1
where Λ  ( 1 , 2 ,...,l ), y  ( y1 , y 2 ,..., y n ), Dij  yi y j x i  x j

• Only difference: upper bound C on i


• Representation for w
l
w   i yi x i
• Decision function i 1

l
f ( x )  sign (  yi i ( x  x i )  b)
i 1

ABDBM © Ron Shamir 19


Comments
• Param C
– Controls the range of i  avoids over emphasizing
some examples
– i (C - i) = 0 (“complementary slackness”)
– C can be extended to be case-dependent
• Weight i
– i < C  i = 0  i-th example is correctly classified 
not quite important
– i = C  i can be nonzero  i-th training example may
be misclassified  very important

ABDBM © Ron Shamir 20


Robustness of Soft vs Hard
Margin SVMs

Var1 Var1

i

i

  Var2
Var2 w x  b  0
 
w x  b  0

Soft Margin SVM Hard Margin SVM

ABDBM © Ron Shamir 21


Soft vs Hard Margin SVM
• Soft-Margin always has a solution
• Soft-Margin is more robust to outliers
– Smoother surfaces (in the non-linear case)
• Hard-Margin does not require to guess the cost
parameter (requires no parameters at all)

ABDBM © Ron Shamir 22


Support Vector Machines
• Three main ideas:
1. Define what an optimal hyperplane is (taking into account
that it needs to be computed efficiently): maximize
margin
2. Generalize to non-linearly separable problems: have a
penalty term for misclassifications
3. Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate
problem so that data are mapped implicitly to this space

ABDBM © Ron Shamir 23


Disadvantages of Linear
Decision Surfaces
Var1

Var2
ABDBM © Ron Shamir 24
Advantages of Non-Linear
Surfaces
Var1

Var2
ABDBM © Ron Shamir 25
Linear Classifiers in High-
Dimensional Spaces
Constructed
Var1
Feature 2

Var2
Constructed
Feature 1
Find function (x) to map to
a different space

ABDBM © Ron Shamir 26


Mapping Data to a High-
Dimensional Space
• Find function (x) to map to a different
space, then SVM formulation becomes:
1
w  C  i s.t. yi ( w   ( xi )  b)  1  i , xi
2
min
2 i i  0

• Data appear as (x), weights w are now


weights in the new space
• Explicit mapping expensive if (x) is very
high dimensional
• Can we solve the problem without explicitly
mapping the data ?
ABDBM © Ron Shamir 27
The Dual of the SVM Formulation

• Original SVM formulation 1 2


– n inequality constraints min w  C i
– n positivity constraints
w ,b 2 i

– n number of  variables
s.t. yi ( w   ( x )  b)  1   i , xi
• The (Wolfe) dual of this i  0
problem
– one equality constraint 1
– n positivity constraints min   i j yi y j ( ( xi )   ( x j ))    i
ai 2
– n number of  variables i, j i

(Lagrange multipliers)
– Objective function more s.t. C   i  0, xi
complicated
 y
i
i i 0

• But: Data only appear as


(xi)  (xj)

ABDBM © Ron Shamir 28


The Kernel Trick
• (xi)t  (xj) means: map data into new space, then
take the inner product of the new vectors
• Suppose we can find a function such that: K(xi , xj) =
(xi)t  (xj) , i.e., K is the inner product of the images
of the data
•  For training, no need to explicitly map the data into
the high-dimensional space to solve the optimization
problem
• How do we classify without explicitly mapping the new
instances? Turns out

sgn( wx  b)  sgn(   i yi K ( xi , x )  b)
i

where b solves  j ( y j   i yi K ( xi , x j )  b  1)  0,
i

for any j with  j  0


ABDBM © Ron Shamir 29
Examples of Kernels
• Assume we measure x1,x2 and we use the
mapping:
 : x1 , x2  {x12 , x22 , 2 x1 x2 , 2 x1 , 2 x2 ,1}

• Consider the function:


K ( x, z )  ( x  z  1) 2
• Then:
• 𝜙 𝑥 t 𝜙 𝑧 = 𝑥12 𝑧12 + 𝑥22 𝑧22 + 2𝑥1 𝑥2 𝑧1 𝑧2 +
2𝑥1 𝑧1 + 2𝑥2 𝑧2 + 1 = 𝑥1 𝑧1 + 𝑥2 𝑧2 + 1 2 =
𝑥 ⋅ 𝑧 + 1 2 = 𝐾(𝑥, 𝑧)

ABDBM © Ron Shamir 30


Polynomial and Gaussian Kernels
K ( x, z )  ( x  z  1) p

is called the polynomial kernel of degree p.


• For p=2, with 7,000 genes using the kernel once:
inner product with 7,000 terms, squaring
• Mapping explicitly to the high-dimensional space:
calculating ~50,000,000 new features for both
training instances, then taking the inner product of
that (another 50,000,000 terms to sum)
• In general, using the Kernel trick provides huge
computational savings over explicit mapping!
• Another common option: Gaussian kernel (maps to l
dimensional space with l=no of training points):
K ( x, z )  exp( x  z / 2 2 )

ABDBM © Ron Shamir 31


The Mercer Condition
• Is there a mapping (x) for any symmetric function
K(x,z)? No
• The SVM dual formulation requires calculation K(xi , xj)
for each pair of training instances. The matrix Gij =
K(xi,xj) is called the Gram matrix
• Theorem (Mercer 1908): There is a feature space (x)
iff the Kernel is such that G is positive-semi definite
• Recall: M PSD iff z≠0 zTMz>0 iff M has non-negative eigenvalues

ABDBM © Ron Shamir 32


Support Vector Machines
• Three main ideas:
1. Define what an optimal hyperplane is (taking into account
that it needs to be computed efficiently): maximize
margin
2. Generalize to non-linearly separable problems: have a
penalty term for misclassifications
3. Map data to high dimensional space where it is easier to
classify with linear decision surfaces: reformulate
problem so that data are mapped implicitly to this space

ABDBM © Ron Shamir 33


Complexity
(for one implementation, Burges 98)
Notation: l training pts of dimension d, N support
vectors (Nl)
• When most SVs are not at the upper bound:
– O(N3+N2l+Ndl) if N<<l
– O(N3+nl+Ndl) if N~l
• When most SVs are at the upper bound:
– O(N2 + Ndl) if N<<l
– O(dl2) if N~l

ABDBM © Ron Shamir 34


Other Types of Kernel Methods
• SVMs that perform regression
• SVMs that perform clustering
• -Support Vector Machines: maximize
margin while bounding the number of margin
errors
• Leave One Out Machines: minimize the
bound of the leave-one-out error
• SVM formulations that allow different cost
of misclassification for different classes
• Kernels suitable for sequences of strings,
or other specialized kernels
ABDBM © Ron Shamir 35
Feature Selection with SVMs
• Recursive Feature Elimination
– Train a linear SVM
– Remove the x% of variables with the lowest
weights (those variables affect classification
the least)
– Retrain the SVM with remaining variables and
repeat until classification quality is reduced
• Very successful
• Other formulations exist where minimizing
the number of variables is folded into the
optimization problem
• Similar algs for non-linear SVMs
• Quite successful
ABDBM © Ron Shamir 36
Why do SVMs Generalize?
• Even though they map to a very high-dimensional
space
– They have a very strong bias in that space
– The solution has to be a linear combination of the training
instances
• Large theory on Structural Risk Minimization
providing bounds on the error of an SVM
– Typically the error bounds too loose to be of practical use

ABDBM © Ron Shamir 37


Conclusions
• SVMs formulate learning as a mathematical
program taking advantage of the rich theory in
optimization
• SVM uses kernels to map indirectly to extremely
high dimensional spaces
• SVMs are extremely successful, robust, efficient,
and versatile, and have a good theoretical basis

ABDBM © Ron Shamir 39


Vladimir Vapnik
• Vladimir Naumovich Vapnik is one of the main developers of
Vapnik–Chervonenkis theory. He was born in the Soviet
Union. He received his master's degree in mathematics at
the Uzbek State University, Samarkand, Uzbek SSR in
1958 and Ph.D in statistics at the Institute of Control
Sciences, Moscow in 1964. He worked at this institute from
1961 to 1990 and became Head of the Computer Science
Research Department.
• At the end of 1990, he moved to the USA and joined the
Adaptive Systems Research Department at AT&T Bell Labs
in Holmdel, New Jersey. The group later became the Image
Processing Research Department of AT&T Laboratories
when AT&T spun off Lucent Technologies in 1996. Vapnik
Left AT&T in 2002 and joined NEC Laboratories in
Princeton, New Jersey, where he currently works in the
Machine Learning group. He also holds a Professor of
Computer Science and Statistics position at Royal Holloway,
University of London since 1995, as well as an Adjunct
Professor position at Columbia University, New York City
since 2003. He was inducted into the U.S. National Academy
of Engineering in 2006. He received the 2008 Paris
Kanellakis Award.
• While at AT&T, Vapnik and his colleagues developed the
theory of the support vector machine. They demonstrated
its performance on a number of problems of interest to the
machine learning community, including handwriting
recognition.
http://en.wikipedia.org/wiki/Vladimir_Vapnik
ABDBM © Ron Shamir 40
Suggested Further Reading
• http://www.kernel-machines.org/tutorial.html
• http://www.svms.org/tutorials/ - many tutorials
• C. J. C. Burges. "A Tutorial on Support Vector Machines for
Pattern Recognition." Knowledge Discovery and Data Mining,
2(2), 1998.
• E. Osuna, R. Freund, and F. Girosi. "Support vector machines:
Training and applications." Technical Report AIM-1602, MIT A.I.
Lab., 1996.
• P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support
vector machines. 2003.
• N. Cristianini. ICML'01 tutorial, 2001.
• K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An
introduction to kernel-based learning algorithms. IEEE Neural
Networks, 12(2):181-201, May 2001. (PDF)
• B. Schölkopf. SVM and kernel methods, 2001. Tutorial given at the
NIPS Conference.
• Hastie, Tibshirani, Friedman, The Elements of Statistical Learning,
Springel 2001

ABDBM © Ron Shamir 41


Analysis of microarray GE data
using SVM

Brown, Grundy, Lin, Cristianini,


Sugnet, Furey, Ares Jr., Haussler

PNAS 97(1) 262-7 (2000)

ABDBM © Ron Shamir 42


Data
• Expression patterns of n=2467 annotated
yeast genes over m=79 different conditions
• Six gene functional classes: 5 related to
transcript levels, tricarboxylic acid (TCA) cycle, respiration,
cytoplasmic ribosomes, proteasome, histones, and 1 unrelated
(control) helix-turn-helix proteins.
• For gene x, condition i:
– Ei level of x in tested condition
– Ri level of x in reference condition
• Normalized pattern (X1,…,Xm) of gene x:
Xi= log(Ei/Ri)/(klog2(Ek/Rk))0.5

ABDBM © Ron Shamir 43


Goal
• Classify genes based on gene expression
• Tried SVM and other classifiers

1/||w||

ABDBM © Ron Shamir www.contrib.andrew.cmu.edu/~jin/ir_proj/44


Kernel functions used
• Simplest : K(X,Y)=X•Y+1 (dot product; linear
kernel)
• Kernel of degree d: K(X,Y)=(X•Y+1)d
• Radial basis (Gaussian) kernel:
exp(-||X-Y||2/22)
• n+ / n- : no. of positive / negative examples
• Problem: n+ << n-
• Overcoming imbalance: modify K’s diagonal:
Kij=K(Xi,Xj)+c/n+ for positive ex,
Kij=K(Xi,Xj)+c/n- for negative ex

ABDBM © Ron Shamir 48


Measuring performance
True
+ -
Classifier
+ TP FP
- FN TN
• The imbalance problem: very few positives
• Performance of method M: C(M) =FP+2FN
• C(N) = cost of classifying all as negatives
• S(M) =C(N)-C(M) (how much we save by the
classifier).
• 3-way cross validation: 2/3 learn, 1/3 test
ABDBM © Ron Shamir 49
Results – TCA class
Method FP FN TP TN S(M)
D-p-1-SVM 18 5 12 2,432 6
D-p-2-SVM 7 9 8 2,443 9
D-p-3-SVM 4 9 8 2,446 12
Radial-SVM 5 9 8 2,445 11
Parzen 4 12 5 2,446 6
FLD 9 10 7 2,441 5
C4.5 7 17 0 2,443 -7
MOC1 3 16 1 2,446 -1

D-p-i-SVM: dot product kernel, degree i


Other methods used: Parzen windows, Fisher linear discriminant,
ABDBM © Ron Shamir C4.5+MOC1: decision trees 50
Results: Ribo Class
Method FP FN TP TN S(M)
D-p-1-SVM 14 2 119 2,332 224
D-p-2-SVM 9 2 119 2,337 229
D-p-3-SVM 7 3 118 2,339 229
Radial-SVM 6 5 116 2,340 226
Parzen 6 8 113 2,340 220
FLD 15 5 116 2,331 217
C4.5 31 21 100 2,315 169
MOC1 26 26 95 2,320 164

ABDBM © Ron Shamir 51


Results: Summary
• SVM outperformed the other methods
• Either high-dim dot-product or Gaussian
kernels worked best
• Insensitive to specific cost weighting
• Consistently misclassified genes require
special attention
• Does not always reflect protein levels and
post-translational modifications
• Can use classifiers for functional
annotation

ABDBM © Ron Shamir 52


David Haussler

ABDBM © Ron Shamir 53


Gene Selection via the
BAHSIC Family of Algorithms
Le Song, Justin Bedo,
Karsten M. Borgwardt, Arthur Gretton, Alex Smola
ISMB 07

ABDBM © Ron Shamir 54


Testing
• 15 two-class datasets (mostly cancer), 2K-25K
genes, 50-300 samples
• 10-fold cross validation
• Selected the 10 top features according to each
method
– pc=Pearson’s correlation, snr=signal-to-noise ratio,
pam=shrunken centroid, t=t-statistics, m-t = moderated t-
statistics, lods=B-statistics, lin=centroid, RBF= SVM w
Gaussian kernel, rfe=SVM recursive feature elimination,
l1=l1 norm SVM, mi=mutual information)
• Selection method: RFE: Train, remove 10% of
features that are least relevant, repeat.

ABDBM © Ron Shamir 55


Classification Overlap btw the 10 genes Linear kernel has best overall
error % selected in each fold performance

L2
dist
from
best

56

# times alg was best


ABDBM © Ron Shamir
Multiclass datasets
• In a similar comparison on 13 multiclass datasets,
linear kernel was again best.

ABDBM © Ron Shamir 58


Rules of thumb
• Always apply the linear kernel for general
purpose gene selection
• Apply a Gaussian Kernel if nonlinear effects
are present, such as multimodality or
complementary effects of different genes
• Not a big surprise, given the high dimension
of microarray datasets, but point driven
home by broad experimentation.

ABDBM © Ron Shamir 59

You might also like