Lec 10 SVM
Lec 10 SVM
Lec 10 SVM
2008
Lecture 10
Support Vector Machines
Geoffrey Hinton
Is preprocessing cheating?
Its cheating if we use a carefully designed set of taskspecific, hand-coded features and then claim that the
learning algorithm solved the whole problem.
The really hard bit is done by designing the features.
Its not cheating if we learn the non-linear preprocessing.
This makes learning much more difficult and much more
interesting (e.g. backpropagation after pre-training)
Its not cheating if we use a very big set of non-linear
features that is task-independent.
Support Vector Machines do this.
They have a clever way to prevent overfitting (first half of
lecture)
They have a very clever way to use a huge number of
features without requiring nearly as much computation as
seems to be necessary (second half of lecture).
An example of VC dimension
Suppose our model class is a hyperplane.
In 2-D, we can find a plane (i.e. a line) to deal with any
labeling of three points. A 2-D hyperplane shatters 3 points
f ( x) a sin(b x)
h h log(2 N / h) log( p / 4)
Etrain
1
2
A Bayesian Interpretation
Using the maximum margin separator often
gives a pretty good approximation to using all
separators weighted by their posterior
probabilities.
w.x c b 1 c
w.x c b 1 c
with c 0
for all c
|| w ||2
and
c
2
c
as small as possible
A planar separator in
a 20-D feature space
projected back to the
original 2-D space
The kernel trick makes your brain hurt when you first
learn about it, but its actually very simple.
Low-D
xb
High-D
K ( x a , x b ) ( x a ) . ( x b )
Letting the
kernel do
the work
xa
( xb )
(xa )
bias
ws K ( x
test
,x ) 0
s SV
The set of
support vectors
K (x, y ) (x.y 1) p
Gaussian
radial basis
function
K (x, y ) e
Neural net:
||x y||2 / 2 2
Parameters
that the user
must choose
Performance
Support Vector Machines work very well in practice.
The user must choose the kernel function and its
parameters, but the rest is automatic.
The test performance is very good.
They can be expensive in time and space for big datasets
The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
We need to store all the support vectors.
SVMs are very good if you have no idea about what
structure to impose on the task.
The kernel trick can also be used to do PCA in a much
higher-dimensional space, thus giving a non-linear version
of PCA in the original space.
A hybrid approach
If we use a neural net to define the features, maybe we
can use convex optimization for the final layer of weights
and then backpropagate derivatives to learn the kernel.
The convex optimization is quadratic in the number of
training cases. So this approach works best when most of
the data is unlabelled.
Unsupervised pre-training can then use the unlabelled
data to learn features that are appropriate for the
domain.
The final convex optimization can use these features
as well as possible and also provide derivatives that
allow them to be fine-tuned.
This seems better than just trying lots of kernels and
selecting the best ones (which is the current method).
GP on
top-level
features
GP on top-level
features with
fine-tuning
17.9
15.2
12.7
7.2
11.2
6.4