Lec3-The Kernel Trick
Lec3-The Kernel Trick
Lec3-The Kernel Trick
1
1.1
Support Vectors
Solving the dual problem
Last time we transformed our original max-margin problem to its dual version:
max
0
s.t.
() =
1 XX
i j yi yj xTi xj
2 i j
i yi = 0
This is a quadratic program which we can solve using a number of techniques. The easiest (but innefficient)
way is gradient projection. Or simply give it to Matlab. Obtaining the optimal will give us the vector
we really care about using the simple relation we showed earlier:
X
=
i yi xi
i
However originally we wanted and b because at the end of the day we want to classify new data points
xnew by testing:
T xnew + b 0
So how do we get b? To answer this we need to digress to introduce the KKT conditions for optimality
(Karush Kuhn Tucker).
1.2
f (x)
g(x) 0
1.3
Why does this matter to us ? What do the KKT conditions imply in our case ?
In our case our Lagrange multiplier is , and
g(xi ) = 1 yi ( T xi + b)
Now, at the optimum, pick an i such that i > 0. Then the KKT conditions imply g(xi ) = 0 so:
i.e.
yi ( T xi + b) = 1
b = yi T xi
Any i such that i > 0 will do, but in practice it is better (for numerical stability) to average the result
obtained using several of them.
The corresponding xi s are called the support vectors since they support the hyperplanes on both sides
of the margin. And this is why the max-margin classifier is also called support vector machine (SVM).
However when people refer to SVM, they generally refer to the enhanced and more general version that we
will now describe.
All the algorithms we have described so far use the data only through inner products. Because of this, they
can be made non-linear in a very general way. Lets start by an example:
2.1
Example
Clearly, the data on the left in figure 1 is not linearly separable. Yet if we map it to a three-dimensional
space using
:
<2
(x1 , x2 ) 7
<3
and if we try to linearly separate the mapped data, our decision boundaries will be hyperplanes in <3 , of
the form T z + b = 0, i.e. as a function of x they will be of the form
x1 x1 xT1 x2
T
..
.
x
x
K =
= XX T
1
2
..
.
nn
T
x1
where
X = ...
xTn
nd
..
T
.
K=
(x2 ) (x1 )
..
.
Lets write these inner products. Let r and s be vectors in <3 corresponding to a and b respectively.
hr, si = r1 s1 + r2 s2 + r3 s3
= a21 b21 + 2a1 a2 b1 b2 + a22 b22
2
= ha, bi
So instead of mapping our data via and computing the inner product, we can do it in one operation,
leaving the mapping completely implicit. In fact in the end we dont even need to know , all we need to
know is how to compute the modified inner product. And because modified inner product is a long name,
we will call it a kernel, K(x, y). We will also call K the kernel matrix because it contains the value of
the kernel for every pair of data points, thus using same letter both for the function and its matrix.
Because the kernel seems to be the object of interest, and not the mapping , we would like to characterize
kernels without this mere intermediate. Mercers Theorem gives us just that.
2.2
Mercers Theorem
or, equivalently:
K(x1 , x1 )
K(x2 , x1 )
..
.
K(x1 , x2 )
..
.
Therefore you can either explicitly map the data with a and take the dot product, or you can take any
kernel and use it right away, without knowing nor caring what looks like. For example:
1
Spectrum Kernel: count the number of substrings in common. It is a kernel since it is a dot product
between vectors of indicators of all the substrings.