Lec3-The Kernel Trick

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

CS281B/Stat241B: Advanced Topics in Learning & Decision Making

The Kernel Trick


Lecturer: Michael I. Jordan

1
1.1

Scribe: Romain Thibaux

Support Vectors
Solving the dual problem

Last time we transformed our original max-margin problem to its dual version:
max
0

s.t.

() =

1 XX
i j yi yj xTi xj
2 i j

i yi = 0

This is a quadratic program which we can solve using a number of techniques. The easiest (but innefficient)
way is gradient projection. Or simply give it to Matlab. Obtaining the optimal will give us the vector
we really care about using the simple relation we showed earlier:
X
=
i yi xi
i

However originally we wanted and b because at the end of the day we want to classify new data points
xnew by testing:
T xnew + b 0

So how do we get b? To answer this we need to digress to introduce the KKT conditions for optimality
(Karush Kuhn Tucker).

1.2

The KKT conditions

Recall that for a general convex optimization problem


min
s.t.

f (x)
g(x) 0

we introduced the Lagrangian:


L(x, ) = f (x) T g(x)
where T g(x) acts as a force atracting x to the feasible set, away from the non-feasible set. At optimality,
the solution satisfies the KKT conditions:
1

The Kernel Trick

Figure 1: Transforming the data can make it linearly separable


both x and are feasible
i i g(xi ) = 0
These conditions make intuitive sense, in particular if we imagine a single constraint. The unconstrained
optimum is either inside the feasible set, or outside. If it is inside, then the constrained and unconstrained
optima are the same. x is atracted to this optimum, therefore there is no need to introduce a force to
convince it to stay in the feasible set, so i = 0. If it is outside, then x will want to escape the feasible
set. We will need a non-zero i to retain it, but x will go as far as possible towards the non-feasible set.
Therefore it will be on the boundary, where g(xi ) = 0. So one of the two must be zero.

1.3

The Support Vectors

Why does this matter to us ? What do the KKT conditions imply in our case ?
In our case our Lagrange multiplier is , and
g(xi ) = 1 yi ( T xi + b)
Now, at the optimum, pick an i such that i > 0. Then the KKT conditions imply g(xi ) = 0 so:

i.e.

yi ( T xi + b) = 1
b = yi T xi

Any i such that i > 0 will do, but in practice it is better (for numerical stability) to average the result
obtained using several of them.
The corresponding xi s are called the support vectors since they support the hyperplanes on both sides
of the margin. And this is why the max-margin classifier is also called support vector machine (SVM).
However when people refer to SVM, they generally refer to the enhanced and more general version that we
will now describe.

The Kernel Trick

The Kernel Trick

All the algorithms we have described so far use the data only through inner products. Because of this, they
can be made non-linear in a very general way. Lets start by an example:

2.1

Example

Clearly, the data on the left in figure 1 is not linearly separable. Yet if we map it to a three-dimensional
space using
:

<2
(x1 , x2 ) 7

<3

(z1 , z2 , z3 ) = (x21 , 2x1 x2 , x22 )

and if we try to linearly separate the mapped data, our decision boundaries will be hyperplanes in <3 , of
the form T z + b = 0, i.e. as a function of x they will be of the form

1 x21 + 2 2x1 x2 + 3 x22 = 0


which is the equation of an ellipse. So thats interesting, we can use our linear algorithm on a transformed
version of the data to get a non-linear algorithm with no effort !
But look more closely at what the algorithm is doing. All we use is the Gram Matrix K of the data, in the
sense that once we know K, we can throw away our original data.
T

x1 x1 xT1 x2
T

..

.
x
x
K =
= XX T
1
2

..
.
nn
T
x1

where
X = ...
xTn

nd

X, containing all the data, is called the design matrix.


What happens in our example when we first map our data via some function ? The Gram Matrix is now
that of the zi s:

(x1 )T (x1 ) (x1 )T (x2 )

..
T

.
K=
(x2 ) (x1 )

..
.
Lets write these inner products. Let r and s be vectors in <3 corresponding to a and b respectively.
hr, si = r1 s1 + r2 s2 + r3 s3
= a21 b21 + 2a1 a2 b1 b2 + a22 b22
2

= ha, bi

So instead of mapping our data via and computing the inner product, we can do it in one operation,
leaving the mapping completely implicit. In fact in the end we dont even need to know , all we need to
know is how to compute the modified inner product. And because modified inner product is a long name,
we will call it a kernel, K(x, y). We will also call K the kernel matrix because it contains the value of
the kernel for every pair of data points, thus using same letter both for the function and its matrix.
Because the kernel seems to be the object of interest, and not the mapping , we would like to characterize
kernels without this mere intermediate. Mercers Theorem gives us just that.

The Kernel Trick

2.2

Mercers Theorem

A symmetric function K(x, y) can be expressed as an inner product


K(x, y) = h(x), (y)i
for some if and only if K(x, y) is positive semidefinite, i.e.
Z
K(x, y)g(x)g(y)dxdy 0

or, equivalently:

K(x1 , x1 )

K(x2 , x1 )

..
.

K(x1 , x2 )
..
.

is psd for any collection {x1 . . . xn }

Therefore you can either explicitly map the data with a and take the dot product, or you can take any
kernel and use it right away, without knowing nor caring what looks like. For example:
1

Gaussian Kernel: K(x, y) = e 2 ||xy||

Spectrum Kernel: count the number of substrings in common. It is a kernel since it is a dot product
between vectors of indicators of all the substrings.

You might also like