0% found this document useful (0 votes)
59 views13 pages

SVM and Kernels

The document discusses kernels and the kernel trick in support vector machines. It explains that kernels allow computing the dot product of data points mapped into a higher dimensional feature space without explicitly performing the mapping. This is done through kernel functions that evaluate the similarity between pairs of data points. Common kernel functions include polynomial kernels and Gaussian radial basis function kernels. The kernel trick allows support vector machines to operate in infinite dimensional feature spaces while maintaining computational efficiency.

Uploaded by

Ulil Herdianto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views13 pages

SVM and Kernels

The document discusses kernels and the kernel trick in support vector machines. It explains that kernels allow computing the dot product of data points mapped into a higher dimensional feature space without explicitly performing the mapping. This is done through kernel functions that evaluate the similarity between pairs of data points. Common kernel functions include polynomial kernels and Gaussian radial basis function kernels. The kernel trick allows support vector machines to operate in infinite dimensional feature spaces while maintaining computational efficiency.

Uploaded by

Ulil Herdianto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Kernels and the Kernel Trick

Martin Hofmann

Reading Club "Support Vector Machines"

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

1 / 13

Optimization Problem

maximize:

W() =

m
X
i=1

m
1X
j j yi yj hxi xj i
2
i,j=1

subject to i 0, i = 1, . . . , m and

m
i=1

i yi = 0

data not linear separable in input space

map into some feature space where data is linear separable

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

2 / 13

Mapping Example
map data points into feature space with some function
e.g.:
: R2 R2

(x2 , x2 ) (z1 , z2 , z3 ) := (x12 , 2x1 x2 , x22 )

hyperplane hw zi = 0, as a function of x:

w1 x12 + w2 2x1 x2 + w3 x22 = 0

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

3 / 13

Kernel Trick
solve maximisation problem using mapped data points

W() =

m
X
i=1

m
1X
i
j j yi yj h(xi ) (xj )i
2
i,j=1

Dual Representation of Hyperplane ( primal Lagrangian):

f (x) = hw xi + b =

i yi hxi xi with w =

i yi xi

weight vector represented only by data points


only inner product of data points necessary, no coordinates
kernel function K(x1 , x2 ) = h(xi ) (xj )i

not necessary any more


possible to operate in any n-dimensional FS
complexity independent of FS

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

4 / 13

Example Kernel Trick


~x = (x1 , x2 )
~z = (z1 , z2 )
2

K(x, z) = h~x ~zi

K(x, z)

= h~x ~zi
= (x1 z1 + x2 z2 )2
= (x12 z21 + 2x1 z1 x2 z2 + x22 z22 )
E
D

=
(x12 , 2x1 x2 , x22 ) (z21 , 2z1 z2 , z22 )
= h(~x) (~z)i

mapping function fused in K

implicit (~x) = (x12 , 2x1 x2 , x22 )

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

5 / 13

Typical Kernels
Polynomial Kernel
d

K(x, z) = (hx zi + ) ,

for d 0

Radial Basis Function (Gaussian Kernel)

K(x, z) = e

kxzk2
2 2

kxk :=

hx xi

(Sigmoid Kernel)

K(x, z) = tanh( hx zi +
Inverse multi-quadric

K(x, z) = p

kx zk2 2 2 + c2

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

6 / 13

Typical Kernels Cont.


Kernels for Sets -

, 0
0

K s(, ) =

N N0
X
X

k(xi , xj0 )

i=1 j=1

where k(xi , xj0 ) is a kernel on elements in

, 0

Kernels for strings (Spectral Kernels) and trees

no one-fits-all kernel
model search and cross-validation in practice
low polynomial or RBF a good initial try

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

7 / 13

Kernel Properties

Symmetry

K(x, z) = h(x) (z)i = h(z) (x)i = K(z, x)


Cauchy-Schwarz Inequality
2

K(x, z)2 = h(x) (z)i k(x)k2 k(z)k2


= h(x) (x)i h(z) ( z)i
= K(x, x)K(z, z)

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

8 / 13

Making Kernels from Kernels

create complex Kernels by combining simpler ones


Closure Properties:

K(x, z)
K(x, z)
K(x, z)
K(x, z)
K(x, z)

=
=
=
=
=

c K1 (x, z)
c + K1 (x, z)
K1 (x, z) + K2 (x, z)
K1 (x, z) K2 (x, z)
f (x) f (z)

if K1 and K2 are kernels, f : X R, and c > 0

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

9 / 13

Gram Matrix

Kernel function as similarity measure between input objects


Gram Matrix (Similarity/Kernel Matrix) represents similarities between

input vectors
let V = ~v1 , . . . ,~vn a set of input vectors, then the Gram Matrix K is defined
as:

h(~v1 ) (~v1 )i . . . h(~v1 ) (~vn )i

..
h(~v2 ) (~v1 )i . . .

K=

..

.
h(~vn ) (~v1 )i . . . h(~vn ) (~vn )i
K is symmetric and positive semis-definite (positive eigenvalues)

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

10 / 13

Mercers Theorem
assume:
finite input space X = {x1 , . . . , xn }
symmetric function K(x, z) on X
Gram Matrix K = (K(xi , xj ))ni,j=1
since K is symmetric there exists an orthogonal matrix V s.t. K = VV0
diagonal containing eigenvalues t of K
and eigenvectors vt = (vti )ni=1 as columns of V
all eigenvalues are non-negative and let feature mapping be
: xi 7

p

i vti

n

t=1

then

h(xi ) (xj )i =

n
X

Rn , i = 1, . . . , n.

t vti vtj = (VV0 )ij = Kij = K(xi , xj )

t=1

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

11 / 13

Mercers Theorem Cont.

every Gram Matrix is symmetric and positive semi-definite


every spsd matrix can be regarded as a Kernel Matrix, i.e. as an inner

product matrix in some space


diagonal matrix satisfies Mercers criteria, but not good as Gram Matrix
self-similarity dominates between-sample similarity
represents orthogonal samples
generalization for infinite input space
eigenvectors of the data in can be used to detect directions of maximum
variance
kernel principal components analysis

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

12 / 13

Summary

Kernel calculates dot product of mapped data points without mapping

function
represented by symmetric, positive semi-definite Gram Matrix
fuses information about data and kernel
standard kernels (cross validation)
every similarity matrix can be used as kernel (satisfying Mercers criteria)
ongoing research to estimate Kernel Matrix from available data

Kernels and the Kernel Trick

Reading Club "Support Vector Machines"

13 / 13

You might also like