Kernel Models 1233

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Kernel

Models
From Intuition to Application
Motivation & Problem
Statement
Linear models like Logistic Regression struggle with data
that is not linearly separable, such as complex shapes or
XOR patterns. This limitation makes them less effective in
real-world applications like image classification or
bioinformatics. To overcome this, we need a mechanism
that can handle non-linear relationships while retaining
computational efficiency. This is where kernel models
come in, allowing us to transform data into higher
dimensions, making it linearly separable.
3/1/20XX 2
Linearly separable data
Linearly separable data
Not a linear decision boundary?
Not Linearly separable
data?
Intuition Behind Non-Linearity
Content

Consider classifying cats and dogs using only weight


and height. If these two features overlap significantly, a
straight line cannot effectively separate them. However,
if we could transform this 2D feature space into a
higher dimension—say, weight² and height²—the data
points might become separable. This is the core
intuition behind using kernels to project data into
higher-dimensional spaces.
3/1/20XX 8
1D TO 2D Transformation
x2 x12
X X

X X
O O
X X O O O O X X x1 O O x1
What if there’s a 2D input
space that’s not linearly
separable?
2D TO 3D Transformation
Pipeline: TEST
DATA

Low High Model


Dimension Kernel Dimension
Data Data (say,SVM)

OUTPUT
Support Vector Machines
Click icon to add picture

Kernel Ridge
Click icon to add picture

Types
Regression(KRR)

of Kernel Principal Component


Click icon to add picture

Analysis (KPCA)

Models: Gaussian Processes


Click icon to add picture

Kernel Density
Estimation
Support Vector Machines
• Definition:
SVMs are max-margin classifiers that use
kernel functions to separate data into
different classes.
• SVMs find a hyperplane or line that
maximizes the distance between classes in
an N-dimensional space. They are based on
statistical learning theory and use support
vectors and margins to find the optimal
separating hyperplane.
• Advantages:
• Perform well with noisy or outlier-prone
data.
14 Presentation title 20XX
• Effective in high-dimensional spaces.
SVM using Kernel
• How It Works:
• Data is transformed into a
higher-dimensional space
or a new feature space
using a kernel.
• The optimal hyperplane is
found to separate the
classes.
• New data is classified
based on this hyperplane.

15 Presentation title 20XX


Kernel Ridge Regression
•Definition:
A regression model that uses kernels to
predict continuous variables.

•Advantages:
•Can capture non-linear relationships in
data.
•Provides predictions with probabilistic
confidence.
•Suitable for high-dimensional datasets.

16 Presentation title 20XX


Kernel Ridge Regression
1.How It Works:
• Data is mapped into a
higher-dimensional space
using a kernel function
• Ridge regression is applied
to predict the output.
• The model gives
probabilistic predictions.

17 Presentation title 20XX


Kernel Principal Component Analysis
(KPCA)
• KPCA uses kernel methods to map
the original input vectors into a
high-dimensional feature space,
where linear PCA is then
calculated. This results in a
nonlinear PCA in the original space
• How It Works:
• Data is mapped into a
higher-dimensional space
via a kernel.
• PCA is applied to identify
principal components.
• The data is transformed
18 into a lower-dimensional Presentation title 20XX
PCA vs KPCA

19 Presentation title 20XX


Kernels helped us to Restructure Data!!

20 Presentation title 20XX


How to map?
Kernel Functions
So, Kernel Function generally transforms the training set
of data so that a non-linear decision surface is able to
transform to a linear equation in a higher number of
dimension spaces.
Basically, It returns the inner product between two
points in a standard feature dimension.

3/1/20XX 22
3/1/20XX 23
Let’s discuss in detail….

3/1/20XX 24
1) Linear Kernel Function
• It is used when the data
is linearly separable.
• K(x1, x2) = x1 . X2
Q)When to use it?
Linear kernels are best for low-dimensional
datasets with lots of features, and are often
used for text classification. They're also ideal
for linear problems like support vector
machines (SVMs) and logistic regression.

Example: Text classification or scenarios


with a large number of features relative to 3/1/20XX 25
Linear Kernel Function
• Linear kernels are of the type
k(x,y) = xT.y,
where x and y are two
vectors.

Therefore k(x, y) = Ø(x). Ø(y) = xT.y

3/1/20XX 26
2. Polynomial Kernel
• It is used when the data is not
linearly separable.
• K(x1, x2) = (x1 . x2 + 1)d

How it works?
The polynomial kernel generates new
features by combining existing features
using polynomials. It looks at the given
features of input samples, as well as
combinations of those features, to
determine their similarity.

Example: Datasets where the decision 3/1/20XX 27


Polynomial Kernel Function

• Polynomial kernels are of the type


k(x,y) = (xT.y)q ,
where x and y are two
vectors.
• This is called homogenous kernel.
• Here, q is degree of the polynomial.
• If q=2 then it is called quadratic kernel.

3/1/20XX 28
Polynomial Kernel Function
• For inhomogeneous kernels, this is given as:
k(x,y) = (c+ xT.y)q ,
where x and y are two vectors.
• Here c is a constant and q is the degree of the polynomial.
• If c is zero and degree is one, the polynomial kernel is
reduced to a linear kernel.
• The value of degree q should be optimal as more degree
may lead to overfitting

3/1/20XX 29
Application
Q) Consider two data points x = and y = (2, 3) with c = 1
Apply linear, homogeneous and inhomogeneous kernels.

Solution:•
The kernel is given by k(x, y) = (xT * y)q
If the it is called linear kernel q = 1

k(x, y) =

3/1/20XX 30
Solution:•
The kernel is given by k(x, y) = (x ^ T * y) ^ q
If q = 2 the it is called homogeneous or quadratic kernel

k(x, y) =

Solution:
The kernel is given by k(x, y) = (x ^ T * y) ^ q
If q = 2 and c = 1 the it is called inhomogeneous kernel

k(x,y)=

3/1/20XX 31
3. Gaussian Kernel Radial Basis
Function (RBF):
• k(xi, xj) = exp(-𝛾||xi - xj||2)
• It is a popular radial basis
function, and is used in a variety
of learning architectures,
including: Spatial statistics,
Dynamical system identification,
Gaussian processes for machine
learning, and Classification of
object existence
• Example: Image classification or
datasets with clusters.
3/1/20XX 32
Application
Q)Consider two data points x = (1, 2) and y = (2, 3) with sigma = 1 .
Apply RBF kernel and find the value of RBF kernel for these points.

Substitute the value of x and y in RBF kernel.


The squared distance between the points (1, 2) and (2, 3) is given
as:

then k(x, y) =
3/1/20XX 33
Gaussian Kernel RBF:
• Radial Basis Functions (RBFs) or Gaussian kernels are extremely
useful in SVM.The RBF function is shown as below:

• Here, y is an important parameter. If y is small, then the RBF is


similar to linear SVM and if y is large, then the kernel is influenced
by more support vectors.
• The RBF performs the dot product in R∞, and therefore, it is highly
effective in separating the classes and is often used.
3/1/20XX 34
4. Sigmoid Kernel:
• This function is equivalent to a
two-layer, perceptron model of the
neural network, which is used as
an activation function for artificial
neurons.
• The Sigmoid kernel has been
proposed theoretically for a
Support Vector Machine (SVM)
because it originates from a neural
network, but until now it has not
been widely used in practice.
• Example: Problems where data
has binary features or decision
σ(x) = 1 / (1 + e^(-x)),
boundaries are non-linear.
3/1/20XX 35
How to Choose the Right Kernel
Function in SVM?
• 1. Understand the problem - Understand the type of data, features,
and the complexity of the relationship between the features
• 2. Choose a simple kernel function - Start with the linear kernel
function as it serves as the baseline for comparison with the complex
kernel functions.
• 3. Test different kernel functions - Test polynomial, RBF kernel, etc
and compare their performance.

36 Presentation title 20XX


• 4. Tune the parameters - Experiment with different parameter values &
choose the values that deliver the best performance.
• 5. Use domain knowledge - Based on the type of data, use the domain
knowledge & choose the right type of kernel for your data set.
• 6. Consider computational complexity - Calculate the computation
type & the resources that would be required for larger data sets.

37 Presentation title 20XX



Is it all we need to
know to solve real
world problems?

38 Presentation title

20XX
Problems with the discussed
approach:
1. Curse of Dimensionality
2. Slow and ineffective Computation
3. Mapping Data to Higher-Dimensional Spaces

39 Presentation title 20XX


Curse of Dimensionality
• Mapping data to higher dimensions increases the complexity and
computational burden of machine learning models. This problem
is known as the curse of dimensionality.
• As the number of dimensions increases, the amount of data
needed to generalize accurately also increases exponentially.
• Additionally, storing and computing the coordinates in high-
dimensional spaces becomes infeasible.

40 Presentation title 20XX


Kernel Trick
• The kernel trick offers a way to bypass this problem by
implicitly performing the mapping into higher-
dimensional spaces without ever computing the
coordinates explicitly. Instead of transforming the data into
high-dimensional space, the kernel trick uses a function
(the kernel function) to compute the inner product
between pairs of data points as if they were mapped to a
higher-dimensional space.

41 Presentation title 20XX


Computational
Efficiency
• Computing and storing high-dimensional feature vectors
can be infeasible for large datasets.
• Instead of explicitly mapping data points to a higher-
dimensional space, the kernel trick leverages kernel
functions to compute the inner products between pairs of
data points directly in this higher-dimensional space.
• This approach allows us to perform complex
transformations and find non-linear decision boundaries
efficiently.

42 Presentation title 20XX


• When talking about kernels in machine learning, most
likely the first thing that comes into your mind is the
support vector machines (SVM) model because the kernel
trick is widely used in the SVM model to bridge linearity
and non-linearity.

43 Presentation title 20XX



What is the
kernel trick? Let’s
dive deep…

44 Presentation title

20XX
Why is it important to use
the kernel trick?
As you can see in the above picture, if we find a way to
map the data from 2-dimensional space to 3-dimensional
space, we will be able to find a decision surface that
clearly divides between different classes. My first thought
of this data transformation process is to map all the data
point to a higher dimension (in this case, 3 dimension), Click icon to add picture
find the boundary, and make the classification.

That sounds alright. However, when there are more


and more dimensions, computations within that
space become more and more expensive. This is
when the kernel trick comes in. It allows us to
operate in the original feature space without
computing the coordinates of the data in a higher
dimensional space.
45
46 Presentation title 20XX
Let’s look at an example:

Say we have a point in xy plane, we map it as follows:

{ (x,y) = (x²,
√2xy, y²) }

Our kernel function accepts inputs in the original lower dimensional


space and returns the dot product of the transformed vectors in the
higher dimensional space. There are also theorems which guarantee
the existence of such kernel functions under certain conditions.
The kernel trick for the 2nd-degree polynomial is illustrated below , and
we visualized this transformation in 3-d in a previous figure. The transformed
vectors have coordinates that are functions of the two components x1 and x2.
So the dot product will only involve components x1 and x2 as well. The kernel
function will also take inputs x1, x2 and return a real number. The dot product
always returns a real number too.

48 Presentation title 20XX


Let’s look at another example:

Here x and y are two data points in 3 dimensions. Let’s assume that we need to map
x and y to 9-dimensional space. We need to do the following calculations to get the
final result, which is just a scalar. The computational complexity, in this case, is
O(n²).
However, if we use the kernel function, which is denoted as k(x, y),
instead of doing the complicated computations in the 9-dimensional
space, we reach the same result within the 3-dimensional space by
calculating the dot product of x -transpose and y. The computational
complexity, in this case, is O(n).

In essence, what the kernel trick does for us is to offer a more


efficient and less expensive way to transform data into higher
dimensions. With that saying, the application of the kernel trick is
not limited to the SVM algorithm. Any computations involving the
dot products (x, y) can utilize the kernel trick.
• The kernel trick sounds like a “perfect” plan. However, one
critical thing to keep in mind is that when we map data to
a higher dimension, there are chances that we may
overfit the model.
• Thus choosing the right kernel function (including the right
parameters) and regularization are of great importance.

51 Presentation title 20XX


• The kernel function here is the polynomial kernel k(a,b) =
(a^T * b)²
• The ultimate benefit of the kernel trick is that the objective
function we are optimizing to fit the higher dimensional
decision boundary only includes the dot product of the
transformed feature vectors. Therefore, we can just
substitute these dot product terms with the kernel
function, and we don’t even use ϕ(x).
Note:
• Remember, our data is only linearly separable as the
vectors ϕ(x) in the higher dimensional space, and we are
finding the optimal separating hyperplane in this higher
dimensional space without having to calculate or in reality
even know anything about ϕ(x).
52 Presentation title 20XX
53
Using Kernels
 Kernels can turn many linear models into nonlinear models
 Recall that represents a dot product in some high-dim feature
space
 Important: Any ML model/algo in which, during training and
test, inputs only appear as dot product can be “kernelized”
 Just replace each term of the form by
54
Using Kernels
 Most ML models/algos can be easily kernelized,
e.g.,
 Distance based methods, Perceptron, SVM,
linear regression, etc.
 Many of the unsupervised learning algorithms
too can be kernelized (e.g., K-means
clustering, Principal Component Analysis, etc. -
will see later)
General Tips
•Start Simple: Begin with a linear kernel for quick
results and to understand data behavior.
•Use RBF as a Default: RBF is generally a good first
choice for non-linear problems if no prior information is
available.
•Cross-Validation: Use techniques like k-fold cross-
validation to test different kernels and select the one
with the best performance.
•Check Overfitting: Polynomial and Sigmoid kernels
can overfit small datasets. Use regularization to manage
complexity.
55 Presentation title 20XX
Thank you
Shaini Mohanty
Ayush Kumar Sinha
Madhvendra Kumar
Patel

You might also like