Support Vector Machine

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

DTSC6014001 – MACHINE LEARNING

SUPPORT VECTOR MACHINE


SESSION 10

SUBJECT MATTER EXPERT


Lili Ayu Wulandhari, S.Si., M.Sc., Ph.D.
Learning
Outcomes
LO1: Identify machine learning problem and task
LO2: Explain fundamental concept of machine learning
LO3: Construct machine learning model for given problem
using Python
LO4: Analyze the result of machine learning modeling
process
OUTLINE

• Maximal margin classifier


• Support vector classifiers
• Support vector machines
• Support vector regression
MAXIMAL MARGIN CLASSIFIER
INTRODUCTION

Support vector machine (SVM) is:

• A powerful, versatile, and the most popular Machine Learning model

• Capable of performing linear or nonlinear classification, regression, and


even outlier detection.

• Well-suited for classification of complex small or medium-sized


datasets.
MAXIMAL MARGIN CLASSIFIER
INTRODUCTION

• SVM is a family of classification rules that contain both parametric


(e.g., linear) and nonparametric (e.g., kernel based) methods.

• It can be viewed as a generalization of linear decision boundaries for


classification.

• In particular, SVM produces nonlinear boundaries by constructing a


linear boundary in a large, transformed version of the feature space.

• We will mainly discuss three approaches: maximal margin classifier, the


support vector classifier, and the support vector machine – people often
loosely refer to these methods collectively as support vector machines.
SVM TERMINOLOGY

1.Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of different classes in a feature
space. In the case of linear classifications, it will be a linear equation i.e. wx+b = 0.
2.Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a critical role in deciding the
hyperplane and margin.
3.Margin: Margin is the distance between the support vector and hyperplane. The main objective of the support vector
machine algorithm is to maximize the margin. The wider margin indicates better classification performance.
4.Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data points into high-
dimensional feature spaces, so, that the hyperplane can be easily found out even if the data points are not linearly
separable in the original input space. Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
5.Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly separates the
data points of different categories without any misclassifications.
6.Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft margin technique. Each data
point has a slack variable introduced by the soft-margin SVM formulation, which softens the strict margin requirement and
permits certain misclassifications or violations. It discovers a compromise between increasing the margin and reducing
violations.
7.C: Margin maximisation and misclassification fines are balanced by the regularisation parameter C in SVM. The penalty for
going over the margin or misclassifying data items is decided by it. A stricter penalty is imposed with a greater value of C,
MAXIMAL MARGIN CLASSIFIER
MAXIMAL MARGIN CLASSIFIER
SEPARATING HYPERPLANE
• We know that logistic regression
estimates linear decision boundaries
in classification problems.  The basis
for support vector classifiers.

• Let us take a look at the following


artificially generated dataset. The
dataset contains two predictors, and ,
and observations come from two classes
(blue circles and orange triangles).
o The two classes are well separated,
and a straight line can be used for
classification.
o This situation is called linearly
separable.
MAXIMAL MARGIN CLASSIFIER
SEPARATING HYPERPLANE

• Formally, in a linearly separable case, there exists such that the line

perfectly separate the two classes.

• If we code the response as for the two classes, then for a data point
with predictor values , we have the relation

• In other words, with p predictors , we will use a linear combination of


the form .
MAXIMAL MARGIN CLASSIFIER
SEPARATING HYPERPLANE

• One crucial property of a separating hyperplane is that

• When such a separating hyperplane exists, we can construct a natural


classifier: for a new observation , the predicted class is

• In other words, a test observation is assigned a class depending on


which side of the hyperplane it is located.
MAXIMAL MARGIN CLASSIFIER
MAXIMAL MARGIN

• One possible was to find such a separating hyperplane is to minimize


the distance of misclassified points to the decision boundary.

• If a response is misclassified, then , and the opposite for a


misclassified response with .

• Also, the magnitude of tells us how far the observation is located


from the boundary.

• Thus, we can minimize

with respect to , where indexes the set of misclassified observations.


MAXIMAL MARGIN CLASSIFIER
MAXIMAL MARGIN

• However, when the problem is linearly


separable, there are infinitely many such
separating hyperplanes.

• In this perfectly separable case, given any


separating hyperplane, we can define the
margin as the minimal distance from the
observations to the hyperplane.

• The optimal classification rule is the line


that maximizes the margin around the Example of multiple separating
Example of multiple separating
separating line. hyperplanes for the same data.
hyperplanes for the same data.

• Such a classifier is called the maximal


margin classifier.
MAXIMAL MARGIN CLASSIFIER
MAXIMAL MARGIN

• Formally, we consider the optimization problem:

with respect to , where is the margin and = distance between the


hyperplane and i-th training point.

• It can be shown that the optimization problem above is equivalent to:

with respect to .
MAXIMAL MARGIN CLASSIFIER
MAXIMAL MARGIN
• Once we obtain the estimators , the optimal separating hyperplane is .
• Thus, our classification rule is as follows: for a new observation ,

• In the figure, the optimal separating line is


shown as the solid red line, the closest points
to the line are circled, and the separation
between the classes is shown using the
dashed black lines.
• Notice that the are a few points (circled) that
are closest, and equidistant, to the red
separating line.
MAXIMAL MARGIN CLASSIFIER
MAXIMAL MARGIN
• These three points lie along the dashed lines
indicating the width of the margin, that is,
these points satisfy the condition

• These points are called the support vectors


for this problem. It can be shown that only
the support vectors are enough to define the
optimal classification rule fully.

• This property of the maximal margin


classifier is important in development of
support vector classifier and support vector
machine.
SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR / SOFT-MARGIN CLASSIFIER

• When the two classes are not linearly


separable (see figure), we will not be able to
find a line that entirely separates the groups, i.e.
the maximal margin classifier can not be
computed.
o Thus, the two classes cannot be classified
exactly.
o We can, however, generalize the ideas to
develop a classification rule that almost
separates the classes.

• To do so, we allow a few points to fall on the


wrong side of the margin or separating
hyperplane.
SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR / SOFT-MARGIN CLASSIFIER

• In support vector classifier, each data point i is given a slack variable


that allow individual data points to be on the wrong side of the
margin or the separating hyperplane.

• The slack variables quantifies where the i-th observation is located


relative to the hyperplane and the margin:

o If , then the i-th data point is on the correct side of the margin
o If , then the i-th data point is on the wrong side of the margin
(violated the margin)

o If , then the i-th data point is on the wrong side of the hyperplane
SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR / SOFT-MARGIN CLASSIFIER
• The support vector classifiers then attempt to maximize the margin
such that , for a pre-specified constant
o controls the number and severity of the violations to the margin
and to the hyperplane that can be tolerated by the classier.

• Formally, we solve the optimization problem:

with respect to and , subject to constraints

where is a nonnegative tuning parameter.


SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR / SOFT-MARGIN CLASSIFIER
• As before, we can write an equivalent optimization problem:

• Once we obtain the estimators , the estimated hyperplane is and


our classification rule is as follows: for a new observation ,

• The classification boundary: , the two margins:


SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR / SOFT-MARGIN CLASSIFIER
• As with the maximal margin classifier, the
classifier is affected only by the
observations that lie on the margin or
violates the margin.

• Data points that lies strictly on the correct


side of the margin does not affect the
support vector classifier at all.

• In this case, data points that fall directly on


the margin, or on the wrong side of the
margin for their class, are known as support
vectors.

• The circled points in the figure are the


SUPPORT VECTOR MACHINES
SUPPORT VECTOR MACHINES
INCORPORATING NONLINEAR TERMS
• The support vector classifier described so far finds linear boundaries in
the input feature space.
o Often linear effects of the covariates is not enough for a
classification problem.
o We might want to incorporate nonlinear terms (e.g., square or cubic
terms).

• In general, we can incorporate other nonlinear transformation as


features. Once the basis functions are selected, the procedure is the
same as before.

• We fit the support vector classifier using input features , and produce the
nonlinear function
SUPPORT VECTOR MACHINES
KERNEL TRICK

• Support vector machines (SVM) generalize support vector classifiers by


including nonlinear features in a specific way that allows us to add
many such features as well as a high number of variables.

• Without going into mathematics, SVM does so using the so called kernel
trick, i.e. by specifying a kernel function that controls which nonlinear
features to include in the classifier.

• The solution to the support vector classifier problem can be represented


as
SUPPORT VECTOR MACHINES
KERNEL TRICK
• To estimate , it can be shown that we only need the all pair-wise inner
products of the training data .
o Many of the resulting solutions are zero. The observations for which
are nonzero are called the support vectors.
• Therefore, for general nonlinear features

the classifier can be computed using the inner products: and .


• In fact, we need not specify the transformation at all, but required only
knowledge of the kernel function
that computes inner products in the transformed space.
SUPPORT VECTOR MACHINES
POPULAR KERNEL FUNCTIONS
• Some popular choices for the kernel
function in the SVM are

• For example, using the “linear” or


“quadratic” (d = 2-degree polynomial) kernel
will result in a linear or quadratic
classification boundaries, respectively. Simulated two-class data
with linear (blue dashed),
• On the other hand, using a “radial basis quadratic (black dash-
dotted) and radial-basis
kernel” captures other nonlinear features. based (red solid)
SUPPORT VECTOR REGRESSION
SUPPORT VECTOR REGRESSION
MODEL

• SVM can be applied to regression problems with quantitative response


as well.

• Let us start with a linear regression problem

where
SUPPORT VECTOR REGRESSION
MODEL
• Support vector regression solves the following problem:

where the loss function has the form

• The penalty term is called a shrinkage penalty. Here , a tuning


parameter, controls the relative impact of the two terms on the
regression coefficient estimates.
o For large , the quadratic penalty term dominates the criterion, and
the resulting estimates approach to zero.
o When , there is no penalty, and thus we get exactly the ordinary least
square estimates.
SUPPORT VECTOR REGRESSION
MODEL

• The loss function ignores errors of size


less than (see figure).

• It can be shown that the solution function


has the form

where and are constants. Typically, only


a subset of the values are nonzero, and The loss function for support
vector regression for epsilon
the associated data values are called the value 1. The blue dashed line is
the usual squared error loss.
support vectors.
SUPPORT VECTOR REGRESSION
MODEL

• As was the case in the support vector classification, the solution


depends on the input values only through the inner products .

• Thus, we can generalize the methods to richer spaces by defining


an appropriate inner product, and specifying the corresponding
kernel function.
CODE
HTTPS://DRIVE.GOOGLE.COM/DRIVE/FOLDERS/1EORYNTWXWBWPRRKVDY2SEKUIUNOMXTBZ?
USP=SHARING
REFERENCES

• HANDS-ON MACHINE LEARNING WITH SCIKIT-LEARN, KERAS,


AND TENSORFLOW: CONCEPTS, TOOLS, AND TECHNIQUES TO
BUILD INTELLIGENT SYSTEMS BY GERON.

• AN INTRODUCTION TO STATISTICAL LEARNING BY JAMES,


WITTEN, HASTIE, TIBSHIRANI. ​

• THE ELEMENTS OF STATISTICAL LEARNING BY HASTIE, TIBSHIRANI,


AND FRIEDMAN.​​

July 2023

You might also like