Support Vector Machine

DTSC6014001 – MACHINE LEARNING
SUPPORT VECTOR MACHINE

SESSION 10
SUBJECT MATTER EXPERT

Lili Ayu Wulandhari, S.Si., M.Sc., Ph.D.
Learning
Outcomes
LO1: Identify machine learning problem and task
LO2: Explain fundamental concept of machine learning
LO3: Construct machine learning model for given problem
using Python
LO4: Analyze the result of machine learning modeling
process
OUTLINE
• Maximal margin classifier

• Support vector classifiers
• Support vector machines
• Support vector regression
MAXIMAL MARGIN CLASSIFIER
INTRODUCTION
Support vector machine (SVM) is:
• A powerful, versatile, and the most popular Machine Learning model
• Capable of performing linear or nonlinear classification, regression, and

even outlier detection.
• Well-suited for classification of complex small or medium-sized

datasets.
INTRODUCTION
• SVM is a family of classification rules that contain both parametric

(e.g., linear) and nonparametric (e.g., kernel based) methods.
• It can be viewed as a generalization of linear decision boundaries for

classification.
• In particular, SVM produces nonlinear boundaries by constructing a

linear boundary in a large, transformed version of the feature space.
• We will mainly discuss three approaches: maximal margin classifier, the

support vector classifier, and the support vector machine – people often
loosely refer to these methods collectively as support vector machines.
SVM TERMINOLOGY
1.Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of different classes in a feature
space. In the case of linear classifications, it will be a linear equation i.e. wx+b = 0.
2.Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a critical role in deciding the
hyperplane and margin.
3.Margin: Margin is the distance between the support vector and hyperplane. The main objective of the support vector
machine algorithm is to maximize the margin. The wider margin indicates better classification performance.
4.Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data points into high-
dimensional feature spaces, so, that the hyperplane can be easily found out even if the data points are not linearly
separable in the original input space. Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
5.Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly separates the
data points of different categories without any misclassifications.
6.Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft margin technique. Each data
point has a slack variable introduced by the soft-margin SVM formulation, which softens the strict margin requirement and
permits certain misclassifications or violations. It discovers a compromise between increasing the margin and reducing
violations.
7.C: Margin maximisation and misclassification fines are balanced by the regularisation parameter C in SVM. The penalty for
going over the margin or misclassifying data items is decided by it. A stricter penalty is imposed with a greater value of C,
SEPARATING HYPERPLANE
• We know that logistic regression
estimates linear decision boundaries
in classification problems.  The basis
for support vector classifiers.
• Let us take a look at the following

artificially generated dataset. The
dataset contains two predictors, and ,
and observations come from two classes
(blue circles and orange triangles).
o The two classes are well separated,
and a straight line can be used for
classification.
o This situation is called linearly
separable.
• Formally, in a linearly separable case, there exists such that the line
perfectly separate the two classes.
• If we code the response as for the two classes, then for a data point
with predictor values , we have the relation
• In other words, with p predictors , we will use a linear combination of

the form .
• One crucial property of a separating hyperplane is that
• When such a separating hyperplane exists, we can construct a natural

classifier: for a new observation , the predicted class is
• In other words, a test observation is assigned a class depending on

which side of the hyperplane it is located.
MAXIMAL MARGIN
• One possible was to find such a separating hyperplane is to minimize

the distance of misclassified points to the decision boundary.
• If a response is misclassified, then , and the opposite for a

misclassified response with .
• Also, the magnitude of tells us how far the observation is located

from the boundary.
• Thus, we can minimize
with respect to , where indexes the set of misclassified observations.

MAXIMAL MARGIN
• However, when the problem is linearly

separable, there are infinitely many such
separating hyperplanes.
• In this perfectly separable case, given any

separating hyperplane, we can define the
margin as the minimal distance from the
observations to the hyperplane.
• The optimal classification rule is the line

that maximizes the margin around the Example of multiple separating
Example of multiple separating
separating line. hyperplanes for the same data.
hyperplanes for the same data.
• Such a classifier is called the maximal

margin classifier.
MAXIMAL MARGIN
• Formally, we consider the optimization problem:
with respect to , where is the margin and = distance between the

hyperplane and i-th training point.
• It can be shown that the optimization problem above is equivalent to:
with respect to .
MAXIMAL MARGIN
• Once we obtain the estimators , the optimal separating hyperplane is .
• Thus, our classification rule is as follows: for a new observation ,
• In the figure, the optimal separating line is

shown as the solid red line, the closest points
to the line are circled, and the separation
between the classes is shown using the
dashed black lines.
• Notice that the are a few points (circled) that
are closest, and equidistant, to the red
separating line.
MAXIMAL MARGIN
• These three points lie along the dashed lines
indicating the width of the margin, that is,
these points satisfy the condition
• These points are called the support vectors

for this problem. It can be shown that only
the support vectors are enough to define the
optimal classification rule fully.
• This property of the maximal margin

classifier is important in development of
support vector classifier and support vector
machine.
SUPPORT VECTOR CLASSIFIERS
SUPPORT VECTOR / SOFT-MARGIN CLASSIFIER
• When the two classes are not linearly

separable (see figure), we will not be able to
find a line that entirely separates the groups, i.e.
the maximal margin classifier can not be
computed.
o Thus, the two classes cannot be classified
exactly.
o We can, however, generalize the ideas to
develop a classification rule that almost
separates the classes.
• To do so, we allow a few points to fall on the

wrong side of the margin or separating
hyperplane.
• In support vector classifier, each data point i is given a slack variable

that allow individual data points to be on the wrong side of the
margin or the separating hyperplane.
• The slack variables quantifies where the i-th observation is located

relative to the hyperplane and the margin:
o If , then the i-th data point is on the correct side of the margin
o If , then the i-th data point is on the wrong side of the margin
(violated the margin)
o If , then the i-th data point is on the wrong side of the hyperplane
• The support vector classifiers then attempt to maximize the margin
such that , for a pre-specified constant
o controls the number and severity of the violations to the margin
and to the hyperplane that can be tolerated by the classier.
• Formally, we solve the optimization problem:
with respect to and , subject to constraints
where is a nonnegative tuning parameter.

• As before, we can write an equivalent optimization problem:
• Once we obtain the estimators , the estimated hyperplane is and

our classification rule is as follows: for a new observation ,
• The classification boundary: , the two margins:

• As with the maximal margin classifier, the
classifier is affected only by the
observations that lie on the margin or
violates the margin.
• Data points that lies strictly on the correct

side of the margin does not affect the
support vector classifier at all.
• In this case, data points that fall directly on

the margin, or on the wrong side of the
margin for their class, are known as support
vectors.
• The circled points in the figure are the

SUPPORT VECTOR MACHINES
INCORPORATING NONLINEAR TERMS
• The support vector classifier described so far finds linear boundaries in
the input feature space.
o Often linear effects of the covariates is not enough for a
classification problem.
o We might want to incorporate nonlinear terms (e.g., square or cubic
terms).
• In general, we can incorporate other nonlinear transformation as

features. Once the basis functions are selected, the procedure is the
same as before.
• We fit the support vector classifier using input features , and produce the
nonlinear function
KERNEL TRICK
• Support vector machines (SVM) generalize support vector classifiers by

including nonlinear features in a specific way that allows us to add
many such features as well as a high number of variables.
• Without going into mathematics, SVM does so using the so called kernel
trick, i.e. by specifying a kernel function that controls which nonlinear
features to include in the classifier.
• The solution to the support vector classifier problem can be represented

as
KERNEL TRICK
• To estimate , it can be shown that we only need the all pair-wise inner
products of the training data .
o Many of the resulting solutions are zero. The observations for which
are nonzero are called the support vectors.
• Therefore, for general nonlinear features
the classifier can be computed using the inner products: and .

• In fact, we need not specify the transformation at all, but required only
knowledge of the kernel function
that computes inner products in the transformed space.
POPULAR KERNEL FUNCTIONS
• Some popular choices for the kernel
function in the SVM are
• For example, using the “linear” or

“quadratic” (d = 2-degree polynomial) kernel
will result in a linear or quadratic
classification boundaries, respectively. Simulated two-class data
with linear (blue dashed),
• On the other hand, using a “radial basis quadratic (black dash-
dotted) and radial-basis
kernel” captures other nonlinear features. based (red solid)
SUPPORT VECTOR REGRESSION
MODEL
• SVM can be applied to regression problems with quantitative response

as well.
• Let us start with a linear regression problem
where
MODEL
• Support vector regression solves the following problem:
where the loss function has the form
• The penalty term is called a shrinkage penalty. Here , a tuning

parameter, controls the relative impact of the two terms on the
regression coefficient estimates.
o For large , the quadratic penalty term dominates the criterion, and
the resulting estimates approach to zero.
o When , there is no penalty, and thus we get exactly the ordinary least
square estimates.
MODEL
• The loss function ignores errors of size

less than (see figure).
• It can be shown that the solution function

has the form
where and are constants. Typically, only

a subset of the values are nonzero, and The loss function for support
vector regression for epsilon
the associated data values are called the value 1. The blue dashed line is
the usual squared error loss.
support vectors.
MODEL
• As was the case in the support vector classification, the solution

depends on the input values only through the inner products .
• Thus, we can generalize the methods to richer spaces by defining

an appropriate inner product, and specifying the corresponding
kernel function.
CODE
HTTPS://DRIVE.GOOGLE.COM/DRIVE/FOLDERS/1EORYNTWXWBWPRRKVDY2SEKUIUNOMXTBZ?
USP=SHARING
REFERENCES
• HANDS-ON MACHINE LEARNING WITH SCIKIT-LEARN, KERAS,

AND TENSORFLOW: CONCEPTS, TOOLS, AND TECHNIQUES TO
BUILD INTELLIGENT SYSTEMS BY GERON.
• AN INTRODUCTION TO STATISTICAL LEARNING BY JAMES,

WITTEN, HASTIE, TIBSHIRANI.
• THE ELEMENTS OF STATISTICAL LEARNING BY HASTIE, TIBSHIRANI,

AND FRIEDMAN.
July 2023

Support Vector Machine

Uploaded by

Copyright:

Available Formats

Support Vector Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machine

Uploaded by

Copyright:

Available Formats

DTSC6014001 – MACHINE LEARNING

SUPPORT VECTOR MACHINE

SUBJECT MATTER EXPERT

• Maximal margin classifier

Support vector machine (SVM) is:

• A powerful, versatile, and the most popular Machine Learning model

• Capable of performing linear or nonlinear classification, regression, and

• Well-suited for classification of complex small or medium-sized

• SVM is a family of classification rules that contain both parametric

• It can be viewed as a generalization of linear decision boundaries for

• In particular, SVM produces nonlinear boundaries by constructing a

• We will mainly discuss three approaches: maximal margin classifier, the

• Let us take a look at the following

perfectly separate the two classes.

• In other words, with p predictors , we will use a linear combination of

• One crucial property of a separating hyperplane is that

• When such a separating hyperplane exists, we can construct a natural

• In other words, a test observation is assigned a class depending on

• One possible was to find such a separating hyperplane is to minimize

• If a response is misclassified, then , and the opposite for a

• Also, the magnitude of tells us how far the observation is located

• Thus, we can minimize

with respect to , where indexes the set of misclassified observations.

• However, when the problem is linearly

• In this perfectly separable case, given any

• The optimal classification rule is the line

• Such a classifier is called the maximal

• Formally, we consider the optimization problem:

with respect to , where is the margin and = distance between the

• It can be shown that the optimization problem above is equivalent to:

• In the figure, the optimal separating line is

• These points are called the support vectors

• This property of the maximal margin

• When the two classes are not linearly

• To do so, we allow a few points to fall on the

• In support vector classifier, each data point i is given a slack variable

• The slack variables quantifies where the i-th observation is located

• Formally, we solve the optimization problem:

with respect to and , subject to constraints

where is a nonnegative tuning parameter.

• Once we obtain the estimators , the estimated hyperplane is and

• The classification boundary: , the two margins:

• Data points that lies strictly on the correct

• In this case, data points that fall directly on

• The circled points in the figure are the

• In general, we can incorporate other nonlinear transformation as

• Support vector machines (SVM) generalize support vector classifiers by

• The solution to the support vector classifier problem can be represented

the classifier can be computed using the inner products: and .

• For example, using the “linear” or

• SVM can be applied to regression problems with quantitative response

• Let us start with a linear regression problem

where the loss function has the form

• The penalty term is called a shrinkage penalty. Here , a tuning

• The loss function ignores errors of size

• It can be shown that the solution function

where and are constants. Typically, only

• As was the case in the support vector classification, the solution

• Thus, we can generalize the methods to richer spaces by defining

• HANDS-ON MACHINE LEARNING WITH SCIKIT-LEARN, KERAS,