Unit-III - SVM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 105

DEPARTMENT OF STATISTICS & OPERATIONS RESEARCH

AMU ALIGARH -202002 , U. P. (INDIA)

MACHINE LEARNING (DSM 2002)


M.SC. II SEMESTER (DATA SCIENCE)
2022-23

DR ZAHID AHMED ANSARI


2

UNIT-III
Support Vector Machines
(SVMs)
Dr. Zahid Ahmed Ansari 5/10/2023
3

OVERVIEW

• Overview
• Separating Hyperplane
• Maximal Margin Classifier
• Support Vector Classifier (SVC):
• linear classification and
• classification with non-linear decision boundaries,
• SVM versus SVC
• SVM with more than 2 classes:
• One-versus-One and
• One-versus- All case,
• Kernel Functions

Dr. Zahid Ahmed Ansari 5/10/2023


HYPERPLANE

• A hyperplane divides any ‘d’ dimensional space


into two parts using a (d-1) dimensional
hyperplane.
• The “green” colored 2-dimensional hyperplane is
used to separate the two classes “red” and
“blue” present in the third dimension, as shown
in the diagram.
• The data in the d dimension is divided into two
parts using a (d-1) dimension hyperplane.
• For instance, a point in (0-D) divides a line (in 1-
D) into two parts, a line in (1-D) divides a plane
(in 2-D) into two parts, and a plane (in 2-D)
divides a three-dimensional space into two parts.
5

MAXIMAL MARGIN CLASSIFIER


• This classifier is designed specifically for linearly separable data, refers to the condition in which
data can be separated linearly using a hyperplane.
• But, what is linearly separable data?

• Linear and non-linear separable data are described in the diagram above.
• Linearly separable data is data that is populated in such a way that it can be easily classified
with a straight line or a hyperplane.
• Non-linearly separable data, on the other hand, is described as data that cannot be separated
using a simple straight line (requires a complex classifier). 5/10/2023
6

THERE CAN BE AN INFINITE NO OF HYPERPLANES

• However, as shown in the diagram below, there can be an infinite number of


hyperplanes that will classify the linearly separable classes.

Dr. Zahid Ahmed Ansari 5/10/2023


7

MAXIMAL-MARGIN CLASSIFIER
• How do we choose the hyperplane that we really
need?
• Based on the maximum margin, the Maximal-Margin
Classifier chooses the optimal hyperplane.
• The dotted lines, parallel to the hyperplane in the
following diagram are the Margins and the distance
between both these dotted lines (Margins) is the
Maximum Margin.
• A margin passes through the nearest points from each
class; to the hyperplane. The angle between these
nearest points and the hyperplane is 90°. These points
are referred to as “Support Vectors”. Support vectors
are shown by circles in the diagram below.
• This classifier would choose the hyperplane with the
maximum margin which is why it is known as Maximal –
Margin Classifier. 5/10/2023
8

DRAWBACKS

• Maximal Margin classifier is heavily reliant on the support vector and changes as
support vectors change. As a result, they tend to overfit.
• They can’t be used for data that isn’t linearly separable. Since the majority of real-
world data is non-linear. As a result, this classifier is inefficient.
• The maximum margin classifier is also known as a “Hard Margin Classifier” because it
prevents misclassification and ensures that no point crosses the margin. It tends
to overfit due to the hard margin.
• An extension of the Maximal Margin Classifier, “Support Vector Classifier” was
introduced to address the problem associated with it.

Dr. Zahid Ahmed Ansari 5/10/2023


9

SUPPORT VECTOR CLASSIFIER


• Support Vector Classifier is an extension of the
Maximal Margin Classifier.
• It is less sensitive to individual data. Since it allows
certain data to be misclassified.
• It’s also known as the “Soft Margin Classifier”. It
creates a budget under which the misclassification
allowance is granted.
• Also, It allows some points to be misclassified, as
shown in the following diagram.
• The points inside the margin and on the margin are
referred to as “Support Vectors” in this scenario.
• Whereas the points on the margins were referred to
as “Support vectors” in the Maximal – Margin
Classifier.

5/10/2023
10

THE INFLUENCE OF C’S VALUE ON THE


MARGIN
• The margin widens as the budget for
misclassification increases, while the margin
narrows as the budget decreases.
• While building the model, we use a
hyperparameter called “Cost”. Here Cost is
inverse of budget means when the budget
increases —> Cost decreases and vice
versa. It is denoted by “C”.
• The influence of C’s value on the margin is
depicted in the diagram below. When the
value is small, for example, C=1, the margin
widens, while when the value is high, the
margin narrows down.

Small ‘C’ Value —-> Large Budget —–> Wide Margin —> Allows more misclassification
Large ‘C’ Value —-> Small Budget —–> Narrow Margin —> Allows less misclassification
5/10/2023
11

DRAWBACK

• Only linear classification can be done by this classifier. It becomes inefficient when
classification is nonlinear.
• Note the difference!!!!
Maximal Margin Classifier —————————> Hard Margin Classifier
Support Vector Classifier —————————–> Soft Margin Classifier
• However, all Maximum-Margin Classifiers and Support Vector Classifiers are
restricted to data that can be separated linearly.

Dr. Zahid Ahmed Ansari 5/10/2023


12

SUPPORT VECTOR MACHINES

• Support Vector Machines are an extension of Soft Margin Classifier. It can also be
used for nonlinear classification by using the kernel. As a result, this algorithm
performs well in the majority of real-world problem statements. Since, in the real
world, we will mostly find non-linear separable data, which will necessitate the use
of complex classifiers to classify them.
• Kernel: It transforms non-linear separable data from lower to higher dimensions to
facilitate linear classification, as illustrated in the figure below. We use the kernel-
based technique to separate non-linear data because separation can be simpler in
higher dimensions.
• The kernel transforms the data from lower to higher dimensions using mathematical
formulas.

Dr. Zahid Ahmed Ansari 5/10/2023


13

SEPARATION MAY BE EASIER IN HIGHER


DIMENSIONS

Dr. Zahid Ahmed Ansari 5/10/2023


14

SUPPORT VECTOR MACHINE ALGORITHM


• Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine
Learning.
• The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional
space into classes so that we can easily put the new data
point in the correct category in the future. This best
decision boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called
as support vectors, and hence algorithm is termed as
Support Vector Machine.
• Consider the below diagram in which there are two
different categories that are classified using a decision
boundary or hyperplane:
Dr. Zahid Ahmed Ansari
15

SVM BASED CLASSIFICATION EXAMPLE


• Suppose we see a strange cat that also has some
features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such
a model can be created by using the SVM algorithm.
• We will first train our model with lots of images of
cats and dogs so that it can learn about different
features of cats and dogs, and then we test it with
this strange creature.
• So as support vector creates a decision boundary
between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the
extreme case of cat and dog.
• On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram.
• SVM algorithm can be used for Face detection,
image classification, text categorization, etc.

5/10/2023
16

TYPES OF SVM

• SVM can be of two types:


• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear
SVM classifier.

Dr. Zahid Ahmed Ansari 5/10/2023


17

SVM: IMPORTANT CONCEPTS

• The followings are important concepts in SVM −


• Hyperplane − There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features, then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
• Support Vectors − The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a Support vector

Dr. Zahid Ahmed Ansari 5/11/2023


18

SVM: IMPORTANT CONCEPTS


• The followings are important concepts in SVM −
• Margin − It may be defined as the gap between two lines
on the closet data points of different classes. It can be
calculated as the perpendicular distance from the line to
the support vectors. Large margin is considered as a good
margin and small margin is considered as a bad margin.
• Maximal Margin and Optimal Hyperplane:
• The SVM algorithm helps to find the best line or decision
boundary; this best boundary or region is called as
a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support
vectors.
• The distance between the vectors and the hyperplane is
called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with Maximum Margin is called
the Optimal Hyperplane.

Dr. Zahid Ahmed Ansari 5/11/2023


19

LINEAR SVM

• The working of the SVM algorithm can be


understood by using an example. Suppose we
have a dataset that has two tags (green and
blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or
blue. Consider the below image:

Dr. Zahid Ahmed Ansari 5/11/2023


20

LINEAR SVM

• So as it is 2-d space so by just using a straight


line, we can easily separate these two classes.
But there can be multiple lines that can
separate these classes. Consider the below
image:

Dr. Zahid Ahmed Ansari 5/11/2023


21

LINEAR SVM

• Hence, the SVM algorithm helps to find the


best line or decision boundary; this best
boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the
lines from both the classes. These points are
called support vectors. The distance between
the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize
this margin. The hyperplane with maximum
margin is called the optimal hyperplane.

Dr. Zahid Ahmed Ansari 5/11/2023


22

NON-LINEAR SVM

• If data is linearly arranged, then we can


separate it by using a straight line, but for non-
linear data, we cannot draw a single straight
line. Consider the below image:

Dr. Zahid Ahmed Ansari 5/11/2023


23

NON-LINEAR SVM

• So to separate these data points, we need to


add one more dimension. For linear data, we
have used two dimensions x and y, so for non-
linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
• By adding the third dimension, the sample
space will become as below image:

Dr. Zahid Ahmed Ansari 5/11/2023


24

NON-LINEAR SVM

• So now, SVM will divide the datasets into


classes in the following way. Consider the
below image:

Dr. Zahid Ahmed Ansari 5/11/2023


25

NON-LINEAR SVM

• Since we are in 3-d Space, hence it is looking


like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
• Hence we get a circumference of radius 1 in
case of non-linear data.

Dr. Zahid Ahmed Ansari 5/11/2023


26

SVM KERNELS

• In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form.
• Kernel Function is a method used to take data as input and transform it into the required form
of processing data.
• “Kernel” is used due to a set of mathematical functions used in Support Vector Machine
providing the window to manipulate the data.
• So, Kernel Function generally transforms the training set of data so that a non-linear decision
surface is able to transform to a linear equation in a higher number of dimension spaces.
Basically, It returns the inner product between two points in a standard feature dimension.
• Kernel takes a low dimensional input space and transforms it into a higher dimensional space.
• In simple words, kernel converts non-separable problems into separable problems by adding
more dimensions to it.
• It makes SVM more powerful, flexible and accurate.

Dr. Zahid Ahmed Ansari 5/11/2023


27

RULES OF KERNEL FUNCTIONS


• There are certain rules which the kernel functions must follow.
• Also, these rules are the deciding factors of what kernel should be implemented for
classification.
• One such rule is the moving window classifier or we can also call it the window function. This
function is shown as:
• fn(x) = 1, if ∑ ?(|| x – xi|| <= h)? (yi=1) > ∑ ?(|| x – xi|| <= h)? (yi=0)
fn(x) = 0, otherwise.
• Here, the summations are from i=1 to n. ‘h’ is the width of the window. This rule assigns
weights to the points at a fixed distance from ‘x’.
• ‘xi’ are the points nearby ‘x’.
• It is essential that the weights should be distributed in the direction of xi. This ensures the
smooth working of the weight functions. These weight functions are the kernel functions.
• The kernel function is represented as K: Rd-> R.

Dr. Zahid Ahmed Ansari 5/11/2023


28

TYPES OF KERNELS IN SVM

• The following are some of the types of kernels functions


used by SVM:
• Polynomial Kernel Function
• Gaussian Radial Basis Function (RBF) Kernel
• Sigmoid Kernel Function
• Linear Kernel Function
• Hyperbolic Tangent Kernel Function
• Graph Kernel Function
• String Kernel Function
• Tree Kernel Function

Dr. Zahid Ahmed Ansari 5/11/2023


29

POLYNOMIAL KERNEL

• The polynomial kernel is a general representation of


kernels with a degree of more than one. It’s useful
for image processing.
• There are two types of this:
• Homogenous Polynomial Kernel Function
• K(xi,xj) = (xi.xj)d, where ‘.’ is the dot product of both
the numbers and d is the degree of the polynomial.
• Inhomogeneous Polynomial Kernel Function
• K(xi,xj) = (xi.xj + c)d where c is a constant.

Dr. Zahid Ahmed Ansari 5/11/2023


30

GAUSSIAN RBF KERNEL FUNCTION

• RBF is the radial basis function. This is used when


there is no prior knowledge about the data.
• RBF kernel, mostly used in SVM classification,
maps input space in indefinite dimensional space.
• It’s represented as:
• K(xi,xj) = exp(-γ||xi – xj||)2
• Here, gamma ranges from 0 to 1. We need to
manually specify it in the learning algorithm. A
good default value of gamma is 0.1.

Dr. Zahid Ahmed Ansari 5/11/2023


31

SIGMOID KERNEL FUNCTION

• We can use Sigmoid kernel Function as the


proxy for neural networks. Equation is:
• K(xi, xj) = tanh(αxay + c)
• It is just taking your input, mapping them to a
value of 0 and 1 so that they can be
separated by a simple straight line.

Dr. Zahid Ahmed Ansari 5/11/2023


32

LINEAR KERNEL

• This kernel is one-dimensional and is the most basic form of kernel in SVM. The
equation is:
• K(xi,xj) = xi.xj + c
• xi.xj is dot product between any two observations xi and xj.

Dr. Zahid Ahmed Ansari 5/11/2023


33

HYPERBOLIC TANGENT KERNEL FUNCTION

• This is also used in neural networks. The equation is:


• K(xi,xj) = tanh(kxi.xj + c)

Dr. Zahid Ahmed Ansari 5/11/2023


34

GRAPH KERNEL FUNCTION

• This kernel is used to compute the inner on graphs. They measure the similarity
between pairs of graphs. They contribute in areas like bioinformatics, chemoinformatic,
etc.

Dr. Zahid Ahmed Ansari 5/11/2023


35

STRING KERNEL FUNCTION

• This kernel operates on the basis of strings. It is mainly used in areas


like text classification. They are very useful in text mining, genome analysis, etc.

Dr. Zahid Ahmed Ansari 5/11/2023


36

TREE KERNEL FUNCTION

• This kernel is more associated with the tree structure.


• The kernel helps to split the data into tree format and helps the SVM to distinguish
between them.
• This is helpful in language classification, and it is used in areas like NLP.

Dr. Zahid Ahmed Ansari 5/11/2023


37

SVM Classification Example

Dr. Zahid Ahmed Ansari 5/11/2023


38

SVM BASED CLASSIFICATION


• Here we will use the same dataset user_data.
#Data Pre-processing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

#importing datasets
data_set= pd.read_csv('user_data.csv')

#Extracting Independent and dependent Variable


x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values

Dr. Zahid Ahmed Ansari 5/11/2023


39

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
#feature Scaling
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
• After executing the above code, we will pre-process the data. The code will give
the dataset as:

Dr. Zahid Ahmed Ansari 5/11/2023


40

Dr. Zahid Ahmed Ansari 5/11/2023


41

• The scaled output for the test set will be:

Dr. Zahid Ahmed Ansari 5/11/2023


42

FITTING THE SVM CLASSIFIER TO THE TRAINING


SET
• Now the training set will be fitted to the SVM classifier. To create the SVM
classifier, we will import SVC class from Sklearn.svm library.
from sklearn.svm import SVC # "Support vector classifier"
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(x_train, y_train)
• In the above code, we have used kernel='linear', as here we are creating
SVM for linearly separable data. However, we can change it for non-linear
data. And then we fitted the classifier to the training dataset(x_train, y_train)

Dr. Zahid Ahmed Ansari 5/11/2023


43

OUTPUT

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
• The model performance can be altered by changing the value of
C(Regularization factor), gamma, and kernel.

Dr. Zahid Ahmed Ansari 5/11/2023


44

PREDICTING THE TEST SET RESULT

• Now, we will predict the output for test set. For this, we will create a new
vector y_pred. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
• After getting the y_pred vector, we can compare the result of y_pred and
y_test to check the difference between the actual value and predicted
value.

Dr. Zahid Ahmed Ansari 5/11/2023


45

• Output: Below is the output for the


prediction of the test set:

Dr. Zahid Ahmed Ansari 5/11/2023


46

CREATING THE CONFUSION MATRIX

• Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the
confusion matrix, we need to import the confusion_matrix function of the sklearn
library. After importing the function, we will call it using a new variable cm. The
function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:

#Creating the Confusion matrix


from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test, y_pred)

Dr. Zahid Ahmed Ansari 5/11/2023


47

OUTPUT: CONFUSION MATRIX

• As we can see in the above output image, there


are 66+24= 90 correct predictions and 8+2= 10
correct predictions.
Therefore we can say that our SVM model
improved as compared to the Logistic
regression model.

Dr. Zahid Ahmed Ansari 5/11/2023


48

VISUALIZING THE TRAINING SET RESULT


from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Training set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show() 5/11/2023
49

OUTPUT

• By executing the above code, we will get the


output as:
• As we can see, the above output is appearing
similar to the Logistic regression output. In the
output, we got the straight line as hyperplane
because we have used a linear kernel in the
classifier. And we have also discussed above
that for the 2d space, the hyperplane in SVM is
a straight line

Dr. Zahid Ahmed Ansari 5/11/2023


50

VISUALIZING THE TEST SET RESULT


from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01),
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
alpha = 0.75, cmap = ListedColormap(('red','green' )))
mtp.xlim(x1.min(), x1.max())
mtp.ylim(x2.min(), x2.max())
for i, j in enumerate(nm.unique(y_set)):
mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
mtp.title('SVM classifier (Test set)')
mtp.xlabel('Age')
mtp.ylabel('Estimated Salary')
mtp.legend()
mtp.show() 5/11/2023
51

OUTPUT

• By executing the above code, we will get the


output as:
• As we can see in the above output image, the
SVM classifier has divided the users into two
regions (Purchased or Not purchased). Users
who purchased the SUV are in the red region
with the red scatter points. And users who did
not purchase the SUV are in the green region
with green scatter points. The hyperplane has
divided the two classes into Purchased and not
purchased variable.

Dr. Zahid Ahmed Ansari 5/11/2023


52

SVM Classification Example 2

Dr. Zahid Ahmed Ansari 5/11/2023


53

SVM CLASSIFIER EXAMPLE 2

• The following is an example for creating an SVM classifier by using kernels.


We will be using iris dataset from scikit-learn.
• We will start by importing following packages −
import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt

• Now, we need to load the input data −


iris = datasets.load_iris()

Dr. Zahid Ahmed Ansari 5/11/2023


54

• From this dataset, we are taking first two features as follows :


X = iris.data[:, :2]
y = iris.target
• Next, we will plot the SVM boundaries with original data as follows −
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
X_plot = np.c_[xx.ravel(), yy.ravel()]
• Now, we need to provide the value of regularization parameter as follows −
C = 1.0
• Next, SVM classifier object can be created as follows −
Svc_classifier = svm.SVC(kernel='linear', C=C).fit(X, y)

Dr. Zahid Ahmed Ansari 5/11/2023


55

OUTPUT
• Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')

Dr. Zahid Ahmed Ansari 5/11/2023


56

EXAMPLE

• For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −
Svc_classifier = svm.SVC(kernel = 'rbf', gamma =‘auto’, C = C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap = plt.cm.tab10, alpha = 0.3)
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')

Dr. Zahid Ahmed Ansari 5/11/2023


57

OUTPUT
• Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel’)
• We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.

Dr. Zahid Ahmed Ansari 5/11/2023


58

PROS AND CONS OF SVM CLASSIFIERS

• Pros of SVM classifiers


• SVM classifiers offers great accuracy and work well with high dimensional space.
SVM classifiers basically use a subset of training points hence in result uses very
less memory.
• Cons of SVM classifiers
• They have high training time hence in practice not suitable for large datasets.
Another disadvantage is that SVM classifiers do not work well with overlapping
classes.

Dr. Zahid Ahmed Ansari 5/11/2023


59

Support Vector Machine


Classification

Dr. Zahid Ahmed Ansari 5/12/2023


60

SVM

• A Support Vector Machine (SVM) is a very powerful and versatile Machine


Learning model, capable of performing linear or nonlinear classification,
regression, and even outlier detection.
• It is one of the most popular models in Machine Learning, and anyone
interested in Machine Learning should have it in their toolbox.
• SVMs are particularly well suited for classification of complex but small- or
medium-sized datasets.

Dr. Zahid Ahmed Ansari 5/12/2023


61

LINEAR SVM CLASSIFICATION

• The following Scikit-Learn code loads the iris dataset, scales the features, and
then trains a linear SVM model (using the LinearSVC class with C = 0.1 and
the hinge loss function) to detect Iris-Virginica flowers. The resulting model is
represented on the right of Figure.

Dr. Zahid Ahmed Ansari 5/12/2023


62

LINEAR SVM CLASSIFICATION


import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X_scaled, y)
# Then, as usual, you can use the model to make predictions:
svm_clf.predict([[5.5, 1.7]])
array([ 1.])
• Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class. 5/13/2023
63

LINEAR SVM CLASSIFICATION

• Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it is much slower, especially with
large training sets, so it is not recommended.
• Another option is to use the SGDClassifier class, with SGDClassifier(loss="hinge", alpha=1/(m*C)). This applies
regular Stochastic Gradient Descent to train a linear SVM classifier.
• It does not converge as fast as the LinearSVC class, but it can be useful to handle huge datasets that do not fit in
memory (out-of-core training), or to handle online classification tasks.
• The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean.
This is automatic if you scale the data using the StandardScaler.
• Moreover, make sure you set the loss hyperparameter to "hinge", as it is not the default value. Finally, for better
performance you should set the dual hyperparameter to False, unless there are more features than training
instances

Dr. Zahid Ahmed Ansari 5/13/2023


64

NONLINEAR SVM CLASSIFICATION


• Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not
even close to being linearly separable.
• One approach to handling nonlinear datasets is to add more features, such as polynomial features; in some
cases this can result in a linearly separable dataset.
• Consider the left plot in Figure: it represents a simple dataset with just one feature x1. This dataset is not
linearly separable, as you can see.
• But if you add a second feature x2 = (x1)2, the resulting 2D dataset is perfectly linearly separable.
• In Figure, features added to make a dataset linearly separable

Dr. Zahid Ahmed Ansari 5/12/2023


65

NONLINEAR SVM CLASSIFICATION

• To implement this idea using Scikit-Learn, you can create a Pipeline containing a PolynomialFeatures
transformer, followed by a StandardScaler and a LinearSVC. Let’s test this on the moons dataset (Figure):
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)

Dr. Zahid Ahmed Ansari 5/13/2023


66

POLYNOMIAL KERNEL
• Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning
algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets, and with
a high polynomial degree it creates a huge number of features, making the model too slow.
• Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the kernel
trick (it is explained in a moment). It makes it possible to get the same result as if you added many polynomial
features, even with very highdegree polynomials, without actually having to add them.
• So there is no combinatorial explosion of the number of features since you don’t actually add any features. This
trick is implemented by the SVC class. Let’s test it on the moons dataset:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)

5/12/2023
67

• This code trains an SVM classifier using a 3rd-


POLYNOMIAL KERNEL
degree polynomial kernel. It is represented on
the left of Figure. On the right is another SVM
classifier using a 10th degree polynomial kernel.
• Obviously, if your model is overfitting, you might
want to reduce the polynomial degree.
Conversely, if it is underfitting, you can try
increasing it.
• The hyperparameter coef0 controls how much
the model is influenced by high-degree
polynomials versus low-degree polynomials.
• A common approach to find the right
hyperparameter values is to use grid search. It is
often faster to first do a very coarse grid search,
then a finer grid search around the best values
found. Having a good sense of what each
hyperparameter actually does can also help you
search in the right part of the hyperparameter
space 5/12/2023
68

ADDING SIMILARITY FEATURES


• Another technique to tackle nonlinear problems is to add
features computed using a similarity function that
measures how much each instance resembles a
particular landmark. For example, let’s take the one-
dimensional dataset discussed earlier and add two
landmarks to it at x1 = –2 and x1 = 1 (see the left plot in
Figure).
• Next, let’s define the similarity function to be the
Gaussian Radial Basis Function (RBF) with γ = 0.3
ϕγ , ℓ = exp −γ∥ − ℓ ∥2
• It is a bell-shaped function varying from 0 (very far away • You may wonder how to select the landmarks. The simplest
from the landmark) to 1 (at the landmark). approach is to create a landmark at the location of each and
• Now we are ready to compute the new features. For every instance in the dataset.
example, let’s look at the instance x1 = –1: it is located • This creates many dimensions and thus increases the
at a distance of 1 from the first landmark, and 2 from the chances that the transformed training set will be linearly
second landmark. Therefore, its new features are x2 = separable.
exp (–0.3 × 12) ≈ 0.74 and x3 = exp (–0.3 × 22) ≈ 0.30. • The downside is that a training set with m instances and n
• The plot on the right of Figure shows the transformed features gets transformed into a training set with m instances
dataset (dropping the original features). As you can see, and m features (assuming you drop the original features). If
it is now linearly separable. your training set is very large, you end up with an equally large
number of features. 5/12/2023
69

GAUSSIAN RBF KERNEL


• Just like the polynomial features method, the similarity features method can be useful with any Machine
Learning algorithm, but it may be computationally expensive to compute all the additional features, especially
on large training sets. However, once again the kernel trick does its SVM magic:
• it makes it possible to obtain a similar result as if you had added many similarity features, without actually
having to add them. Let’s try the Gaussian RBF kernel using the SVC class:
rbf_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X, y)

Dr. Zahid Ahmed Ansari 5/12/2023


70

SVM CLASSIFIER USING GAUSSIAN RBF KERNEL


• This model is represented on the bottom left of
Figure .
• The other plots show models trained with different
values of hyperparameters gamma (γ) and C.
• Increasing gamma makes the bell-shape curve
narrower (see the left plot of Figure), and as a
result each instance’s range of influence is
smaller: the decision boundary ends up being
more irregular, wiggling around individual
instances.
• Conversely, a small gamma value makes the bell-
shaped curve wider, so instances have a larger
range of influence, and the decision boundary
ends up smoother.
• So γ acts like a regularization hyperparameter: if
your model is overfitting, you should reduce it, and
if it is underfitting, you should increase it (similar to
the C hyperparameter)
5/18/2023
71

MAY TRY OTHER KERNELS

• Other kernels exist but are used much more rarely. For example, some kernels are specialized for specific data
structures.
• String kernels are sometimes used when classifying text documents or DNA sequences (e.g., using the string
subsequence kernel or kernels based on the Levenshtein distance).
• With so many kernels to choose from, how can you decide which one to use? As a rule of thumb, you should
always try the linear kernel first (remember that LinearSVC is much faster than SVC(kernel="linear")), especially
if the training set is very large or if it has plenty of features.
• If the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in most cases.
• Then if you have spare time and computing power, you can also experiment with a few other kernels using
cross-validation and grid search, especially if there are kernels specialized for your training set’s data structure.

Dr. Zahid Ahmed Ansari 5/12/2023


72

COMPUTATIONAL COMPLEXITY
• The LinearSVC class is based on the liblinear library, which implements an optimized algorithm for linear SVMs.1 It does
not support the kernel trick, but it scales almost linearly with the number of training instances and the number of
features: its training time complexity is roughly O(m × n).
• The algorithm takes longer if you require a very high precision. This is controlled by the tolerance hyperparameter ϵ
(called tol in Scikit-Learn). In most classification tasks, the default tolerance is fine.
• The SVC class is based on the libsvm library, which implements an algorithm that supports the kernel trick.2 The
training time complexity is usually between O(m2 × n) and O(m3 × n). Unfortunately, this means that it gets dreadfully
slow when the number of training instances gets large (e.g., hundreds of thousands of instances).
• This algorithm is perfect for complex but small or medium training sets. However, it scales well with the number of
features, especially with sparse features (i.e., when each instance has few nonzero features).
• In this case, the algorithm scales roughly with the average number of nonzero features per instance. Table 5-1
compares Scikit-Learn’s SVM classification classes.

Class Time complexity Out-of-core support Scaling required Kernel trick


LinearSVC O(m × n) No Yes No
SGDClassifier O(m × n) Yes Yes No
SVC O(m² × n) to O(m³ ×Dr.n)
Zahid Ahmed Ansari
No Yes Yes
5/12/2023
73

Support Vector Regression


Algorithm

Dr. Zahid Ahmed Ansari 5/13/2023


74

SUPPORT VECTOR REGRESSION


• As we mentioned earlier, the SVM algorithm is quite
versatile: not only does it support linear and nonlinear
classification, but it also supports linear and nonlinear
regression.
• The trick is to reverse the objective: instead of trying to
fit the largest possible street between two classes while
limiting margin violations, SVM Regression tries to fit as
many instances as possible on the street while limiting
margin violations (i.e., instances off the street).
• The width of the street is controlled by a
hyperparameter ϵ. Figure 5-10 shows two linear SVM
Regression models trained on some random linear data,
one with a large margin (ϵ = 1.5) and the other with a
small margin (ϵ =0.5).
• Adding more training instances within the margin does
not affect the model’s predictions; thus, the model is
said to be ϵ-insensitive.

5/13/2023
75

SUPPORT VECTOR REGRESSION


• You can use Scikit-Learn’s LinearSVR class to
perform linear SVM Regression. The following
code produces the model represented on the left
of Figure (the training data should be scaled and
centered first):
from sklearn.svm import LinearSVR
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)
• To tackle nonlinear regression tasks, you can use
a kernelized SVM model. For example, Figure
shows SVM Regression on a random quadratic
training set, using a 2nd-degree polynomial kernel.
• There is little regularization on the left plot (i.e., a
large C value), and much more regularization on
the right plot (i.e., a small C value).

5/13/2023
76

SUPPORT VECTOR REGRESSION

• The following code produces the model represented on the left of previous Figure using Scikit-Learn’s
SVR class (which supports the kernel trick).
• The SVR class is the regression equivalent of the SVC class, and the LinearSVR class is the
regression equivalent of the LinearSVC class.
• The LinearSVR class scales linearly with the size of the training set (just like the LinearSVC class),
while the SVR class gets much too slow when the training set grows large (just like the SVC class).
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)
• SVMs can also be used for outlier detection; see Scikit-Learn’s documentation for more details.

Dr. Zahid Ahmed Ansari 5/13/2023


77

Support Vector Regression


Algorithm

Dr. Zahid Ahmed Ansari 5/12/2023


78

INTRODUCTION TO SUPPORT VECTOR


REGRESSION (SVR)

• Support Vector Regression (SVR) is a type of machine learning algorithm used for regression
analysis. The goal of SVR is to find a function that approximates the relationship between the
input variables and a continuous target variable, while minimizing the prediction error.
• Unlike Support Vector Machines (SVMs) used for classification tasks, SVR seeks to find a
hyperplane that best fits the data points in a continuous space. This is achieved by mapping
the input variables to a high-dimensional feature space and finding the hyperplane that
maximizes the margin (distance) between the hyperplane and the closest data points, while
also minimizing the prediction error.
• SVR can handle non-linear relationships between the input variables and the target variable
by using a kernel function to map the data to a higher-dimensional space. This makes it a
powerful tool for regression tasks where there may be complex relationships between the
input variables and the target variable.
• Support Vector Regression (SVR) uses the same principle as SVM, but for regression
problems.

Dr. Zahid Ahmed Ansari 5/12/2023


79

THE IDEA BEHIND SUPPORT VECTOR


REGRESSION
• The problem of regression is to find a
function that approximates mapping from
an input domain to real numbers on the
basis of a training sample. So let’s now dive
deep and understand how SVR works
actually.
• Consider these two red lines as the decision
boundary and the green line as the
hyperplane. Our objective, when we are
moving on with SVR, is to basically
consider the points that are within the
decision boundary line. Our best fit line is
the hyperplane that has a maximum number
of points.
Dr. Zahid Ahmed Ansari 5/12/2023
80

SVR
• The first thing that we’ll understand is what is the decision boundary (the danger red line above!). Consider
these lines as being at any distance, say ‘a’, from the hyperplane. So, these are the lines that we draw at
distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to as epsilon.
• Assuming that the equation of the hyperplane is as follows:
Y = wx+b (equation of hyperplane)
• Then the equations of decision boundary become:
wx+b= +a
wx+b= -a
• Thus, any hyperplane that satisfies our SVR should satisfy:
-a < Y- wx+b < +a
• Our main aim here is to decide a decision boundary at ‘a’ distance from the original hyperplane such that
data points closest to the hyperplane or the support vectors are within that boundary line.
• Hence, we are going to take only those points that are within the decision boundary and have the least error
rate or are within the Margin of Tolerance. This gives us a better fitting model.

5/13/2023
81

IMPLEMENTING SVR IN PYTHON

• In this section, we’ll understand the use of


Support Vector Regression with the help of
a dataset.
• Here, we have to predict the salary of an
employee given a few independent
variables. A classic HR analytics project!

Dr. Zahid Ahmed Ansari 5/13/2023


82

STEP 1: IMPORT LIBRARIES & READ DATASET

#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Read the dataset


dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

Dr. Zahid Ahmed Ansari 5/13/2023


83

STEP 2: FEATURE SCALING


• A real-world dataset contains features that vary in magnitudes, units, and range.
• It is suggested to perform normalization when the scale of a feature is irrelevant or
misleading.
• Feature Scaling basically helps to normalize the data within a particular range. Normally
several common class types contain the feature scaling function so that they make feature
scaling automatically.
• However, the SVR class is not a commonly used class type so we should perform feature
scaling using Python.

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
5/13/2023
84

STEP 3: FITTING SVR TO THE DATASET

from sklearn.svm import SVR


regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
• Kernel is the most important feature.
• There are many types of kernels – linear, Gaussian, etc. Each is used depending on the
dataset.

5/13/2023
85

STEP 4. PREDICTING A NEW RESULT

y_pred = regressor.predict(6.5)
y_pred = sc_y.inverse_transform(y_pred)

• So, the prediction for y_pred(6, 5) will be 170,370.

Dr. Zahid Ahmed Ansari 5/13/2023


86

STEP 6. VISUALIZING THE SVR RESULTS

#Step 6. Visualizing the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

Dr. Zahid Ahmed Ansari 5/13/2023


87

OUTPUT

• This is what we get as output- the best


fit line that has a maximum number of
points. Quite accurate!

Dr. Zahid Ahmed Ansari 5/13/2023


88

WHAT IS SVM REGRESSION?

• SVM regression or Support Vector Regression (SVR) is a machine learning algorithm


used for regression analysis.
• It is different from traditional linear regression methods as it finds a hyperplane that
best fits the data points in a continuous space, instead of fitting a line to the data
points.
The SVR algorithm aims to find the hyperplane that passes through as many data
points as possible within a certain distance, called the margin.
• This approach helps to reduce the prediction error and allows SVR to handle non-
linear relationships between input variables and the target variable using a kernel
function.
• As a result, SVM regression is a powerful tool for regression tasks where there may
be complex relationships between the input variables and the target variable.

Dr. Zahid Ahmed Ansari 5/13/2023


89

DIFFERENCE BETWEEN SVM AND SVR

• SVM (Support Vector Machines) is a classification algorithm that separates data


points into different classes with a hyperplane while minimizing the misclassification
error.
• On the other hand, SVR (Support Vector Regression) is a regression algorithm that
finds a hyperplane that best fits data points in a continuous space while minimizing
the prediction error.
• SVM is used for categorical target variables, while SVR is used for continuous target
variables

Dr. Zahid Ahmed Ansari 5/13/2023


90

APPLICATIONS OF SVM REGRESSION

• SVM regression or Support Vector Regression (SVR) has a wide range of applications
in various fields.
• It is commonly used in finance for predicting stock prices, in engineering for
predicting machine performance, and in bioinformatics for predicting protein
structures.
• SVM regression is also used in natural language processing for text classification and
sentiment analysis.
• Additionally, it is used in image processing for object recognition and in healthcare
for predicting medical outcomes.
• Overall, SVM regression is a versatile algorithm that can be used in many domains
for making accurate predictions.

Dr. Zahid Ahmed Ansari 5/13/2023


91

Multi-class Classification with


SVM
Dr. Zahid Ahmed Ansari 5/13/2023
92

BINARY AND MULTI-CLASS CLASSIFICATION

• Binary classification are those tasks where examples are assigned exactly one of two
classes.
• Binary Classification: Classification tasks with two classes.
• Multi-class classification is those tasks where examples are assigned exactly one of
more than two classes.
• Multi-class Classification: Classification tasks with more than two classes.
• Some algorithms are designed for binary classification problems. Examples include:
• Logistic Regression
• Perceptron
• Support Vector Machines
• As such, they cannot be used for multi-class classification tasks, at least not directly.

Dr. Zahid Ahmed Ansari 5/13/2023


93

BINARY CLASSIFIERS FOR MULTI-CLASS


CLASSIFICATION
• Heuristic methods can be used to split a multi-class classification problem
into multiple binary classification datasets and train a binary classification
model each.
• Two examples of these heuristic methods include:
• One-vs-Rest (OvR)
• One-vs-One (OvO)
• Let’s take a closer look at each.

Dr. Zahid Ahmed Ansari 5/13/2023


94

ONE-VS-REST FOR MULTI-CLASS


CLASSIFICATION
• One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using
binary classification algorithms for multi-class classification.
• It involves splitting the multi-class dataset into multiple binary classification problems. A binary
classifier is then trained on each binary classification problem and predictions are made using
the model that is the most confident.
• For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’
and ‘green‘.
• This could be divided into three binary classification datasets as follows:
• Binary Classification Problem 1: red vs [blue, green]
• Binary Classification Problem 2: blue vs [red, green]
• Binary Classification Problem 3: green vs [red, blue]
• A possible downside of this approach is that it requires one model to be created for each class.
For example, three classes requires three models. This could be an issue for large datasets (e.g.
millions of rows), slow models (e.g. neural networks), or very large numbers of classes (e.g.
hundreds of classes).
5/13/2023
95

ONE-VERSUS-THE-REST

• The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in


which we train C binary classifiers, fc(x), where the data from class c is treated as positive,
and the data from all the other classes is treated as negative.
• This approach requires that each model predicts a class membership probability or a
probability-like score. The argmax of these scores (class index with the largest score) is then
used to predict a class.
• This approach is commonly used for algorithms that naturally predict numerical class
membership probability or score, such as:
• Logistic Regression
• Perceptron
• As such, the implementation of these algorithms in the scikit-learn library implements the
OvR strategy by default when using these algorithms for multi-class classification.
Dr. Zahid Ahmed Ansari 5/13/2023
96

ONE-VERSUS-THE-REST

• We can demonstrate this with an example on a 3-class classification problem using


the LogisticRegression algorithm. The strategy for handling multi-class classification
can be set via the “multi_class” argument and can be set to “ovr” for the one-vs-rest
strategy.
• The complete example of fitting a logistic regression model for multi-class
classification using the built-in one-vs-rest strategy is listed below.

Dr. Zahid Ahmed Ansari 5/18/2023


97

ONE-VERSUS-THE-REST

# logistic regression for multi-class classification using built-in one-vs-rest


from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5,
n_classes=3, random_state=1)
# define model
model = LogisticRegression(multi_class='ovr')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)

Dr. Zahid Ahmed Ansari 5/18/2023


98

ONE-VERSUS-THE-REST

• The scikit-learn library also provides a separate OneVsRestClassifier class that


allows the one-vs-rest strategy to be used with any classifier.
• This class can be used to use a binary classifier like Logistic Regression or
Perceptron for multi-class classification, or even other classifiers that natively
support multi-class classification.
• It is very easy to use and requires that a classifier that is to be used for binary
classification be provided to the OneVsRestClassifier as an argument.
• The example below demonstrates how to use the OneVsRestClassifier class with
a LogisticRegression class used as the binary classification model.

Dr. Zahid Ahmed Ansari 5/18/2023


99

ONE-VERSUS-THE-REST

• The scikit-learn library also provides a separate OneVsRestClassifier class that


allows the one-vs-rest strategy to be used with any classifier.
• This class can be used to use a binary classifier like Logistic Regression or
Perceptron for multi-class classification, or even other classifiers that natively
support multi-class classification.
• It is very easy to use and requires that a classifier that is to be used for binary
classification be provided to the OneVsRestClassifier as an argument.
• The example below demonstrates how to use the OneVsRestClassifier class with
a LogisticRegression class used as the binary classification model.

Dr. Zahid Ahmed Ansari 5/18/2023


100

ONE-VERSUS-THE-REST
# logistic regression for multi-class classification using a one-vs-rest
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3,
random_state=1)
# define model
model = LogisticRegression()
# define the ovr strategy
ovr = OneVsRestClassifier(model)
# fit model
ovr.fit(X, y)
# make predictions
yhat = ovr.predict(X)
5/18/2023
101

ONE-VS-ONE FOR MULTI-CLASS


CLASSIFICATION
• One-vs-One (OvO for short) is another heuristic method for using binary classification algorithms for multi-
class classification.
• Like one-vs-rest, one-vs-one splits a multi-class classification dataset into binary classification problems.
Unlike one-vs-rest that splits it into one binary dataset for each class, the one-vs-one approach splits the
dataset into one dataset for each class versus every other class.
• For example, consider a multi-class classification problem with four classes: ‘red,’ ‘blue,’ and ‘green,’ ‘yellow.’
This could be divided into six binary classification datasets as follows:
• Binary Classification Problem 1: red vs. blue
• Binary Classification Problem 2: red vs. green
• Binary Classification Problem 3: red vs. yellow
• Binary Classification Problem 4: blue vs. green
• Binary Classification Problem 5: blue vs. yellow
• Binary Classification Problem 6: green vs. yellow
• This is significantly more datasets, and in turn, models than the one-vs-rest strategy described in the
previous section.

Dr. Zahid Ahmed Ansari 5/13/2023


102

ONE-VS-ONE

• The formula for calculating the number of binary datasets, and in turn, models, is as follows:
(NumClasses * (NumClasses – 1)) / 2
• We can see that for four classes, this gives us the expected value of six binary classification problems:
(NumClasses * (NumClasses – 1)) / 2
(4 * (4 – 1)) / 2
(4 * 3) / 2
12 / 2
6
• Each binary classification model may predict one class label and the model with the most predictions or
votes is predicted by the one-vs-one strategy.
• An alternative is to introduce K(K − 1)/2 binary discriminant functions, one for every possible pair of classes.
This is known as a one-versus-one classifier. Each point is then classified according to a majority vote
amongst the discriminant functions.
5/13/2023
103

ONE-VS-ONE

• Similarly, if the binary classification models predict a numerical class membership, such as a
probability, then the argmax of the sum of the scores (class with the largest sum score) is
predicted as the class label.

• Classically, this approach is suggested for support vector machines (SVM) and related
kernel-based algorithms. This is believed because the performance of kernel methods does
not scale in proportion to the size of the training dataset and using subsets of the training
data may counter this effect.

• The support vector machine implementation in the scikit-learn is provided by the SVC class
and supports the one-vs-one method for multi-class classification problems. This can be
achieved by setting the “decision_function_shape” argument to ‘ovo‘.

Dr. Zahid Ahmed Ansari 5/13/2023


104

SVM FOR MULTI-CLASS CLASSIFICATION


USING THE ONE-VS-ONE METHOD
• The example below demonstrates SVM for multi-class classification using the one-vs-one
method.
# SVM for multi-class classification using built-in one-vs-one
from sklearn.datasets import make_classification
from sklearn.svm import SVC
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5,
n_classes=3, random_state=1)
# define model
model = SVC(decision_function_shape='ovo')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
5/13/2023
105

THANK YOU!

Dr. Zahid Ahmed Ansari 5/10/2023

You might also like