Unit-III - SVM
Unit-III - SVM
Unit-III - SVM
UNIT-III
Support Vector Machines
(SVMs)
Dr. Zahid Ahmed Ansari 5/10/2023
3
OVERVIEW
• Overview
• Separating Hyperplane
• Maximal Margin Classifier
• Support Vector Classifier (SVC):
• linear classification and
• classification with non-linear decision boundaries,
• SVM versus SVC
• SVM with more than 2 classes:
• One-versus-One and
• One-versus- All case,
• Kernel Functions
• Linear and non-linear separable data are described in the diagram above.
• Linearly separable data is data that is populated in such a way that it can be easily classified
with a straight line or a hyperplane.
• Non-linearly separable data, on the other hand, is described as data that cannot be separated
using a simple straight line (requires a complex classifier). 5/10/2023
6
MAXIMAL-MARGIN CLASSIFIER
• How do we choose the hyperplane that we really
need?
• Based on the maximum margin, the Maximal-Margin
Classifier chooses the optimal hyperplane.
• The dotted lines, parallel to the hyperplane in the
following diagram are the Margins and the distance
between both these dotted lines (Margins) is the
Maximum Margin.
• A margin passes through the nearest points from each
class; to the hyperplane. The angle between these
nearest points and the hyperplane is 90°. These points
are referred to as “Support Vectors”. Support vectors
are shown by circles in the diagram below.
• This classifier would choose the hyperplane with the
maximum margin which is why it is known as Maximal –
Margin Classifier. 5/10/2023
8
DRAWBACKS
• Maximal Margin classifier is heavily reliant on the support vector and changes as
support vectors change. As a result, they tend to overfit.
• They can’t be used for data that isn’t linearly separable. Since the majority of real-
world data is non-linear. As a result, this classifier is inefficient.
• The maximum margin classifier is also known as a “Hard Margin Classifier” because it
prevents misclassification and ensures that no point crosses the margin. It tends
to overfit due to the hard margin.
• An extension of the Maximal Margin Classifier, “Support Vector Classifier” was
introduced to address the problem associated with it.
5/10/2023
10
Small ‘C’ Value —-> Large Budget —–> Wide Margin —> Allows more misclassification
Large ‘C’ Value —-> Small Budget —–> Narrow Margin —> Allows less misclassification
5/10/2023
11
DRAWBACK
• Only linear classification can be done by this classifier. It becomes inefficient when
classification is nonlinear.
• Note the difference!!!!
Maximal Margin Classifier —————————> Hard Margin Classifier
Support Vector Classifier —————————–> Soft Margin Classifier
• However, all Maximum-Margin Classifiers and Support Vector Classifiers are
restricted to data that can be separated linearly.
• Support Vector Machines are an extension of Soft Margin Classifier. It can also be
used for nonlinear classification by using the kernel. As a result, this algorithm
performs well in the majority of real-world problem statements. Since, in the real
world, we will mostly find non-linear separable data, which will necessitate the use
of complex classifiers to classify them.
• Kernel: It transforms non-linear separable data from lower to higher dimensions to
facilitate linear classification, as illustrated in the figure below. We use the kernel-
based technique to separate non-linear data because separation can be simpler in
higher dimensions.
• The kernel transforms the data from lower to higher dimensions using mathematical
formulas.
5/10/2023
16
TYPES OF SVM
LINEAR SVM
LINEAR SVM
LINEAR SVM
NON-LINEAR SVM
NON-LINEAR SVM
NON-LINEAR SVM
NON-LINEAR SVM
SVM KERNELS
• In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form.
• Kernel Function is a method used to take data as input and transform it into the required form
of processing data.
• “Kernel” is used due to a set of mathematical functions used in Support Vector Machine
providing the window to manipulate the data.
• So, Kernel Function generally transforms the training set of data so that a non-linear decision
surface is able to transform to a linear equation in a higher number of dimension spaces.
Basically, It returns the inner product between two points in a standard feature dimension.
• Kernel takes a low dimensional input space and transforms it into a higher dimensional space.
• In simple words, kernel converts non-separable problems into separable problems by adding
more dimensions to it.
• It makes SVM more powerful, flexible and accurate.
POLYNOMIAL KERNEL
LINEAR KERNEL
• This kernel is one-dimensional and is the most basic form of kernel in SVM. The
equation is:
• K(xi,xj) = xi.xj + c
• xi.xj is dot product between any two observations xi and xj.
• This kernel is used to compute the inner on graphs. They measure the similarity
between pairs of graphs. They contribute in areas like bioinformatics, chemoinformatic,
etc.
#importing datasets
data_set= pd.read_csv('user_data.csv')
OUTPUT
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
• The model performance can be altered by changing the value of
C(Regularization factor), gamma, and kernel.
• Now, we will predict the output for test set. For this, we will create a new
vector y_pred. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
• After getting the y_pred vector, we can compare the result of y_pred and
y_test to check the difference between the actual value and predicted
value.
• Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create the
confusion matrix, we need to import the confusion_matrix function of the sklearn
library. After importing the function, we will call it using a new variable cm. The
function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:
OUTPUT
OUTPUT
OUTPUT
• Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
EXAMPLE
• For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −
Svc_classifier = svm.SVC(kernel = 'rbf', gamma =‘auto’, C = C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap = plt.cm.tab10, alpha = 0.3)
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')
OUTPUT
• Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel’)
• We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.
SVM
• The following Scikit-Learn code loads the iris dataset, scales the features, and
then trains a linear SVM model (using the LinearSVC class with C = 0.1 and
the hinge loss function) to detect Iris-Virginica flowers. The resulting model is
represented on the right of Figure.
• Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it is much slower, especially with
large training sets, so it is not recommended.
• Another option is to use the SGDClassifier class, with SGDClassifier(loss="hinge", alpha=1/(m*C)). This applies
regular Stochastic Gradient Descent to train a linear SVM classifier.
• It does not converge as fast as the LinearSVC class, but it can be useful to handle huge datasets that do not fit in
memory (out-of-core training), or to handle online classification tasks.
• The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean.
This is automatic if you scale the data using the StandardScaler.
• Moreover, make sure you set the loss hyperparameter to "hinge", as it is not the default value. Finally, for better
performance you should set the dual hyperparameter to False, unless there are more features than training
instances
• To implement this idea using Scikit-Learn, you can create a Pipeline containing a PolynomialFeatures
transformer, followed by a StandardScaler and a LinearSVC. Let’s test this on the moons dataset (Figure):
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)
POLYNOMIAL KERNEL
• Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning
algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets, and with
a high polynomial degree it creates a huge number of features, making the model too slow.
• Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the kernel
trick (it is explained in a moment). It makes it possible to get the same result as if you added many polynomial
features, even with very highdegree polynomials, without actually having to add them.
• So there is no combinatorial explosion of the number of features since you don’t actually add any features. This
trick is implemented by the SVC class. Let’s test it on the moons dataset:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)
5/12/2023
67
• Other kernels exist but are used much more rarely. For example, some kernels are specialized for specific data
structures.
• String kernels are sometimes used when classifying text documents or DNA sequences (e.g., using the string
subsequence kernel or kernels based on the Levenshtein distance).
• With so many kernels to choose from, how can you decide which one to use? As a rule of thumb, you should
always try the linear kernel first (remember that LinearSVC is much faster than SVC(kernel="linear")), especially
if the training set is very large or if it has plenty of features.
• If the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in most cases.
• Then if you have spare time and computing power, you can also experiment with a few other kernels using
cross-validation and grid search, especially if there are kernels specialized for your training set’s data structure.
COMPUTATIONAL COMPLEXITY
• The LinearSVC class is based on the liblinear library, which implements an optimized algorithm for linear SVMs.1 It does
not support the kernel trick, but it scales almost linearly with the number of training instances and the number of
features: its training time complexity is roughly O(m × n).
• The algorithm takes longer if you require a very high precision. This is controlled by the tolerance hyperparameter ϵ
(called tol in Scikit-Learn). In most classification tasks, the default tolerance is fine.
• The SVC class is based on the libsvm library, which implements an algorithm that supports the kernel trick.2 The
training time complexity is usually between O(m2 × n) and O(m3 × n). Unfortunately, this means that it gets dreadfully
slow when the number of training instances gets large (e.g., hundreds of thousands of instances).
• This algorithm is perfect for complex but small or medium training sets. However, it scales well with the number of
features, especially with sparse features (i.e., when each instance has few nonzero features).
• In this case, the algorithm scales roughly with the average number of nonzero features per instance. Table 5-1
compares Scikit-Learn’s SVM classification classes.
5/13/2023
75
5/13/2023
76
• The following code produces the model represented on the left of previous Figure using Scikit-Learn’s
SVR class (which supports the kernel trick).
• The SVR class is the regression equivalent of the SVC class, and the LinearSVR class is the
regression equivalent of the LinearSVC class.
• The LinearSVR class scales linearly with the size of the training set (just like the LinearSVC class),
while the SVR class gets much too slow when the training set grows large (just like the SVC class).
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)
• SVMs can also be used for outlier detection; see Scikit-Learn’s documentation for more details.
• Support Vector Regression (SVR) is a type of machine learning algorithm used for regression
analysis. The goal of SVR is to find a function that approximates the relationship between the
input variables and a continuous target variable, while minimizing the prediction error.
• Unlike Support Vector Machines (SVMs) used for classification tasks, SVR seeks to find a
hyperplane that best fits the data points in a continuous space. This is achieved by mapping
the input variables to a high-dimensional feature space and finding the hyperplane that
maximizes the margin (distance) between the hyperplane and the closest data points, while
also minimizing the prediction error.
• SVR can handle non-linear relationships between the input variables and the target variable
by using a kernel function to map the data to a higher-dimensional space. This makes it a
powerful tool for regression tasks where there may be complex relationships between the
input variables and the target variable.
• Support Vector Regression (SVR) uses the same principle as SVM, but for regression
problems.
SVR
• The first thing that we’ll understand is what is the decision boundary (the danger red line above!). Consider
these lines as being at any distance, say ‘a’, from the hyperplane. So, these are the lines that we draw at
distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to as epsilon.
• Assuming that the equation of the hyperplane is as follows:
Y = wx+b (equation of hyperplane)
• Then the equations of decision boundary become:
wx+b= +a
wx+b= -a
• Thus, any hyperplane that satisfies our SVR should satisfy:
-a < Y- wx+b < +a
• Our main aim here is to decide a decision boundary at ‘a’ distance from the original hyperplane such that
data points closest to the hyperplane or the support vectors are within that boundary line.
• Hence, we are going to take only those points that are within the decision boundary and have the least error
rate or are within the Margin of Tolerance. This gives us a better fitting model.
5/13/2023
81
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
5/13/2023
85
y_pred = regressor.predict(6.5)
y_pred = sc_y.inverse_transform(y_pred)
#Step 6. Visualizing the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
OUTPUT
• SVM regression or Support Vector Regression (SVR) has a wide range of applications
in various fields.
• It is commonly used in finance for predicting stock prices, in engineering for
predicting machine performance, and in bioinformatics for predicting protein
structures.
• SVM regression is also used in natural language processing for text classification and
sentiment analysis.
• Additionally, it is used in image processing for object recognition and in healthcare
for predicting medical outcomes.
• Overall, SVM regression is a versatile algorithm that can be used in many domains
for making accurate predictions.
• Binary classification are those tasks where examples are assigned exactly one of two
classes.
• Binary Classification: Classification tasks with two classes.
• Multi-class classification is those tasks where examples are assigned exactly one of
more than two classes.
• Multi-class Classification: Classification tasks with more than two classes.
• Some algorithms are designed for binary classification problems. Examples include:
• Logistic Regression
• Perceptron
• Support Vector Machines
• As such, they cannot be used for multi-class classification tasks, at least not directly.
ONE-VERSUS-THE-REST
ONE-VERSUS-THE-REST
ONE-VERSUS-THE-REST
ONE-VERSUS-THE-REST
ONE-VERSUS-THE-REST
ONE-VERSUS-THE-REST
# logistic regression for multi-class classification using a one-vs-rest
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3,
random_state=1)
# define model
model = LogisticRegression()
# define the ovr strategy
ovr = OneVsRestClassifier(model)
# fit model
ovr.fit(X, y)
# make predictions
yhat = ovr.predict(X)
5/18/2023
101
ONE-VS-ONE
• The formula for calculating the number of binary datasets, and in turn, models, is as follows:
(NumClasses * (NumClasses – 1)) / 2
• We can see that for four classes, this gives us the expected value of six binary classification problems:
(NumClasses * (NumClasses – 1)) / 2
(4 * (4 – 1)) / 2
(4 * 3) / 2
12 / 2
6
• Each binary classification model may predict one class label and the model with the most predictions or
votes is predicted by the one-vs-one strategy.
• An alternative is to introduce K(K − 1)/2 binary discriminant functions, one for every possible pair of classes.
This is known as a one-versus-one classifier. Each point is then classified according to a majority vote
amongst the discriminant functions.
5/13/2023
103
ONE-VS-ONE
• Similarly, if the binary classification models predict a numerical class membership, such as a
probability, then the argmax of the sum of the scores (class with the largest sum score) is
predicted as the class label.
• Classically, this approach is suggested for support vector machines (SVM) and related
kernel-based algorithms. This is believed because the performance of kernel methods does
not scale in proportion to the size of the training dataset and using subsets of the training
data may counter this effect.
• The support vector machine implementation in the scikit-learn is provided by the SVC class
and supports the one-vs-one method for multi-class classification problems. This can be
achieved by setting the “decision_function_shape” argument to ‘ovo‘.
THANK YOU!