SVM
SVM
SVM
Introduction
• Some of you may have selected the hyper-plane B as it has higher margin
compared to A. But, here is the catch, SVM selects the hyper-plane which
classifies the classes accurately prior to maximizing margin. Here, hyper-
plane B has a classification error and A has classified all correctly.
Therefore, the right hyper-plane is A.
Convex Hull
• In the case of linearly separable data, the
MMH is as far away as possible from the outer
boundaries of the two groups of data points.
These outer boundaries are known as the
convex hull. The MMH is then the
perpendicular bisector of the shortest line
between the two convex hulls. Sophisticated
computer algorithms that use a technique
known as quadratic optimization are capable
of finding the maximum margin in this way.
Hyperplane in n dimensional space
• 𝜔. 𝑥 + 𝑏 = 0
• w is a vector of n weights, that is, {w1, w2, ...,
wn}, and b is a single number known as the
bias. The bias is conceptually equivalent to the
intercept term in the slope-intercept form
discussed in Regression Methods.
• The goal of the process is to find a set of weights
that specify two hyperplanes
𝜔. 𝑥 + 𝑏 ≥ +1
𝜔. 𝑥 + 𝑏 ≤ −1
• These hyperplanes are specified such that all the
points of one class fall above the first hyperplane
and all the points of the other class fall beneath
the second hyperplane.
• This is possible so long as the data are linearly
separable.
• These hyperplanes are specified such that all
the points of one class fall above the first
hyperplane and all the points of the other
class fall beneath the second hyperplane. This
is possible so long as the data are linearly
separable.
• Vector geometry defines the distance between these two planes as:
• Here, ||w|| indicates the Euclidean norm (the distance from the
origin to vector w). Because ||w|| is in the denominator, to
maximize distance, we need to minimize ||w||. The task is typically
reexpressed as a set of constraints, as follows:
•
Scenario 4
As I have already mentioned, one star at other end is like an outlier for
star class. The SVM algorithm has a feature to ignore outliers and find
the hyper-plane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
Find the hyper-plane to segregate to classes (Scenario-5): Non linear
spaces
Strengths Weaknesses
• Can be used for classification or • Finding the best model requires
numeric prediction problems testing of various combinations of
• Not overly influenced by noisy data kernels and model parameters
and not very prone to overfitting • Can be slow to train, particularly if
• May be easier to use than neural the input dataset has a large
networks, particularly due to the number of features or examples
existence of several well-supported • Results in a complex black box
SVM algorithms model that is difficult, if not
• Gaining popularity due to its high impossible, to interpret
accuracy and high-profile wins in
data mining competitions
• Kernel functions, in general, are of the
following form.
• The function denoted by the Greek letter phi,
that is, ϕ(x), is a mapping of the data into
another space. Therefore, the general kernel
function applies some transformation to the
feature vectors xi and xj and combines them
using the dot product, which takes two
vectors and returns a single number.
• The linear kernel does not transform the data
at all. Therefore, it can be expressed simply as
the dot product of the features:
• The polynomial kernel of degree d adds a
simple nonlinear transformation of the data:\
• The sigmoid kernel results in a SVM model
somewhat analogous to a neural network
using a sigmoid activation function. The Greek
letters kappa and delta are used as kernel
parameters:
• The Gaussian RBF kernel is similar to a RBF
neural network. The RBF kernel performs well
on many types of data and is thought to be a
reasonable starting point for many learning
tasks:
• The fit depends heavily on the concept to be
learned as well as the amount of training data
and the relationships among the features.
• A bit of trial and error is required by training
and evaluating several SVMs on a validation
dataset. This said, in many cases, the choice of
kernel is arbitrary, as the performance may
vary slightly.
Multiclass SVM
• In this type, the machine should classify an
instance as only one of three classes or more.
• Classifying a text as positive, negative, or
neutral
• Determining the dog breed in an image
• Categorizing a news article to sports, politics,
economics, or social
• In its most simple type, SVM doesn’t support multiclass classification natively. It
supports binary classification and separating data points into two classes. For
multiclass classification, the same principle is utilized after breaking down the
multiclassification problem into multiple binary classification problems.
• The idea is to map data points to high dimensional space to gain mutual linear
separation between every two classes. This is called a One-to-One approach,
which breaks down the multiclass problem into multiple binary classification
problems. A binary classifier per each pair of classes.
• Another approach one can use is One-to-Rest. In that approach, the breakdown is
set to a binary classifier per each class.
• A single SVM does binary classification and can differentiate between two classes.
So that, according to the two breakdown approaches, to classify data points
from m classes data set:
– In the One-to-Rest approach, the classifier can use m SVMs. Each SVM would predict
membership in one of the m classes.
– In the One-to-One approach, the classifier can use m(m-1)/2 SVMs.
• Let’s take an example of 3 classes classification
problem; green, red, and blue, as the
following image:
•
• In the One-to-Rest approach, we need a
hyperplane to separate between a class and all
others at once. This means the separation takes
all points into account, dividing them into two
groups; a group for the class points and a group
for all other points. For example, the green line
tries to maximize the separation between green
points and all other points at once:
•
https://www.baeldung.com/cs/svm-
multiclass-classification
• In the One-to-One approach, we need a
hyperplane to separate between every two
classes, neglecting the points of the third class.
This means the separation takes into account
only the points of the two classes in the current
split. For example, the red-blue line tries to
maximize the separation only between blue and
red points. It has nothing to do with green points:
•
• from sklearn import svm, datasets
• import sklearn.model_selection as model_selection
• from sklearn.metrics import accuracy_score
• from sklearn.metrics import f1_score
• iris = datasets.load_iris()
• X = iris.data[:, :2]
• y = iris.target
• X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.80, test_size=0.20, random_state=101)
• rbf = svm.SVC(kernel='rbf', gamma=0.5, C=0.1).fit(X_train, y_train)
• poly = svm.SVC(kernel='poly', degree=3, C=1).fit(X_train, y_train)
• poly_pred = poly.predict(X_test)
• rbf_pred = rbf.predict(X_test)
• poly_accuracy = accuracy_score(y_test, poly_pred)
• poly_f1 = f1_score(y_test, poly_pred, average='weighted')
• print('Accuracy (Polynomial Kernel): ', "%.2f" % (poly_accuracy*100))
• print('F1 (Polynomial Kernel): ', "%.2f" % (poly_f1*100))
• rbf_accuracy = accuracy_score(y_test, rbf_pred)
• rbf_f1 = f1_score(y_test, rbf_pred, average='weighted')
• print('Accuracy (RBF Kernel): ', "%.2f" % (rbf_accuracy*100))
• print('F1 (RBF Kernel): ', "%.2f" % (rbf_f1*100))
•