Chapter 6
Support Vector Machines
Abstract Classifying data is a common task in data-driven modeling. Using
support vector machines, we can separate classes of data by a hyperplane. A support
vector machine (SVM) is a concept for a set of related supervised learning methods
that analyze data and recognize patterns, used for classification and regression
analysis. The formulation of SVM uses the structural risk minimization principle,
which has been shown to be superior to the traditional empirical risk minimization
principle used by conventional neural networks. This chapter presents principles
of classification and regression analysis by support vector machines, briefly. Also
related MATLAB programs are presented.
Keywords Support vector machines • Classification • Kernel function
• Regression
6.1 Introduction
Classifying data is a common task in data-driven modeling. Using support vector
machines, we can separate classes of data by a hyperplane. A support vector
machine (SVM) is a concept for a set of related supervised learning methods that
analyze data and recognize patterns, used for classification and regression analysis.
Support vector machine was developed by Vapnik in 1995 (Vapnik 1995). The
formulation of SVM uses the structural risk minimization principle, which has been
shown to be superior to the traditional empirical risk minimization principle used
by conventional neural networks (Burges 1998). In what follows, the application of
SVM will be introduced, briefly (Fig. 6.1).
S. Araghinejad, Data-Driven Modeling: Using MATLAB® in Water Resources 195
and Environmental Engineering, Water Science and Technology Library 67,
DOI 10.1007/978-94-007-7506-0_6, © Springer Science+Business Media Dordrecht 2014
196 6 Support Vector Machines
6.2 Support Vector Machines for Classification
As it was presented in Chap. 5, models such as artificial neural network divide the
decision space by a line, a plane, or a hyperplane as it is presented in Fig. 6.1. The
difference between SVM and other classifiers is to divide the decision space in a
way that the risk of classification is minimized. It means that two classes have
maximum distance from both lines. As it is demonstrated in Fig. 6.1 a, classifier line
might be inappropriate to divide a two-dimensional space into two classes (line a in
Fig. 6.1). Line b shown in Fig. 6.2 divides the space in a proper manner; however, it
has a high risk of misclassification in case of adding new data points. Among the
others, line c divides the space with the associated risk which is minimized as
possible. Line b represents a classifier line obtained by neural networks where line
c may represent a classifier line resulted by SVM.
D c b a
T D D
D
D D
D
D D
D D W
D D
D D
D D W
D W W
D D
D
W W
D W W
W
W W
W W W W
W
Fig. 6.1 Different W W W
examples of classification W W W
W
for a wrong classifier (a), W W W W
an unstable classifier (b),
and a stable classifier (c) P
T D
D D
D
D D D
D D
D D
D D D D
D
D D
W D D
W D
D
W W
W
W W
W W
W W W
W W
W W W W
W W
W W
W
Fig. 6.2 Schematic view W W
of classification in the
SVM approach P
6.2 Support Vector Machines for Classification 197
Fig. 6.3 Plot of the
classification results
in Example 6.1
There are many hyperplanes that might classify the data. One reasonable choice
as the best hyperplane is the one that represents the largest separation, or margin,
between the two classes. So SVM chooses the hyperplane so that the distance from
it to the nearest data point on each side is maximized. If such a hyperplane exists, it
is known as the maximum-margin hyperplane, and the defined linear classifier is
known as a maximum-margin classifier or, equivalently, the perceptron of optimal
stability.
To describe the algorithm of SVM, let us consider a two-dimensional classifi-
cation problem which is divided by a line as shown in Fig. 6.2. The equation of this
line is written as
WTX þ b ¼ 0 (6.1)
where X is the variables in the decision space and W and b are the parameters of the
classifier.
In contrast to the artificial neural networks, SVM considers a margin for a
classifier line as shown in Fig. 6.3b. Let us define the equations of the marginal
lines as
WTX þ b ¼ 1 (6.2)
and
W T X þ b ¼ 1 (6.3)
We know that the distance of a line from the origin is obtained by |b|/kWk.
Therefore, the distance between the upper marginal line and the classifier line
of Fig. 6.2 is obtained as
198 6 Support Vector Machines
jb 1j jbj 1
d¼ ¼ (6.4)
kW k kW k kW k
So the width of the margin is
2
D ¼ 2d ¼ (6.5)
kW k
The objective function of SVM method becomes maximizing 2/kWk or its
equivalent form of
1
Min L ¼ WTW (6.6)
2
Subject to the following constraints,
if y ¼ 1 then W T X þ b 1
(6.7)
if y ¼ 1 then W T X þ b 1
In the above constraints, y is representative of each class. Two above constraints
can be combined to the form of
1
Minimize L ¼ WTW (6.8)
2
Subject to
y WTX þ b 1 0 (6.9)
The above optimization problem is used for the approach of “hard margin”
where a solid border is considered for the support vectors. We can also make the
support vectors more flexible by accepting an error of ξ for each of the border lines.
So, the optimization becomes
1 T Xn
Minimize W WþC ξi i ¼ 1, . . . , n (6.10)
2 i¼1
Subject to : yi W T X þ b 1 ξi , 8i∈f1; . . . ; ng
(6.11)
ξi 0, 8i∈f1; . . . ; ng
where n is the number of observation data.
The objective function of Eq. 6.8 could be changed to the following form
inserting the constraints into the objective function:
1 X
Minimize L ¼ WTW αi y W T X þ b 1 i ¼ 1, . . . , n (6.12)
2 i
6.2 Support Vector Machines for Classification 199
where α is the multiplier of the constraint when it is inserted in the objective
function.
The above equation is the primal problem of SVM method. “b” and “W” can be
omitted from the equation taking derivations of L.
8 X X
> dL
>
> ¼0)W α i yi xi ) W ¼ αi yi xi
< dW
i
> dL X (6.13)
>
> ¼ 0 ) αi yi ¼ 0
: db
i
The dual problem is then obtained as follows, inserting the equivalents
of b and W:
1XX X
Maximize LD ¼ αi αj yi yj xTi xj þ αi i ¼ 1, . . . , n (6.14)
2 i j i
X
Subject to αi yi ¼ 0 αi 0 (6.15)
i
The decision space variables (X) could be mapped to a higher dimensional space
using function ϕ. Applying this transformation, the dual problem becomes
Xn
1X n
Maximize αi αi αj yi yj :k xi ; xj
i¼1
2 i, j¼1
Subject to : αi 0, 8i∈f1; . . . ; ng (6.16)
Xn
α i yi ¼ 0
i¼1
where
k xi ; xj ¼ ϕðxi Þ:ϕ xj (6.17)
is called the kernel function.
The kernel functions commonly used in SVM’s formulations are:
Linear:
k xi xj ¼ xTi xj (6.18)
Polynomial:
d
k xi xj ¼ γ xTi xj þ r , γ>0 (6.19)
200 6 Support Vector Machines
Table 6.1 Data presented Class A Class B
for Example 6.1
7 15 9.1 10.6
16.3 13.4 6.7 7.3
9.7 20.5 10.3 11.4
11 17.1 10.6 13.2
17.6 19.5 7.7 8.6
16.1 18.7 9.9 6.8
10.4 17.9 10.7 7.1
14.7 18.4 9.6 9.2
9.9 17.8 11.5 11.4
18.1 13.7 8.8 11.6
17.5 13.1 9.7 12.6
11.8 13.8 10.2 11.3
16.9 15.6 10.1 7.6
14.9 22.2 9.8 10
14.8 13.2 8.3 9.7
Radial basis function (RBF):
2
k xi xj ¼ exp γ xi xj , γ>0 (6.20)
Sigmoid:
k xi xj ¼ tanh γ xTi xj þ r : (6.21)
where, γ, r, and d are kernel parameters.
Example 6.1: Linear Classification by SVM
Classify the data of Table 6.1.
Solution
n¼numel(y);
ClassA¼find(y¼¼1);
ClassB¼find(y¼¼-1);
%% Development of SVM
%% Parameter C
C¼10;
(continued)
6.2 Support Vector Machines for Classification 201
H¼zeros(n,n);
for i¼1:n
for j¼i:n
H(i,j)¼y(i)*y(j)*x(:,i)’*x(:,j);
H(j,i)¼H(i,j);
end
end
f¼-ones(n,1);
Aeq¼y;
beq¼0;
lb¼zeros(n,1);
ub¼C*ones(n,1);
Alg¼’interior-point-convex’;
options¼optimset(’Algorithm’,Alg,. . .
’Display’,’off’,. . .
’MaxIter’,20);
alpha¼quadprog(H,f,[],[],Aeq,beq,lb,ub,[],
options)’;
AlmostZero¼(abs(alpha)<max(abs(alpha))/1e5);
alpha(AlmostZero)¼0;
S¼find(alpha>0 & alpha<C);
w¼0;
for i¼S
w¼w+alpha(i)*y(i)*x(:,i);
end
b¼mean(y(S)-w’*x(:,S));
%% Plot
Line¼@(x1,x2) w(1)*x1+w(2)*x2+b;
LineA¼@(x1,x2) w(1)*x1+w(2)*x2+b+1;
LineB¼@(x1,x2) w(1)*x1+w(2)*x2+b-1;
(continued)