Analytica Chimica Acta 703 (2011) 152–162
Contents lists available at ScienceDirect
Analytica Chimica Acta
journal homepage: www.elsevier.com/locate/aca
Support vector machines in water quality management
Kunwar P. Singh a,∗ , Nikita Basant b,1 , Shikha Gupta a
a
Environmental Chemistry Division, CSIR-Indian Institute of Toxicology Research (Council of Scientific & Industrial Research), Post Box 80, Mahatma Gandhi Marg, Lucknow 226001,
India
b
Laboratory of Chemometrics and Drug Design, School of Pharmacy, Howard University, Washington, DC, USA
a r t i c l e
i n f o
Article history:
Received 8 May 2011
Received in revised form 11 July 2011
Accepted 16 July 2011
Available online 23 July 2011
Keywords:
Support vector classification
Support vector regression
Kernel discriminant analysis
Kernel partial least squares
Water quality
Biochemical oxygen demand
a b s t r a c t
Support vector classification (SVC) and regression (SVR) models were constructed and applied to the
surface water quality data to optimize the monitoring program. The data set comprised of 1500 water
samples representing 10 different sites monitored for 15 years. The objectives of the study were to classify
the sampling sites (spatial) and months (temporal) to group the similar ones in terms of water quality
with a view to reduce their number; and to develop a suitable SVR model for predicting the biochemical
oxygen demand (BOD) of water using a set of variables. The spatial and temporal SVC models rendered
grouping of 10 monitoring sites and 12 sampling months into the clusters of 3 each with misclassification
rates of 12.39% and 17.61% in training, 17.70% and 26.38% in validation, and 14.86% and 31.41% in test
sets, respectively. The SVR model predicted water BOD values in training, validation, and test sets with
reasonably high correlation (0.952, 0.909, and 0.907) with the measured values, and low root mean
squared errors of 1.53, 1.44, and 1.32, respectively. The values of the performance criteria parameters
suggested for the adequacy of the constructed models and their good predictive capabilities. The SVC
model achieved a data reduction of 92.5% for redesigning the future monitoring program and the SVR
model provided a tool for the prediction of the water BOD using set of a few measurable variables. The
performance of the nonlinear models (SVM, KDA, KPLS) was comparable and these performed relatively
better than the corresponding linear methods (DA, PLS) of classification and regression modeling.
© 2011 Elsevier B.V. All rights reserved.
1. Introduction
Surface water bodies in general, and the rivers, in particular are
among the most vulnerable aquatic systems to contamination due
to their easy accessibility for the disposal of wastes. The hydrochemistry of the river systems is largely determined by a number
of factors, such as climatic conditions, soil-rock types and anthropogenic activities in the basin [1,2]. In India, almost all the surface
water bodies, including the streams and the river systems, are
grossly polluted and the regulatory agencies have been putting all
efforts to restore the health and ecology of these aquatic ecosystems. Consequently, a number of monitoring programs have been
initiated for generating baseline databases pertaining to different
lakes and river systems throughout the country with a view to
develop and implement appropriate pollution prevention, control
and restoration strategies [3]. Water quality monitoring programs
for large number of water bodies at several strategic sites for
∗ Corresponding author. Tel.: +91 522 2476091; fax: +91 522 2628227.
E-mail addresses: kunwarpsingh@gmail.com, kpsingh 52@yahoo.com
(K.P. Singh).
1
Current address.
0003-2670/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.aca.2011.07.027
various characteristic variables are very expensive, time, and manpower intensive and hence, difficult to sustain over longer periods.
Moreover, the determination of the biochemical oxygen demand
(BOD) is very tedious and cost-intensive. BOD measures an approximate amount of bio-degradable organic matter present in the water
and serves as an indicator parameter for the extent of water pollution. BOD of an aquatic system is the foremost parameter needed
for assessment of the water quality as well as for development of
the management strategies for the protection of water resources
[4]. This warrants for the need of a foolproof method for its determination. Currently available method for BOD determination and
prone to measurement errors. The method is subjected to various
complicating factors, such as the oxygen demand resulting from
the respiration of algae in the sample and the possible oxidation of
ammonia. Presence of toxic substances in sample may also affect
microbial activity leading to a reduction in the measured BOD value
[5]. The laboratory conditions for BOD determination usually differ from those in aquatic systems. Therefore, interpretation of BOD
results and their implications may be associated with large variations. Moreover, it is determined over a period of 5 days at a
constant temperature (20 ◦ C) maintained throughout the duration,
which is difficult to achieve in developing countries due to frequent power interruptions. Overall, the BOD measurement result is
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
associated with large uncertainties, thus making its estimate
largely unreliable [4,6]. Therefore, there is a need to modify and
redefine the water quality monitoring programs with a view to
minimize the number of the sampling sites and sampling frequencies under various networks, but still generating meaningful data,
without compromising with the quality and interpretability [3]. In
addition, it is economically justified to develop methods which are
capable of estimating the tedious parameter, such as BOD based
on the knowledge of other, already identified variables. Pattern
classification method may help to classify the sites and months
on the basis of similarity and dissimilarity in terms of the observed
water quality, thus in grouping the identical water quality sites and
months under a monitoring program. This may provide guidelines
to select single representative sampling site and month from each
of the cluster. Similarly, regression modeling would allow developing relationship between the independent and dependent variables
and in predicting the difficult variable(s) using the simpler ones as
the input.
Several linear (discriminant analysis, partial least squares) and
nonlinear (artificial neural networks, kernel discriminant analysis,
kernel partial least squares, support vector machines) modeling
methods are now available for the classification and regression
problems [3,4,7–11]. The linear discriminant analysis (DA) and
partial least-squares (PLS) regression capture only linear relationship and the artificial neural networks (ANNs) have some
problems inherent to its architecture, such as overtraining, overfitting, network optimization, and reproducibility of the results, due
to random initialization of the networks and variation of stopping
criteria [11]. Kernel-based techniques are becoming more popular,
because in contrast to ANNs, they allow interpretation of the calibration models. In kernel-based methods the calibration is carried
out in space of nonlinearly transformed input data, the so-called
feature space, without actually carrying out the transformation. The
feature space is defined by the kernel function [12]. The kernel versions of the disicriminant analysis (KDA) and partial least squares
(KPLS) algorithms have been described in literature [8–10,13].
The support vector machines (SVMs), essentially a kernel-based
procedure, is relatively new machine learning method based on
Vapnik–Chervonenkis (VC) theory [14] that recently emerged as
one of the leading techniques for pattern classification and function
approximation [15]. SVM can simultaneously minimize estimation
errors and model dimensions. It has good generalization ability and
is less prone to over-fitting. SVMs adopt kernel functions which
make the original inputs linearly separable in mapped high dimensional feature space [16]. SVMs have successfully been applied for
the classification and regression problems in various research fields
[17–22].
Here, we have considered a large data set pertaining to the
surface water (Gomti river, India) quality monitored for thirteen
different variables each month over a period of fifteen years at
ten different sites spread over a distance of about 500 km. The
objectives of this study were to develop the SVM models for (1) classification of the sampling sites with a view to identify similar ones
in the monitoring network for reducing their number for the future
water quality monitoring; (2) classification of the sampling months
into the groups of seasons for reducing the annual sampling frequency; and (3) to predict the BOD of the river water using simple
measurable water quality variables. Further, the SVM classification
(temporal and spatial) and SVM regression modeling results were
compared with those obtained through the linear (DA, PLS) and
the corresponding nonlinear kernel-based (KDA, KPLS) modeling
approaches. This study has shown that the application of SVM classification and regression methods to the large water quality data
set successfully helped in optimization of the water quality monitoring program through minimizing the number of the sampling
sites, frequency, and parameters.
153
2. Materials and methods
2.1. Problem statement
For a given set of water samples (W), repeatedly collected over
a period of time (monthly) from a certain number of monitoring
sites spread within a geographical area, and characterized for a set
of properties (N), the task is to develop a function using the known
set of properties which could group the sampling sites as well as
the sampling months on the basis of similarities and dissimilarities.
This would help to reduce the sampling sites as well as the sampling
frequency for the future monitoring program.
Consider a set of water samples, W representing the spatial
and temporal water quality in terms of the measured water quality variables (N), such that; W = {x|x ∈ RN }. Each water sample is
represented as N-dimensional real vector and every particular
coordinate xi represents a value of some measured water quality
variable. Dimension N is equal to the number of measured variables used to describe each water sample. Let C = {c1, c2, . . ., cI} be
the set of I classes that correspond to some predefined water type
such as spatial or temporal. The function fc : W → C is called classification if for each xi ∈ W it holds that fc (xi ) = cj if xi belongs to the
class cj . In real situations, a limited set of labeled water samples (xi ,
yi ), xi ∈ RN , yi ∈ C, I = 1, . . ., m, are available through the monitoring
studies, which form the training set for the classification problem
[17]. The machine learning approach using the labeled cases (water
samples) from the training set aims to find the function f̄c which
could be a good approximation of the real, unknown function fc .
Another relevant task is to estimate the value of an unknown
property for a particular water sample using any other set of properties available and known for the same sample. If we have a
training set with m water samples (xi , yi ), xi ∈ RN , yi ∈ R, i = 1, . . ., m,
where, yi is the known real value of the target property one aims to
estimate for water samples in W not included in the training set. The
function fr : W → R is called a regression if it estimates the value of
the target property given the values of other properties for any random sample x ∈ W [17]. Here, the machine learning approach aims
to find f̄r using the training cases and a specific learning method.
The basic aim of this study is to find the most accurate possible
classification f̄c (spatial and temporal) and regression f̄r from the
training data pertaining to the surface water quality monitored over
a long period of time (1995–2010) at ten different sites and using
the SVMs.
2.2. Brief theory of support vector machines
A detailed description of SVMs may be found elsewhere
[14,23,24], however, a brief account is given here. SVMs originally
developed for binary classification problems make use of the hyperplanes to define decision boundaries between the data points of
different classes [25]. Now with the introduction of -insensitive
loss function, SVM has been extended to solve the regression problems [15]. In SVM approach, the original data points from the
input space are mapped into a high dimensional or even infinite dimensional feature space using a suitable kernel function,
where a maximal separating plane (SP) is constructed. Two parallel hyper-planes are constructed, one on each side of the SP that
separates the data. The SP maximizes the distance between the
two parallel hyper-planes. As a special property, SVM simultaneously minimizes the empirical classification error and maximizes
the geometric margin. It is assumed that larger the margin or distance between these parallel hyper-planes, the better will be the
generalization error of the classifier. In case of the SVM regression,
the goal of SVM is to find the optimal hyper-plane, from which the
distance to all the data points is minimum [15,16,26].
154
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
2.2.1. SVM classification (SVC)
SVM method deals with the binary classification model which
assumes that there are just two classes (C = {c1 , c2 }) and an object
(say water sample) belongs to one of these only.
For a set of N data vectors {xi , yi }, i = 1, . . ., N, yi ∈ {−1, 1}, xi ∈ Rn ,
where xi is the ith data vector that belongs to a binary class yi , if
the data are linearly separable, there exists a SP in the input space,
which can be expressed [16] as:
f (x) = wT x + b =
n
wj xj + b = 0
where
is a weight vector, b, the “bias” term, is a scalar, and T
stands for the transpose operator. The parameters of w and b define
the location of SP and are determined during the training process.
In SVM, the training data points satisfying the constraints that
f(xi ) = 1 if yi = 1 and f(xi ) = −1 if yi = −1 are called support vectors
(SVs). Other training data points satisfy the inequalities that f(xi ) > 1
if yi = 1 and f(xi ) < −1 if yi = −1 [27]. Therefore, a complete form of
the constraints for all training data can be expressed [16] as:
yi f (x) = yi (wT xi + b) ≥ 1,
for i = 1, 2, . . . , N
1
i
||w||2 + C
2
(3)
i=1
Subject to
yi (wT xi + b) ≥ 1 = i ; i = 1, 2, . . . , N,
i ≥ 0, i = 1, 2, ..., N
where C is a positive constant and i represents the distance
between the data points xi lying on the false class side and the
margin of its virtual class. The optimization problem can be solved
by introducing the Lagrange multipliers ˛i and ˇi as [28]:
N
L(w, b, , ˛, ˇ) =
1
i
||w||2 + C
2
i=1
N
−
˛(yi (wT xi + b) − 1 + i ) −
i=1
N
ˇi i
(4)
i=1
where ˛ = (˛1 , . . ., ˛N )T , ˇ = (ˇ1 , . . ., ˇN )T , and = ( 1 , . . ., N )T .
An optimal solution of Eq. (4) may be achieved through minimizing the derivatives of the Lagrange function with respect to w,
b and which yields:
L(˛) =
N
N
˛i −
i=1
1
˛i ˛j yi yj xTi xj
2
(5)
i,j=1
Under the constraints:
⎧
N
⎪
⎨
⎪
⎩
yi ˛i = 0
(6)
i=1
C ≥ ˛i ≥ 0,
and
i = 1, . . . , N
f (x) = sign
N
˛i (yi (wT xi + b)) = 0,
i = 1, . . . , N.
(7)
1
(yi − wT xj )
Nsv
(8)
˛i yi (xTi x) + b
i=1
(9)
Here, if f (x) is positive, the new input data point x belongs to class
1 (yi = 1) and if f (x) is negative, x belongs to class 2 (yi = −1).
However, in case of nonlinear separation the original input data
is projected into a high dimensional feature space: ϕ: Rn → Rd , n ≪ d
i.e. x → ϕ(x), in which the input data can be linearly separated. In
such a space, the dot-product from Eq. (9) is transformed into ϕ(xi ).
ϕ(x), and the non-linear function can be expressed as:
f (x) = sign
N
T
˛i yi (ϕ (xi ).ϕ(x)) + b
i=1
(10)
A number of kernel functions are available for which it holds:
K(xi , xj ) = (ϕT (xi ).ϕ(xj )), and it returns a dot product of the feature space mappings of the original inputs. The non-linear decision
function can be written as:
f (x) = sign
N
˛i yi K(xi , x) + b
(11)
The selection of kernel function is dependent on the distribution
of the data. However, kernel function is generally selected through
“trial and error” test [29]. There are four possible choices of the kernel functions, such as linear, polynomial, sigmoid, and radial basis
function (RBF). RBF is among the most common kernel functions
employed in most of the applications and it has the form [30]:
K(x, xi ) = exp(− ||xi − xj ||2 )
(12)
where the parameter controls the smoothness of the decision
boundary in the feature space.
Here, we applied the SVM classification model with RBF to the
surface water quality data with a view to achieve the spatial and
temporal classifications.
2.2.2. SVM regression (SVR)
For a given regression problem, the goal of SVM is to find the
optimal hyper-plane, from which the distance to all the data points
is minimum. Consider a training data set, (xi , yi ), xi ∈ Rn , i = 1, . . .,
m, y ∈ {+1, −1}, where yi denotes the target property of an already
known ith case. During the learning phase, the aim is to find the
linear function f(x) = wx + b, w, x ∈ Rn , b ∈ R for which the difference between the actual measured value yi and estimated value
f(xi ) would be at most equal to ε or [yi − f(xi )] < ε, where ε is the
insensitive loss function [15,31].
If the error of estimation is taken into account by introducing
slack variables and *, as well as the penalty parameter C, the
corresponding problem can be expressed [32,33] as:
1
min ||w||2 + C
2
N
i=1
The solution yields the coefficients ˛i which are required to express
the w. Now,
b=
j=1
i=1
N
min
Nsv
˛i yi xi
where Nsv is the number of SVs. From the w and b, the linear decision
function can be expressed as:
(2)
The distance between the plane crossing the points denoting the
SVs (wT x + b = −1 and wT x + b = 1) is called the margin and it can
be expressed as (2/||w||). SP, the line passing through mid of these
two SVs (wT x + b = 0) provides the largest margin value. The optimum SP may be achieved through maximization of the margin and
minimization of the noise with introducing the slack variable i ; as
[16,27]:
N
i=1
(1)
j=1
w ∈ Rn
Thus, w =
(i + i∗ )
T
Subject to ((yi − w xi − b) ≤ ε + i , w.x + b − yi ≤ ε +
(13)
i∗ ; i i∗
≥ 0)
Transforming this quadratic programming problem to its corresponding dual optimization problem and introducing the kernel
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
function in order to achieve the non-linearity, yields the optimal
regression function [31,32] as:
f (x) =
N
(˛i − ˛∗i )K(xi , x) + b
(14)
i=1
where C ≥ ˛i , ˛∗i ≥ 0, i = 1, . . . , N. ˛i and ˛∗i (with 0 ≤ ˛i , ˛∗i ≥ C)
are the Lagrange multipliers and K(xi , x) represents the kernel function. In contrast of Eq. (14), data points with nonzero ˛i and ˛∗i
values are SVs.
The performance of SVMs for classification and regression
depends on the combination of several factors, such as the kernel
function type and its corresponding parameters, capacity parameter, C, and -insensitive loss function [15]. For the RBF kernel, the
most important parameter is the width of the RBF, which controls
the amplitude of the kernel function and, hence, the generalization
ability of SVM [34]. C is a regularization parameter that controls
the trade-off between maximizing the margin and minimizing the
training error. If C is too small, then insufficient stress will be placed
on fitting the training data. If C is too large, then the algorithm will
over fit the training data [35].
Here, we applied the SVM regression model with RBF kernel
function to the surface water quality data with a view to predict the
target variables, BOD of the water samples using set of independent
water quality variables.
2.3. Data set
The data set used here pertains to the surface water quality monitored each month at eight different sites during first
ten years (1995–2005) and at ten different sites during the last
five years (2005–2010). The first three sites (S1–S3) are located
in the area of relatively low pollution (LP) and are upstream of
the Lucknow city. Other three sites (S4–S6) are in the region of
gross pollution (GP) as there are a 26 wastewater drains and
two highly polluted tributaries emptying in to the river in this
stretch. The last four sites (S7–S10) are in the region of moderate pollution (MP) as the river considerably recovers in the
course [4]. The water quality parameters measured include the
water pH, total alkalinity (T-Alk, mg L−1 ), total hardness (T-Hard,
mg L−1 ), total solids (TS, mg L−1 ), ammonical nitrogen (NH4 –N,
mg L−1 ), nitrate nitrogen (NO3 –N, mg L−1 ), chloride (Cl, mg L−1 ),
phosphate (PO4 , mg L−1 ), potassium (K, mg L−1 ), sodium (Na,
mg L−1 ), dissolved oxygen (DO, mg L−1 ), chemical oxygen demand
(COD, mg L−1 ), and biochemical oxygen demand (BOD, mg L−1 ).
Details on analytical procedures are available elsewhere [3]. The
water quality data set has the dimension of 1500 samples × 13
variables.
2.4. Data processing
The data pertaining to the surface water quality monitored over
a long duration at sampling sites spread over a large geographical area may be contaminated by human and measurement errors.
Such erroneous data may behave as an outlier. An outlier is a
data point that is numerically distant from the majority of data
[16,36]. Using such contaminated data set directly for pattern classifications and predictive modeling will lead to unreliable results;
hence identifying outliers and conducting data cleaning become
important [16]. Moreover, some of the features in original data
set may have insignificant or no relevance with the response variable rendering these useless in pattern classification and predictive
modeling; hence implementing initial feature selection is also necessary [26,36]. Therefore, the removal of outliers and initial features
selection were implemented here.
155
The data were partitioned into three subsets; training, validation, and the test, using the Kennard–Stone (K–S) approach. The K–S
algorithm designs the model set in such a way that the objects are
scattered uniformly around the training domain. Thus, all sources
of the data variance are included into the training model [37,38]. In
the present study, the complete data set (1500 samples × 13 variables) was partitioned as training set (900 samples × 13 variables);
validation and test set each (300 samples × 13 variables). Thus, the
training, validation and test sets, thus comprised of 60%, 20% and
20% samples, respectively.
2.4.1. Data cleaning
Outliers in training or validation sub-set affect the classification
as well as regression modeling results. The outliers in the training set may cause the SP falsely determined, which leads to the
misclassification of validation data; while the outliers in validation set are misclassified, as these are located in opposite class
side [16]. Similarly, in case of regression, the outliers may cause
higher prediction errors. Since, the outliers are physically located
far away from the normal data; they will be rarely classified or
predicted correctly. Therefore, the outliers may be detected if the
classification or regression process is implemented several times
with changed training and validation sets and record the misclassified or predicted cases for every single run. Data misclassified or
mis-predicted most of the time may be taken as candidate outlier
[16].
Here, we repeated the classification and regression processes
several times with changed training and validation data sets and
identified the outliers which were misrepresented most of the
times. The final outliers were determined based on the removal
impacts of the candidate outliers on misclassification rate (MR) and
mean squared error (MSE) values. The data points whose removal
results in the reduction of MR and MSE values are treated as final
outliers. The final outliers were permanently removed from the
data set before performing the classification and regression modeling. A total of 53 samples were identified as the outliers and were
permanently removed from the final data set.
2.4.2. Initial feature selection
Removal of the irrelevant features from the data set is required
prior to the classification or regression modeling. Presence of
such variables would lead to misfit of the models, thus losing
the predictive ability [36]. Here, the initial feature selection was
performed using the multiple linear regression (MLR) method. Variables exhibiting significant relationship with the target variable
were retained while dropping the others. The insignificant variables then dropped one by one and misclassification rate (MR) and
prediction error in regression were recorded. Finally the pH, T-alk,
Cl, PO4 , COD, DO and BOD were retained for the purpose of modeling
of the water quality data set (Table 1).
Table 1
Basic statistics of the selected measured water quality variables in surface water,
India (n = 1500).
Variable
Unit
Min
Max
Median
Mean
a
pH
T-Alk
Cl
PO4
COD
BOD
DO
–
mg L−1
mg L−1
mg L−1
mg L−1
mg L−1
mg L−1
6.02
42.67
0.21
0.00
1.20
0.12
0.00
9.03
366.67
53.94
9.90
57.39
31.67
9.97
8.27
216.7
8.33
0.26
14.4
4.17
6.37
8.23
204.55
9.75
0.51
16.76
6.11
5.72
0.34
53.91
6.34
0.81
8.72
4.71
2.52
a
b
Standard deviation.
Coefficient of variation.
SD
b
CV
4.14
26.36
65.01
159.21
52.06
77.09
44.12
156
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
Table 2a
Classification matrix for spatial classification of surface water by DA, KDA and SVC models.
Model
Actual class
Training set
DA
LP
GP
MP
Total
LP
KDA
GP
MP
Total
SVC
LP
GP
MP
Total
Validation set
LP
DA
GP
MP
Total
LP
KDA
GP
MP
Total
LP
SVC
GP
MP
Total
Test set
DA
LP
GP
MP
Total
KDA
LP
GP
MP
Total
SVC
LP
GP
MP
Total
Total cases
291
304
268
863
291
304
268
863
291
304
268
863
Predicted correct
assignations
LP
GP
MP
246
15
73
5
261
4
40
28
191
246
261
191
265
2
25
16
271
17
48
5
215
265
271
215
252
5
39
5
287
12
34
12
217
252
287
217
84
16
35
1
81
2
12
4
53
84
81
53
84
3
13
8
83
10
21
2
67
84
83
67
88
7
24
1
87
4
8
7
62
88
87
62
123
6
24
0
54
2
11
7
69
123
54
69
116
4
14
6
58
3
13
2
80
116
58
80
119
4
20
0
59
1
15
4
74
119
59
74
97
101
90
288
97
101
90
288
97
101
90
288
134
67
95
296
134
67
95
296
134
67
95
296
Correct
assignations
2.5. Modeling and performance criteria
The main aim of this study was to build SVM models for the
classification and regression problems pertaining to the surface
water quality with a view to achieve the reduction in number of
the sampling sites and sampling frequency for minimization of the
monitoring efforts; as well as to develop a tool for the prediction of
the water BOD levels using simple and directly measurable water
quality parameters as the input.
2.5.1. Optimization of SVM parameters
Similar to other multivariate calibration methods, the generalization performance of SVM models depends on a proper setting
of several parameters. These include the capacity parameter C,
the insensitive loss function ε, and the kernel function dependent
parameter in SVM classification and regression models [39]. RBF
is the most commonly used kernel in SVM and the RBF width
parameter ( ) reflects the distribution/range of x-values of training
data [32]. The parameter C determines the trade-off between the
smoothness of the regression function and the amount up to which
deviations larger than ε are tolerated. Since, the highest ˛i and ˛∗i
values are by definition equal to C, the robustness of the regression
model depends on the choice of the later. Therefore, the choice of
the C value influences the significance of the individual data points
in the training set [40]. Hence, a proper choice of C in combination with ε might result in a well performing and robust regression
% Mis-classified
Sensitivity (%)
Specificity (%)
Accuracy (%)
15.46
14.14
28.73
19.11
9.28
10.86
19.78
13.30
13.40
5.59
19.03
12.39
73.65
96.67
73.75
91.49
92.75
87.25
84.59
93.97
83.20
91.40
89.14
80.22
88.81
98.75
92.95
89.68
95.37
89.00
85.13
94.40
82.50
93.12
96.95
91.50
90.38
96.06
88.76
13.40
19.80
41.11
24.30
14.43
17.82
25.56
19.27
9.27
13.86
31.11
17.70
62.22
96.43
76.81
91.50
90.20
83.11
77.78
92.01
81.60
84.00
82.18
74.44
84.74
97.88
88.56
84.48
92.41
84.19
73.94
94.56
80.51
94.67
92.85
86.72
86.11
93.40
85.06
8.20
19.40
27.36
16.89
13.43
13.43
15.79
14.22
11.19
11.94
22.10
14.86
86.27
96.43
79.31
92.31
94.58
87.56
86.15
94.93
85.14
86.57
86.67
84.21
88.27
97.38
91.54
87.50
94.93
89.19
83.21
98.33
79.56
90.19
96.61
89.65
86.82
96.95
86.48
model, which is also insensitive to the presence of possible outliers
[39]. Here, the optimum value of C was determined through grid
search over a space of 0.01–50,000 with step size of 10-1.
The parameter ε regulates the radius of the ε tube around the
regression function and thus, the number of SVs that finally will
be selected to construct the regression function (leading to a space
solution). A too large value of ε results in less SVs (more data points
will be fit in the ε tube) and, consequently, the resulting regression model may yield large prediction errors on unseen future data.
Since, ε is related to the amplitude of the noise present in the training set, and the exact contribution of the noise to real information
in a data set is usually unknown, ε was optimized in the range
of 0.001 and 0.2 [39]. A good combination of the two parameters
(C and ε) also prevents overtraining. To achieve this, an internal
cross-validation during construction of SVR models was performed.
The kernel function is used to map the input data into a high
dimensional feature space which is required to transform the nonlinear input space to a high-dimensional feature space where linear
regression is possible [41]. The mapping depends on the intrinsic
structure of the data, implying that the kernel type and parameters
need be optimized to approximate the ideal mapping [23,24]. In this
work, RBF kernel was used. Unlike the linear kernel, the RBF kernel can handle the case when the relation between class labels and
attributes is nonlinear. Besides, the linear kernel is a special case of
the RBF [42]. The RBF kernel has fewer tuning parameters than the
polynomial and sigmoid kernels [43], and it tends to give good per-
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
formance under general smoothness assumptions [44]. The value
is important in the RBF model and can lead to under or over fitting
in prediction. A very large value of may lead to over fitting and
all the support vectors distances are taken into account, while in
case of a very small , the machine will ignore most of the support
vectors leading to failure in the trained point prediction (under fitting) [45]. Here, the optimum value of the RBF kernel function ( )
was determined through the grid search over the space 0.001–20.
Since, the model parameters exhibit an interaction, these need
be optimized simultaneously. Here the parameters were optimized
using the grid and pattern searches over a wide space employing the v-fold cross-validation [46]. The grid search takes samples
from the space of the independent variables. For each sample, the
model prediction is computed and compared with the best value
found from the previous iterations. If the newly found value is
better than the previous one, the new results are stored. This process is repeated until the end of the iterations is reached. The grid
search technique is unguided algorithms based on brute computing
power. Hence, it can be computationally more expensive than other
optimization techniques. The accuracy of grid search optimization
depends on the parameter range in combination with the chosen
interval size. A higher accuracy in optimal solution may be achieved
through increasing the parameter range and decrease in the step
size [35]. Optimized values of the parameters searched over large
ranges were repeatedly optimized over closer ranges with smaller
step sizes to achieve finely tuned values. In v-fold cross-validation,
the data in training set are divided into v subsets of equal size.
Subsequently one subset is tested using the model trained on the
remaining v-1 subsets. Thus, each instance of the whole training
set is predicted once. The cross-validation procedure prevents the
over-fitting problem [46].
2.6. Linear and kernel discriminant analysis
Linear DA finds a linear projection of the data onto a onedimensional subspace such that the classes are well separated
according to a certain measure of separability. Since, DA constructs
a linear function; it is beyond its capabilities to separate the data
with nonlinear structures. Such a problem of the nonlinearities in
the data may be overcome by use of the kernel trick in KDA, which
maps the input data into the kernel feature space, yielding a nonlinear discriminant function in the input space. The procedural details
of the DA and KDA techniques are available elsewhere [4,7,9,10].
Here, the DA and KDA were performed for the spatial and temporal classifications using the training, validation and test sets of the
water quality data as employed for the SVC modeling.
2.7. Linear and kernel partial least squares regression
PLS, a linear multivariate method for relating the process variables (X) with the responses (Y) can analyze strongly collinear
and noisy data. It maximizes the covariance between X and Y.
PLS reduces the dimension of the predictor variables by extracting latent variables (LVs). In PLS, the scaled X and Y matrices are
decomposed into their corresponding scores vectors and loadings
vectors. In an inner relation, the scores vector of the predictor is
linearly regressed against the scores vector of the dependent variables. However, it is crucial to determine the optimal number of the
LVs, and cross-validation is a reliable approach to test the predictive
significance of the selected model [13,47].
The KPLS is a nonlinear extension of linear PLS in which training samples are transformed into a feature space via a nonlinear
mapping through the kernel trick, and the PLS algorithm is then
implemented in the feature space. Nonlinear data structure in the
original space is most likely to be linear after high-dimensional
nonlinear mapping. Therefore, KPLS can efficiently compute LVs
157
in the feature space by means of integral operators and nonlinear
kernel function. The procedural details of PLS and KPLS regression
methods are available elsewhere [9,13,37,48].
2.8. Model performance criteria
The performance of the classification models (DA, KDA, and SVC)
was assessed in terms of the misclassification rate (MR), sensitivity,
specificity and accuracy of prediction, computed as below [22]:
Sensitivity =
TP
TP + FN
(15)
Specificity =
TN
TN + FP
(16)
Accuracy =
TP + TN
TP + FP + TN + FN
(17)
where TP denotes the number of true positive, FP is the number
of false positive, FN is the number of false negative, and TN is the
number of true negative.
The performance of the regression (PLS, KPLS and SVR) models was evaluated in terms of the mean square error (MSE), root
mean square error (RMSE), bias, correlation coefficient (R), accuracy factor (Af ), and the Nash–Sutcliffe coefficient of efficiency (Ef )
computed for all the three (training, validation and test) sets. The
mean square error (MSE), used as the target error goal, is defined
as [49]:
N
MSE =
1
(ypred,i − ymeas,i )2
N
(18)
i=1
where ypred,i and ymeas,i represent the predicted and measured values of the ith variable, and N represents the number of observations.
Table 2b
Classification error in spatial classification of surface water by DA, KDA and SVC
models.
Model
Actual class
Training set
LP
DA
GP
MP
KDA
LP
GP
MP
SVC
LP
GP
MP
Validation set
LP
DA
GP
MP
KDA
LP
GP
MP
LP
SVC
GP
MP
Test set
LP
DA
GP
MP
KDA
LP
GP
MP
LP
SVC
GP
MP
Total cases
Classification error (%)
Accuracy
Sensitivity
Specificity
291
304
268
291
304
268
291
304
268
15.41
6.03
16.8
10.32
4.63
11.0
9.62
3.94
11.24
26.35
3.33
26.25
8.60
10.86
19.78
14.87
5.60
17.50
8.51
7.25
12.75
11.19
1.25
7.05
6.88
3.05
8.50
97
101
90
97
101
90
97
101
90
22.22
7.99
18.4
15.52
7.59
15.81
13.89
6.60
14.94
37.78
3.57
23.19
16.00
17.82
25.56
26.06
5.44
19.49
8.50
9.80
16.89
15.26
2.12
11.44
5.33
7.15
13.28
134
67
95
134
67
95
134
67
95
13.85
5.07
14.86
12.50
5.07
10.81
13.18
3.05
13.52
13.73
3.57
20.69
13.43
13.43
15.79
16.79
1.67
20.44
7.69
5.42
12.44
11.73
2.62
8.46
9.81
3.39
10.35
158
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
Table 3a
Classification matrix for temporal classification of surface water by DA, KDA and SVC models.
Model
Actual class
Training set
Summer
DA
Monsoon
Winter
Total
KDA
Summer
Monsoon
Winter
Total
SVC
Summer
Monsoon
Winter
Total
Validation set
Summer
DA
Monsoon
Winter
Total
KDA
Summer
Monsoon
Winter
Total
Summer
SVC
Monsoon
Winter
Total
Test set
DA
Summer
Monsoon
Winter
Total
KDA
Summer
Monsoon
Winter
Total
Summer
SVC
Monsoon
Winter
Total
Total cases
Predicted correct assignations
Correct assignations
% Mis-classified
Sensitivity (%)
Specificity (%)
Accuracy (%)
20
23
254
176
146
254
53.66
61.34
85.52
81.50
77.44
91.70
70.92
73.00
89.57
73
199
15
20
19
262
215
199
262
78.18
69.34
87.04
80.94
88.54
95.03
79.95
82.16
92.25
233
75
9
29
194
8
13
18
284
233
194
284
36.00
49.12
15.61
33.25
21.82
30.66
12.96
21.81
15.27
32.40
5.64
17.61
73.50
83.98
90.15
92.30
85.28
96.95
85.39
84.93
94.50
92
96
100
288
92
96
100
288
92
96
100
288
47
39
16
37
50
4
8
7
80
47
50
80
46.08
54.95
84.21
75.81
76.65
89.64
65.28
69.79
87.85
68
21
3
26
63
7
5
11
85
68
63
85
73.91
65.63
84.16
84.26
83.42
94.68
80.97
75.51
91.00
69
32
7
16
58
8
7
6
85
69
58
85
48.91
47.91
20.00
38.54
26.09
34.38
16.0
25.49
25.00
39.58
15.00
26.38
63.88
70.73
86.73
87.22
81.55
92.10
78.47
78.47
90.27
121
85
90
296
121
85
90
296
121
85
90
296
82
50
10
31
27
6
8
8
74
82
27
74
57.75
42.19
82.22
74.68
75.00
92.23
66.55
67.91
89.19
87
28
6
21
58
6
5
4
81
87
58
81
71.90
68.24
90.00
85.14
84.83
94.17
79.73
80.07
92.91
92
48
5
18
31
5
11
6
80
92
31
80
32.23
68.23
17.77
38.17
28.10
31.76
11.11
23.66
23.96
63.52
11.11
31.41
63.44
57.40
82.47
80.79
77.68
94.97
72.29
73.98
90.87
Summer
Monsoon
Winter
275
287
301
863
275
287
301
863
275
287
301
863
176
118
34
79
146
13
215
47
13
The RMSE represents the error associated with the model and
was computed as:
N
(y
i=1 pred,i
RMSE =
Af = 10
− ymeas,i )2
(19)
N
where ypred,i and ymeas,i represent the model computed and measured values of the variable, and N represents the number of
observations. The RMSE, a measure of the goodness-of-fit, best
describes an average measure of the error in predicting the dependent variable. However, it does not provide any information on
phase differences.
The bias or average value of residuals (non-explained difference)
between the measured and predicted values of the dependent variable represents the mean of all the individual errors and indicates
whether, the model overestimates or underestimates the dependent variable. It is calculated as:
N
Bias =
1
(ypred,i − ymeas,i )
N
(20)
i=1
The correlation coefficient (R) represents the percentage of variability that can be explained by the model and is calculated as:
N
y
y
i=1 meas,i pred,i
R=
N
y2
i=1 meas,i
−
1
N
The accuracy factor (Af ), a simple multiplicative factor indicating
the spread of results aboutthe prediction is computed as [50]:
−
N
y
i=1 meas,i
1
N
2
N
y
i=1 meas,i
−
1
N
N
(22)
The larger the value of Af , the less accurate is the average estimate.
A value of one indicates that there is perfect agreement between all
the predicted and the measured values. Each performance criteria
term described above conveys specific information regarding the
predictive performance/efficiency of a specific model. Goodnessof-fit of the selected models was also checked through the analysis
of the residuals.
The Nash–Sutcliffe coefficient of efficiency (Ef ), an indicator of
the model fit is computed as [51]:
Ef = 1 −
N
(y
− ymeas,i )2
i=1 pred,i
N
(y
− ȳmeas )2
i=1 meas,i
(23)
where ȳmeas is the mean of the measured values. The Ef is a normalized measure (−∞ to 1) that compares the mean square error
generated by a particular model simulation to the variance of the
target output sequence. The Ef value of 1 indicates perfect model
N
y
i=1 pred,i
N
y2
i=1 pred,i
|log(ypred,i /ymeas,i )|
(21)
N
y
i=1 pred,i
2
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
Table 3b
Classification error in temporal classification of surface water by DA, KDA and SVC
models.
Model
Actual class
Training set
DA
Summer
Monsoon
Winter
Summer
KDA
Monsoon
Winter
SVC
Summer
Monsoon
Winter
Validation set
DA
Summer
Monsoon
Winter
Summer
KDA
Monsoon
Winter
SVC
Summer
Monsoon
Winter
Test set
Summer
DA
Monsoon
Winter
KDA
Summer
Monsoon
Winter
SVC
Summer
Monsoon
Winter
Total cases
Classification error (%)
Accuracy
Sensitivity
Specificity
275
287
301
275
287
301
275
287
301
29.08
27.00
10.43
20.05
17.84
7.75
14.61
15.07
5.50
46.34
38.66
14.48
21.82
30.66
12.96
26.50
16.02
9.85
18.50
22.56
8.30
9.06
11.46
4.97
7.70
14.72
3.05
92
96
100
92
96
100
92
96
100
34.72
30.21
12.15
19.03
24.49
9.00
21.53
21.53
9.73
53.92
45.05
15.79
26.09
34.37
15.84
36.12
29.27
13.27
24.19
23.35
10.36
15.74
16.58
5.32
12.78
18.45
7.90
121
85
90
121
85
90
121
85
90
33.45
32.09
10.81
20.27
19.93
7.09
20.71
26.02
9.13
42.25
57.81
17.78
28.10
31.76
10.00.
36.56
42.60
17.53
25.32
25.00
7.77
14.86
15.17
5.83
19.21
22.32
5.03
performance (the model perfectly simulates the target output), an
Ef value of zero indicates that the model is, on average, performing
only as good as the use of the mean target value as prediction, and
an Ef value <0 indicates an altogether questionable choice of the
model [52].
159
dation and test sets. Further, the results showed that the specificity
of the classification was mostly higher than the sensitivity. The
classification error ranged between 14% and 3% (Table 2b).
Similarly, the temporal SVC was used for the differentiation
among the three distinct groups of months (summer, monsoon and
winter seasons). The season was the category (dependent) variable,
while all other seven water quality variables constituted the independent set of variables. The optimal values of the temporal SVC
model parameters, C, ε and kernel-dependent parameter ( ) were
found to be 3.29, 0.001, and 4.17, respectively, and the number
of SVs was 362. The temporal classification matrices (CMs) for the
training, validation and test sets are presented in Table 3a. The temporal SVC rendered the mean MR of 17.61, 26.38, and 31.41% among
the training, validation and test sets. Further, the results (Table 3a)
showed that the specificity of the classification was always higher
than the sensitivity. The classification error ranged between 23%
and 5% (Table 3b).
Fig. 1 shows the importance of variables in spatial (Fig. 1a) and
temporal (Fig. 1b) classification using the SVC models. It may be
noted that T-alk followed by DO were the most important variables
in seasonal classification, whereas, the DO followed by chloride
and BOD were the most important variables in spatial classification
of the surface water quality. This may be due to the fact that the
seasonal variations in surface water quality in a region are largely
due to the alkalinity, DO and chloride ions, whereas the anthropogenic pollution parameters such as DO, Cl, and BOD at different
sites in a geographical area largely determines the water quality
and largely account for the water quality variation.
Further, it is evident that the results of spatial classification are
more precise as compared with the temporal classification. This
may be attributed to the fact that the seasonal variations over a
long period of time are more prominent and seasonal fluctuations
over the study years and subsequent overlapping of months across
a
120
3.1. SVM classification
The SVC approach was used for the spatial and temporal classification of the surface water quality data. The complete water quality
data set was divided in to three sub-sets (training, validation, and
test). Both the spatial and temporal SVCs were performed on data
comprised of seven variables (pH, T-alk, Cl, PO4 , COD, DO, BOD).
Among the linear, polynomial, sigmoid, and RBF kernel functions,
the later was finally selected in SVC model, as it yielded the lowest
MSE.
In spatial classification, SVC was used for the differentiation
between the three classes of sampling sites, viz. low-polluted (LP),
moderately polluted (MP) and grossly polluted (GP) in the water
quality monitoring network. The sites were the category (dependent) variables, while all the other seven water quality variables
constituted the independent set of variables. A two-step grid search
method [53] with a 25% random validation was used to derive the
SVC model parameters. First a coarse grid search was used to determine the best region of these three-dimensional grids and then a
finer grid search was conducted to find the optimal parameters.
The optimal values of the spatial SVC model parameters, C, ε and
kernel-dependent parameter ( ) were found as 12.88, 0.001, and
7.74, respectively, and the number of SVs was 551.
The spatial classification matrices (CMs) for the training, validation and test sets are presented in Table 2a. The spatial SVC rendered
the mean MR of 12.39, 17.70, and 14.86% among the training, vali-
80
60
40
20
0
DO
Cl
BOD
T-Alk
COD
PO4
pH
BOD
pH
PO4
Variables
b
120
100
Impotance (%)
3. Results and discussion
Importance (%)
100
80
60
40
20
0
T-Alk
DO
Cl
COD
Variables
Fig. 1. Plot showing importance of the input variables in (a) spatial SVC model and
(b) temporal SVC model.
160
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
different seasons may have significant influence on the resulted
water quality in the study region.
However, the spatial SVM classification modeling successfully
grouped the ten sampling sites in to three, the low pollution
(S1–S3), moderate pollution (S7–S10) and the gross pollution
(S4–S6) sites, which may serve the water quality monitoring purpose in the study area, thus, achieving the data reduction by 70%.
The temporal SVC grouped the twelve monitoring months in to
three seasons (summer, monsoon and winter) for monitoring, and
hence achieved data reduction by 75%. The spatial and temporal
SVC, thus achieved overall data reduction by 92.5%.
3.2. Linear and kernel discriminant analysis
DA and KDA were performed on the water quality data for the
spatial and temporal classifications using the training, validation
and test sets as employed for SVC. Similar to SVC, the site (spatial)
and the season (temporal) were the grouping (dependent) variables, while the selected water quality parameters constituted the
independent variables. In case of the KDA, RBF was used as the kernel function. An optimum value of the kernel function parameter
( ) was determined through generating several sets of the classification error for the training set. The best value of , thus, achieved
was 2.0, and it was then used to predict the classification of the
validation and test sub-sets.
The temporal and spatial classification results (Tables 2 and 3)
achieved through DA, KDA and SVC techniques for the training, validation, and test sets suggested that both the SVC and KDA models
performed relatively better than DA, suggesting that the variables
exhibited nonlinear relationships. However, the performances of
the SVC and KDA methods were comparable in the spatial and temporal classifications of the water quality. This may be attributed
to the fact that both these methods employ the kernel trick for
mapping of the input data, thus capturing the nonlinearities.
3.3. SVM regression
The SVR approach was used for predicting the BOD of surface
water using a set of simple and directly measurable water quality variables. The complete water quality data set was divided in
to three sub-sets (training, validation, and test). In SVR, BOD was
the dependent variable, whereas, the remaining six variables (pH,
T-alk, Cl, PO4 , COD, DO) constituted the set of independent variables. Among the linear, polynomial, sigmoid, and RBF kernel
functions, the later was finally selected in SVR models as it yielded
the lowest MSE. Moreover, the RBF kernels tend to give good performance under general smoothness assumption [43,44]. A two-step
grid search method [53] with a ten-fold cross-validation was used
to derive the SVR model parameters. The optimal values of the SVR
model parameters, C, ε and kernel-dependent parameter ( ) were
determined as 51.08, 0.001, and 0.995, respectively, and the number of SVs was 624. The values of the model performance criteria
parameters (MSE, RMSE, bias, R, Af , Ef ) as computed for the training, validation and test sets used for the model are presented in
Table 4. Fig. 2 shows the plots between the measured and the model
predicted values of BOD in training, validation and test set, respectively. For the BOD values predicted by the model, the correlation
coefficient (R) values (p < 0.001) for the training, validation and test
sets were 0.952, 0.909, and 0.907, respectively. The SVR predictions are precise, if R values are closer to unity [4]. The respective
values of RMSE and bias for the three data sets are 1.53 and −0.06
for training, 1.44 and −0.05 for validation, and 1.32 and −0.10 for
testing, respectively. A closely followed pattern of variation in the
measured and model predicted BOD values (Fig. 2), a considerably
high correlation, and low values of MSE, RMSE, bias, along with the
values of Af and Ef closer to unity suggested for a good-fit of the
Fig. 2. Plot of measured versus SVR model predicted values of BOD in surface water
(a) training, (b) validation, and (c) test sets.
model to the data set and its predictive ability for the new future
samples [48].
Moreover, the model-predicted BOD values and the residuals
corresponding to the training, validation and test sets show almost
complete independence and random distribution (Fig. 3) with negligibly small correlations between them (Table 4). Residuals versus
predicted value plots can be more informative regarding model fitting to a data set. If the residuals appear to behave randomly (low
correlation), it suggests that the model fits the data well. On the
other hand, if non-random distribution is evident in the residuals,
the model does not fit the data adequately [48].
Further, all the six input variables participated in SVR model for
BOD prediction (figure not shown due to brevity), COD had highest
contribution followed by DO. Although, the SVM does not necessarily represent physical meaning through the weights, it suggests
that all the input variables have direct relevance on the dependent
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
161
Table 4
Values of the performance criteria parameters for PLS, KPLS and SVR models.
Model
Sub-set
MSE (mg L−1 )
RMSE (mg L−1 )
Bias (mg L−1 )
Ef
Af
R
R*
PLS
Training
Validation
Test
Training
Validation
Test
Training
Validation
Test
3.53
1.45
2.29
3.02
2.21
1.98
2.33
2.09
1.74
1.88
1.21
1.51
1.74
1.49
1.41
1.53
1.44
1.32
0.00
0.06
0.09
0.00
0.10
−0.11
−0.06
−0.05
−0.10
0.86
0.76
0.78
0.88
0.82
0.80
0.90
0.83
0.80
1.02
1.01
1.03
1.30
1.28
1.28
1.03
1.02
1.02
0.926
0.871
0.884
0.937
0.905
0.894
0.952
0.909
0.907
0.38
0.39
0.46
0.00
0.09
−0.09
−0.01
−0.03
0.11
KPLS
SVR
R: correlation between predicted and measured values; R*: correlation between predicted values and residuals.
variables in water and play role in determining the BOD levels. A
relatively higher contribution of the COD in the BOD model may be
attributed to the fact that the COD of water has direct bearing on
the BOD levels in water [4].
3.4. Linear and kernel PLS
First, a linear PLS model was built between the predictor
variables (X) and the response variable (y). On the basis of the
cross-validation results, five latent variables were included in the
model and it was then applied to the validation and test sets. A KPLS
model was developed with RBF. The kernel function was selected
a
6
on the basis of RMSE value of the validation set. The kernel function parameter ( ) and number of the LVs in the feature space were
determined on the basis of the minimum cross-validation error
value. The optimum values of and LVs were 0.9 and 5, respectively. These values were used as the optimal parameters in the
KPLS model.
The results pertaining to the performance criteria parameters
for the PLS, KPLS and SVR modeling approaches (Table 4) suggested
that both the KPLS and SVR models performed relatively better than
the PLS in predicting the water BOD levels, suggesting for a non linear dependence of BOD on the independent variables. However,
the performances of both the KPLS and SVR models were comparable. Since, both these models use kernel function for mapping the
input data in feature space, these are capable in capturing the data
nonlinearities, and yielding relatively better predictions.
Residuals
4
4. Conclusion
2
0
The surface water quality pertaining to a large geographical area
covering the low, moderate and high pollution regions monitored
over a long duration of 15 years with several seasonal variations
and disturbances can be modeled as a function of selected water
quality variables using the SVMs approach, which performed relatively better than the linear (DA and PLS) models for classification
and regression. Relatively high number of SVs both in case of the
classification (spatial and temporal) and regression modeling suggests that the constructed SVC and SVR models actually used most
of the data points. It is also concluded that the SVC and SVR models
can make future predictions possible. SVC achieved a data reduction of 92.5% and the future water quality monitoring program may
be redesigned accordingly without compromising with the output
quality. Further, the predictive tool for the water BOD envisaged
in SVR using selected number of simple and directly measurable
variables may be used for futuristic trends in water quality. Thus,
the SVM based approaches helped in optimization of the water
quality monitoring program through reduction in number of the
sampling sites, frequency, and water quality parameters, and hence
in water quality management in a large geographical area with wide
seasonal variations.
-2
-4
-6
0
5
10
15
20
25
30
BODpred(mg.L-1)
Residuals
b
6
4
2
0
-2
-4
-6
0
5
10
15
20
25
BODpred(mg.L-1)
c
6
4
Acknowledgement
Residuals
2
The authors thank the Director, Indian Institute of Toxicology
Research, Lucknow for his keen interest in this work.
0
-2
-4
References
-6
0
5
10
15
20
25
BODpred(mg.L-1)
Fig. 3. Plot of the SVR model predicted BOD values and residuals (a) training, (b)
validation, and (c) test sets.
[1] K.P. Singh, A. Malik, V.K. Singh, N. Basant, S. Sinha, Anal. Chim. Acta 571 (2006)
248–259.
[2] K.P. Singh, A. Malik, V.K. Singh, Water Air Soil Pollut. 170 (2005) 383–404.
[3] K.P. Singh, A. Malik, D. Mohan, S. Sinha, Water Res. 38 (2004) 3980–3992.
[4] K.P. Singh, A. Basant, A. Malik, G. Jain, Ecol. Model. 220 (2009) 888–895.
[5] J.W. Einax, A. Aulinger, W.V. Tumpling, A. Prange, J. Anal. Chem. 363 (1999)
655–661.
162
K.P. Singh et al. / Analytica Chimica Acta 703 (2011) 152–162
[6] K.P. Singh, N. Basant, A. Malik, S. Sinha, G. Jain, Chemometr. Intell. Lab. Syst. 95
(2009) 18–30.
[7] K.P. Singh, A. Malik, D. Mohan, S. Sinha, V.K. Singh, Anal. Chim. Acta 532 (2005)
15–25.
[8] Y. Zhang, C. Ma, Chem. Eng. Sci. 66 (2011) 64–72.
[9] D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, Q.-N. Hu, L.-X. Zhang, G.-H. Fu, Chemometr. Intell.
Lab. Syst. 107 (2011) 106–115.
[10] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, J. Wang, Pattern Recogn. 38 (2005)
1788–1890.
[11] H. Li, Y. Liang, Q. Xu, Chemometr. Intell. Lab. Syst. 95 (2009) 188–198.
[12] D. Cozzolino, W.U. Cynkar, N. Shah, P. Smith, Food Res. Int 44 (2011)
181–186.
[13] S.H. Woo, C.O. Jeon, Y.S. Yun, H. Choi, C.S. Lee, D.S. Lee, J. Hazard. Mater. 161
(2009) 538–544.
[14] V.N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1998.
[15] Y. Pan, J. Jiang, R. Wang, H. Cao, Chemometr. Intell. Lab. Syst. 92 (2008) 169–178.
[16] J. Qu, M.J. Zuo, Measurement 43 (2010) 781–791.
[17] M. Kovacevic, B. Bajat, B. Gajic, Geoderma 154 (2010) 340–347.
[18] A. Mucherino, P. Papajorgji, M.P. Paradalos, Oper. Res. 9 (2009) 121–140.
[19] B. Schollopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, V. Vapnik, IEEE
Trans. Signal Process. 45 (1997) 2758–2765.
[20] V. Vapnik, S. Golowich, A.J. Simola, Adv. Neural Inform. Process. Syst. 9 (1996)
281–287.
[21] K. Kavaklioglu, Appl. Energy 88 (2010) 368–375.
[22] T. Rumpf, A.K. Mahlein, U. Steiner, E.C. Oerke, H.W. Dehne, L. Plumer, Comput.
Electron. Agric. 74 (2010) 91–99.
[23] N. Cristianine, J.S. Taylor, An Introduction to Support Vector Machine and other
Kernel based Learning Methods, Cambridge, 2000.
[24] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, 2002.
[25] J. Lutsa, F. Ojedaa, R. Van de Plasa, B. De Moora, S. Van Huffela, J.A.K. Suykensa,
Anal. Chim. Acta 665 (2010) 129–145.
[26] S.W. Lin, Z.J. Lee, S.C. Chen, T.Y. Tseng, Appl. Soft. Comput. 8 (2008) 1505–1512.
[27] S. Ekici, Expert Syst. Appl. 36 (2009) 9859–9868.
[28] J.F. Wang, X. Liu, Y.L. Liao, H.Y. Chen, W.X. Li, X.Y. Zheng, Biomed. Environ. Sci.
23 (2010) 167–172.
[29] A.B. Widodo, S. yang, Mech. Syst. Signal Process. 21 (2007) 2560–2574.
[30] X. Xie, W.T. Liu, B. Tang, Remote Sens. Environ. 112 (2008) 1846–1855.
[31] V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed., Berlin, Springer,
1999.
[32] V. Cherkassky, Y. Ma, Neural Networks 17 (2004) 113–126.
[33] C. Wu, X. Lv, X. Cao, Y. Mo, C. Chen, Int. J. Phys. Sci. 5 (2010) 2523–2527.
[34] R. Noori, A.R. Karbassi, K. Moghaddamnia, D. Han, M.H. Zokaei-Ashtiani, A.
Farokhnia, N. Ghafari Gousheh, J. Hydrol. 401 (2011) 177–189.
[35] J. Wang, H.Y. Du, H.X. Liu, X.J. Yao, Z.D. Hu, B.T. Fan, Talanta 73 (2007) 147–156.
[36] A.K.S. Jardine, D. Lin, D. Danjevic, Mech. Syst. Signal Process. 20 (2006)
1483–1510.
[37] N. Basant, S. Gupta, A. Malik, K.P. Singh, Chemometr. Intell. Lab. Syst. 104 (2010)
172–180.
[38] M. Daszykowski, S. Semeels, K. Kaczmarck, P. Van Espen, C. Croux, B. Walczak,
Chemometr. Intell. Lab. Syst. 85 (2007) 269–277.
[39] B. Ustun, W.J. Melssen, M. Oudenhuijzen, L.M.C. Buydens, Anal. Chim. Acta 544
(2005) 292–305.
[40] W.J. Wang, Z.B. Xu, W.Z. Lu, X.Y. Zhang, Neurocomputing 55 (2003) 643–663.
[41] C.H. Wu, G.H. Tzeng, R.H. Lin, Expert Syst. Appl. 36 (2009) 4725–4735.
[42] S.S. Keerthi, C.J. Lin, Neural Comput. 15 (2001) 1667–1689.
[43] X. Li, D. Lord, Y. Zhang, Y. Xic, Accident Anal. Prev. 40 (2008) 1611–1618.
[44] R. Noori, M.A. Abdoli, A. Ameri, M. Jalili-Ghazzizade, Environ. Prog. Sustain.
Energy 28 (2009) 249–258.
[45] D. Han, L. Chan, N. Zhu, J. Hydroinform 09.4 (2007) 267–276.
[46] C.W. Hsu, C.C. Chang, A Practical Guide to Support Vector Classification, 2003,
http://www.csie/ntu.edu.tw/∼cjlin/papers/guide/guide.pdf.
[47] K.P. Singh, A. Malik, V.K. Singh, D. Mohan, S. Sinha, Anal. Chim. Acta 550 (2005)
82–91.
[48] K.P. Singh, N. Basant, A. Malik, G. Jain, Anal. Chim. Acta 658 (2010) 1–11.
[49] C. Karul, S. Soyupak, A.F. Clesiz, N. Akbay, E. German, Ecol. Model. 134 (2000)
145–152.
[50] T. Ross, J. Appl. Biotechnol. 81 (1996) 501–508.
[51] J.E. Nash, I.V. Sutcliffe, J. Hydrol. 10 (1970) 282–290.
[52] S. Palani, S. Liong, P. Tkalich, Mar. Pollut. Bull. 56 (2008) 1586–1597.
[53] S.T. Chen, P.S. Yu, J. Hydrol. 340 (2007) 63–77.