Support Vector Machines: Theory and Applications: January 2001
Support Vector Machines: Theory and Applications: January 2001
net/publication/221621494
CITATIONS READS
55 11,832
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Massimiliano Pontil on 31 May 2014.
INTRODUCTION
Support Vector Machines (SVM) have been recently developed in the framework of statistical learning theory
(Vapnik, 1998) (Cortes and Vapnik, 1995), and have been successfully applied to a number of applications, ranging
from time series prediction (Fernandez, 1999), to face recognition (Tefas et al., 1999), to biological data processing
for medical diagnosis (Veropoulos et al., 1999). Their theoretical foundations and their experimental success
encourage further research on their characteristics, as well as their further use.
In this report we present a brief introduction to the theory and implementation of SVM, and we discuss the five
papers presented during the workshop. The report is organized as follows: section 2 presents the theoretical
foundations of SVM. A brief overview of statistical learning theory also using the discussion in (Vayatis and
Azencott, 1999) is given. The mathematical formulation of SVM is presented, and theory for the implementation of
SVM, as in (Trafalis, 1999), is briefly discussed. Section 3 summarizes the experimental work of (Veropoulos et al.,
1999), (Fernandez, 1999) and (Tefas et al., 1999) and the variations of the standard SVM proposed in these papers.
Finally section 4 presents some conclusions and suggestions for future research.
with SVM classification. For more information we refer the reader to the literature.
For regression the loss function used is the so-called epsilon-insensitive loss function |y-f(x)|ε which is equal to
|y-f(x)|-ε if |y-f(x)| > ε, and 0 otherwise.
To summarize, following the SLT ideas outlined above for the given choices of the loss function and the
hypothesis spaces, SVM are learning machines that minimize the empirical error while taking into account the
2
"complexity" of the hypothesis space used by also minimizing the RKHS norm of the solution f K . SVM in
practice minimize a trade off between empirical error and complexity of hypothesis space. Formally this is done by
solving the following minimization problems:
SVM classification
l
+ C ∑ | 1 − y i f(x i ) | + (1)
2
min f K
f
i =1
SVM regression
l
+ C ∑ y i - f( x i ) ε
2
min f (2)
f K
i =1
where C is a so called "regularization parameter" that controls the trade off between empirical error and
complexity of the hypothesis space used.
Having discussed how SVM stem out of the theory outlined above, we now turn to their actual implementation.
The next section briefly discusses how the minimization problems (1) and (2) can be done, taking also into account
(Trafalis, 1999).
SVM IMPLEMENTATION
As mentioned above, training an SVM means solving problems (1) or (2). It turns out that both problems can be
rewritten as constrained quadratic programming (QP) problems. We present the QP formulation for SVM
classification, and regarding regression we refer the reader to the literature (Vapnik, 1998).
Problem (1) can be rewritten as follows:
SV classification:
l
+ C ∑ ξi
2
min f (3)
f,ξi K
i =1
∑α y
i =1
i i =0
Typically in the literature SVM are trained by solving the dual optimization problem (4) ((Osuna et al., 1997),
(Vapnik, 1998), (Burges, 1998)). Trafalis (Trafalis, 1999) proposes primal-dual IPM methods for SVM training
which differ from the ones typically used.
In (Trafalis, 1999) the IPM discussed are also used to train learning machines other than SVM. In particular,
(Trafalis, 1999) shows how the proposed primal-dual IPM can be used to train Artificial Neural Networks (ANN)
typically trained using backpropagation One of the main differences between ANN and SVM is that, as mentioned
in (Trafalis, 1999), while for ANN there can be many local optimal solutions, for SVM there is only one optimal
solution for problem (3), since SVM are trained by solving a QP problem which has one global optimal solution.
This is one practical “advantage” of SVM when compared with ANN.
∑ξ ∑ξ (5)
2
min f K
+ C1 i + C2 i
f,ξ i
i ∈ class1 i ∈ class 2
w ,ξi
i =1
CONCLUSIONS
The report presented an overview of the theory of SVM in parallel with a summary of the papers presented in the
ACAI 99 workshop on "Support Vector Machines: theory and applications". Some of the important conclusions of
this report as well as of the workshop are summarized below:
(i) SVM are motivated through statistical learning theory. The theory characterizes the performance of learning
machines using bounds on their ability to predict future data. One of the papers in the workshop (Vayatis and
Azencott, 1999) presented new bounds on the performance of learning machines, and suggested a method to use
them experimentally in order to better understand the learning machines (including SVM).
(ii) SVM are trained by solving a constrained quadratic optimization problem. Among others, this implies that
there is a unique optimal solution for each choice of the SVM parameters. This is unlike other learning machines,
such as standard Neural Networks trained using backpropagation.
(iii) Primal dual interior point optimization methods may be used to efficiently train SVM with large data sets, as
described in (Trafalis, 1999).
(iv) Training many local SVMs instead of a single global one can lead to significant improvement in the
performance of a learning machine, as shown in (Fernandez, 1999).
(v) SVM has been successfully used for medical diagnosis (Veropoulos et al., 1999). Methods for dealing with
unbalanced training data, or for biasing the performance of an SVM towards one of the classes during classification
were suggested and used in (Veropoulos et al., 1999).
(vi) An SVM using a kernel motivated from Fisher Linear Discriminant was shown to outperform the standard
linear SVM for a face recognition task in (Tefas et al., 1999).
The ideas presented in the papers and discussed in the workshop suggest a number of future research directions:
from tuning the basic statistical learning theory results, to developing efficient training methods for SVM, to
designing variations of the standard SVM for practical usage. Some of the main issues regarding the design and use
of SVMs are, among others, the choice of the kernel of the SVM (as (Tefas et al., 1999) showed), and the choice of
the regularization parameter (as (Veropoulos et al., 1999) discussed). On the other hand, significant improvements in
the performance of SVM may be achieved if ensembles of SVMs are used (like in (Fernandez, 1999))
Acknowledgements: The authors would like to thank the organizers of ACAI 99, and the co-organizers of the
workshop: Constantine Papageorgiou, Tomaso Poggio, and Ioannis Pitas.
REFERENCES
Bartlett P. and Shawe-Taylor J., “Generalization performance of support vector machine and other pattern
classifiers”, In C.~Burges B.~Scholkopf, editor, “Advances in Kernel Methods--Support Vector Learning”. MIT
press, 1998.
Bottou L. and Vapnik V., “Local learning algorithms”, Neural Computation, 4(6): 888--900, November 1992.
Burges C., “A tutorial on support vector machines for pattern recognition”, In “Data Mining and Knowledge
Discovery”. Kluwer Academic Publishers, Boston, 1998, (Volume 2).
Cortes C. and Vapnik V., “Support vector networks”, Machine Learning, 20:1--25, 1995.
Duda R. and Hart P., "Pattern Classification and Scene Analysis", Wiley, New York 1973.
Evgeniou T., Pontil M., and Poggio T., “A unified framework for regularization networks and support vector
machines” A.I. Memo No. 1654, Artificial Intelligence Laboratory, MIT, 1999.
Fernandez R., “Predicting time series with a local support vector regression machine", ACAI99.
Osuna E., Freund R., and Girosi F., “Support Vector Machines: Training and Applications”, A.I. Memo No. 1602,
Artificial Intelligence Laboratory, MIT, 1997.
Platt J., “Fast training of Support Vector Machines using sequential minimal optimization”, In C.~Burges
B.~Scholkopf, editor, “Advances in Kernel Methods--Support Vector Learning”. MIT press, 1998.
Pontil M., Mukherjee S., and Girosi F., “On the noise model of Support Vector Machine regression” A.I. Memo,
MIT Artificial Intelligence Laboratory, 1998.
Tefas A., Kotropoulos C., and Pitas I., "Enhancing the performance of elastic graph matching for face
authentications by using Support Vector Machines", ACAI99.
Trafalis T., "Primal-dual optimization methods in neural networks and support vector machines training", ACAI99.
Vapnik V., ”Statistical Learning Theory”, Wiley, New York, 1998.
Vapnik V. and Chervonenkis A., “On the uniform convergence of relative frequencies of events to their
probabilities”, in ”Th. Prob. and its Applications”, 17(2): 264--280, 1971.
Vayatis N. and Azencott R., "How to estimate the Vapnik-Chervonenkis Dimension of Support Vector Machines
through simulations", ACAI99.
Veropoulos K., Cristianini N., and Campbell C., "The Application of Support Vector Machines to Medical Decision
Support: A Case Study", ACAI99.
Wahba G., “Splines Models for Observational Data”, Series in Applied Mathematics, Vol. 59, SIAM.