Skip to content

[MRG] ENH: Add SVDD to svm module #5899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1241,6 +1241,7 @@ Estimators
svm.LinearSVR
svm.NuSVR
svm.OneClassSVM
svm.SVDD

.. autosummary::
:toctree: generated/
Expand Down
91 changes: 57 additions & 34 deletions doc/modules/outlier_detection.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _outlier_detection:

===================================================
=============================
Novelty and Outlier Detection
===================================================
=============================

.. currentmodule:: sklearn

Expand Down Expand Up @@ -49,37 +49,56 @@ In general, it is about to learn a rough, close frontier delimiting
the contour of the initial observations distribution, plotted in
embedding :math:`p`-dimensional space. Then, if further observations
lay within the frontier-delimited subspace, they are considered as
coming from the same population than the initial
observations. Otherwise, if they lay outside the frontier, we can say
that they are abnormal with a given confidence in our assessment.

The One-Class SVM has been introduced by Schölkopf et al. for that purpose
and implemented in the :ref:`svm` module in the
:class:`svm.OneClassSVM` object. It requires the choice of a
kernel and a scalar parameter to define a frontier. The RBF kernel is
usually chosen although there exists no exact formula or algorithm to
set its bandwidth parameter. This is the default in the scikit-learn
implementation. The :math:`\nu` parameter, also known as the margin of
the One-Class SVM, corresponds to the probability of finding a new,
but regular, observation outside the frontier.
coming from the same population than the initial observations. Otherwise,
if they lay outside the frontier, we can say that they are abnormal with a
given confidence in our assessment.

There are two SVM-based approaches for that purpose:

1. :class:`svm.OneClassSVM` finds a hyperplane which separates the data from
the origin by the largest margin.
2. :class:`svm.SVDD` finds a sphere with a minimum radius which encloses
the data.

Both methods can implicitly work in a transformed high-dimensional space using
the kernel trick. :class:`svm.OneClassSVM` provides :math:`\nu` parameter for
controlling the trade off between the margin and the number of outliers during
training, namely it is an upper bound on the fraction of outliers in a training
set or probability of finding a new, but regular, observation outside the
frontier. :clss:`svm.SVDD` provides a similar parameter
:math:`C = 1 / (\nu l)`, where :math:`l` is the number of samples, such that
:math:`1/C` approximately equals the number of outliers in a training set.

Both methods are equivalent if a) the kernel used depends only on the
difference between two vectors, one example is RBF kernel, and
b) :math:`C = 1 / (\nu l)`.

.. topic:: References:
.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
:target: ../auto_examples/svm/plot_oneclasse.html
:align: center
:scale: 75%

.. figure:: ../auto_examples/svm/images/plot_oneclass_vs_svdd_001.png
:target: ../auto_examples/svm/plot_oneclass_vs_svdd.html
:align: center
:scale: 75

* `Estimating the support of a high-dimensional distribution
<http://dl.acm.org/citation.cfm?id=1119749>`_ Schölkopf,
Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.

.. topic:: Examples:

* See :ref:`example_svm_plot_oneclass.py` for visualizing the
frontier learned around some data by a
:class:`svm.OneClassSVM` object.
frontier learned around some data by :class:`svm.OneClassSVM`.
* See :ref:`example_svm_plot_oneclass_vs_svdd.py` to get the idea about
the difference between the two approaches.

.. topic:: References:

* Bernhard Schölkopf et al, `Estimating the Support of a High-Dimensional
Distribution <http://dl.acm.org/citation.cfm?id=1119749>`_, Neural
computation 13.7 (2001): 1443-1471.
* David M. J. Tax and Robert P. W. Duin, `Support Vector Data Description
<http://dl.acm.org/citation.cfm?id=960109>`_, Machine Learning,
54(1):45-66, 2004.

.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
:target: ../auto_examples/svm/plot_oneclasse.html
:align: center
:scale: 75%


Outlier Detection
=================
Expand Down Expand Up @@ -131,7 +150,7 @@ This strategy is illustrated below.


Isolation Forest
----------------------------
----------------

One efficient way of performing outlier detection in high-dimensional datasets
is to use random forests.
Expand Down Expand Up @@ -187,7 +206,11 @@ results in these situations.
The examples below illustrate how the performance of the
:class:`covariance.EllipticEnvelope` degrades as the data is less and
less unimodal. The :class:`svm.OneClassSVM` works better on data with
multiple modes and :class:`ensemble.IsolationForest` performs well in every cases.
multiple modes and :class:`ensemble.IsolationForest` performs well in all
cases.

:class:`svm.SVDD` is not presented in comparison as it works the same as
:class:`svm.OneClassSVM` when using RBF kernel.

.. |outlier1| image:: ../auto_examples/covariance/images/plot_outlier_detection_001.png
:target: ../auto_examples/covariance/plot_outlier_detection.html
Expand All @@ -213,20 +236,20 @@ multiple modes and :class:`ensemble.IsolationForest` performs well in every case
:class:`covariance.EllipticEnvelope` learns an ellipse, which
fits well the inlier distribution. The :class:`ensemble.IsolationForest`
performs as well.
- |outlier1|
- |outlier1|

*
*
- As the inlier distribution becomes bimodal, the
:class:`covariance.EllipticEnvelope` does not fit well the
inliers. However, we can see that both :class:`ensemble.IsolationForest`
and :class:`svm.OneClassSVM` have difficulties to detect the two modes,
and that the :class:`svm.OneClassSVM`
tends to overfit: because it has not model of inliers, it
interprets a region where, by chance some outliers are
clustered, as inliers.
- |outlier2|
clustered, as inliers.
- |outlier2|

*
*
- If the inlier distribution is strongly non Gaussian, the
:class:`svm.OneClassSVM` is able to recover a reasonable
approximation as well as :class:`ensemble.IsolationForest`,
Expand Down
48 changes: 34 additions & 14 deletions doc/modules/svm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -325,32 +325,54 @@ floating point values instead of integer values::
* :ref:`example_svm_plot_svm_regression.py`



.. _svm_outlier_detection:

Density estimation, novelty detection
=======================================
Novelty and outlier detection
=============================

Support vector machines can be used for detecting novelty and outliers in
unlabeled data sets. That is, given a set of samples, detect the soft boundary
of that set so as to classify new points as belonging to that set or not.

There are two SVM-based approaches to this problem:

One-class SVM is used for novelty detection, that is, given a set of
samples, it will detect the soft boundary of that set so as to
classify new points as belonging to that set or not. The class that
implements this is called :class:`OneClassSVM`.
1. :class:`OneClassSVM` finds a hyperplane which separates the data from
the origin by the largest margin.
2. :class:`SVDD` finds a sphere with a minimum radius which encloses
the data.

In this case, as it is a type of unsupervised learning, the fit method
will only take as input an array X, as there are no class labels.
Both methods can be tuned for the optimal trade-off between number of outliers
and the margin/radius of a separation bound.

See, section :ref:`outlier_detection` for more details on this usage.
See section :ref:`outlier_detection` for more details on their usage.

.. figure:: ../auto_examples/svm/images/plot_oneclass_001.png
:target: ../auto_examples/svm/plot_oneclass.html
:align: center
:scale: 75

.. figure:: ../auto_examples/svm/images/plot_oneclass_vs_svdd_001.png
:target: ../auto_examples/svm/plot_oneclass_vs_svdd.html
:align: center
:scale: 75

.. topic:: Examples:

* :ref:`example_svm_plot_oneclass.py`
* :ref:`example_applications_plot_species_distribution_modeling.py`
* See :ref:`example_svm_plot_oneclass.py` for visualizing the
frontier learned around some data by :class:`OneClassSVM`.
* See :ref:`example_svm_plot_oneclass_vs_svdd.py` to get the idea about
the difference between the two approaches.
* :ref:`example_applications_plot_species_distribution_modeling.py`

.. topic:: References:

* Bernhard Schölkopf et al, `Estimating the Support of a High-Dimensional
Distribution <http://dl.acm.org/citation.cfm?id=1119749>`_, Neural
computation 13.7 (2001): 1443-1471.
* David M. J. Tax and Robert P. W. Duin, `Support Vector Data Description
<http://dl.acm.org/citation.cfm?id=960109>`_, Machine Learning,
54(1):45-66, 2004.



Complexity
Expand Down Expand Up @@ -707,5 +729,3 @@ computations. These libraries are wrapped using C and Cython.

- `LIBLINEAR -- A Library for Large Linear Classification
<http://www.csie.ntu.edu.tw/~cjlin/liblinear/>`_


21 changes: 12 additions & 9 deletions examples/applications/plot_outlier_detection_housing.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
able to focus on the main mode of the data distribution, it sticks to the
assumption that the data should be Gaussian distributed, yielding some biased
estimation of the data structure, but yet accurate to some extent.
The One-Class SVM algorithm
The One-Class SVM algorithm and Support Vector Data Description

First example
-------------
Expand All @@ -39,7 +39,7 @@
distribution: the location seems to be well estimated, although the covariance
is hard to estimate due to the banana-shaped distribution. Anyway, we can
get rid of some outlying observations.
The One-Class SVM is able to capture the real data structure, but the
The One-Class SVM and SVDD are able to capture the real data structure, but the
difficulty is to adjust its kernel bandwidth parameter so as to obtain
a good compromise between the shape of the data scatter matrix and the
risk of over-fitting the data.
Expand All @@ -52,7 +52,7 @@

import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
from sklearn.svm import OneClassSVM, SVDD
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn.datasets import load_boston
Expand All @@ -67,8 +67,9 @@
contamination=0.261),
"Robust Covariance (Minimum Covariance Determinant)":
EllipticEnvelope(contamination=0.261),
"OCSVM": OneClassSVM(nu=0.261, gamma=0.05)}
colors = ['m', 'g', 'b']
"OCSVM": OneClassSVM(nu=0.261, gamma=0.05),
"SVDD": SVDD(kernel='rbf', gamma = 0.03, C=0.01)}
colors = ['m', 'g', 'b', 'y']
legend1 = {}
legend2 = {}

Expand Down Expand Up @@ -105,8 +106,9 @@
plt.ylim((yy1.min(), yy1.max()))
plt.legend((legend1_values_list[0].collections[0],
legend1_values_list[1].collections[0],
legend1_values_list[2].collections[0]),
(legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2]),
legend1_values_list[2].collections[0],
legend1_values_list[3].collections[0]),
(legend1_keys_list[0], legend1_keys_list[1], legend1_keys_list[2], legend1_keys_list[3]),
loc="upper center",
prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("accessibility to radial highways")
Expand All @@ -122,8 +124,9 @@
plt.ylim((yy2.min(), yy2.max()))
plt.legend((legend2_values_list[0].collections[0],
legend2_values_list[1].collections[0],
legend2_values_list[2].collections[0]),
(legend2_values_list[0], legend2_values_list[1], legend2_values_list[2]),
legend2_values_list[2].collections[0],
legend2_values_list[3].collections[0]),
(legend2_keys_list[0], legend2_keys_list[1], legend2_keys_list[2], legend2_keys_list[3]),
loc="upper center",
prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("% lower status of the population")
Expand Down
12 changes: 7 additions & 5 deletions examples/covariance/plot_outlier_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,10 @@
classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"robust covariance estimator": EllipticEnvelope(contamination=.1),
"Isolation Forest": IsolationForest(max_samples=n_samples, random_state=rng)}
"Robust Covariance Estimator": EllipticEnvelope(contamination=0.1),
"Isolation Forest": IsolationForest(max_samples=n_samples,
random_state=rng)
}

# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
Expand Down Expand Up @@ -83,7 +85,6 @@
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = plt.subplot(1, 3, i + 1)
subplot.set_title("Outlier detection")
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
cmap=plt.cm.Blues_r)
a = subplot.contour(xx, yy, Z, levels=[threshold],
Expand All @@ -95,11 +96,12 @@
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
['Decision function', 'True inliers', 'True outliers'],
prop=matplotlib.font_manager.FontProperties(size=11))
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
subplot.set_xlabel("%s (errors: %d)" % (clf_name, n_errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
plt.suptitle("Outlier detection")
plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)

plt.show()
4 changes: 2 additions & 2 deletions examples/svm/plot_oneclass.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Novelty Detection")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
plt.title("Novelty Detection by One-class SVM")
plt.contourf(xx , yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred')

Expand Down
Loading