Skip to content

[MRG + 1] Novelty detection with LocalOutlierFactor #11569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
40f9fef
start novelty detection for lof
albertcthomas Feb 25, 2018
85532aa
use ValueError and always raise exception before return in methods
albertcthomas Apr 2, 2018
9465553
fix outlier detection common tests
albertcthomas Apr 2, 2018
2925a36
add tests for novelty ValueError
albertcthomas Apr 2, 2018
c1f7bf4
update plot_lof to plot_lof for outlier detection
albertcthomas Apr 8, 2018
2e3b0a3
[circle full] novelty detection example
albertcthomas Apr 8, 2018
e4bae2d
fix travis and outlier_detection example
albertcthomas Apr 8, 2018
8454951
add 2 components gaussian mixture with same variance in plot_anomaly_…
albertcthomas Apr 29, 2018
89e3126
[doc build] rm plot_outlier_detection and update outlier detection doc
albertcthomas Apr 29, 2018
c83be71
[doc build] sc in doc + add test for training scores
albertcthomas May 1, 2018
3bd477f
[doc build] sc and add table for LOF behavior in doc
albertcthomas May 1, 2018
bab882d
address review
albertcthomas May 25, 2018
7e84a4f
remove comma in plot docstring
albertcthomas Jun 19, 2018
1f5cb5a
move details on comparison between outlier detection estimators from …
albertcthomas Jun 19, 2018
c553263
sc in beginning of novelty and outlier detection doc + add distinctio…
albertcthomas Jun 19, 2018
3857938
sc in doc
albertcthomas Jun 23, 2018
d93bcf9
use properties to make method disappear depending on novelty value
albertcthomas Jun 28, 2018
8b12246
remove novelty=True statements in estimator_checks and use check_esti…
albertcthomas Jul 11, 2018
37e4bf0
add non regression tests checking availability of prediction methods …
albertcthomas Jul 16, 2018
1de858d
add entry in whatsnew
albertcthomas Jul 16, 2018
b733b22
pass on docstrings
agramfort Jul 16, 2018
6bbba29
remove decision_function in outlier examplea and use negative_outlier…
albertcthomas Jul 16, 2018
15f8e8e
update whatsnew
albertcthomas Jul 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 115 additions & 83 deletions doc/modules/outlier_detection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,27 @@ belongs to the same distribution as existing observations (it is an
Often, this ability is used to clean real data sets. Two important
distinction must be made:

:novelty detection:
The training data is not polluted by outliers, and we are interested in
detecting anomalies in new observations.

:outlier detection:
The training data contains outliers, and we need to fit the central
mode of the training data, ignoring the deviant observations.
The training data contains outliers which are defined as observations that
are far from the others. Outlier detection estimators thus try to fit the
regions where the training data is the most concentrated, ignoring the
deviant observations.

:novelty detection:
The training data is not polluted by outliers and we are interested in
detecting whether a **new** observation is an outlier. In this context an
outlier is also called a novelty.

Outlier detection and novelty detection are both used for anomaly
detection, where one is interested in detecting abnormal or unusual
observations. Outlier detection is then also known as unsupervised anomaly
detection and novelty detection as semi-supervised anomaly detection. In the
context of outlier detection, the outliers/anomalies cannot form a
dense cluster as available estimators assume that the outliers/anomalies are
located in low density regions. On the contrary, in the context of novelty
detection, novelties/anomalies can form a dense cluster as long as they are in
a low density region of the training data, considered as normal in this
context.

The scikit-learn project provides a set of machine learning tools that
can be used both for novelty or outliers detection. This strategy is
Expand All @@ -44,19 +58,55 @@ inliers::
estimator.decision_function(X_test)

Note that :class:`neighbors.LocalOutlierFactor` does not support
``predict`` and ``decision_function`` methods, as this algorithm is
purely transductive and is thus not designed to deal with new data.
``predict``, ``decision_function`` and ``score_samples`` methods by default
but only a ``fit_predict`` method, as this estimator was originally meant to
be applied for outlier detection. The scores of abnormality of the training
samples are accessible through the ``negative_outlier_factor_`` attribute.

If you really want to use :class:`neighbors.LocalOutlierFactor` for novelty
detection, i.e. predict labels or compute the score of abnormality of new
unseen data, you can instantiate the estimator with the ``novelty`` parameter
set to ``True`` before fitting the estimator. In this case, ``fit_predict`` is
not available.

.. warning:: **Novelty detection with Local Outlier Factor**

When ``novelty`` is set to ``True`` be aware that you must only use
``predict``, ``decision_function`` and ``score_samples`` on new unseen data
and not on the training samples as this would lead to wrong results.
The scores of abnormality of the training samples are always accessible
through the ``negative_outlier_factor_`` attribute.


Overview of outlier detection methods
=====================================

A comparison of the outlier detection algorithms in scikit-learn. Local
Outlier Factor (LOF) does not show a decision boundary in black as it
has no predict method to be applied on new data when it is used for outlier
detection.

.. figure:: ../auto_examples/images/sphx_glr_plot_anomaly_comparison_001.png
:target: ../auto_examples/plot_anomaly_comparison.html
:align: center
:scale: 50

A comparison of the outlier detection algorithms in scikit-learn
:class:`ensemble.IsolationForest` and :class:`neighbors.LocalOutlierFactor`
perform reasonably well on the data sets considered here.
The :class:`svm.OneClassSVM` is known to be sensitive to outliers and thus
does not perform very well for outlier detection. Finally,
:class:`covariance.EllipticEnvelope` assumes the data is Gaussian and learns
an ellipse. For more details on the different estimators refer to the example
:ref:`sphx_glr_auto_examples_plot_anomaly_comparison.py` and the sections
hereunder.

.. topic:: Examples:

* See :ref:`sphx_glr_auto_examples_plot_anomaly_comparison.py`
for a comparison of the :class:`svm.OneClassSVM`, the
:class:`ensemble.IsolationForest`, the
:class:`neighbors.LocalOutlierFactor` and
:class:`covariance.EllipticEnvelope`.

Novelty Detection
=================
Expand Down Expand Up @@ -189,7 +239,7 @@ This strategy is illustrated below.
* See :ref:`sphx_glr_auto_examples_ensemble_plot_isolation_forest.py` for
an illustration of the use of IsolationForest.

* See :ref:`sphx_glr_auto_examples_covariance_plot_outlier_detection.py` for a
* See :ref:`sphx_glr_auto_examples_plot_anomaly_comparison.py` for a
comparison of :class:`ensemble.IsolationForest` with
:class:`neighbors.LocalOutlierFactor`,
:class:`svm.OneClassSVM` (tuned to perform like an outlier detection
Expand Down Expand Up @@ -237,20 +287,29 @@ where abnormal samples have different underlying densities.
The question is not, how isolated the sample is, but how isolated it is
with respect to the surrounding neighborhood.

When applying LOF for outlier detection, there are no ``predict``,
``decision_function`` and ``score_samples`` methods but only a ``fit_predict``
method. The scores of abnormality of the training samples are accessible
through the ``negative_outlier_factor_`` attribute.
Note that ``predict``, ``decision_function`` and ``score_samples`` can be used
on new unseen data when LOF is applied for novelty detection, i.e. when the
``novelty`` parameter is set to ``True``. See :ref:`novelty_with_lof`.


This strategy is illustrated below.

.. figure:: ../auto_examples/neighbors/images/sphx_glr_plot_lof_001.png
:target: ../auto_examples/neighbors/plot_lof.html
.. figure:: ../auto_examples/neighbors/images/sphx_glr_plot_lof_outlier_detection_001.png
:target: ../auto_examples/neighbors/sphx_glr_plot_lof_outlier_detection.html
:align: center
:scale: 75%

.. topic:: Examples:

* See :ref:`sphx_glr_auto_examples_neighbors_plot_lof.py` for
an illustration of the use of :class:`neighbors.LocalOutlierFactor`.
* See :ref:`sphx_glr_auto_examples_neighbors_plot_lof_outlier_detection.py`
for an illustration of the use of :class:`neighbors.LocalOutlierFactor`.

* See :ref:`sphx_glr_auto_examples_covariance_plot_outlier_detection.py` for a
comparison with other anomaly detection methods.
* See :ref:`sphx_glr_auto_examples_plot_anomaly_comparison.py` for a
comparison with other anomaly detection methods.

.. topic:: References:

Expand All @@ -259,72 +318,45 @@ This strategy is illustrated below.
<http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf>`_
Proc. ACM SIGMOD

One-class SVM versus Elliptic Envelope versus Isolation Forest versus LOF
-------------------------------------------------------------------------

Strictly-speaking, the One-class SVM is not an outlier-detection method,
but a novelty-detection method: its training set should not be
contaminated by outliers as it may fit them. That said, outlier detection
in high-dimension, or without any assumptions on the distribution of the
inlying data is very challenging, and a One-class SVM gives useful
results in these situations.

The examples below illustrate how the performance of the
:class:`covariance.EllipticEnvelope` degrades as the data is less and
less unimodal. The :class:`svm.OneClassSVM` works better on data with
multiple modes and :class:`ensemble.IsolationForest` and
:class:`neighbors.LocalOutlierFactor` perform well in every cases.

.. |outlier1| image:: ../auto_examples/covariance/images/sphx_glr_plot_outlier_detection_001.png
:target: ../auto_examples/covariance/plot_outlier_detection.html
:scale: 50%

.. |outlier2| image:: ../auto_examples/covariance/images/sphx_glr_plot_outlier_detection_002.png
:target: ../auto_examples/covariance/plot_outlier_detection.html
:scale: 50%

.. |outlier3| image:: ../auto_examples/covariance/images/sphx_glr_plot_outlier_detection_003.png
:target: ../auto_examples/covariance/plot_outlier_detection.html
:scale: 50%

.. list-table:: **Comparing One-class SVM, Isolation Forest, LOF, and Elliptic Envelope**
:widths: 40 60

*
- For a inlier mode well-centered and elliptic, the
:class:`svm.OneClassSVM` is not able to benefit from the
rotational symmetry of the inlier population. In addition, it
fits a bit the outliers present in the training set. On the
opposite, the decision rule based on fitting an
:class:`covariance.EllipticEnvelope` learns an ellipse, which
fits well the inlier distribution. The :class:`ensemble.IsolationForest`
and :class:`neighbors.LocalOutlierFactor` perform as well.
- |outlier1|

*
- As the inlier distribution becomes bimodal, the
:class:`covariance.EllipticEnvelope` does not fit well the
inliers. However, we can see that :class:`ensemble.IsolationForest`,
:class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
have difficulties to detect the two modes,
and that the :class:`svm.OneClassSVM`
tends to overfit: because it has no model of inliers, it
interprets a region where, by chance some outliers are
clustered, as inliers.
- |outlier2|

*
- If the inlier distribution is strongly non Gaussian, the
:class:`svm.OneClassSVM` is able to recover a reasonable
approximation as well as :class:`ensemble.IsolationForest`
and :class:`neighbors.LocalOutlierFactor`,
whereas the :class:`covariance.EllipticEnvelope` completely fails.
- |outlier3|
.. _novelty_with_lof:

.. topic:: Examples:
Novelty detection with Local Outlier Factor
===========================================

To use :class:`neighbors.LocalOutlierFactor` for novelty detection, i.e.
predict labels or compute the score of abnormality of new unseen data, you
need to instantiate the estimator with the ``novelty`` parameter
set to ``True`` before fitting the estimator::

lof = LocalOutlierFactor(novelty=True)
lof.fit(X_train)

Note that ``fit_predict`` is not available in this case.

.. warning:: **Novelty detection with Local Outlier Factor`**

When ``novelty`` is set to ``True`` be aware that you must only use
``predict``, ``decision_function`` and ``score_samples`` on new unseen data
and not on the training samples as this would lead to wrong results.
The scores of abnormality of the training samples are always accessible
through the ``negative_outlier_factor_`` attribute.

The behavior of LOF is summarized in the following table.

==================== ================================ =====================
Method Outlier detection Novelty detection
==================== ================================ =====================
`fit_predict` OK Not available
`predict` Not available Use only on test data
`decision_function` Not available Use only on test data
`score_samples` Use `negative_outlier_factor_` Use only on test data
==================== ================================ =====================


This strategy is illustrated below.

.. figure:: ../auto_examples/neighbors/images/sphx_glr_plot_lof_novelty_detection_001.png
:target: ../auto_examples/neighbors/sphx_glr_plot_lof_novelty_detection.html
:align: center
:scale: 75%

* See :ref:`sphx_glr_auto_examples_covariance_plot_outlier_detection.py` for a
comparison of the :class:`svm.OneClassSVM` (tuned to perform like
an outlier detection method), the :class:`ensemble.IsolationForest`,
the :class:`neighbors.LocalOutlierFactor`
and a covariance-based outlier detection :class:`covariance.EllipticEnvelope`.
9 changes: 9 additions & 0 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -827,6 +827,15 @@ Outlier Detection models
``raw_values`` parameter is deprecated as the shifted Mahalanobis distance
will be always returned in 0.22. :issue:`9015` by `Nicolas Goix`_.

- Novelty detection with :class:`neighbors.LocalOutlierFactor`:
Add a ``novelty`` parameter to :class:`neighbors.LocalOutlierFactor`. When
``novelty`` is set to True, :class:`neighbors.LocalOutlierFactor` can then
be used for novelty detection, i.e. predict on new unseen data. Available
prediction methods are ``predict``, ``decision_function`` and
``score_samples``. By default, ``novelty`` is set to ``False``, and only
the ``fit_predict`` method is avaiable.
By :user:`Albert Thomas <albertcthomas>`.

Covariance

- The :func:`covariance.graph_lasso`, :class:`covariance.GraphLasso` and
Expand Down
129 changes: 0 additions & 129 deletions examples/covariance/plot_outlier_detection.py

This file was deleted.

Loading