[MRG+2] LOF algorithm (Anomaly Detection) #5279

ngoix · 2015-09-16T12:44:13Z

Local Outlier Factor implementation.

Motivated by previous discussions
http://sourceforge.net/p/scikit-learn/mailman/message/32485020
and
#4163

benchmarks of LOF on:
https://github.com/ngoix/scikit-learn/blob/AD_benchmarks/benchmarks/bench_lof_vs_iforest.py

agramfort · 2015-09-16T14:56:33Z

thanks for the early PR

let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc)

jmschrei · 2015-09-21T16:49:52Z

I'd also be interested in reviewing this when you've moved past the WIP stage.

ngoix · 2015-10-09T14:38:21Z

I think it is ready for a first review @agramfort @jmschrei !

agramfort · 2015-10-09T14:41:41Z

sklearn/neighbors/lof.py

+__all__ = ["LOF"]
+
+
+class LOFMixin(object):


why do you need a mixin here?

I would put all methods from Mixin into LOF class and make them private.

I agree. Mixins imply that multiple estimators will be using them,

jmschrei · 2015-10-09T22:00:42Z

I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix.

jmschrei · 2015-10-09T22:01:55Z

I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution.

ngoix · 2015-10-12T09:25:41Z

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'.

What the user wants is:

for each x in X,
fit(X-{x})
predict(x)

which is allowed by

fit(X)
predict()

It is like looking for k-nearest-neighbors of points in a dataset X: you can do:

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors()

which is different from

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors(X)

I can make predict() have as signature

def predict(self, X):

and allows taking X=None in argument... Is it allowed ?

agramfort · 2015-10-12T10:05:32Z

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

implement a fit_predict(X) method is the way to go.

ngoix · 2015-10-12T10:45:20Z

Ok thanks !

ngoix · 2015-10-12T16:26:34Z

I merged the mixin with LOF class, changed the API and added a comparison example.
@agramfort @jmschrei what do you think?

agramfort · 2015-10-12T20:02:06Z

examples/covariance/plot_outlier_detection.py

-        clf.fit(X)
-        y_pred = clf.decision_function(X).ravel()
+
+        if clf_name=="Local Outlier Factor":


ngoix · 2016-10-19T11:06:35Z

done!

agramfort · 2016-10-20T12:30:57Z

@amueller want to take a final look?

for me it's good enough to merge

amueller

I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either predict and decision_function should both be private or neither. I kinda tend towards both, as making public is easier than hiding.
The rest is mostly minor, though how to tune n_neighbors seems pretty important.

amueller · 2016-10-20T17:04:29Z

sklearn/neighbors/lof.py

+        Returns
+        -------
+        lof_scores : array, shape (n_samples,)
+            The Local Outlier Factor of each input samples. The lower,


This seems to contradict the title of the docstring.

Yes it is -lof_scores

amueller · 2016-10-20T17:05:11Z

sklearn/neighbors/lof.py

+        return is_inlier
+
+    def decision_function(self, X):
+        """Opposite of the Local Outlier Factor of X (as bigger is better).


I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.

I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)

amueller · 2016-10-20T17:09:58Z

examples/covariance/plot_outlier_detection.py

@@ -18,6 +18,9 @@
  hence more adapted to large-dimensional settings, even if it performs
  quite well in the examples below.

+- using the Local Outlier Factor to measure the local deviation of a given


It's kinda odd that this example lives in this folder... but whatever..

Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.

amueller · 2016-10-20T17:12:45Z

sklearn/neighbors/lof.py

+
+        # Avoid re-computing X_lrd if same parameters:
+        if not (np.all(distances_X == self._distances_fit_X_) *
+                np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)):


this == raises a deprecation warning

/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):

This means they have different size, I think. So I guess you should check the shape first?

Also happens for the line above.

amueller · 2016-10-20T17:16:10Z

doc/modules/outlier_detection.rst

+The question is not, how isolated the sample is, but how isolated it is
+with respect to the surrounding neighborhood.
+
+This strategy is illustrated below.


I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?

No global vs local anymore!

amueller · 2016-10-20T17:40:16Z

sklearn/neighbors/lof.py

+        # Avoid re-computing X_lrd if same parameters:
+        if not (np.all(distances_X == self._distances_fit_X_) *
+                np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)):
+            lrd = self._local_reachability_density(


It seems that lrd is "small" compared to _distances_fit_X_ and _neighbors_indices_fit_X_. Why not compute it in fit and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor.

amueller · 2016-10-20T17:41:35Z

sklearn/neighbors/lof.py

+        Parameters
+        ----------
+        distances_X : array, shape (n_query, self.n_neighbors)
+            Distances to the neighbors (in the training samples self._fit_X) of


I would put backticks around _fit_X to be save ;)

Do you mean replacing self._fit_X by self._fit_X or self._fit_X or just by _fit_X? I don't understand the purpose...

I meant putting backticks around self._fit_X. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.

amueller · 2016-10-20T17:43:02Z

sklearn/neighbors/tests/test_lof.py

+    score = clf.fit(X).outlier_factor_
+    assert_array_equal(clf._fit_X, X)
+
+    # Assert scores are good:


Assert smallest outlier score is is greater than largest inlier score

amueller · 2016-10-20T17:44:40Z

sklearn/neighbors/tests/test_lof.py

+    clf = neighbors.LocalOutlierFactor().fit(X_train)
+
+    # predict scores (the lower, the more normal)
+    y_pred = - clf.decision_function(X_test)


I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -

I agree but this is to be consistent with OneClassSVM, EllipticEnvelop and IsolationForest.

amueller · 2016-10-20T17:45:56Z

sklearn/neighbors/lof.py

+        distance between them. This works for Scipy's metrics, but is less
+        efficient than passing the metric name as a string.
+
+        Distance matrices are not supported.


I don't understand this comment.

amueller · 2016-10-24T20:08:58Z

sklearn/neighbors/tests/test_lof.py

+    clf = neighbors.LocalOutlierFactor().fit(X_train)
+
+    # predict scores (the lower, the more normal)
+    y_pred = -clf.decision_function(X_test)


I meant changing y_test to be [0] * 20 + [-1] * 20 and then remove the -

amueller · 2016-10-24T20:10:27Z

sklearn/neighbors/lof.py

+
+        return self
+
+    def _predict(self, X=None):


I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.

amueller · 2016-10-24T20:11:22Z

examples/covariance/plot_outlier_detection.py

    for i, (clf_name, clf) in enumerate(classifiers.items()):
        # fit the data and tag outliers
-        clf.fit(X)
-        scores_pred = clf.decision_function(X)
+        if clf_name == "Local Outlier Factor":


Wait, I don't understand this. Please elaborate.

amueller · 2016-10-24T20:14:09Z

sklearn/neighbors/lof.py

+    Attributes
+    ----------
+    outlier_factor_ : numpy array, shape (n_samples,)
+        The LOF of X. The lower, the more normal.


I don't know which comment of yours refers to which comment of mine.

For the first comment: Yes, I'd either do negative_outlier_factor or inlier_score or something generic?

For the second comment: The explanation of outlier_factor_ as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor estimator was trained on, right?

amueller · 2016-10-25T15:54:41Z

thanks :)

raghavrv · 2016-10-25T15:57:18Z

Hurray 🍻 Thanks @ngoix !!

tguillemot · 2016-10-25T16:02:48Z

Youpi 🍻 !!

albertcthomas · 2016-10-25T16:05:22Z

Thanks @ngoix !!

GaelVaroquaux · 2016-10-25T18:17:31Z

Merged #5279.

Whoot!! Thanks to everybody involved.

agramfort · 2016-10-25T18:51:01Z

Congrats !

* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute

agramfort reviewed Oct 9, 2015
View reviewed changes

agramfort reviewed Oct 12, 2015
View reviewed changes

ngoix added 4 commits October 19, 2016 10:47

take into account tguillemot review + remove useless k_distance function

aced0ff

fix broken links :meth:kneighbors

e1178ba

cosmit

dc2ee4a

whatsnew

8337597

ngoix force-pushed the lof branch from ed1ab7e to 8337597 Compare October 19, 2016 08:55

amueller requested changes Oct 20, 2016

View reviewed changes

amueller review + remove _local_outlier_factor method

f59d70a

ngoix force-pushed the lof branch 4 times, most recently from 95036df to 662ed9c Compare October 24, 2016 15:09

add n_neighbors_ parameter the effective nb neighbors we use

6500640

ngoix force-pushed the lof branch from 662ed9c to 6500640 Compare October 24, 2016 15:27

amueller reviewed Oct 24, 2016

View reviewed changes

make decision_function private and negative_outlier_factor attribute

bcab8ea

amueller approved these changes Oct 25, 2016

View reviewed changes

amueller merged commit 788a458 into scikit-learn:master Oct 25, 2016

albertcthomas mentioned this pull request Sep 29, 2017

[MRG+1] Fix LOF and Isolation benchmarks #9798

Merged

[MRG+2] LOF algorithm (Anomaly Detection) #5279

[MRG+2] LOF algorithm (Anomaly Detection) #5279

Conversation

ngoix commented Sep 16, 2015 • edited Loading

agramfort commented Sep 16, 2015

jmschrei commented Sep 21, 2015

ngoix commented Oct 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmschrei commented Oct 9, 2015

jmschrei commented Oct 9, 2015

ngoix commented Oct 12, 2015

agramfort commented Oct 12, 2015

ngoix commented Oct 12, 2015

ngoix commented Oct 12, 2015

Choose a reason for hiding this comment

ngoix commented Oct 19, 2016

agramfort commented Oct 20, 2016

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Oct 25, 2016

raghavrv commented Oct 25, 2016

tguillemot commented Oct 25, 2016

albertcthomas commented Oct 25, 2016

GaelVaroquaux commented Oct 25, 2016 via email

agramfort commented Oct 25, 2016

ngoix commented Sep 16, 2015 •

edited

Loading