Skip to content

[MRG+2] LOF algorithm (Anomaly Detection) #5279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 25, 2016
Merged

Conversation

ngoix
Copy link
Contributor

@ngoix ngoix commented Sep 16, 2015

@agramfort
Copy link
Member

thanks for the early PR

let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc)

@jmschrei
Copy link
Member

I'd also be interested in reviewing this when you've moved past the WIP stage.

@ngoix
Copy link
Contributor Author

ngoix commented Oct 9, 2015

I think it is ready for a first review @agramfort @jmschrei !

__all__ = ["LOF"]


class LOFMixin(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a mixin here?

I would put all methods from Mixin into LOF class and make them private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Mixins imply that multiple estimators will be using them,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@jmschrei
Copy link
Member

jmschrei commented Oct 9, 2015

I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix.

@jmschrei
Copy link
Member

jmschrei commented Oct 9, 2015

I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution.

@ngoix
Copy link
Contributor Author

ngoix commented Oct 12, 2015

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'.

What the user wants is:

for each x in X,
fit(X-{x})
predict(x)

which is allowed by

fit(X)
predict()

It is like looking for k-nearest-neighbors of points in a dataset X: you can do:

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors()

which is different from

neigh = NearestNeighbors()
neigh.fit(X)
neigh.kneighbors(X)

I can make predict() have as signature

def predict(self, X):

and allows taking X=None in argument... Is it allowed ?

@agramfort
Copy link
Member

If you have a dataset X and want to remove outliers from it, you don't want to do

fit(X)
predict(X)

implement a fit_predict(X) method is the way to go.

@ngoix
Copy link
Contributor Author

ngoix commented Oct 12, 2015

Ok thanks !

@ngoix
Copy link
Contributor Author

ngoix commented Oct 12, 2015

I merged the mixin with LOF class, changed the API and added a comparison example.
@agramfort @jmschrei what do you think?

clf.fit(X)
y_pred = clf.decision_function(X).ravel()

if clf_name=="Local Outlier Factor":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8

@ngoix
Copy link
Contributor Author

ngoix commented Oct 19, 2016

done!

@agramfort
Copy link
Member

@amueller want to take a final look?

for me it's good enough to merge

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either predict and decision_function should both be private or neither. I kinda tend towards both, as making public is easier than hiding.
The rest is mostly minor, though how to tune n_neighbors seems pretty important.

Returns
-------
lof_scores : array, shape (n_samples,)
The Local Outlier Factor of each input samples. The lower,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to contradict the title of the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is -lof_scores

return is_inlier

def decision_function(self, X):
"""Opposite of the Local Outlier Factor of X (as bigger is better).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)

@@ -18,6 +18,9 @@
hence more adapted to large-dimensional settings, even if it performs
quite well in the examples below.

- using the Local Outlier Factor to measure the local deviation of a given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda odd that this example lives in this folder... but whatever..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.


# Avoid re-computing X_lrd if same parameters:
if not (np.all(distances_X == self._distances_fit_X_) *
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this == raises a deprecation warning

/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):

This means they have different size, I think. So I guess you should check the shape first?

Also happens for the line above.

The question is not, how isolated the sample is, but how isolated it is
with respect to the surrounding neighborhood.

This strategy is illustrated below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No global vs local anymore!

# Avoid re-computing X_lrd if same parameters:
if not (np.all(distances_X == self._distances_fit_X_) *
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)):
lrd = self._local_reachability_density(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that lrd is "small" compared to _distances_fit_X_ and _neighbors_indices_fit_X_. Why not compute it in fit and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor.

Parameters
----------
distances_X : array, shape (n_query, self.n_neighbors)
Distances to the neighbors (in the training samples self._fit_X) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put backticks around _fit_X to be save ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean replacing self._fit_X by self._fit_X or self._fit_X or just by _fit_X? I don't understand the purpose...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant putting backticks around self._fit_X. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.

score = clf.fit(X).outlier_factor_
assert_array_equal(clf._fit_X, X)

# Assert scores are good:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert smallest outlier score is is greater than largest inlier score

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

clf = neighbors.LocalOutlierFactor().fit(X_train)

# predict scores (the lower, the more normal)
y_pred = - clf.decision_function(X_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but this is to be consistent with OneClassSVM, EllipticEnvelop and IsolationForest.

distance between them. This works for Scipy's metrics, but is less
efficient than passing the metric name as a string.

Distance matrices are not supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment.

@ngoix ngoix force-pushed the lof branch 4 times, most recently from 95036df to 662ed9c Compare October 24, 2016 15:09
clf = neighbors.LocalOutlierFactor().fit(X_train)

# predict scores (the lower, the more normal)
y_pred = -clf.decision_function(X_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant changing y_test to be [0] * 20 + [-1] * 20 and then remove the -


return self

def _predict(self, X=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.

for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
clf.fit(X)
scores_pred = clf.decision_function(X)
if clf_name == "Local Outlier Factor":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I don't understand this. Please elaborate.

Attributes
----------
outlier_factor_ : numpy array, shape (n_samples,)
The LOF of X. The lower, the more normal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know which comment of yours refers to which comment of mine.

For the first comment: Yes, I'd either do negative_outlier_factor or inlier_score or something generic?

For the second comment: The explanation of outlier_factor_ as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor estimator was trained on, right?

@amueller amueller merged commit 788a458 into scikit-learn:master Oct 25, 2016
@amueller
Copy link
Member

thanks :)

@raghavrv
Copy link
Member

Hurray 🍻 Thanks @ngoix !!

@tguillemot
Copy link
Contributor

Youpi 🍻 !!

@albertcthomas
Copy link
Contributor

Thanks @ngoix !!

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 25, 2016 via email

@agramfort
Copy link
Member

Congrats !

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* LOF algorithm

add tests and example

fix DepreciationWarning by reshape(1,-1) one-sample data

LOF with inheritance

lof and lof2 return same score

fix bugs

fix bugs

optimized and cosmit

rm lof2

cosmit

rm MixinLOF + fit_predict

fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors

add comparison example + doc

LOF -> LocalOutlierFactor
cosmit

change LOF API:
-fit(X).predict() and fit(X).decision_function() do prediction on X without
 considering samples as their own neighbors (ie without considering X as a
 new dataset as does fit(X).predict(X))
-rm fit_predict() method
-add a contamination parameter st predict returns a binary value like other
 anomaly detection algos

cosmit

doc + debug example

correction doc

pass on doc + examples

pep8 + fix warnings

first attempt at fixing API issues

minor changes

takes into account tguillemot advice

-remove pairwise_distance calculation as to heavy in memory
-add benchmarks

cosmit

minor changes + deals with duplicates

fix depreciation warnings

* factorize the two for loops

* take into account @albertthomas88 review and cosmit

* fix doc

* alex review + rebase

* make predict private add outlier_factor_ attribute and update tests

* make fit_predict take y argument

* fix benchmarks file

* update examples

* make decision_function public (rm X=None default)

* fix travis

* take into account tguillemot review + remove useless k_distance function

* fix broken links :meth:`kneighbors`

* cosmit

* whatsnew

* amueller review + remove _local_outlier_factor method

* add n_neighbors_ parameter the effective nb neighbors we use

* make decision_function private and negative_outlier_factor attribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.