-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] LOF algorithm (Anomaly Detection) #5279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thanks for the early PR let me know when you need a review ie when you addressed the standard things (tests, example, some basic doc) |
I'd also be interested in reviewing this when you've moved past the WIP stage. |
I think it is ready for a first review @agramfort @jmschrei ! |
__all__ = ["LOF"] | ||
|
||
|
||
class LOFMixin(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need a mixin here?
I would put all methods from Mixin into LOF class and make them private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Mixins imply that multiple estimators will be using them,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
I would like to see some more extensive unit tests, particularly in cases where the algorithm should fail (wrong dimensions or other incorrect types of data passed in). I'll be able to look more at the performance of the code once you merge the mixin with the other class, and change the API to always take in an X matrix. |
I'd also like to see an example of it performing against a/many current algorithm(s), so that it is clear it is a valuable contribution. |
If you have a dataset X and want to remove outliers from it, you don't want to do
because then each sample is considered in its own neighbourhoud: in predict(X), X is considered as 'new observations'. What the user wants is:
which is allowed by
It is like looking for k-nearest-neighbors of points in a dataset X: you can do:
which is different from
I can make
and allows taking X=None in argument... Is it allowed ? |
implement a fit_predict(X) method is the way to go. |
Ok thanks ! |
I merged the mixin with LOF class, changed the API and added a comparison example. |
clf.fit(X) | ||
y_pred = clf.decision_function(X).ravel() | ||
|
||
if clf_name=="Local Outlier Factor": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8
done! |
@amueller want to take a final look? for me it's good enough to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think caching the LRD on the training set would be good (and actually make the code easier to follow). I think either predict
and decision_function
should both be private or neither. I kinda tend towards both, as making public is easier than hiding.
The rest is mostly minor, though how to tune n_neighbors
seems pretty important.
Returns | ||
------- | ||
lof_scores : array, shape (n_samples,) | ||
The Local Outlier Factor of each input samples. The lower, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to contradict the title of the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is -lof_scores
return is_inlier | ||
|
||
def decision_function(self, X): | ||
"""Opposite of the Local Outlier Factor of X (as bigger is better). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docstring should be more explicit. Is low outlier or high outlier?
Actually, to be consistent with the other estimators, I think negative needs to be outlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, for all the decision functions, bigger is better (large values correspond to inliers). For prediction, negative values (-1) correspond to outliers though. (It's true that this is a bit odd)
@@ -18,6 +18,9 @@ | |||
hence more adapted to large-dimensional settings, even if it performs | |||
quite well in the examples below. | |||
|
|||
- using the Local Outlier Factor to measure the local deviation of a given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda odd that this example lives in this folder... but whatever..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes very weird! it is the folder of the first outlier detection algorithm in scikit-learn.
|
||
# Avoid re-computing X_lrd if same parameters: | ||
if not (np.all(distances_X == self._distances_fit_X_) * | ||
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ==
raises a deprecation warning
/home/andy/checkout/scikit-learn/sklearn/neighbors/lof.py:279: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future.
np.all(self.neighbors_indices_fit_X == neighbors_indices_X)):
This means they have different size, I think. So I guess you should check the shape first?
Also happens for the line above.
The question is not, how isolated the sample is, but how isolated it is | ||
with respect to the surrounding neighborhood. | ||
|
||
This strategy is illustrated below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel that the example illustrates the point that was just made about the different densities. I'm fine to leave it as-is but I don't get a good idea of the global vs local. It would be nice to also illustrate a failure mode maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No global vs local anymore!
# Avoid re-computing X_lrd if same parameters: | ||
if not (np.all(distances_X == self._distances_fit_X_) * | ||
np.all(self._neighbors_indices_fit_X_ == neighbors_indices_X)): | ||
lrd = self._local_reachability_density( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that lrd
is "small" compared to _distances_fit_X_
and _neighbors_indices_fit_X_
. Why not compute it in fit
and store it once and for all? You are currently recomputing it on every call to _local_outlier_factor
.
Parameters | ||
---------- | ||
distances_X : array, shape (n_query, self.n_neighbors) | ||
Distances to the neighbors (in the training samples self._fit_X) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put backticks around _fit_X
to be save ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean replacing self._fit_X by self._fit_X
or self._fit_X
or just by _fit_X
? I don't understand the purpose...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant putting backticks around self._fit_X
. a) for nicer highlighting b) I'm not sure sphinx will render the current version correctly because of the underscore. But I might be paranoid.
score = clf.fit(X).outlier_factor_ | ||
assert_array_equal(clf._fit_X, X) | ||
|
||
# Assert scores are good: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert smallest outlier score is is greater than largest inlier score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
clf = neighbors.LocalOutlierFactor().fit(X_train) | ||
|
||
# predict scores (the lower, the more normal) | ||
y_pred = - clf.decision_function(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it more natural to give the outliers the negative label. If you want to leave it like this, remove space after -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but this is to be consistent with OneClassSVM
, EllipticEnvelop
and IsolationForest
.
distance between them. This works for Scipy's metrics, but is less | ||
efficient than passing the metric name as a string. | ||
|
||
Distance matrices are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment.
95036df
to
662ed9c
Compare
clf = neighbors.LocalOutlierFactor().fit(X_train) | ||
|
||
# predict scores (the lower, the more normal) | ||
y_pred = -clf.decision_function(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant changing y_test
to be [0] * 20 + [-1] * 20
and then remove the -
|
||
return self | ||
|
||
def _predict(self, X=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to be consistent. I don't think there's a good argument to have one but not the other. Not sure if the example is a strong enough point to make them both public.
for i, (clf_name, clf) in enumerate(classifiers.items()): | ||
# fit the data and tag outliers | ||
clf.fit(X) | ||
scores_pred = clf.decision_function(X) | ||
if clf_name == "Local Outlier Factor": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I don't understand this. Please elaborate.
Attributes | ||
---------- | ||
outlier_factor_ : numpy array, shape (n_samples,) | ||
The LOF of X. The lower, the more normal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know which comment of yours refers to which comment of mine.
For the first comment: Yes, I'd either do negative_outlier_factor
or inlier_score
or something generic?
For the second comment: The explanation of outlier_factor_
as an attribute says "The LOF of X" what is X? It's the training set this LocalOutlierFactor
estimator was trained on, right?
thanks :) |
Hurray 🍻 Thanks @ngoix !! |
Youpi 🍻 !! |
Thanks @ngoix !! |
Merged #5279.
Whoot!!
Thanks to everybody involved.
|
Congrats ! |
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
* LOF algorithm add tests and example fix DepreciationWarning by reshape(1,-1) one-sample data LOF with inheritance lof and lof2 return same score fix bugs fix bugs optimized and cosmit rm lof2 cosmit rm MixinLOF + fit_predict fix travis - optimize pairwise_distance like in KNeighborsMixin.kneighbors add comparison example + doc LOF -> LocalOutlierFactor cosmit change LOF API: -fit(X).predict() and fit(X).decision_function() do prediction on X without considering samples as their own neighbors (ie without considering X as a new dataset as does fit(X).predict(X)) -rm fit_predict() method -add a contamination parameter st predict returns a binary value like other anomaly detection algos cosmit doc + debug example correction doc pass on doc + examples pep8 + fix warnings first attempt at fixing API issues minor changes takes into account tguillemot advice -remove pairwise_distance calculation as to heavy in memory -add benchmarks cosmit minor changes + deals with duplicates fix depreciation warnings * factorize the two for loops * take into account @albertthomas88 review and cosmit * fix doc * alex review + rebase * make predict private add outlier_factor_ attribute and update tests * make fit_predict take y argument * fix benchmarks file * update examples * make decision_function public (rm X=None default) * fix travis * take into account tguillemot review + remove useless k_distance function * fix broken links :meth:`kneighbors` * cosmit * whatsnew * amueller review + remove _local_outlier_factor method * add n_neighbors_ parameter the effective nb neighbors we use * make decision_function private and negative_outlier_factor attribute
Local Outlier Factor implementation.
Motivated by previous discussions
http://sourceforge.net/p/scikit-learn/mailman/message/32485020
and
#4163
benchmarks of LOF on:
https://github.com/ngoix/scikit-learn/blob/AD_benchmarks/benchmarks/bench_lof_vs_iforest.py