-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Outlier detection algorithms API consistency #9015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/neighbors/lof.py
Outdated
self.threshold_ = -scoreatpercentile( | ||
-self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
if self.contamination is None: | ||
self.threshold_ = -1.1 # inliers score around 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.5 is used in the paper in the experiments section. But it does not work on small dataset (the added tests break). I can put 1.5 and remove the added test in the case contamination is None
.
waiting for #9018 to be merged before dealing with EllipticEnvelop API |
sklearn/ensemble/iforest.py
Outdated
return -scores | ||
|
||
def score_samples(self, X): | ||
"""Opposite of the anomaly score define in the original paper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined
sklearn/ensemble/iforest.py
Outdated
|
||
Returns | ||
------- | ||
scores : array of shape (n_samples,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array, shape (n_samples,)
assert_greater(np.min(decision_func1[-2:]), np.max(decision_func1[:-2])) | ||
assert_greater(np.min(decision_func2[-2:]), np.max(decision_func2[:-2])) | ||
assert_array_equal(pred1, 6 * [1] + 2 * [-1]) | ||
assert_array_equal(pred2, 6 * [1] + 2 * [-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a for loop with contamination to avoid the dupes
sklearn/neighbors/lof.py
Outdated
self.threshold_ = -scoreatpercentile( | ||
-self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
if self.contamination is None: | ||
self.offset_ = 1.5 # inliers score around 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you document the offset_ attribute in the docstrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot as threshold_
wasn't documented, i will do this
sklearn/neighbors/lof.py
Outdated
else: # we're in training | ||
return scores | ||
|
||
def _score_samples(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this private function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want a score_samples
method for every anomaly detection algo (returning the "raw" decision function as defined in original papers).
Here it has to be private for the same reason as decision function and predict are private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict
is private because fit_predict(X)
would be different than fit(X).predict(X)
, and that decision_function
and score_samples
are private because predict
is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a score_samples
for the OneClassSVM
. A suggestion:
For IsolationForest
, LocalOutlierFactor
and EllipticEnvelope
all the work to compute the score of normality is currently done in decision_function
. decision_function
is then called in score_samples
and offset_
has to be used in both methods.
IMO it would maybe be better to do the opposite: do all the work to compute the score in score_samples
and define decision_function
by score_samples(X) - offset_
. I find it more natural (and easier to understand) to compute the score first and threshold this score to obtain decision_function
and predict
instead of computing decision_function
and un-threshold decision_function
to obtain the score. This would also involve the following changes:
- For
IsolationForest
you can callscore_samples
instead ofdecision_function
infit
and computeoffset_
fromscore_samples(X)
.decision_function
would then bescore_samples(X) - offset_
. And I think that you don't need the following if statement anymore indecision_function
if hasattr(self, 'offset_'): # means that we're in testing
return -scores + self.offset_
else: # we're in training
return -scores
-
For
LocalOutlierFactor
,_decision_function
would be_score_samples(X) - offset_
and as in the current version, no need to callscore_samples
nordecision_function
infit
. -
For
EllipticEnvelope
,decision_function
would bescore_samples(X) - offset_
and as in the current version, no need to callscore_samples
nordecision_function
infit
. -
For the
OneClassSVM
we might have to stick with the current solution forscore_samples
, i.e,score_samples(X) = decision_function(X) + offset_
asdecision_function
uses the LIBSVM interface?
sklearn/neighbors/tests/test_lof.py
Outdated
@@ -118,3 +123,14 @@ def test_n_neighbors_attribute(): | |||
"n_neighbors will be set to (n_samples - 1)", | |||
clf.fit, X) | |||
assert_equal(clf.n_neighbors_, X.shape[0] - 1) | |||
|
|||
|
|||
def test__score_samples(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_score_samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we test _score_samples
not score_samples
so I'm not sure of the convention...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have removed the duplicated _
as well, but it's not a big deal.
sklearn/neighbors/lof.py
Outdated
else: # we're in training | ||
return scores | ||
|
||
def _score_samples(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict
is private because fit_predict(X)
would be different than fit(X).predict(X)
, and that decision_function
and score_samples
are private because predict
is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.
sklearn/neighbors/lof.py
Outdated
@@ -92,10 +92,11 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin): | |||
metric_params : dict, optional (default=None) | |||
Additional keyword arguments for the metric function. | |||
|
|||
contamination : float in (0., 0.5), optional (default=0.1) | |||
contamination : float in (0., 0.5), optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None --> 'auto'
sklearn/neighbors/lof.py
Outdated
The amount of contamination of the data set, i.e. the proportion | ||
of outliers in the data set. When fitting this is used to define the | ||
threshold on the decision function. | ||
threshold on the decision function. If None, the decision function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None --> 'auto'
sklearn/neighbors/lof.py
Outdated
@@ -125,7 +126,7 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin): | |||
""" | |||
def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30, | |||
metric='minkowski', p=2, metric_params=None, | |||
contamination=0.1, n_jobs=1): | |||
contamination=None, n_jobs=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None --> 'auto'
sklearn/neighbors/lof.py
Outdated
@@ -167,8 +168,9 @@ def fit(self, X, y=None): | |||
self : object | |||
Returns self. | |||
""" | |||
if not (0. < self.contamination <= .5): | |||
raise ValueError("contamination must be in (0, 0.5]") | |||
if self.contamination is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be adapted to 'auto' instead of None
sklearn/neighbors/lof.py
Outdated
lrd_ratios_array = (self._lrd[_neighbors_indices_fit_X_] / | ||
self._lrd[:, np.newaxis]) | ||
|
||
self.negative_outlier_factor_ = -np.mean(lrd_ratios_array, axis=1) | ||
|
||
self.threshold_ = -scoreatpercentile( | ||
-self.negative_outlier_factor_, 100. * (1. - self.contamination)) | ||
if self.contamination is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.contamination == 'auto'
sklearn/neighbors/lof.py
Outdated
|
||
if hasattr(self, 'offset_'): # means that we're in testing | ||
return scores + self.offset_ # to make value 0 special | ||
else: # we're in training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is useless here as _decision_function
does not seem to be called during fit
(there is a check_is_fitted
in _decision_function
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, thanks!
@@ -219,7 +224,7 @@ def predict(self, X): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check_is_fitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/ensemble/iforest.py
Outdated
# abnormal) and substract self.offset_ to make 0 be the threshold | ||
# value for being an outlier. | ||
|
||
if hasattr(self, 'offset_'): # means that we're in testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my general review feedback about this if-else statement.
@ngoix why is this still WIP? |
It's not, I make the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. We also need to make sure that removing the raw_values
parameter in EllipticEnvelope
, which will require a deprecation cycle, is OK. Otherwise LGTM.
self.dist_, 100. * (1. - self.contamination)) | ||
return self | ||
|
||
def decision_function(self, X, raw_values=False): | ||
def decision_function(self, X): | ||
"""Compute the decision function of the given observations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we remove the raw_values we need a deprecation warning
else: | ||
transformed_mahal_dist = mahal_dist ** 0.33 | ||
decision = self.threshold_ ** 0.33 - transformed_mahal_dist | ||
negative_mahal_dist = self.score_samples(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need to double check the examples using EllipticEnvelope but if I remember correctly they should be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
covariance/plot_outlier_detection.py
and applications/plot_outlier_detection_housing.py
seem ok.
values = self.decision_function(X, raw_values=True) | ||
is_inlier[values <= self.threshold_] = 1 | ||
values = self.decision_function(X) | ||
is_inlier[values > 0] = 1 | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this? the contamination parameter is 0.1 by default
""" | ||
check_is_fitted(self, 'offset_') | ||
X = check_array(X) | ||
return -self.mahalanobis(X) | ||
|
||
def predict(self, X): | ||
"""Outlyingness of observations in X according to the fitted model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also change this docstring because "outlyingness" could mean that predict
returns scores whereas it returns labels.
assert_raises(NotFittedError, clf.decision_function, X) | ||
clf.fit(X) | ||
y_pred = clf.predict(X) | ||
decision = clf.score_samples(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scores =
@@ -63,6 +63,13 @@ class EllipticEnvelope(MinCovDet): | |||
A mask of the observations that have been used to compute the | |||
robust estimates of location and shape. | |||
|
|||
offset_ : float | |||
Offset used to define the decision function from the raw scores. | |||
We have the relation: decision_function = score_samples - offset_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the code it is: score_samples + offset_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
sklearn/neighbors/lof.py
Outdated
@@ -118,14 +119,22 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin): | |||
n_neighbors_ : integer | |||
The actual number of neighbors used for :meth:`kneighbors` queries. | |||
|
|||
offset_ : float | |||
Offset used to define the decision function from the raw scores. | |||
We have the relation: decision_function = score_samples - offset_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the code it is: score_samples + offset_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
sklearn/neighbors/tests/test_lof.py
Outdated
assert_array_equal(clf1._score_samples([[2., 2.]]), | ||
clf1._decision_function([[2., 2.]]) - clf1.offset_) | ||
assert_array_equal(clf2._score_samples([[2., 2.]]), | ||
clf2._decision_function([[2., 2.]]) - clf2.offset_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert equality of clf1.score_samples and clf2.score_samples
sklearn/svm/classes.py
Outdated
We have the relation: decision_function = score_samples - offset_. | ||
The offset is equal to intercept_ and is provided for consistency | ||
with other outlier detection algorithms such as LocalOutlierFactor, | ||
IsolationForest and EllipticEnvelope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove such as LocalOutlierFactor, IsolationForest and EllipticEnvelope
. Otherwise if a new outlier detection estimator is added to scikit-learn we would need to update the list.
sklearn/svm/classes.py
Outdated
Returns the decision function of the samples. | ||
""" | ||
dec = self._decision_function(X) | ||
return dec | ||
|
||
def score_samples(self, X): | ||
"""Raw decision function of the samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not very pleased with Raw decision function
. Maybe Shifted decision function
or scoring function
?
@ngoix as soon as you take into account my review this PR will be ready for final reviews. |
21cbff0
to
78b6032
Compare
thanks @albertcthomas for the review |
Ping @agramfort for another one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A partial review.
Could you please add an entry to what's new and be clear on any decision_function
or predict
behaviour that has changed.
"""Compute the decision function of the given observations. | ||
|
||
Parameters | ||
---------- | ||
X : array-like, shape (n_samples, n_features) | ||
|
||
raw_values : bool | ||
raw_values : bool, optional (default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "(default=None)" means anything. Remove it.
return decision | ||
# raw_values deprecation: | ||
if raw_values is not None: | ||
warnings.warn("raw_values parameter is deprecated in 0.20 and will" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tested, apparently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added
return negative_mahal_dist - self.offset_ | ||
|
||
def score_samples(self, X): | ||
"""Compute the Mahalanobis distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say negative here, please
""" | ||
check_is_fitted(self, 'threshold_') | ||
check_is_fitted(self, 'offset_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need repeating if done in decision_function.
Should this definition of predict be provided by a mixin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman do you mean a mixin for outlier detection estimators? I was thinking that all outlier detection estimators should have a fit_predict
like clustering estimators but they can't all have a predict
unless we exclude LocalOutlierFactor
which has a private predict
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. We can make the presence of predict conditional on the presence of decision_function
. It's a bit ugly, but possible with a descriptor/property...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to create such a mixin in a separate PR, as this one is already a bit heavy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a mixin PR once this PR is merged.
sklearn/neighbors/lof.py
Outdated
raise ValueError("contamination must be in (0, 0.5]") | ||
if self.contamination != "auto": | ||
if not(0. < self.contamination <= .5): | ||
raise ValueError("contamination must be in (0, 0.5]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the actual value of self.contamination
in the error message:
ValueError("contamination must be in (0, 0.5],"
"got: %f" % self.contamination)
727e597
to
5c7dd33
Compare
Thanks @jnothman |
@ngoix for the what's new put it on online that will render as tiny paragraph if line is long. Don't itemize it. |
Put it all in one paragraph. We can restructure later.
…On 12 September 2017 at 04:59, Alexandre Gramfort ***@***.***> wrote:
@ngoix <https://github.com/ngoix> for the what's new put it on online
that will render as tiny paragraph if line is long. Don't itemize it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9015 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67SIOTQ8yF1FmwDqhWnXU1b2Ae0Uks5shYMkgaJpZM4NxkWN>
.
|
7da6313
to
8e06afe
Compare
Resolves #8693
Resolves #8707
Following up discussion in issue #8693
Make
IsolationForest
andLocalOutlierFactor
decision functions usecontamination
asEllipticEnvelope
does. It is used to define a drift such that the threshold on the decision function for being an outlier is 0 (positive value = inlier, negative value = outlier).Add a
score_samples
method to OCSVM, iForest, LOF, EllipticEnvelope, enabling the user to access raw score functions from original papers. A newoffset_
parameter allows to link this new method todecision_function
through "decision_function = score_samples - offset_".clean
EllipticEnvelope
, fix docs, deprecateraw_values
parameter.Add why lof decision_function is private (mentioned in [MRG] Add decision_function api to LocalOutlierFactor #8707)
ravel the decision function of OCSVM so that it is indeed of shape (n_samples,)