Skip to content

[MRG+2] Outlier detection algorithms API consistency #9015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 5, 2018

Conversation

ngoix
Copy link
Contributor

@ngoix ngoix commented Jun 6, 2017

Resolves #8693
Resolves #8707

Following up discussion in issue #8693

  • Make IsolationForest and LocalOutlierFactor decision functions use contamination as EllipticEnvelope does. It is used to define a drift such that the threshold on the decision function for being an outlier is 0 (positive value = inlier, negative value = outlier).

  • Add a score_samples method to OCSVM, iForest, LOF, EllipticEnvelope, enabling the user to access raw score functions from original papers. A new offset_ parameter allows to link this new method to decision_function through "decision_function = score_samples - offset_".

  • clean EllipticEnvelope, fix docs, deprecate raw_values parameter.

  • Add why lof decision_function is private (mentioned in [MRG] Add decision_function api to LocalOutlierFactor #8707)

  • ravel the decision function of OCSVM so that it is indeed of shape (n_samples,)

self.threshold_ = -scoreatpercentile(
-self.negative_outlier_factor_, 100. * (1. - self.contamination))
if self.contamination is None:
self.threshold_ = -1.1 # inliers score around 1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5 is used in the paper in the experiments section. But it does not work on small dataset (the added tests break). I can put 1.5 and remove the added test in the case contamination is None.

@ngoix ngoix changed the title Outlier detection algorithms API consistency [WIP] Outlier detection algorithms API consistency Jun 7, 2017
@ngoix
Copy link
Contributor Author

ngoix commented Jun 8, 2017

waiting for #9018 to be merged before dealing with EllipticEnvelop API

return -scores

def score_samples(self, X):
"""Opposite of the anomaly score define in the original paper.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined


Returns
-------
scores : array of shape (n_samples,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array, shape (n_samples,)

assert_greater(np.min(decision_func1[-2:]), np.max(decision_func1[:-2]))
assert_greater(np.min(decision_func2[-2:]), np.max(decision_func2[:-2]))
assert_array_equal(pred1, 6 * [1] + 2 * [-1])
assert_array_equal(pred2, 6 * [1] + 2 * [-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a for loop with contamination to avoid the dupes

self.threshold_ = -scoreatpercentile(
-self.negative_outlier_factor_, 100. * (1. - self.contamination))
if self.contamination is None:
self.offset_ = 1.5 # inliers score around 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you document the offset_ attribute in the docstrings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot as threshold_ wasn't documented, i will do this

else: # we're in training
return scores

def _score_samples(self, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this private function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a score_samples method for every anomaly detection algo (returning the "raw" decision function as defined in original papers).
Here it has to be private for the same reason as decision function and predict are private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a score_samples for the OneClassSVM. A suggestion:

For IsolationForest, LocalOutlierFactorand EllipticEnvelope all the work to compute the score of normality is currently done in decision_function. decision_function is then called in score_samples and offset_ has to be used in both methods.

IMO it would maybe be better to do the opposite: do all the work to compute the score in score_samples and define decision_function by score_samples(X) - offset_. I find it more natural (and easier to understand) to compute the score first and threshold this score to obtain decision_function and predict instead of computing decision_function and un-threshold decision_function to obtain the score. This would also involve the following changes:

  • For IsolationForest you can call score_samples instead of decision_function in fit and compute offset_ from score_samples(X). decision_function would then be score_samples(X) - offset_. And I think that you don't need the following if statement anymore in decision_function
if hasattr(self, 'offset_'):  # means that we're in testing
    return -scores + self.offset_
else:  # we're in training
    return -scores
  • For LocalOutlierFactor, _decision_function would be _score_samples(X) - offset_ and as in the current version, no need to call score_samples nor decision_function in fit.

  • For EllipticEnvelope, decision_function would be score_samples(X) - offset_ and as in the current version, no need to call score_samples nor decision_function in fit.

  • For the OneClassSVM we might have to stick with the current solution for score_samples, i.e, score_samples(X) = decision_function(X) + offset_ as decision_function uses the LIBSVM interface?

@@ -118,3 +123,14 @@ def test_n_neighbors_attribute():
"n_neighbors will be set to (n_samples - 1)",
clf.fit, X)
assert_equal(clf.n_neighbors_, X.shape[0] - 1)


def test__score_samples():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_score_samples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we test _score_samples not score_samples so I'm not sure of the convention...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have removed the duplicated _ as well, but it's not a big deal.

else: # we're in training
return scores

def _score_samples(self, X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides what is written in the docstring I think it would be good to add a comment in the code saying that predict is private because fit_predict(X) would be different than fit(X).predict(X), and that decision_function and score_samples are private because predict is private. I think this is the real reason behind making these methods private. Otherwise we could have extended the original paper approach (which can be done using the private methods). If someone dives into the code and wants to use the private methods he should be aware of that, which is not mentioned in the current version of the code.

@@ -92,10 +92,11 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin):
metric_params : dict, optional (default=None)
Additional keyword arguments for the metric function.

contamination : float in (0., 0.5), optional (default=0.1)
contamination : float in (0., 0.5), optional (default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None --> 'auto'

The amount of contamination of the data set, i.e. the proportion
of outliers in the data set. When fitting this is used to define the
threshold on the decision function.
threshold on the decision function. If None, the decision function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None --> 'auto'

@@ -125,7 +126,7 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin):
"""
def __init__(self, n_neighbors=20, algorithm='auto', leaf_size=30,
metric='minkowski', p=2, metric_params=None,
contamination=0.1, n_jobs=1):
contamination=None, n_jobs=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None --> 'auto'

@@ -167,8 +168,9 @@ def fit(self, X, y=None):
self : object
Returns self.
"""
if not (0. < self.contamination <= .5):
raise ValueError("contamination must be in (0, 0.5]")
if self.contamination is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be adapted to 'auto' instead of None

lrd_ratios_array = (self._lrd[_neighbors_indices_fit_X_] /
self._lrd[:, np.newaxis])

self.negative_outlier_factor_ = -np.mean(lrd_ratios_array, axis=1)

self.threshold_ = -scoreatpercentile(
-self.negative_outlier_factor_, 100. * (1. - self.contamination))
if self.contamination is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.contamination == 'auto'


if hasattr(self, 'offset_'): # means that we're in testing
return scores + self.offset_ # to make value 0 special
else: # we're in training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is useless here as _decision_function does not seem to be called during fit (there is a check_is_fitted in _decision_function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, thanks!

@@ -219,7 +224,7 @@ def predict(self, X):
"""
Copy link
Contributor

@albertcthomas albertcthomas Jun 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check_is_fitted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# abnormal) and substract self.offset_ to make 0 be the threshold
# value for being an outlier.

if hasattr(self, 'offset_'): # means that we're in testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my general review feedback about this if-else statement.

@agramfort
Copy link
Member

@ngoix why is this still WIP?

@ngoix
Copy link
Contributor Author

ngoix commented Jul 4, 2017

It's not, I make the change.

@ngoix ngoix changed the title [WIP] Outlier detection algorithms API consistency [MRG] Outlier detection algorithms API consistency Jul 4, 2017
Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. We also need to make sure that removing the raw_values parameter in EllipticEnvelope, which will require a deprecation cycle, is OK. Otherwise LGTM.

self.dist_, 100. * (1. - self.contamination))
return self

def decision_function(self, X, raw_values=False):
def decision_function(self, X):
"""Compute the decision function of the given observations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we remove the raw_values we need a deprecation warning

else:
transformed_mahal_dist = mahal_dist ** 0.33
decision = self.threshold_ ** 0.33 - transformed_mahal_dist
negative_mahal_dist = self.score_samples(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need to double check the examples using EllipticEnvelope but if I remember correctly they should be ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

covariance/plot_outlier_detection.py and applications/plot_outlier_detection_housing.py seem ok.

values = self.decision_function(X, raw_values=True)
is_inlier[values <= self.threshold_] = 1
values = self.decision_function(X)
is_inlier[values > 0] = 1
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? the contamination parameter is 0.1 by default

"""
check_is_fitted(self, 'offset_')
X = check_array(X)
return -self.mahalanobis(X)

def predict(self, X):
"""Outlyingness of observations in X according to the fitted model.
Copy link
Contributor

@albertcthomas albertcthomas Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also change this docstring because "outlyingness" could mean that predict returns scores whereas it returns labels.

assert_raises(NotFittedError, clf.decision_function, X)
clf.fit(X)
y_pred = clf.predict(X)
decision = clf.score_samples(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scores =

@@ -63,6 +63,13 @@ class EllipticEnvelope(MinCovDet):
A mask of the observations that have been used to compute the
robust estimates of location and shape.

offset_ : float
Offset used to define the decision function from the raw scores.
We have the relation: decision_function = score_samples - offset_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the code it is: score_samples + offset_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@@ -118,14 +119,22 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin):
n_neighbors_ : integer
The actual number of neighbors used for :meth:`kneighbors` queries.

offset_ : float
Offset used to define the decision function from the raw scores.
We have the relation: decision_function = score_samples - offset_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the code it is: score_samples + offset_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

assert_array_equal(clf1._score_samples([[2., 2.]]),
clf1._decision_function([[2., 2.]]) - clf1.offset_)
assert_array_equal(clf2._score_samples([[2., 2.]]),
clf2._decision_function([[2., 2.]]) - clf2.offset_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert equality of clf1.score_samples and clf2.score_samples

We have the relation: decision_function = score_samples - offset_.
The offset is equal to intercept_ and is provided for consistency
with other outlier detection algorithms such as LocalOutlierFactor,
IsolationForest and EllipticEnvelope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove such as LocalOutlierFactor, IsolationForest and EllipticEnvelope. Otherwise if a new outlier detection estimator is added to scikit-learn we would need to update the list.

Returns the decision function of the samples.
"""
dec = self._decision_function(X)
return dec

def score_samples(self, X):
"""Raw decision function of the samples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not very pleased with Raw decision function. Maybe Shifted decision function or scoring function?

@albertcthomas
Copy link
Contributor

@ngoix as soon as you take into account my review this PR will be ready for final reviews.

@ngoix
Copy link
Contributor Author

ngoix commented Sep 6, 2017

thanks @albertcthomas for the review

@ngoix
Copy link
Contributor Author

ngoix commented Sep 6, 2017

Ping @agramfort for another one?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A partial review.

Could you please add an entry to what's new and be clear on any decision_function or predict behaviour that has changed.

"""Compute the decision function of the given observations.

Parameters
----------
X : array-like, shape (n_samples, n_features)

raw_values : bool
raw_values : bool, optional (default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "(default=None)" means anything. Remove it.

return decision
# raw_values deprecation:
if raw_values is not None:
warnings.warn("raw_values parameter is deprecated in 0.20 and will"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tested, apparently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added

return negative_mahal_dist - self.offset_

def score_samples(self, X):
"""Compute the Mahalanobis distances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say negative here, please

"""
check_is_fitted(self, 'threshold_')
check_is_fitted(self, 'offset_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need repeating if done in decision_function.

Should this definition of predict be provided by a mixin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman do you mean a mixin for outlier detection estimators? I was thinking that all outlier detection estimators should have a fit_predict like clustering estimators but they can't all have a predict unless we exclude LocalOutlierFactor which has a private predict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We can make the presence of predict conditional on the presence of decision_function. It's a bit ugly, but possible with a descriptor/property...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to create such a mixin in a separate PR, as this one is already a bit heavy

Copy link
Contributor

@albertcthomas albertcthomas Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a mixin PR once this PR is merged.

raise ValueError("contamination must be in (0, 0.5]")
if self.contamination != "auto":
if not(0. < self.contamination <= .5):
raise ValueError("contamination must be in (0, 0.5]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the actual value of self.contamination in the error message:

ValueError("contamination must be in (0, 0.5],"
           "got: %f" % self.contamination)

@ngoix
Copy link
Contributor Author

ngoix commented Sep 11, 2017

Thanks @jnothman
Concerning what's new, should I add an "outlier detection methods" section, or add an entry for each algo in its respective section?

@agramfort
Copy link
Member

@ngoix for the what's new put it on online that will render as tiny paragraph if line is long. Don't itemize it.

@jnothman
Copy link
Member

jnothman commented Sep 11, 2017 via email

@ngoix ngoix force-pushed the consistency_AD branch 3 times, most recently from 7da6313 to 8e06afe Compare September 12, 2017 08:22
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 25, 2019
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Aug 22, 2019
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Oct 5, 2019
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Oct 11, 2019
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Nov 1, 2019
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jan 10, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jan 30, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jan 30, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Feb 27, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Mar 12, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Apr 17, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request May 3, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request May 30, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 5, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 5, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Aug 5, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Oct 15, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Nov 28, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Dec 12, 2020
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Feb 25, 2021
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Feb 25, 2021
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Mar 27, 2021
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jun 15, 2021
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jul 23, 2021
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Nov 10, 2021
ivannz added a commit to ivannz/scikit-learn that referenced this pull request May 15, 2022
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Jun 14, 2022
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Aug 29, 2022
ivannz added a commit to ivannz/scikit-learn that referenced this pull request Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lack of consistency for decision_function methods in outlier detection
7 participants