[MRG+1] Novelty detection for LOF #10700

albertcthomas · 2018-02-25T23:09:10Z

What does this implement/fix? Explain your changes.

This PR will allow users to use LOF for novelty detection as suggested in this comment of PR #9015. An argument novelty=True or False is added in __init__:

if novelty=False (default) then lof.predict(X) raises ValueError('predict is not available when novelty=False, use fit_predict if you want to predict on training data. Use novelty=True if you want to use LOF for novelty detection and predict on new unseen data.')
Similar error for lof.decision_function(X) and lof.score_samples(X)
if novelty=True, lof.fit_predict(X) raises ValueError('fit_predict is not available when novelty=True. Use novelty=False if you want to predict on the training set.'). But lof.predict(X), lof.decision_function and lof.score_samples(X) are OK.

This way users can use LOF for novelty detection but by default, as novelty=False, users will get an error with an informative message saying to use novelty=True if they really want to predict on new data.

Any other comments?

This PR also refactors doc on outlier detection: particularly, plot_outlier_detection.py is removed in favor of plot_anomaly_comparison.py

jnothman

So we have two big questions:

does LOF make sense as an inductive model?
if so, is it necessary to make two different modes?

It doesn't look like the fitting is any different in the inductive case from fit_predict. Is the main concern that prediction on the training data is not identical to the results of fit_predict? We have such cases with {fit_,}transform in the works, FWIW.

albertcthomas · 2018-02-26T18:06:46Z

The main concern is indeed that fit_predict would not be identical to fit.predict.

jnothman · 2018-02-26T22:20:46Z

I don't mind having such cases and just documenting them clearly. Maybe we need a name for such a thing so that we can stamp such estimators clearly with that name. Maybe we also need a warning when it looks like the user is passing the training data to predict... (By storing the training id(X) on the estimator, but it might be hard to check if only a subset is being passed in, be it a slice or certainly a copy)

agramfort · 2018-02-27T20:34:40Z

sklearn/neighbors/lof.py

+               'considering the negative_outlier_factor_ attribute. Use '
+               'novelty=True if you want to use LOF for novelty detection and '
+               'compute score_samples for new unseen data.')
+        raise NotImplementedError(msg)


I would structure the code differently.

if not self.novelty: raise ... return ...``` it feels better to first capture errors and to end the function with a return

agramfort · 2018-02-27T20:36:36Z

I find this solution very elegant and rigorous. Using the id will not be a robust solution.
And we still don't break the contract that fit.predict == fit_predict unless the user explicitly asks for it.

jnothman · 2018-03-04T01:40:14Z

sklearn/neighbors/lof.py

+        if self.novelty:
+            msg = ('fit_predict is not available when novelty=True. Use '
+                   'novelty=False if you want to predict on the training set.')
+            raise NotImplementedError(msg)


I don't think this is the right error type. It makes it seem like one day we will implement this case

A TypeError would be more normal for a bad call, while a ValueError would be possible too. None is quite satisfying.

I would use ValueError

jnothman · 2018-03-04T01:44:43Z

Test failures. I'm happy with this solution. I'm still not sure we can maintain similar assurances for all Transformers, though

agramfort · 2018-03-04T21:06:02Z

sklearn/neighbors/lof.py

+               'if you want to predict on training data. Use novelty=True if '
+               'you want to use LOF for novelty detection and predict on new '
+               'unseen data.')
+        raise NotImplementedError(msg)


ValueError

as I said elsewhere I think it more natural to finish a function with a return than a raise though

albertcthomas · 2018-03-05T08:21:34Z

Thanks for your comments and reviews @jnothman and @agramfort. I think this is a good solution as some people are not ready to break fit.predict = fit_predict (for what I think are good reasons). I'm OK with ValueError. In a way the error is related to the value of the novelty parameter in this case. I will finish this PR, I just wanted to be sure about this before completely implementing this feature.

jnothman · 2018-03-05T08:37:31Z

are we sure about the name of the parameter?

agramfort · 2018-03-05T21:08:37Z

that's fine with me.

albertcthomas · 2018-03-05T22:56:21Z

Unless this kind of solution can be more generally applied (i.e. not just for LOF of outlier detection estimators), novelty seems good to me as well.

albertcthomas · 2018-04-08T14:42:21Z

BTW something's strange in the CI tests... the example examples/covariance/plot_outlier_detection.py should fail as it is using the private method _decision_function, which I removed in the previous commits.

Running the example locally returns

AttributeError: 'LocalOutlierFactor' object has no attribute '_decision_function'

agramfort · 2018-04-08T15:21:47Z

you can commit asking for a full rebuild of the doc. Try putting "[circle full]" in your commit message.

albertcthomas · 2018-04-08T15:24:21Z

I also suggest to remove this example (examples/covariance/plot_outlier_detection.py) since we now have examples/plot_anomaly_comparison.py. WDYT?

albertcthomas · 2018-04-08T15:30:22Z

For plot_lof_outlier_detection.py, I refitted the model with novelty=True in order to show the level sets of the decision_function but that might confuse the users. We should maybe remove this and only illustrate the use of the negative_outlier_factor_ attribute.

agramfort · 2018-04-08T15:32:38Z

ok but I would suggest to reuse the blob of text from the doc at the top. At least some parts. Check that you don't break any link in the doc.

albertcthomas · 2018-04-08T15:39:16Z

ok but I would suggest to reuse the blob of text from the doc at the top.
At least some parts.
Check that you don't break any link in the doc.

@agramfort you are saying this about removing plot_outlier_detection.py?

sklearn-lgtm · 2018-04-08T16:02:06Z

This pull request introduces 1 alert when merging 25e9362 into bfad4da - view on lgtm.com

new alerts:

1 for Variable defined multiple times

Comment posted by lgtm.com

agramfort · 2018-04-08T16:17:35Z

yes

albertcthomas · 2018-04-30T11:44:14Z

I'm almost done here, just need to fix a couple typos in the doc but Circleci returns the following error. I don't know if it's related to this PR.

Unexpected failing examples:
/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py", line 75, in <module>
    data = fetch_mldata('mauna-loa-atmospheric-co2').data
  File "/home/circleci/project/sklearn/datasets/mldata.py", line 154, in fetch_mldata
    mldata_url = urlopen(urlname)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

agramfort · 2018-04-30T11:46:07Z

push to restart the build. it's likely a random failure

albertcthomas · 2018-04-30T11:54:13Z

Yes that's what I already tried. I will try again this evening. Thanks!

jnothman · 2018-04-30T12:48:52Z

mldata has been having some down time lately...

…doc to example

…n between outlier and novelty detection for anomaly detection

…mator in test_lof

…depending on the novelty parameter

…_factor_

albertcthomas · 2018-07-18T11:34:59Z

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

albertcthomas · 2018-07-19T06:39:37Z

CIs are green but I don’t know why we don’t have the codecov checks.

agramfort · 2018-07-19T07:53:39Z

good to go from my end.

Anyone for MRG+2?

GaelVaroquaux · 2018-07-19T11:59:10Z

I would have liked codecov to run. I don't understand why it's not running. I'd like to check the coverage of this PR.

albertcthomas · 2018-07-19T12:31:27Z

I restarted CIs to see if we can get code coverage.

albertcthomas · 2018-07-19T14:02:30Z

@GaelVaroquaux codecov is green

glemaitre · 2018-07-19T14:17:22Z

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

We can make more fine-grain ignoring.

glemaitre · 2018-07-19T14:21:27Z

It looks good to me. I would however change the deprecation warning to future warning for the outier methods.

albertcthomas · 2018-07-19T14:30:59Z

Thanks @jnothman @agramfort @GaelVaroquaux and @glemaitre for the help and the reviews!

jnothman reviewed Feb 25, 2018

View reviewed changes

agramfort reviewed Feb 27, 2018

View reviewed changes

jnothman reviewed Mar 4, 2018

View reviewed changes

agramfort reviewed Mar 4, 2018

View reviewed changes

albertcthomas force-pushed the novelty_for_lof branch from bec8fa2 to 666c18a Compare April 2, 2018 17:35

albertcthomas force-pushed the novelty_for_lof branch from 407f8e4 to 25e9362 Compare April 8, 2018 15:25

albertcthomas mentioned this pull request Apr 9, 2018

CircleCI default behavior running html-noplot when no example is modified #10943

Closed

albertcthomas force-pushed the novelty_for_lof branch from 8128f66 to 147ea9d Compare April 30, 2018 06:54

albertcthomas force-pushed the novelty_for_lof branch 2 times, most recently from fdeb437 to 5dd2e8c Compare May 2, 2018 17:45

albertcthomas and others added 14 commits July 18, 2018 13:04

[doc build] sc in doc + add test for training scores

6fae094

[doc build] sc and add table for LOF behavior in doc

223bd95

address review

2510258

remove comma in plot docstring

04e273a

move details on comparison between outlier detection estimators from …

e34a93d

…doc to example

sc in beginning of novelty and outlier detection doc + add distinctio…

7c8b28e

…n between outlier and novelty detection for anomaly detection

sc in doc

a3f84b4

use properties to make method disappear depending on novelty value

8f9e88a

remove novelty=True statements in estimator_checks and use check_esti…

c3826ae

…mator in test_lof

add non regression tests checking availability of prediction methods …

cb800ff

…depending on the novelty parameter

add entry in whatsnew

3ae0152

pass on docstrings

2a4e69f

remove decision_function in outlier examplea and use negative_outlier…

15dfb29

…_factor_

update whatsnew with author

a7af1fd

albertcthomas force-pushed the novelty_for_lof branch from 5fc7ff8 to 645cb56 Compare July 18, 2018 11:40

ignore deprecationwarning for new lof tests

9e59491

albertcthomas force-pushed the novelty_for_lof branch from 645cb56 to 9e59491 Compare July 19, 2018 12:29

glemaitre merged commit 4d0a262 into scikit-learn:master Jul 19, 2018

albertcthomas mentioned this pull request Jul 20, 2018

[MRG+1] iforest backward compatibility #11553

Merged

ngoix mentioned this pull request Oct 9, 2018

[MRG+1] Stacking classifier with pipelines API #8960

Closed

7 tasks

adrinjalali mentioned this pull request Dec 21, 2018

[MRG] DOC Resolve a missing link on the home page #12847

Merged

[MRG+1] Novelty detection for LOF #10700

[MRG+1] Novelty detection for LOF #10700

Conversation

albertcthomas commented Feb 25, 2018 • edited Loading

What does this implement/fix? Explain your changes.

Any other comments?

jnothman left a comment

Choose a reason for hiding this comment

albertcthomas commented Feb 26, 2018

jnothman commented Feb 26, 2018 via email

agramfort Feb 27, 2018

Choose a reason for hiding this comment

agramfort commented Feb 27, 2018

jnothman Mar 4, 2018

Choose a reason for hiding this comment

jnothman Mar 4, 2018

Choose a reason for hiding this comment

agramfort Mar 4, 2018

Choose a reason for hiding this comment

jnothman commented Mar 4, 2018

agramfort Mar 4, 2018

Choose a reason for hiding this comment

albertcthomas commented Mar 5, 2018 • edited Loading

jnothman commented Mar 5, 2018 via email

agramfort commented Mar 5, 2018 via email

albertcthomas commented Mar 5, 2018

albertcthomas commented Apr 8, 2018 • edited Loading

agramfort commented Apr 8, 2018 via email

albertcthomas commented Apr 8, 2018

albertcthomas commented Apr 8, 2018

agramfort commented Apr 8, 2018 via email

albertcthomas commented Apr 8, 2018

sklearn-lgtm commented Apr 8, 2018

agramfort commented Apr 8, 2018 via email

albertcthomas commented Apr 30, 2018

agramfort commented Apr 30, 2018 via email

albertcthomas commented Apr 30, 2018

jnothman commented Apr 30, 2018 via email

albertcthomas commented Jul 18, 2018

albertcthomas commented Jul 19, 2018

agramfort commented Jul 19, 2018

GaelVaroquaux commented Jul 19, 2018

albertcthomas commented Jul 19, 2018

albertcthomas commented Jul 19, 2018

glemaitre commented Jul 19, 2018

glemaitre commented Jul 19, 2018

albertcthomas commented Jul 19, 2018

albertcthomas commented Feb 25, 2018 •

edited

Loading

albertcthomas commented Mar 5, 2018 •

edited

Loading

albertcthomas commented Apr 8, 2018 •

edited

Loading