Skip to content

[MRG+1] Novelty detection for LOF #10700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jul 19, 2018

Conversation

albertcthomas
Copy link
Contributor

@albertcthomas albertcthomas commented Feb 25, 2018

Fixes #10191

What does this implement/fix? Explain your changes.

This PR will allow users to use LOF for novelty detection as suggested in this comment of PR #9015. An argument novelty=True or False is added in __init__:

  • if novelty=False (default) then lof.predict(X) raises ValueError('predict is not available when novelty=False, use fit_predict if you want to predict on training data. Use novelty=True if you want to use LOF for novelty detection and predict on new unseen data.')
    Similar error for lof.decision_function(X) and lof.score_samples(X)
  • if novelty=True, lof.fit_predict(X) raises ValueError('fit_predict is not available when novelty=True. Use novelty=False if you want to predict on the training set.'). But lof.predict(X), lof.decision_function and lof.score_samples(X) are OK.

This way users can use LOF for novelty detection but by default, as novelty=False, users will get an error with an informative message saying to use novelty=True if they really want to predict on new data.

Any other comments?

This PR also refactors doc on outlier detection: particularly, plot_outlier_detection.py is removed in favor of plot_anomaly_comparison.py

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have two big questions:

  • does LOF make sense as an inductive model?
  • if so, is it necessary to make two different modes?

It doesn't look like the fitting is any different in the inductive case from fit_predict. Is the main concern that prediction on the training data is not identical to the results of fit_predict? We have such cases with {fit_,}transform in the works, FWIW.

@albertcthomas
Copy link
Contributor Author

The main concern is indeed that fit_predict would not be identical to fit.predict.

@jnothman
Copy link
Member

jnothman commented Feb 26, 2018 via email

'considering the negative_outlier_factor_ attribute. Use '
'novelty=True if you want to use LOF for novelty detection and '
'compute score_samples for new unseen data.')
raise NotImplementedError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would structure the code differently.

if not self.novelty:
     raise

...

return ...```

it feels better to first capture errors and to end the function with a return

@agramfort
Copy link
Member

I find this solution very elegant and rigorous. Using the id will not be a robust solution.
And we still don't break the contract that fit.predict == fit_predict unless the user explicitly asks for it.

if self.novelty:
msg = ('fit_predict is not available when novelty=True. Use '
'novelty=False if you want to predict on the training set.')
raise NotImplementedError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right error type. It makes it seem like one day we will implement this case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TypeError would be more normal for a bad call, while a ValueError would be possible too. None is quite satisfying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use ValueError

@jnothman
Copy link
Member

jnothman commented Mar 4, 2018

Test failures. I'm happy with this solution. I'm still not sure we can maintain similar assurances for all Transformers, though

'if you want to predict on training data. Use novelty=True if '
'you want to use LOF for novelty detection and predict on new '
'unseen data.')
raise NotImplementedError(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError

as I said elsewhere I think it more natural to finish a function with a return than a raise though

@albertcthomas
Copy link
Contributor Author

albertcthomas commented Mar 5, 2018

Thanks for your comments and reviews @jnothman and @agramfort. I think this is a good solution as some people are not ready to break fit.predict = fit_predict (for what I think are good reasons). I'm OK with ValueError. In a way the error is related to the value of the novelty parameter in this case. I will finish this PR, I just wanted to be sure about this before completely implementing this feature.

@jnothman
Copy link
Member

jnothman commented Mar 5, 2018 via email

@agramfort
Copy link
Member

agramfort commented Mar 5, 2018 via email

@albertcthomas
Copy link
Contributor Author

Unless this kind of solution can be more generally applied (i.e. not just for LOF of outlier detection estimators), novelty seems good to me as well.

@albertcthomas
Copy link
Contributor Author

albertcthomas commented Apr 8, 2018

BTW something's strange in the CI tests... the example examples/covariance/plot_outlier_detection.py should fail as it is using the private method _decision_function, which I removed in the previous commits.

Running the example locally returns

AttributeError: 'LocalOutlierFactor' object has no attribute '_decision_function'

@agramfort
Copy link
Member

agramfort commented Apr 8, 2018 via email

@albertcthomas
Copy link
Contributor Author

I also suggest to remove this example (examples/covariance/plot_outlier_detection.py) since we now have examples/plot_anomaly_comparison.py. WDYT?

@albertcthomas
Copy link
Contributor Author

For plot_lof_outlier_detection.py, I refitted the model with novelty=True in order to show the level sets of the decision_function but that might confuse the users. We should maybe remove this and only illustrate the use of the negative_outlier_factor_ attribute.

@agramfort
Copy link
Member

agramfort commented Apr 8, 2018 via email

@albertcthomas
Copy link
Contributor Author

ok but I would suggest to reuse the blob of text from the doc at the top.
At least some parts.
Check that you don't break any link in the doc.

@agramfort you are saying this about removing plot_outlier_detection.py?

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 25e9362 into bfad4da - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@agramfort
Copy link
Member

agramfort commented Apr 8, 2018 via email

@albertcthomas
Copy link
Contributor Author

I'm almost done here, just need to fix a couple typos in the doc but Circleci returns the following error. I don't know if it's related to this PR.

Unexpected failing examples:
/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py", line 75, in <module>
    data = fetch_mldata('mauna-loa-atmospheric-co2').data
  File "/home/circleci/project/sklearn/datasets/mldata.py", line 154, in fetch_mldata
    mldata_url = urlopen(urlname)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

@agramfort
Copy link
Member

agramfort commented Apr 30, 2018 via email

@albertcthomas
Copy link
Contributor Author

Yes that's what I already tried. I will try again this evening. Thanks!

@jnothman
Copy link
Member

jnothman commented Apr 30, 2018 via email

@albertcthomas albertcthomas force-pushed the novelty_for_lof branch 2 times, most recently from fdeb437 to 5dd2e8c Compare May 2, 2018 17:45
@albertcthomas
Copy link
Contributor Author

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

@albertcthomas
Copy link
Contributor Author

CIs are green but I don’t know why we don’t have the codecov checks.

@agramfort
Copy link
Member

good to go from my end.

Anyone for MRG+2?

@GaelVaroquaux
Copy link
Member

I would have liked codecov to run. I don't understand why it's not running. I'd like to check the coverage of this PR.

@albertcthomas
Copy link
Contributor Author

I restarted CIs to see if we can get code coverage.

@albertcthomas
Copy link
Contributor Author

@GaelVaroquaux codecov is green

@glemaitre
Copy link
Member

Actually, @glemaitre what's the difference between
@pytest.mark.filterwarnings and @ignore_warnings(category=DeprecationWarning)?

We can make more fine-grain ignoring.

@glemaitre
Copy link
Member

It looks good to me. I would however change the deprecation warning to future warning for the outier methods.

@glemaitre glemaitre merged commit 4d0a262 into scikit-learn:master Jul 19, 2018
@albertcthomas
Copy link
Contributor Author

Thanks @jnothman @agramfort @GaelVaroquaux and @glemaitre for the help and the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Class method should be public instead of protected if it is used directly in coding
6 participants