-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Novelty detection for LOF #10700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Novelty detection for LOF #10700
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have two big questions:
- does LOF make sense as an inductive model?
- if so, is it necessary to make two different modes?
It doesn't look like the fitting is any different in the inductive case from fit_predict
. Is the main concern that prediction on the training data is not identical to the results of fit_predict
? We have such cases with {fit_,}transform
in the works, FWIW.
The main concern is indeed that |
I don't mind having such cases and just documenting them clearly. Maybe we
need a name for such a thing so that we can stamp such estimators clearly
with that name. Maybe we also need a warning when it looks like the user is
passing the training data to predict... (By storing the training id(X) on
the estimator, but it might be hard to check if only a subset is being
passed in, be it a slice or certainly a copy)
|
sklearn/neighbors/lof.py
Outdated
'considering the negative_outlier_factor_ attribute. Use ' | ||
'novelty=True if you want to use LOF for novelty detection and ' | ||
'compute score_samples for new unseen data.') | ||
raise NotImplementedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would structure the code differently.
if not self.novelty:
raise
...
return ...```
it feels better to first capture errors and to end the function with a return
I find this solution very elegant and rigorous. Using the id will not be a robust solution. |
sklearn/neighbors/lof.py
Outdated
if self.novelty: | ||
msg = ('fit_predict is not available when novelty=True. Use ' | ||
'novelty=False if you want to predict on the training set.') | ||
raise NotImplementedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right error type. It makes it seem like one day we will implement this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TypeError would be more normal for a bad call, while a ValueError would be possible too. None is quite satisfying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use ValueError
Test failures. I'm happy with this solution. I'm still not sure we can maintain similar assurances for all Transformers, though |
sklearn/neighbors/lof.py
Outdated
'if you want to predict on training data. Use novelty=True if ' | ||
'you want to use LOF for novelty detection and predict on new ' | ||
'unseen data.') | ||
raise NotImplementedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError
as I said elsewhere I think it more natural to finish a function with a return than a raise though
Thanks for your comments and reviews @jnothman and @agramfort. I think this is a good solution as some people are not ready to break fit.predict = fit_predict (for what I think are good reasons). I'm OK with ValueError. In a way the error is related to the value of the |
are we sure about the name of the parameter?
|
that's fine with me.
|
Unless this kind of solution can be more generally applied (i.e. not just for LOF of outlier detection estimators), novelty seems good to me as well. |
bec8fa2
to
666c18a
Compare
BTW something's strange in the CI tests... the example examples/covariance/plot_outlier_detection.py should fail as it is using the private method Running the example locally returns AttributeError: 'LocalOutlierFactor' object has no attribute '_decision_function' |
you can commit asking for a full rebuild of the doc. Try putting
"[circle full]" in your commit message.
|
I also suggest to remove this example (examples/covariance/plot_outlier_detection.py) since we now have examples/plot_anomaly_comparison.py. WDYT? |
407f8e4
to
25e9362
Compare
For plot_lof_outlier_detection.py, I refitted the model with novelty=True in order to show the level sets of the decision_function but that might confuse the users. We should maybe remove this and only illustrate the use of the negative_outlier_factor_ attribute. |
ok but I would suggest to reuse the blob of text from the doc at the top.
At least some parts.
Check that you don't break any link in the doc.
|
@agramfort you are saying this about removing plot_outlier_detection.py? |
This pull request introduces 1 alert when merging 25e9362 into bfad4da - view on lgtm.com new alerts:
Comment posted by lgtm.com |
yes
|
8128f66
to
147ea9d
Compare
I'm almost done here, just need to fix a couple typos in the doc but Circleci returns the following error. I don't know if it's related to this PR. Unexpected failing examples:
/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py failed leaving traceback:
Traceback (most recent call last):
File "/home/circleci/project/examples/gaussian_process/plot_gpr_co2.py", line 75, in <module>
data = fetch_mldata('mauna-loa-atmospheric-co2').data
File "/home/circleci/project/sklearn/datasets/mldata.py", line 154, in fetch_mldata
mldata_url = urlopen(urlname)
File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/home/circleci/miniconda/envs/testenv/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error |
push to restart the build. it's likely a random failure
|
Yes that's what I already tried. I will try again this evening. Thanks! |
mldata has been having some down time lately...
|
fdeb437
to
5dd2e8c
Compare
…n between outlier and novelty detection for anomaly detection
…mator in test_lof
…depending on the novelty parameter
Actually, @glemaitre what's the difference between |
5fc7ff8
to
645cb56
Compare
CIs are green but I don’t know why we don’t have the codecov checks. |
good to go from my end. Anyone for MRG+2? |
I would have liked codecov to run. I don't understand why it's not running. I'd like to check the coverage of this PR. |
645cb56
to
9e59491
Compare
I restarted CIs to see if we can get code coverage. |
@GaelVaroquaux codecov is green |
We can make more fine-grain ignoring. |
It looks good to me. I would however change the deprecation warning to future warning for the outier methods. |
Thanks @jnothman @agramfort @GaelVaroquaux and @glemaitre for the help and the reviews! |
Fixes #10191
What does this implement/fix? Explain your changes.
This PR will allow users to use LOF for novelty detection as suggested in this comment of PR #9015. An argument
novelty=True
orFalse
is added in__init__
:novelty=False
(default) thenlof.predict(X)
raisesValueError('predict is not available when novelty=False, use fit_predict if you want to predict on training data. Use novelty=True if you want to use LOF for novelty detection and predict on new unseen data.')
Similar error for
lof.decision_function(X)
andlof.score_samples(X)
novelty=True
,lof.fit_predict(X)
raisesValueError('fit_predict is not available when novelty=True. Use novelty=False if you want to predict on the training set.')
. Butlof.predict(X)
,lof.decision_function
andlof.score_samples(X)
are OK.This way users can use LOF for novelty detection but by default, as
novelty=False
, users will get an error with an informative message saying to usenovelty=True
if they really want to predict on new data.Any other comments?
This PR also refactors doc on outlier detection: particularly, plot_outlier_detection.py is removed in favor of plot_anomaly_comparison.py