-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fix LOF and Isolation benchmarks #9798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
benchmarks/bench_lof.py
Outdated
@@ -101,7 +97,7 @@ | |||
fit_time = time() - tstart | |||
tstart = time() | |||
|
|||
scoring = -model.decision_function(X_test) # the lower, the more normal | |||
scoring = -model._decision_function(X_test) # the lower, the more normal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no public function that allows you to do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_
. WDYT?
X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]] | ||
y = (y != 'normal.').astype(int) | ||
y = (y != b'normal.').astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works with python2 and python3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
thanks for the review @agramfort |
Concerning LOF, standardizing the datasets (using |
LGTM +1 for MRG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the following error when running the isolation forest benchmark:
====== smtp ======
--- Fetching data...
--- Vectorizing data...
----- Target count values:
------ 0 -> 9568 occurrences
------ 1 -> 3 occurrences
----- Outlier ratio: 0.00031
--- Fitting the IsolationForest estimator...
--- Preparing the plot elements...
/home/ogrisel/code/scikit-learn/sklearn/metrics/ranking.py:547: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
UndefinedMetricWarning)
smtp (AUC: nan, train_time= 0.44s, test_time= 0.25s)
Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when One solution would be to remove the smtp dataset from the benchmark. If Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances) Interestingly, for the same seed in |
+1 for passing random_state to make the warning deterministic
|
I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring. |
Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?) |
percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark) Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve. |
You're right my mistake! |
But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private). |
All in all, I think we should stick to the fix of #8638 in this PR. |
benchmarks/bench_isolation_forest.py
Outdated
@@ -30,15 +41,14 @@ def print_outlier_ratio(y): | |||
print("----- Outlier ratio: %.5f" % (np.min(cnt) / len(y))) | |||
|
|||
|
|||
np.random.seed(1) | |||
SEED = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more used to see rng = np.random.RandomState(0)
but I think it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use a new name for something we refer to as random_state
use:
random_state = 1
and pass random_state=random_state in function calls.
btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes |
Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe... |
The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark. |
Ok I get it now ! |
About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR. |
Then you don't need any change in bench_lof.py... |
The decision to make modifications to the LOF benchmark came after this discussion. We could use |
benchmarks/bench_lof.py
Outdated
|
||
print(__doc__) | ||
|
||
np.random.seed(2) | ||
SEED = 2 # to control the random selection of anomalies in the SA dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SEED...
973ae4d
to
3e06c31
Compare
Codecov Report
@@ Coverage Diff @@
## master #9798 +/- ##
=======================================
Coverage 96.16% 96.16%
=======================================
Files 336 336
Lines 62493 62493
=======================================
Hits 60098 60098
Misses 2395 2395 Continue to review full report at Codecov.
|
@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance. |
I don't think we should expose in examples or benchmarks something that is
not publicly supported.
you can put your benchmark in a public gist on github if you want a public
version of it.
|
Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses |
# Removed the shuttle dataset because as of 2017-03-23 mldata.org is down: | ||
# datasets = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover'] | ||
datasets = ['http', 'smtp', 'SA', 'SF', 'forestcover'] | ||
# datasets available = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the shuttle dataset is run by default, we can remove this comment.
I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a |
Actually, we should just use the categorical encoder once it's ready. |
Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR. |
…cs/donigian-update-contribution-guidelines * 'master' of github.com:scikit-learn/scikit-learn: (23 commits) fixes scikit-learn#10031: fix attribute name and shape in documentation (scikit-learn#10033) [MRG+1] add changelog entry for fixed and merged PR scikit-learn#10005 issue scikit-learn#9633 (scikit-learn#10025) [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(scikit-learn#9995) (scikit-learn#10022) [MRG + 1] Labels of clustering should start at 0 or -1 if noise (scikit-learn#10015) MAINT Remove redundancy in scikit-learn#9552 (scikit-learn#9573) [MRG+1] correct comparison in GaussianNB for 'priors' (scikit-learn#10005) [MRG + 1] ENH add check_inverse in FunctionTransformer (scikit-learn#9399) [MRG] FIX bug in nested set_params usage (scikit-learn#9999) [MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798) [MRG + 1] Fix negative inputs checking in mean_squared_log_error (scikit-learn#9968) DOC Fix typo (scikit-learn#9996) DOC Fix typo: x axis -> y axis (scikit-learn#9985) improve example plot_forest_iris.py (scikit-learn#9989) [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (scikit-learn#9875) DOC update news DOC Fix three typos in manifold documentation (scikit-learn#9990) DOC add missing dot in docstring DOC Add what's new for 0.19.1 (scikit-learn#9983) Improve readability of outlier detection example. (scikit-learn#9973) DOC: Fixed typo (scikit-learn#9977) ...
What does this implement/fix? Explain your changes.
The fix for the benchmark of IsolationForest merged in #8638 leads to the following
which does not correspond to what LabelBinarizer was doing because strings are viewed as iterables (and we obtain an array with 7 features instead of 3). This fix suggests to apply LabelBinarizer as follows:
Any other comments?
This also fixes the benchmark of LocalOutlierFactor.
EDIT: The benchmark of LocalOutlierFactor was predicting on a test set whereas LocalOutlierFactor is not meant to predict on a test set. This is fixed by doing the benchmark in an outlier detection context only.