[MRG+1] Fix LOF and Isolation benchmarks #9798

albertcthomas · 2017-09-19T14:27:18Z

What does this implement/fix? Explain your changes.

The fix for the benchmark of IsolationForest merged in #8638 leads to the following

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset='SA', shuffle=True, percent10=True)
X = dataset.data
y = dataset.target
print(X.dtype)  # object

lb = MultiLabelBinarizer()
x1 = lb.fit_transform(X[:, 1])

print(np.unique(X[:,1]))  # [b'icmp' b'tcp' b'udp']
print(x1.shape)  # (100655, 7)

which does not correspond to what LabelBinarizer was doing because strings are viewed as iterables (and we obtain an array with 7 features instead of 3). This fix suggests to apply LabelBinarizer as follows:

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
x1 = lb.fit_transform(X[:, 1].astype(str))
print(x1.shape)  # (100655, 3)

Any other comments?

This also fixes the benchmark of LocalOutlierFactor.

EDIT: The benchmark of LocalOutlierFactor was predicting on a test set whereas LocalOutlierFactor is not meant to predict on a test set. This is fixed by doing the benchmark in an outlier detection context only.

agramfort · 2017-09-19T17:28:46Z

benchmarks/bench_lof.py

@@ -101,7 +97,7 @@
    fit_time = time() - tstart
    tstart = time()

-    scoring = -model.decision_function(X_test)  # the lower, the more normal
+    scoring = -model._decision_function(X_test)  # the lower, the more normal


there is no public function that allows you to do the same?

Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_. WDYT?

agramfort · 2017-09-19T17:28:56Z

benchmarks/bench_lof.py

        X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]]
-        y = (y != 'normal.').astype(int)
+        y = (y != b'normal.').astype(int)


works with python2 and python3?

albertcthomas · 2017-09-20T11:51:18Z

thanks for the review @agramfort

albertcthomas · 2017-09-20T15:13:16Z

The LocalOutlierFactor benchmark is now done in an outlier detection context.

albertcthomas · 2017-09-21T14:32:03Z

Concerning LOF, standardizing the datasets (using StandardScaler or RobustScaler) does not increase the performance in terms of ROC curve and AUC.

agramfort · 2017-09-21T15:51:09Z

LGTM +1 for MRG

ogrisel

I get the following error when running the isolation forest benchmark:

====== smtp ======
--- Fetching data...
--- Vectorizing data...
----- Target count values: 
------ 0 -> 9568 occurrences
------ 1 -> 3 occurrences
----- Outlier ratio: 0.00031
--- Fitting the IsolationForest estimator...
--- Preparing the plot elements...
/home/ogrisel/code/scikit-learn/sklearn/metrics/ranking.py:547: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
  UndefinedMetricWarning)
smtp (AUC: nan, train_time= 0.44s, test_time= 0.25s)

albertcthomas · 2017-09-22T15:03:25Z

Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when percent10=True. Therefore there is a high chance that there is no outlier in the test set (when the model is trained on a dataset that can contain outliers). Of course this depends on the random seed set in np.random.seedat the beginning of the benchmark.

One solution would be to remove the smtp dataset from the benchmark. If percent10=False the problem is less likely to happen but may still happen depending on the random seed.

Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances)

Interestingly, for the same seed in np.random.seed, if the benchmark starts with the 'smtp' dataset, the problem disappears (because the state of the RNG is not the same as when the benchmark starts with 'http'). For this reason it might be better to set the seed via the random_state parameter of the fetch_kddcup_99, fetch_covtype and shuffle functions rather than setting it with np.random.seed at the beginning of the dataset. The warning/error should then appear (or don't appear) whatever the order of the datasets.

agramfort · 2017-09-24T14:18:19Z

+1 for passing random_state to make the warning deterministic

albertcthomas · 2017-09-25T12:28:15Z

I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring.

ngoix · 2017-09-29T15:07:08Z

Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?)

albertcthomas · 2017-09-29T15:40:09Z

percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark)

Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve.

ngoix · 2017-09-29T15:59:00Z

You're right my mistake!

ngoix · 2017-09-29T16:06:36Z

But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private).
An interesting change on the benchmarks would be to create a file per setting (outlier/anomaly/novelty detection), each file including all AD algo. But this PR is probably not the place to do it.

ngoix · 2017-09-29T16:09:09Z

All in all, I think we should stick to the fix of #8638 in this PR.

ngoix · 2017-09-29T16:19:50Z

benchmarks/bench_isolation_forest.py

@@ -30,15 +41,14 @@ def print_outlier_ratio(y):
    print("----- Outlier ratio: %.5f" % (np.min(cnt) / len(y)))


-np.random.seed(1)
+SEED = 1


I'm more used to see rng = np.random.RandomState(0) but I think it doesn't matter.

don't use a new name for something we refer to as random_state

use:

random_state = 1

and pass random_state=random_state in function calls.

ngoix · 2017-09-29T16:26:35Z

btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes

albertcthomas · 2017-09-30T10:12:53Z

It looks like they are shuffled when we change the SEED. For the 'http' dataset, with SEED = 1,

X[0] = [-2.3025850929940455, 5.313698468586339, 5.59508305773286]

with SEED = 12,

X[0] = [-2.3025850929940455, 5.656341400056065, 6.319148277696116]

Results of the whole benchmark with SEED = 1 for the shuffling and random_state=1 for IsolationForest,

with SEED = 12 for the shuffling and random_state=1 for IsolationForest

We can observe small differences in terms of AUC coming from the fact that the datasets are shuffled.

ngoix · 2017-09-30T10:54:33Z

Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe...

albertcthomas · 2017-09-30T12:28:29Z

The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark.

ngoix · 2017-09-30T12:35:18Z

Ok I get it now !
So WDYT about my previous comments before this big "by the way" :) ?

albertcthomas · 2017-10-02T08:51:45Z

About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR.

ngoix · 2017-10-04T10:46:52Z

Then you don't need any change in bench_lof.py...

albertcthomas · 2017-10-09T12:36:19Z

The decision to make modifications to the LOF benchmark came after this discussion.

We could use _decision_function in the LOF benchmark as we also use _decision_function for LOF in the outlier detection examples (see here). But IMO this could cause confusion and give a wrong message to the users, as in issue #9874 maybe...
@agramfort what do you think?

agramfort · 2017-10-10T08:52:09Z

benchmarks/bench_lof.py


 print(__doc__)

-np.random.seed(2)
+SEED = 2  # to control the random selection of anomalies in the SA dataset


codecov · 2017-10-10T09:03:56Z

Codecov Report

Merging #9798 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #9798   +/-   ##
=======================================
  Coverage   96.16%   96.16%           
=======================================
  Files         336      336           
  Lines       62493    62493           
=======================================
  Hits        60098    60098           
  Misses       2395     2395

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf174af...3e06c31. Read the comment docs.

ngoix · 2017-10-10T21:35:00Z

@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance.

agramfort · 2017-10-11T09:15:05Z

I don't think we should expose in examples or benchmarks something that is not publicly supported. you can put your benchmark in a public gist on github if you want a public version of it.

ngoix · 2017-10-15T17:02:18Z

Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses _decision_fucntion.
Beside, we can't make the two different benchmarks we discussed above (outlier / anomaly detection), each benchmark containing all AD algo.

agramfort · 2017-10-25T11:59:24Z

just ran the bench locally. Worked great.

good to go @ogrisel or @lesteve ?

ogrisel · 2017-10-25T12:49:06Z

benchmarks/bench_isolation_forest.py

-# Removed the shuttle dataset because as of 2017-03-23 mldata.org is down:
-# datasets = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover']
-datasets = ['http', 'smtp', 'SA', 'SF', 'forestcover']
+# datasets available = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover']


Now that the shuttle dataset is run by default, we can remove this comment.

ogrisel · 2017-10-25T13:02:50Z

I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a plt.subplot grid, one subplot per dataset with both LOF and IF ROC curves.

ogrisel · 2017-10-25T13:06:54Z

Also I think having to convert the categorical labels as strings to get LabelBinarizer to behave correctly is a bug. It should work with an array of object labels which is a more memory efficient representation. But this should probably be tackled in a separate PR.

Actually, we should just use the categorical encoder once it's ready.

ogrisel · 2017-10-25T13:22:08Z

Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR.

…cs/donigian-update-contribution-guidelines * 'master' of github.com:scikit-learn/scikit-learn: (23 commits) fixes scikit-learn#10031: fix attribute name and shape in documentation (scikit-learn#10033) [MRG+1] add changelog entry for fixed and merged PR scikit-learn#10005 issue scikit-learn#9633 (scikit-learn#10025) [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(scikit-learn#9995) (scikit-learn#10022) [MRG + 1] Labels of clustering should start at 0 or -1 if noise (scikit-learn#10015) MAINT Remove redundancy in scikit-learn#9552 (scikit-learn#9573) [MRG+1] correct comparison in GaussianNB for 'priors' (scikit-learn#10005) [MRG + 1] ENH add check_inverse in FunctionTransformer (scikit-learn#9399) [MRG] FIX bug in nested set_params usage (scikit-learn#9999) [MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798) [MRG + 1] Fix negative inputs checking in mean_squared_log_error (scikit-learn#9968) DOC Fix typo (scikit-learn#9996) DOC Fix typo: x axis -> y axis (scikit-learn#9985) improve example plot_forest_iris.py (scikit-learn#9989) [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (scikit-learn#9875) DOC update news DOC Fix three typos in manifold documentation (scikit-learn#9990) DOC add missing dot in docstring DOC Add what's new for 0.19.1 (scikit-learn#9983) Improve readability of outlier detection example. (scikit-learn#9973) DOC: Fixed typo (scikit-learn#9977) ...

albertcthomas added 2 commits September 19, 2017 16:01

fix lof bench

67a85ec

fix iforest bench

5a0c557

agramfort reviewed Sep 19, 2017

View reviewed changes

albertcthomas added 2 commits September 20, 2017 16:53

make LOF benchmark an outlier detection benchmark

9825623

remove predict_time for LOF as there is no predict on a test set

16804a1

add X.astype(float) that was inadvertently removed in a previous commit

809fbe0

agramfort changed the title ~~[MRG] Fix LOF and Isolation benchmarks~~ [MRG+1] Fix LOF and Isolation benchmarks Sep 21, 2017

ogrisel reviewed Sep 22, 2017

View reviewed changes

albertcthomas added 2 commits September 25, 2017 13:49

fix and clarify randomness in LOF benchmark

8696938

fix and clarify randomness in iforest benchmark

aaf9e51

ngoix reviewed Sep 29, 2017

View reviewed changes

agramfort reviewed Oct 10, 2017

View reviewed changes

SEED -> random_state

3e06c31

albertcthomas force-pushed the outlier_bench branch from 973ae4d to 3e06c31 Compare October 10, 2017 09:03

ogrisel reviewed Oct 25, 2017

View reviewed changes

ogrisel merged commit 7f19dbe into scikit-learn:master Oct 25, 2017

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798)

e9d5a24

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798)

20b7499

albertcthomas mentioned this pull request Feb 13, 2020

[WIP] Performance comparison (ROC) plots for anomaly detection methods #16378

Closed

MaiRajborirug mentioned this pull request Feb 17, 2020

Example of LOF and IF benchmarks NeuroDataDesign/SPORF#9

Open

MaiRajborirug mentioned this pull request Mar 2, 2020

[MRG] Combining LOF and Isolation benchmarks #16606

Merged

Mai-Raj mentioned this pull request May 6, 2020

[MRG] Combining LOF and Isolation benchmarks (new) #17147

Closed

[MRG+1] Fix LOF and Isolation benchmarks #9798

[MRG+1] Fix LOF and Isolation benchmarks #9798

Conversation

albertcthomas commented Sep 19, 2017 • edited Loading

What does this implement/fix? Explain your changes.

Any other comments?

agramfort Sep 19, 2017

Choose a reason for hiding this comment

albertcthomas Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

agramfort Sep 19, 2017

Choose a reason for hiding this comment

albertcthomas Sep 20, 2017

Choose a reason for hiding this comment

albertcthomas commented Sep 20, 2017

albertcthomas commented Sep 20, 2017

albertcthomas commented Sep 21, 2017

agramfort commented Sep 21, 2017

ogrisel left a comment

Choose a reason for hiding this comment

albertcthomas commented Sep 22, 2017 • edited Loading

agramfort commented Sep 24, 2017 via email

albertcthomas commented Sep 25, 2017

ngoix commented Sep 29, 2017

albertcthomas commented Sep 29, 2017

ngoix commented Sep 29, 2017

ngoix commented Sep 29, 2017

ngoix commented Sep 29, 2017

ngoix Sep 29, 2017

Choose a reason for hiding this comment

agramfort Oct 5, 2017

Choose a reason for hiding this comment

ngoix commented Sep 29, 2017

albertcthomas commented Sep 30, 2017 • edited Loading

ngoix commented Sep 30, 2017

albertcthomas commented Sep 30, 2017

ngoix commented Sep 30, 2017

albertcthomas commented Oct 2, 2017

ngoix commented Oct 4, 2017

albertcthomas commented Oct 9, 2017 • edited Loading

agramfort Oct 10, 2017

Choose a reason for hiding this comment

codecov bot commented Oct 10, 2017 • edited Loading

Codecov Report

ngoix commented Oct 10, 2017

agramfort commented Oct 11, 2017 via email

ngoix commented Oct 15, 2017

agramfort commented Oct 25, 2017

ogrisel Oct 25, 2017

Choose a reason for hiding this comment

ogrisel commented Oct 25, 2017

ogrisel commented Oct 25, 2017 • edited Loading

ogrisel commented Oct 25, 2017

albertcthomas commented Sep 19, 2017 •

edited

Loading

albertcthomas Sep 20, 2017 •

edited

Loading

albertcthomas commented Sep 22, 2017 •

edited

Loading

albertcthomas commented Sep 30, 2017 •

edited

Loading

albertcthomas commented Oct 9, 2017 •

edited

Loading

codecov bot commented Oct 10, 2017 •

edited

Loading

ogrisel commented Oct 25, 2017 •

edited

Loading