Skip to content

[MRG+1] Fix LOF and Isolation benchmarks #9798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 25, 2017

Conversation

albertcthomas
Copy link
Contributor

@albertcthomas albertcthomas commented Sep 19, 2017

What does this implement/fix? Explain your changes.

The fix for the benchmark of IsolationForest merged in #8638 leads to the following

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset='SA', shuffle=True, percent10=True)
X = dataset.data
y = dataset.target
print(X.dtype)  # object

lb = MultiLabelBinarizer()
x1 = lb.fit_transform(X[:, 1])

print(np.unique(X[:,1]))  # [b'icmp' b'tcp' b'udp']
print(x1.shape)  # (100655, 7)

which does not correspond to what LabelBinarizer was doing because strings are viewed as iterables (and we obtain an array with 7 features instead of 3). This fix suggests to apply LabelBinarizer as follows:

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
x1 = lb.fit_transform(X[:, 1].astype(str))
print(x1.shape)  # (100655, 3)

Any other comments?

This also fixes the benchmark of LocalOutlierFactor.

EDIT: The benchmark of LocalOutlierFactor was predicting on a test set whereas LocalOutlierFactor is not meant to predict on a test set. This is fixed by doing the benchmark in an outlier detection context only.

@@ -101,7 +97,7 @@
fit_time = time() - tstart
tstart = time()

scoring = -model.decision_function(X_test) # the lower, the more normal
scoring = -model._decision_function(X_test) # the lower, the more normal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no public function that allows you to do the same?

Copy link
Contributor Author

@albertcthomas albertcthomas Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_. WDYT?

X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]]
y = (y != 'normal.').astype(int)
y = (y != b'normal.').astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works with python2 and python3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@albertcthomas
Copy link
Contributor Author

thanks for the review @agramfort

@albertcthomas
Copy link
Contributor Author

The LocalOutlierFactor benchmark is now done in an outlier detection context.
bench_lof

@albertcthomas
Copy link
Contributor Author

Concerning LOF, standardizing the datasets (using StandardScaler or RobustScaler) does not increase the performance in terms of ROC curve and AUC.

@agramfort agramfort changed the title [MRG] Fix LOF and Isolation benchmarks [MRG+1] Fix LOF and Isolation benchmarks Sep 21, 2017
@agramfort
Copy link
Member

LGTM +1 for MRG

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following error when running the isolation forest benchmark:

====== smtp ======
--- Fetching data...
--- Vectorizing data...
----- Target count values: 
------ 0 -> 9568 occurrences
------ 1 -> 3 occurrences
----- Outlier ratio: 0.00031
--- Fitting the IsolationForest estimator...
--- Preparing the plot elements...
/home/ogrisel/code/scikit-learn/sklearn/metrics/ranking.py:547: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
  UndefinedMetricWarning)
smtp (AUC: nan, train_time= 0.44s, test_time= 0.25s)

@albertcthomas
Copy link
Contributor Author

albertcthomas commented Sep 22, 2017

Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when percent10=True. Therefore there is a high chance that there is no outlier in the test set (when the model is trained on a dataset that can contain outliers). Of course this depends on the random seed set in np.random.seedat the beginning of the benchmark.

One solution would be to remove the smtp dataset from the benchmark. If percent10=False the problem is less likely to happen but may still happen depending on the random seed.

Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances)

Interestingly, for the same seed in np.random.seed, if the benchmark starts with the 'smtp' dataset, the problem disappears (because the state of the RNG is not the same as when the benchmark starts with 'http'). For this reason it might be better to set the seed via the random_state parameter of the fetch_kddcup_99, fetch_covtype and shuffle functions rather than setting it with np.random.seed at the beginning of the dataset. The warning/error should then appear (or don't appear) whatever the order of the datasets.

@agramfort
Copy link
Member

agramfort commented Sep 24, 2017 via email

@albertcthomas
Copy link
Contributor Author

I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring.

@ngoix
Copy link
Contributor

ngoix commented Sep 29, 2017

Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?)

@albertcthomas
Copy link
Contributor Author

percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark)

Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve.

@ngoix
Copy link
Contributor

ngoix commented Sep 29, 2017

You're right my mistake!

@ngoix
Copy link
Contributor

ngoix commented Sep 29, 2017

But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private).
An interesting change on the benchmarks would be to create a file per setting (outlier/anomaly/novelty detection), each file including all AD algo. But this PR is probably not the place to do it.

@ngoix
Copy link
Contributor

ngoix commented Sep 29, 2017

All in all, I think we should stick to the fix of #8638 in this PR.

@@ -30,15 +41,14 @@ def print_outlier_ratio(y):
print("----- Outlier ratio: %.5f" % (np.min(cnt) / len(y)))


np.random.seed(1)
SEED = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more used to see rng = np.random.RandomState(0) but I think it doesn't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use a new name for something we refer to as random_state

use:

random_state = 1

and pass random_state=random_state in function calls.

@ngoix
Copy link
Contributor

ngoix commented Sep 29, 2017

btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes

@albertcthomas
Copy link
Contributor Author

albertcthomas commented Sep 30, 2017

It looks like they are shuffled when we change the SEED. For the 'http' dataset, with SEED = 1,

X[0] = [-2.3025850929940455, 5.313698468586339, 5.59508305773286]

with SEED = 12,

X[0] = [-2.3025850929940455, 5.656341400056065, 6.319148277696116]

Results of the whole benchmark with SEED = 1 for the shuffling and random_state=1 for IsolationForest,
bench_1
with SEED = 12 for the shuffling and random_state=1 for IsolationForest

bench_12

We can observe small differences in terms of AUC coming from the fact that the datasets are shuffled.

@ngoix
Copy link
Contributor

ngoix commented Sep 30, 2017

Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe...

@albertcthomas
Copy link
Contributor Author

The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark.

@ngoix
Copy link
Contributor

ngoix commented Sep 30, 2017

Ok I get it now !
So WDYT about my previous comments before this big "by the way" :) ?

@albertcthomas
Copy link
Contributor Author

About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR.

@ngoix
Copy link
Contributor

ngoix commented Oct 4, 2017

Then you don't need any change in bench_lof.py...

@albertcthomas
Copy link
Contributor Author

albertcthomas commented Oct 9, 2017

The decision to make modifications to the LOF benchmark came after this discussion.

We could use _decision_function in the LOF benchmark as we also use _decision_function for LOF in the outlier detection examples (see here). But IMO this could cause confusion and give a wrong message to the users, as in issue #9874 maybe...
@agramfort what do you think?


print(__doc__)

np.random.seed(2)
SEED = 2 # to control the random selection of anomalies in the SA dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SEED...

@codecov
Copy link

codecov bot commented Oct 10, 2017

Codecov Report

Merging #9798 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9798   +/-   ##
=======================================
  Coverage   96.16%   96.16%           
=======================================
  Files         336      336           
  Lines       62493    62493           
=======================================
  Hits        60098    60098           
  Misses       2395     2395

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf174af...3e06c31. Read the comment docs.

@ngoix
Copy link
Contributor

ngoix commented Oct 10, 2017

@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance.

@agramfort
Copy link
Member

agramfort commented Oct 11, 2017 via email

@ngoix
Copy link
Contributor

ngoix commented Oct 15, 2017

Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses _decision_fucntion.
Beside, we can't make the two different benchmarks we discussed above (outlier / anomaly detection), each benchmark containing all AD algo.

@agramfort
Copy link
Member

just ran the bench locally. Worked great.

good to go @ogrisel or @lesteve ?

# Removed the shuttle dataset because as of 2017-03-23 mldata.org is down:
# datasets = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover']
datasets = ['http', 'smtp', 'SA', 'SF', 'forestcover']
# datasets available = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the shuttle dataset is run by default, we can remove this comment.

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2017

I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a plt.subplot grid, one subplot per dataset with both LOF and IF ROC curves.

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2017

Also I think having to convert the categorical labels as strings to get LabelBinarizer to behave correctly is a bug. It should work with an array of object labels which is a more memory efficient representation. But this should probably be tackled in a separate PR.

Actually, we should just use the categorical encoder once it's ready.

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2017

Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR.

@ogrisel ogrisel merged commit 7f19dbe into scikit-learn:master Oct 25, 2017
donigian added a commit to donigian/scikit-learn that referenced this pull request Oct 28, 2017
…cs/donigian-update-contribution-guidelines

* 'master' of github.com:scikit-learn/scikit-learn: (23 commits)
  fixes scikit-learn#10031: fix attribute name and shape in documentation (scikit-learn#10033)
  [MRG+1] add changelog entry for fixed and merged PR scikit-learn#10005 issue scikit-learn#9633 (scikit-learn#10025)
  [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(scikit-learn#9995) (scikit-learn#10022)
  [MRG + 1] Labels of clustering should start at 0 or -1 if noise (scikit-learn#10015)
  MAINT Remove redundancy in scikit-learn#9552 (scikit-learn#9573)
  [MRG+1] correct comparison in GaussianNB for 'priors' (scikit-learn#10005)
  [MRG + 1] ENH add check_inverse in FunctionTransformer (scikit-learn#9399)
  [MRG] FIX bug in nested set_params usage (scikit-learn#9999)
  [MRG+1] Fix LOF and Isolation benchmarks (scikit-learn#9798)
  [MRG + 1] Fix negative inputs checking in mean_squared_log_error (scikit-learn#9968)
  DOC Fix typo (scikit-learn#9996)
  DOC Fix typo: x axis -> y axis (scikit-learn#9985)
  improve example plot_forest_iris.py (scikit-learn#9989)
  [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (scikit-learn#9875)
  DOC update news
  DOC Fix three typos in manifold documentation (scikit-learn#9990)
  DOC add missing dot in docstring
  DOC Add what's new for 0.19.1 (scikit-learn#9983)
  Improve readability of outlier detection example. (scikit-learn#9973)
  DOC: Fixed typo (scikit-learn#9977)
  ...
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants