[MRG+1] Fix multi-label issues in IsolationForest benchmark #8638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

TomDLT merged 2 commits into scikit-learn:master from hrjn:fix-iforest-benchmark

Apr 20, 2017

Contributor

hrjn commented Mar 23, 2017

Reference Issue

Fixes #8637

What does this implement/fix? Explain your changes.

In previous version, using Python 3 and LabelBinarizer to encode categorical features from SA & SF datasets from kddcup99 led to the error described in #8637. Replacing it with MultipleLabelBinarizerfixes the problem and allows the code to run on both Python 2 and 3.

Output obtained with new version:

Any other comments?

Additional minor cleaning/refactoring:

PEP8 compliance
added a with_scoring_hists flag (set to False by default) to avoid plotting all score histograms (might potentially clog the screen with figure windows)
removed the shuttle dataset from the list (as of today, mldata.org still isn't back up)
added a short helper function to display the outlier ratio for each selected dataset

hrjn mentioned this pull request

IsolationForest benchmark doesn't accept legacy multi-label data representation #8637

Closed

Member

jnothman commented Mar 23, 2017

Thanks! I agree that the described fix is appropriate, but now that there are other changes you will need to be patient for a full review. I hope to look soon.

TomDLT approved these changes

View reviewed changes

Member

TomDLT left a comment

LGTM

benchmarks/bench_isolation_forest.py Outdated

-                  ax[2].hist(scoring[y_test == 1], bins, color='r',
-                             label='outliers')
-                  ax[2].legend(loc="lower right")
+                  y_pred = model.predict(X_test)

Member

TomDLT Mar 31, 2017 •

edited

Loading

you don't need to call predict before decision_function, y_pred is unused,
and it artificially increases the predict time.

TomDLT changed the title ~~[MRG] Fix multi-label issues in IsolationForest benchmark~~ [MRG+1] Fix multi-label issues in IsolationForest benchmark

TomDLT reviewed

View reviewed changes

benchmarks/bench_isolation_forest.py Outdated

+                      bins = np.linspace(-0.5, 0.5, 200)
+                      ax[0].hist(scoring, bins, color='black')
+                      ax[0].set_title('Decision function for %s dataset' % dat)
+                      ax[0].legend(loc="lower right")

Member

TomDLT Mar 31, 2017 •

edited

Loading

You can remove this line, since there is no label in this plot.
It will remove the matplotlib warning

glemaitre reviewed

View reviewed changes

benchmarks/bench_isolation_forest.py Outdated

+              ==========================================
+              A test of IsolationForest on classical anomaly detection datasets.
+              """

Member

glemaitre Apr 3, 2017

Could you put the description on the top of the file

glemaitre reviewed

View reviewed changes

benchmarks/bench_isolation_forest.py Outdated

+              A test of IsolationForest on classical anomaly detection datasets.
+              """
+              print(__doc__)

Member

glemaitre Apr 3, 2017

You can put the print(__doc__) just below the import and before print_outlier_ratio(y)

jmschrei reviewed

View reviewed changes

benchmarks/bench_isolation_forest.py Outdated

               fig_roc, ax_roc = plt.subplots(1, 1, figsize=(8, 5))
+              # Set this to true for plotting score histograms for each dataset:
+              with_scoring_hists = False

Member

jmschrei Apr 3, 2017

To be consistent with the previous example, shouldn't this be set to True by default?

Contributor Author

hrjn Apr 4, 2017

I found it clearer with just the ROC curves, hence the False by default (but can be changed if required).

Contributor

ngoix Apr 19, 2017

Could you please rename it "with_decision_functions_histograms"?

Contributor Author

hrjn Apr 20, 2017

Done.

Contributor

ngoix commented Apr 19, 2017

LGTM

Harizo Rajaona and others added 2 commits

April 20, 2017 15:43


          Fixed a legacy multi-label issue and added minor refactoring and chan…

4e3f88b

…ges (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif


          rerun CI

8d86ae9

TomDLT merged commit 195de6a into scikit-learn:master

Member

TomDLT commented Apr 20, 2017

Thanks @hrjn !

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

bb27ceb

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

c2c6c7b

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

d9216d2

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

6deeea3

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

paulha pushed a commit to paulha/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

0ec5f32

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

albertcthomas mentioned this pull request

[MRG+1] Fix LOF and Isolation benchmarks #9798

Merged

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

3c0045d

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request


          [MRG+1] Fix multi-label issues in IsolationForest benchmark (scikit-l…

e65830a

…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet