Skip to content

[MRG+1] Fix multi-label issues in IsolationForest benchmark #8638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 20, 2017
Merged

[MRG+1] Fix multi-label issues in IsolationForest benchmark #8638

merged 2 commits into from
Apr 20, 2017

Conversation

hrjn
Copy link
Contributor

@hrjn hrjn commented Mar 23, 2017

Reference Issue

Fixes #8637

What does this implement/fix? Explain your changes.

In previous version, using Python 3 and LabelBinarizer to encode categorical features from SA & SF datasets from kddcup99 led to the error described in #8637. Replacing it with MultipleLabelBinarizerfixes the problem and allows the code to run on both Python 2 and 3.

Output obtained with new version:
bench_py35

Any other comments?

Additional minor cleaning/refactoring:

  • PEP8 compliance
  • added a with_scoring_hists flag (set to False by default) to avoid plotting all score histograms (might potentially clog the screen with figure windows)
  • removed the shuttle dataset from the list (as of today, mldata.org still isn't back up)
  • added a short helper function to display the outlier ratio for each selected dataset

@jnothman
Copy link
Member

Thanks! I agree that the described fix is appropriate, but now that there are other changes you will need to be patient for a full review. I hope to look soon.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ax[2].hist(scoring[y_test == 1], bins, color='r',
label='outliers')
ax[2].legend(loc="lower right")
y_pred = model.predict(X_test)
Copy link
Member

@TomDLT TomDLT Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to call predict before decision_function, y_pred is unused,
and it artificially increases the predict time.

@TomDLT TomDLT changed the title [MRG] Fix multi-label issues in IsolationForest benchmark [MRG+1] Fix multi-label issues in IsolationForest benchmark Mar 31, 2017
bins = np.linspace(-0.5, 0.5, 200)
ax[0].hist(scoring, bins, color='black')
ax[0].set_title('Decision function for %s dataset' % dat)
ax[0].legend(loc="lower right")
Copy link
Member

@TomDLT TomDLT Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this line, since there is no label in this plot.
It will remove the matplotlib warning

==========================================

A test of IsolationForest on classical anomaly detection datasets.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put the description on the top of the file


A test of IsolationForest on classical anomaly detection datasets.
"""
print(__doc__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put the print(__doc__) just below the import and before print_outlier_ratio(y)

fig_roc, ax_roc = plt.subplots(1, 1, figsize=(8, 5))

# Set this to true for plotting score histograms for each dataset:
with_scoring_hists = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the previous example, shouldn't this be set to True by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it clearer with just the ROC curves, hence the False by default (but can be changed if required).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rename it "with_decision_functions_histograms"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ngoix
Copy link
Contributor

ngoix commented Apr 19, 2017

LGTM

Harizo Rajaona and others added 2 commits April 20, 2017 15:43
…ges (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif
@TomDLT TomDLT merged commit 195de6a into scikit-learn:master Apr 20, 2017
@TomDLT
Copy link
Member

TomDLT commented Apr 20, 2017

Thanks @hrjn !

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…earn#8638)

* Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic)

Minor corrections after code review.

Minor corrections after 2nd code review.

Minor modif

* rerun CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IsolationForest benchmark doesn't accept legacy multi-label data representation
6 participants