-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fix multi-label issues in IsolationForest benchmark #8638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks! I agree that the described fix is appropriate, but now that there are other changes you will need to be patient for a full review. I hope to look soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
benchmarks/bench_isolation_forest.py
Outdated
ax[2].hist(scoring[y_test == 1], bins, color='r', | ||
label='outliers') | ||
ax[2].legend(loc="lower right") | ||
y_pred = model.predict(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to call predict
before decision_function
, y_pred
is unused,
and it artificially increases the predict time.
benchmarks/bench_isolation_forest.py
Outdated
bins = np.linspace(-0.5, 0.5, 200) | ||
ax[0].hist(scoring, bins, color='black') | ||
ax[0].set_title('Decision function for %s dataset' % dat) | ||
ax[0].legend(loc="lower right") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this line, since there is no label in this plot.
It will remove the matplotlib warning
benchmarks/bench_isolation_forest.py
Outdated
========================================== | ||
|
||
A test of IsolationForest on classical anomaly detection datasets. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put the description on the top of the file
benchmarks/bench_isolation_forest.py
Outdated
|
||
A test of IsolationForest on classical anomaly detection datasets. | ||
""" | ||
print(__doc__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put the print(__doc__)
just below the import and before print_outlier_ratio(y)
benchmarks/bench_isolation_forest.py
Outdated
fig_roc, ax_roc = plt.subplots(1, 1, figsize=(8, 5)) | ||
|
||
# Set this to true for plotting score histograms for each dataset: | ||
with_scoring_hists = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the previous example, shouldn't this be set to True by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it clearer with just the ROC curves, hence the False
by default (but can be changed if required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please rename it "with_decision_functions_histograms"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
LGTM |
…ges (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif
Thanks @hrjn ! |
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
…earn#8638) * Fixed a legacy multi-label issue and added minor refactoring and changes (mostly esthethic) Minor corrections after code review. Minor corrections after 2nd code review. Minor modif * rerun CI
Reference Issue
Fixes #8637
What does this implement/fix? Explain your changes.
In previous version, using Python 3 and
LabelBinarizer
to encode categorical features from SA & SF datasets fromkddcup99
led to the error described in #8637. Replacing it withMultipleLabelBinarizer
fixes the problem and allows the code to run on both Python 2 and 3.Output obtained with new version:

Any other comments?
Additional minor cleaning/refactoring:
with_scoring_hists
flag (set toFalse
by default) to avoid plotting all score histograms (might potentially clog the screen with figure windows)