Skip to content

[MRG] Combining LOF and Isolation benchmarks #16606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Apr 7, 2022

Conversation

MaiRajborirug
Copy link
Contributor

@MaiRajborirug MaiRajborirug commented Mar 2, 2020

What does this implement/fix? Explain your changes.
Create one example that combines both LOF and IF ROC curves, according to the last discussion in a merged PR #9798 (comment).

Other Comments
Also fixed some data input typos in the plot_annomaly_comparison.py

@MaiRajborirug
Copy link
Contributor Author

figure

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this PR be removing bench_lof.py and/or bench_isolation_forest.py?

@jnothman
Copy link
Member

jnothman commented Mar 2, 2020

Should this PR be removing bench_lof.py and/or bench_isolation_forest.py?

Please let us know when you've addressed this, or if you need help.

Currently this appears to have some lining errors

@MaiRajborirug
Copy link
Contributor Author

Should this PR be removing bench_lof.py and/or bench_isolation_forest.py?

Please let us know when you've addressed this, or if you need help.

Currently this appears to have some lining errors

Sorry about that. I just delete them.

@albertcthomas
Copy link
Contributor

The goal is to merge these two benchmarks into an example. Quoting @ogrisel's comment

I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a plt.subplot grid, one subplot per dataset with both LOF and IF ROC curves.

@MaiRajborirug can you please move the merged benchmark to examples/ and change the name of script to match the usual names of the examples?

@MaiRajborirug
Copy link
Contributor Author

@albertcthomas, sure. Is the name plot_anomaly_bench.py good?

@albertcthomas
Copy link
Contributor

@albertcthomas, sure. Is the name plot_anomaly_bench.py good?

Looks good to me. I will try to review your PR soon.

@MaiRajborirug
Copy link
Contributor Author

@albertcthomas, just a followup comment. How was the review going?

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass. This example will need to be referred to in the documentation on Novelty and Outlier detection.

@albertcthomas
Copy link
Contributor

You also need to make the tests pass: you have some flake8 errors that you need to fix.

@MaiRajborirug
Copy link
Contributor Author

You also need to make the tests pass: you have some flake8 errors that you need to fix.
Fixed it

A first pass. This example will need to be referred to in the documentation on Novelty and Outlier detection.
Could you tell me where is the document Novelty and Outlier detection link in Github? I only found it in the Sklearn website.

@MaiRajborirug
Copy link
Contributor Author

You also need to make the tests pass: you have some flake8 errors that you need to fix.

Fixed it. Thank you.

A first pass. This example will need to be referred to in the documentation on Novelty and Outlier detection.

Do you know where is the document Novelty and Outlier detection link in Github? I only found it in the Sklearn website. And should I adjust the document file as well?

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaiRajborirug
Copy link
Contributor Author

The documentation is located here: https://github.com/scikit-learn/scikit-learn/blob/master/doc/modules/outlier_detection.rst

I might need some help with inserting new .png file into the document. I can't find the folder ../auto_examples/ensemble/images/

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you work @MaiRajborirug. There is a few things to fix/do in order for this to be good on my side :).

@MaiRajborirug
Copy link
Contributor Author

@albertcthomas does the changes look good? Please let me know if you want to adjust the document any further.

@MaiRajborirug
Copy link
Contributor Author

@albertcthomas, should we discuss the algorithm's performance in the example or not?

@albertcthomas
Copy link
Contributor

albertcthomas commented Mar 25, 2020

@albertcthomas, should we discuss the algorithm's performance in the example or not?

Yes. I would discuss how to read the curves, that a higher ROC curve means a better performance and that we are usually mostly interested in low values of the FPR. I am not sure that we really want to emphasize any difference in performance between Isolation Forest and LOF. IMO this example is to show how to compare two outlier detection estimators.

Copy link
Contributor Author

@MaiRajborirug MaiRajborirug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the section about interpreting the ROC curve. @albertcthomas, is it short and clear enough?

@MaiRajborirug
Copy link
Contributor Author

Thank you, I just updated the code as you suggested @albertcthomas .

@albertcthomas
Copy link
Contributor

You can see the rendered example here. Could you improve the rendering of the legends in the plot so that they can fit in the figure and do not have ' in their texts?

@MaiRajborirug
Copy link
Contributor Author

You can see the rendered example here. Could you improve the rendering of the legends in the plot so that they can fit in the figure and do not have ' in their texts?

I fixed the ' and shortened the plot label description

@MaiRajborirug
Copy link
Contributor Author

@albertcthomas do you think this example is ready?

@albertcthomas
Copy link
Contributor

It is also a bit sad that LOF is not performing well on these datasets

@MaiRajborirug
Copy link
Contributor Author

I agree that LOF doesn't perform well here. @albertcthomas do you have any further suggestions on this PR improvement?

@cmarmo
Copy link
Contributor

cmarmo commented Mar 30, 2022

The pull request says 'Merging is blocked' even though all checks have passed. Could someone tell me what happen?

The main branch is protected now. Only core-devs can merge... this is not really different from before for contributors but it is explicitly stated.
Thanks for synchronizing the pull request.
@ogrisel, @jeremiedbb, this is a very old PR with one approval needing a decision. Could it be possibly included in 1.1? Thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Apr 2, 2022

@MaiRajborirug this pull request has been added to the list for the review sprint. Feel free to join if you think it will speed up the process. Thanks.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @MaiRajborirug. We are currently moving our examples to use a notebook style (see #22406). Could you update your PR to follow these guidelines ? In particular, split the content in different cells with titles and move the imports in the cells where they are needed.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another pass @MaiRajborirug

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple more nitpicks on my side but looks good otherwise once the comments from @ArturoAmorQ.

It could also be interesting to check if we can give some insights about what makes the performance different between the 2 algorithms for different datasets. @ogrisel had the intuition that IForest could be more performant of datasets with heterogeneous, while LOF might be better with numeric features and rotational invariance for instance.

@MaiRajborirug
Copy link
Contributor Author

@jeremiedbb that is a very interesting topic to dive into, but let's work it in another PR. It might be better to use generated data for that experiment where we have full control of the datasets.

jeremiedbb and others added 2 commits April 7, 2022 17:20
jeremiedbb and others added 2 commits April 7, 2022 17:33
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
@MaiRajborirug
Copy link
Contributor Author

@jeremiedbb @ArturoAmorQ how were the changes?

Copy link
Member

@ArturoAmorQ ArturoAmorQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @MaiRajborirug!

@jeremiedbb jeremiedbb merged commit 3b5f460 into scikit-learn:main Apr 7, 2022
@jeremiedbb
Copy link
Member

Thanks @MaiRajborirug and @albertcthomas and @ArturoAmorQ !

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
…#16606)

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Arturo Amor <86408019+ArturoAmorQ@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants