Skip to content

[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 30, 2024

Conversation

HenriqueProj
Copy link
Contributor

Reference Issues/PRs

Fixes #27839

What does this implement/fix? Explain your changes.

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative_outlier_factor_ (under -1e7), labeling them as outliers.

This fix checks if the minimum negative_outlier_factor_ is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.

Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired.
Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.

Any other comments?

…cated samples

Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers.
This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.
Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
Copy link

github-actions bot commented Apr 5, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 35a6519. Link to the linter CI: here

@ogrisel
Copy link
Member

ogrisel commented Apr 10, 2024

I think I don't like the recursive automatic change to neighbors.

Maybe we should instead just warn the user when we detect the problem with very negative outlier factor values and let the user re-fit the model with a larger value of n_neighbors by themselves.

Removed automatic change to neighbors number and changed the warning
Also changed the associated test, to catch the warning.
@HenriqueProj
Copy link
Contributor Author

@ogrisel Changed the code. Now it only raises a warning, as suggested.

Changed comment according to review

Co-authored-by: Tim Head <betatim@gmail.com>
@HenriqueProj
Copy link
Contributor Author

@betatim Is there anything more I can do?
Thank you.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @HenriqueProj. I think this may require a changelog entry as a Fix.

HenriqueProj and others added 3 commits May 29, 2024 15:57
Changed test description according to review

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
@HenriqueProj
Copy link
Contributor Author

@OmarManzoor Made the changes, is everything ok? Thanks for the feedback!

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions and also the conflict needs to be resolved.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @HenriqueProj

@HenriqueProj
Copy link
Contributor Author

Thank you for all the feedback!

@OmarManzoor OmarManzoor enabled auto-merge (squash) May 30, 2024 13:44
@OmarManzoor OmarManzoor merged commit 546fd5f into scikit-learn:main May 30, 2024
28 checks passed
@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

LocalOutlierFactor might not work with duplicated samples
4 participants