-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix LocalOutlierFactor's output for data with duplicated samples #28773
Conversation
…cated samples Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers. Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low negative outlier factor (under -1e7), labeling them as outliers. This fix checks if the minimum negative outlier factor is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning. Notes: Added a handle_duplicates variable, which allows developers to manually handle the duplicate values, if desired. Also added a memory_limit variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.
I think I don't like the recursive automatic change to neighbors. Maybe we should instead just warn the user when we detect the problem with very negative outlier factor values and let the user re-fit the model with a larger value of |
Removed automatic change to neighbors number and changed the warning Also changed the associated test, to catch the warning.
@ogrisel Changed the code. Now it only raises a warning, as suggested. |
Changed comment according to review Co-authored-by: Tim Head <betatim@gmail.com>
@betatim Is there anything more I can do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @HenriqueProj. I think this may require a changelog entry as a Fix.
Changed test description according to review Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
@OmarManzoor Made the changes, is everything ok? Thanks for the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions and also the conflict needs to be resolved.
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @HenriqueProj
Thank you for all the feedback! |
Reference Issues/PRs
Fixes #27839
What does this implement/fix? Explain your changes.
Previously, when the dataset had values repeat more times than the algorithm's number of neighbors, it miscalculates the outliers.
Because the distance between the duplicated samples is 0, the local reachability density is equal to 1e10. This leads to values that are close to the duplicated values having a really low
negative_outlier_factor_
(under -1e7), labeling them as outliers.This fix checks if the minimum
negative_outlier_factor_
is under -1e7 and, if so, raises the number of neighbors to the number of occurrences of the most frequent value + 1, also raising a warning.Notes: Added a
handle_duplicates
variable, which allows developers to manually handle the duplicate values, if desired.Also added a
memory_limit
variable to avoid creating memory errors for really large datasets, which can also be changed manually by developers.Any other comments?