-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Refactor weighted percentile functions to avoid redundant sorting #30945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Refactor weighted percentile functions to avoid redundant sorting #30945
Conversation
Thanks for opening this PR! It is still a draft, so I've not looked at the diff too closely yet. I wanted to raise one thing already though: how did you decide to do this work? The reason I am asking is that I think the new code is quite a bit more complex than what it replaced, so having some benchmarks or motivation for why making this change is worth it would be good to have. There is a TODO comment in the code that suggests making this change, but I think we need to remember/figure out again why that comment was put there. IMHO the comment itself isn't a good enough motivation |
Hi @betatim Thanks for your feedback! I'm new to open source and eager to contribute. I found this TODO and thought it would make a difference, but I see your point about complexity. I’d love any advice—not just on this PR but on open source in general. Looking forward to your thoughts! |
As the author of the TODO, I confirm that this refactoring is needed; otherwise we won't be able to generalize the use of this function throughout the code base whenever it's needed to fix sample_weight related problems (see #16298) without introducing significant performance regressions. |
Note that there is also a concurrent PR that refactors this function for a different purpose: Not sure which will be ready to merge first, but both are valuable and once one is merged, the other will need to be adapted accordingly. |
@Nujra40 there are a bunch of failing tests. Would you mind trying to fix the code to get them to pass before we start the review? Unless you need help to solve some of them, if so please ask in the comments:
See the continuous integration logs ("failing checks") for details. Furthermore, @lucyleeow's #29431 PR has just been merged in |
Hi @ogrisel I will look into it and get back to you if required. |
7b943ae
to
00d3ef9
Compare
Hi @ogrisel Need some help to figure out whats wrong. Is the tests not complete for my added parts of the code? |
It seems to be complaining about these 2 lines being untested: |
@@ -41,13 +43,17 @@ def _weighted_percentile(array, sample_weight, percentile_rank=50, xp=None): | |||
xp : array_namespace, default=None | |||
The standard-compatible namespace for `array`. Default: infer. | |||
|
|||
symmetrize : bool, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit about this name - could we make it more consistent with how it's named elsewhere? I know this function is private and we ultimately would like to use the scipy quantile so we don't have to maintain our own, but it would still be nice if it was clearer what this method was equivalent to.
I think this quantile method is called "averaged_inverted_cdf" in numpy/scipy. And we were calling it _averaged_weighted_percentile
. I can understand "symmetrize" term but what about "average" or "average_inverted" etc ?
return result[0] if n_dim == 1 else result | ||
|
||
|
||
# TODO: refactor to do the symmetrisation inside _weighted_percentile to avoid | ||
# sorting the input array twice. | ||
def _averaged_weighted_percentile(array, sample_weight, percentile_rank=50, xp=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just replace instances of _averaged_weighted_percentile
with _weighted_percentile(symmetrize=True)
?
REF: Integrate symmetrization in _weighted_percentile to avoid double sorting
Description
This pull request refactors the computation of weighted percentiles by integrating symmetrization directly into the
_weighted_percentile
function. With this change, we avoid sorting the input array twice when computing the averaged weighted percentile. The following changes have been made:symmetrize
parameter to_weighted_percentile
that, when enabled, computes the averaged weighted percentile using both positive and negative arrays._averaged_weighted_percentile
to leverage the new symmetrization functionality.This refactor improves efficiency without altering the external API or behavior.
Please review and let me know if any adjustments are required.