Refactor weighted percentile functions to avoid redundant sorting #30945

Nujra40 · 2025-03-05T09:53:04Z

REF: Integrate symmetrization in _weighted_percentile to avoid double sorting

Description

This pull request refactors the computation of weighted percentiles by integrating symmetrization directly into the _weighted_percentile function. With this change, we avoid sorting the input array twice when computing the averaged weighted percentile. The following changes have been made:

Added a symmetrize parameter to _weighted_percentile that, when enabled, computes the averaged weighted percentile using both positive and negative arrays.
Updated _averaged_weighted_percentile to leverage the new symmetrization functionality.
Preserved the original functionality and all existing comments.
Ensured that the code complies with the scikit-learn contributing guidelines and passes all relevant tests.

This refactor improves efficiency without altering the external API or behavior.

Please review and let me know if any adjustments are required.

github-actions · 2025-03-05T09:54:28Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f45abdb. Link to the linter CI: here}

betatim · 2025-03-05T16:36:45Z

Thanks for opening this PR! It is still a draft, so I've not looked at the diff too closely yet. I wanted to raise one thing already though: how did you decide to do this work? The reason I am asking is that I think the new code is quite a bit more complex than what it replaced, so having some benchmarks or motivation for why making this change is worth it would be good to have.

There is a TODO comment in the code that suggests making this change, but I think we need to remember/figure out again why that comment was put there. IMHO the comment itself isn't a good enough motivation

Nujra40 · 2025-03-05T16:45:16Z

Hi @betatim

Thanks for your feedback! I'm new to open source and eager to contribute. I found this TODO and thought it would make a difference, but I see your point about complexity. I’d love any advice—not just on this PR but on open source in general. Looking forward to your thoughts!

ogrisel · 2025-03-21T16:02:45Z

As the author of the TODO, I confirm that this refactoring is needed; otherwise we won't be able to generalize the use of this function throughout the code base whenever it's needed to fix sample_weight related problems (see #16298) without introducing significant performance regressions.

ogrisel · 2025-03-21T16:07:28Z

Note that there is also a concurrent PR that refactors this function for a different purpose:

Add array API support for _weighted_percentile #29431

Not sure which will be ready to merge first, but both are valuable and once one is merged, the other will need to be adapted accordingly.

ogrisel · 2025-04-04T12:15:13Z

@Nujra40 there are a bunch of failing tests. Would you mind trying to fix the code to get them to pass before we start the review? Unless you need help to solve some of them, if so please ask in the comments:

FAILED tests/test_common.py::test_estimators[KBinsDiscretizer(quantile_method='averaged_inverted_cdf')-check_sample_weights_shape] - IndexError: index -1 is out of bounds for axis 0 with size 0
FAILED tests/test_common.py::test_estimators[KBinsDiscretizer(quantile_method='averaged_inverted_cdf')-check_sample_weights_not_overwritten] - IndexError: index -1 is out of bounds for axis 0 with size 0
FAILED tests/test_common.py::test_estimators[KBinsDiscretizer(quantile_method='averaged_inverted_cdf',subsample=None)-check_sample_weight_equivalence_on_dense_data] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_fit_transform[quantile-averaged_inverted_cdf-expected5-sample_weight5] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_fit_transform[quantile-averaged_inverted_cdf-expected6-sample_weight6] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_fit_transform[quantile-averaged_inverted_cdf-expected7-sample_weight7] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_fit_transform_n_bins_array[quantile-averaged_inverted_cdf-expected4-sample_weight4] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_fit_transform_n_bins_array[quantile-averaged_inverted_cdf-expected5-sample_weight5] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_fit_transform_n_bins_array[quantile-averaged_inverted_cdf-expected6-sample_weight6] - AssertionError: 
FAILED preprocessing/tests/test_discretization.py::test_kbinsdiscretizer_effect_sample_weight - AssertionError: 
FAILED utils/tests/test_stats.py::test_averaged_weighted_median - AssertionError
FAILED utils/tests/test_stats.py::test_averaged_weighted_percentile - AssertionError
FAILED utils/tests/test_stats.py::test_averaged_and_weighted_percentile - AssertionError

See the continuous integration logs ("failing checks") for details.

Furthermore, @lucyleeow's #29431 PR has just been merged in main so this PR's branch needs to be updated by merging the current main branch into it and solve the conflicts before proceeding further. Please let us know if you need help.

Nujra40 · 2025-04-05T03:29:29Z

Hi @ogrisel I will look into it and get back to you if required.

Nujra40 · 2025-04-06T02:42:13Z

Hi @ogrisel Need some help to figure out whats wrong. Is the tests not complete for my added parts of the code?

lucyleeow · 2025-04-07T01:13:08Z

@Nujra40

Looking at codecov: https://app.codecov.io/gh/scikit-learn/scikit-learn/pull/30945?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=scikit-learn

It seems to be complaining about these 2 lines being untested:

lucyleeow · 2025-04-07T01:20:13Z

sklearn/utils/stats.py

@@ -41,13 +43,17 @@ def _weighted_percentile(array, sample_weight, percentile_rank=50, xp=None):
    xp : array_namespace, default=None
        The standard-compatible namespace for `array`. Default: infer.

+    symmetrize : bool, default=False


Nit about this name - could we make it more consistent with how it's named elsewhere? I know this function is private and we ultimately would like to use the scipy quantile so we don't have to maintain our own, but it would still be nice if it was clearer what this method was equivalent to.

I think this quantile method is called "averaged_inverted_cdf" in numpy/scipy. And we were calling it _averaged_weighted_percentile. I can understand "symmetrize" term but what about "average" or "average_inverted" etc ?

lucyleeow · 2025-04-07T01:20:17Z

sklearn/utils/stats.py

    return result[0] if n_dim == 1 else result


-# TODO: refactor to do the symmetrisation inside _weighted_percentile to avoid
-# sorting the input array twice.
 def _averaged_weighted_percentile(array, sample_weight, percentile_rank=50, xp=None):


Why not just replace instances of _averaged_weighted_percentile with _weighted_percentile(symmetrize=True) ?

github-actions bot added the module:utils label Mar 5, 2025

Nujra40 marked this pull request as draft March 5, 2025 10:06

ogrisel mentioned this pull request Mar 21, 2025

Added sample weight handling to BinMapper under HGBT #29641

Open

7 tasks

Nujra40 marked this pull request as ready for review April 4, 2025 09:48

Nujra40 closed this Apr 5, 2025

Nujra40 force-pushed the refactor/weighted-percentile-symmetrization branch from 7b943ae to 00d3ef9 Compare April 5, 2025 17:08

Refactor weighted percentile functions to avoid redundant sorting

f45abdb

Nujra40 reopened this Apr 5, 2025

lucyleeow reviewed Apr 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor weighted percentile functions to avoid redundant sorting #30945

Refactor weighted percentile functions to avoid redundant sorting #30945

Nujra40 commented Mar 5, 2025

github-actions bot commented Mar 5, 2025 •

edited

Loading

betatim commented Mar 5, 2025 •

edited

Loading

Nujra40 commented Mar 5, 2025

ogrisel commented Mar 21, 2025

ogrisel commented Mar 21, 2025

ogrisel commented Apr 4, 2025 •

edited

Loading

Nujra40 commented Apr 5, 2025

Nujra40 commented Apr 6, 2025

lucyleeow commented Apr 7, 2025

lucyleeow Apr 7, 2025

lucyleeow Apr 7, 2025

Refactor weighted percentile functions to avoid redundant sorting #30945

Are you sure you want to change the base?

Refactor weighted percentile functions to avoid redundant sorting #30945

Conversation

Nujra40 commented Mar 5, 2025

REF: Integrate symmetrization in _weighted_percentile to avoid double sorting

Description

github-actions bot commented Mar 5, 2025 • edited Loading

✔️ Linting Passed

betatim commented Mar 5, 2025 • edited Loading

Nujra40 commented Mar 5, 2025

ogrisel commented Mar 21, 2025

ogrisel commented Mar 21, 2025

ogrisel commented Apr 4, 2025 • edited Loading

Nujra40 commented Apr 5, 2025

Nujra40 commented Apr 6, 2025

lucyleeow commented Apr 7, 2025

lucyleeow Apr 7, 2025

Choose a reason for hiding this comment

lucyleeow Apr 7, 2025

Choose a reason for hiding this comment

github-actions bot commented Mar 5, 2025 •

edited

Loading

betatim commented Mar 5, 2025 •

edited

Loading

ogrisel commented Apr 4, 2025 •

edited

Loading