ENH improve _weighted_percentile to provide several interpolation #17768

glemaitre · 2020-06-29T11:20:12Z

closes #6189
closes #7662

Toward the completion of #17377
Next step #17775

Implement different interpolation strategies for _weighted_percentile

…eral interpolation

glemaitre · 2020-06-29T16:42:46Z

@jnothman I started to split the original PR. Now it should be easier to review. Thought this PR introduces a functionality which we are not using yet.

glemaitre · 2020-06-29T16:43:48Z

ping @ogrisel @lucyleeow Since you reviewed the original work
ping @NicolasHug @adrinjalali since it will be used in the GBRT as well a bit later.

NicolasHug

Quick first pass

doc/whats_new/v0.24.rst

sklearn/utils/stats.py

NicolasHug · 2020-06-29T17:05:12Z

sklearn/utils/tests/test_stats.py

+
+
+@pytest.mark.parametrize("percentile", [0, 25, 50, 75, 100])
+def test_weighted_percentile_equivalence_weights_repeated_samples(percentile):


please add short comments describing the tests (here and elsewhere)

sklearn/ensemble/tests/test_gradient_boosting.py

sklearn/utils/stats.py

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

…percentile

adrinjalali

Thanks @glemaitre

doc/whats_new/v0.24.rst

sklearn/utils/tests/test_stats.py

sklearn/utils/stats.py

adrinjalali · 2020-06-30T10:37:56Z

I also mentioned it before, do we not want to have this in cython?

lucyleeow

Sorry I still have concerns about 'linear' but I can drop it. It's difficult to get weighted percentile to have similar properties to normal percentile.

sklearn/utils/stats.py

lucyleeow · 2020-06-30T11:09:50Z

sklearn/utils/stats.py

+    P_w = (x - w) / (S_w - w),
+
+    where x in [1, N], w being the weight and S_w being the sum of the weights.
+    """


I understand the desire to be consistent with numpy implementation of ~~C=0~~ C=1 but I do think extrapolating to samples with weights causes problems. (copied over from old PR) It 'ignores' the last weight so, it doesn't give the same answer as with np.percentile and repeating the array by integer weights:

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]), ... percentile=50, interpolation="linear") 3.3 >>> np.percentile(np.array([1, 2, 3, 4, 4, 4, 4, 4]), 50, interpolation='linear') 4.0 >>> np.median(np.array([1, 2, 3, 4, 4, 4, 4, 4])) 4.0

'ignoring' the last weight also means that these 2 have the same result:

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]), ... percentile=50, interpolation="linear") 3.3 >>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 5, 5]), ... percentile=50, interpolation="linear") 3.3

It (C=1) doesn't have the same properties when implemented for unit weights. I am not sure when it would be desirable to use this implementation.

The repeating issue will not be a big deal because we are not going to use the linear interpolation to keep this property.

Regarding the case

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]), ... percentile=50, interpolation="linear")

the percentile requested is between 3 and 4. It will be computed using linear interpolation. In this case, it is weird because we would have the intuition that the 4 should be more attractive. But then it means that linear interpolation is probably that we don't want with weighted percentile but we would expect a more non-linear interpolation.

sklearn/utils/stats.py

…percentile

glemaitre · 2020-07-07T06:52:17Z

I also mentioned it before, do we not want to have this in cython?

We should benchmark to know if this is a bottleneck where it is used. If this is critical then we can cythonize it. For the moment, we greatly rely on some NumPy function.

glemaitre · 2020-07-07T12:01:38Z

sklearn/utils/stats.py

+        def _weighted_interpolation(
+            x, x_low, x_high, y_low, y_high, w_low, w_high,
+        ):
+            """Weighted interpolation between 2 values.
+
+            If the weights are null, then it is equivalent to a linear
+            interpolation.
+            """
+            with np.errstate(invalid="ignore"):
+                percentile_value = (
+                    y_low + ((1 - w_low) * (x - x_low)) * (
+                        (y_high - y_low) / (
+                            (1 - w_high) * (x_high - x) +
+                            (1 - w_low) * (x - x_low)
+                        )
+                    )
+                )
+            if np.isnan(percentile_value):
+                # case that percentile is 100
+                return y_high
+            return percentile_value


Somehow this weighted interpolation will do as on the picture below

While this closer to the intuition that one would have. I have no idea if this is mathematically correct.
Anybody with some background in interpolation? @jnothman @NicolasHug @thomasjpfan @jeremiedbb @lucyleeow ?

…rcentile

lucyleeow · 2022-05-23T04:57:05Z

I think this could be revived as some good work has been done and a better implementation of weighted percentile would be worthwhile.

The original aim was to add percentile methods included by numpy.percentile ('lower', 'higher', 'nearest' and 'linear'). Note however that since version 1.22 numpy now supports 13 different methods. The 9 methods described in the paper:

R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,” The American Statistician, 50(4), pp. 361-365, 1996

(I've seen this paper referenced a fair bit in my reading and seems like a good reference) and the 4 old 'discontinuous variations of 'linear'. 'linear' is described in the paper above and also in wiki as C=1 (see below).

In discussions above we have been using the 3 'linear interpolation between closest ranks' methods (C=0, C=1/2, C=1) described by wikipedia as reference. The variants can be thought to work like so: calculate the percentile ranks of all values in the sorted list of values, linearly interpolate between each of these points. This plot taken from the page shows the difference between the 3 approaches on the ordered list {15, 20, 35, 40, 50}:

Things to note:

all variants have the same value at p=0.5 (the median), which is a constraint of the method
there is the greatest difference near the margins (0 and 1), the percentile rank of the first value in the list being much higher for C=0 than C=1 and the percentile rank of the last value in the list being much lower for C=0 than C=0

What we have done here is extended the percentile options from numpy to weighted percentile. Using the same ordered list ({15, 20, 35, 40, 50}) and the weights ((4, 1, 1, 1, 4)) - adding arbitrary higher weights to the first and last values to highlight its effect.

a = np.array([15, 20, 35, 40, 50])
w = np.array([4, 1, 1, 1, 4])

Here is what our current implementation looks like:

Notes:

the methods 'lower', 'higher' and 'nearest' are all discontinuous
the p=0.5 varies between the methods (I am not sure we would expect them to be the same?)

linear: 35.0
lower: 20
higher: 35
nearest: 35
weighted: 35.0

np.median of the values repeated by their weights is 35.0.

np.median(np.array(np.repeat(a, w)))

Below is the plot again with only the discontinuous methods 'linear' and 'weighted':

Thoughts:

I am hesitant about our new 'weighted' interpolation method as I am not sure we want non-linear interpolation (all 9 methods described in the paper above linearly interpolate) and it is not an extension of an accepted percentile method that has been described elsewhere.
I am not sure not about extending the old numpy methods ('nearest', 'lower' and 'highest') to weighted percentile as the numpy documentation seems to suggest they are now there just for backwards compatibility reasons (From doc: "NumPy method kept for backwards compatibility.).
I wonder if extension of one of the other 9 methods mention in Hyndman & Fan to weighted percentile will be better..?
When plotted the 'linear' for weights does look reasonable but some unintuitive behaviours, mentioned previously, occur (mostly because of how percentile values are handled at the margins (0,1)):
- when using integer weights, the weighted percentile is not the same as the percentile of the same list repeated by their integer weights, when 'extreme' weights are at the end (or start):

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> np.percentile(np.array([1, 2, 3, 4, 4, 4, 4, 4]), 50, interpolation='linear')
4.0
>>> np.median(np.array([1, 2, 3, 4, 4, 4, 4, 4]))
4.0

the following return the same answers:

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 5, 5]),
...                      percentile=50, interpolation="linear")
3.3

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([5, 1, 1, 1]),
...                      percentile=50, interpolation="linear")
1.7
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([5, 5, 1, 1]),
...                      percentile=50, interpolation="linear")
1.7

we do not ignore/mask 0 weighted values. E.g.,:

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 0, 0, 1]),
...                        percentile=50, interpolation="linear")
2.0

this is because adjusted_percentile looks like this:

adjusted percentile [[0. ]
 [0.5]
 [0.5]
 [1. ]]

TODO:

this blog post seems reasonable and extends 2 of the 9 percentile methods for weighted percentile
Read Hyndman and Fan paper
look at matlab implementation of weighted percentile
some R implementations of weighted percentile:
- wtd.mean {Hmisc}
- Quantile {DescTools}

Thoughts welcome cc those who have already looked at this PR @glemaitre @ogrisel @adrinjalali @NicolasHug and @thomasjpfan

lorentzenchr · 2022-05-23T07:27:04Z

I'd like to point to the epic numpy PR numpy/numpy#9211.
While I reeeeeeeally would like to have weighted quantiles, I think we should ask ourselves where we need them and implement only what's needed. IMHO, the general and publicly available solution has its place in numpy, not scikit-learn.

lucyleeow · 2022-05-23T08:16:28Z

Nice! I didn't know about this. I would tend to agree with you @lorentzenchr, only concern is that this PR was started in 2016!
From a quick scroll - weights is implemented for all 13(!!) implementations of percentile supported by numpy (numpy/numpy#9211 (comment)), they seem to handle 0 weight well (numpy/numpy#9211 (comment)) and they expand to frequency weights sometimes. I need to dig into how they handle frequency weights and probability weights (e.g., (1,9) vs (0.1, 0.9)). It is certainly very complex and probably nice to have another package handle it than us.

What do you think the chances if it getting merged/timeframe are? There were some comments (numpy/numpy#9211 (comment)) about whether this is better in a stats package (e.g., scipy).

From the little reading, potential options are:

wait for numpy to merge the above PR
implement their extension of type 7 (from Hydnman and Fan, equal to 'linear') percentile method to weighted percentile (which seems to be main method to get extended to weighted percentile) until their PR gets merged...

lucyleeow · 2022-05-30T04:31:40Z

For reference here are the above situations I highlighted using the 'linear' implementation of weighted quantile from the numpy PR numpy/numpy#9211

>>> np.percentile([1,2,3,4], weights=[1,1,1,5], q=50, method='linear')
array(4.)
>>> np.percentile([1,2,3,4], weights=[1,1,5,5], q=50, method='linear')
array(3.)
>>> np.percentile([1,2,3,4], weights=[1,0,0,1], q=50, method='linear')
array(2.5)

Also after discussion with @thomasjpfan, another potential solution is to take only 'linear' weighted percentile and update our implementation to match that in the numpy PR. A good point was raised that even when/if the PR is merged into numpy it will take a while before our min numpy version allows us to use it. Also considering the number of lines of code involved, it would not be trivial to extract the numpy implementation weighted and copy to sklearn.

Edit:
I think this is due to how weights are implemented, some discussion here: numpy/numpy#9211 (comment)
Will look into the difference between the 3 types of weights

lucyleeow · 2022-07-05T09:46:01Z

Having a look at the current numpy PR implementation, it varies a bit from our implementation:

weights that are equal to 0 are made into NaN and ignored
any weights < 1 are normalized (afaict this is the only amendment made to weights):

    # Now check/renormalize weights if any is (0, 1)
    def _normalize(v):
        inds = v > 0
        if (v[inds] < 1).any():
            vec = v.copy()
            vec[inds] = vec[inds] / vec[inds].min()  # renormalization
            return vec
        else:
            return v

instead of calculating adjusted percentiles, then doing linear interpolation between the higher and lower percentile indicies like we do in this PR:

    P_w_i = (x - w_i) / (S_w - w_i),
    where x in [1, N], w_i being the weight of sample i and S_w being the sum
    of the weights.

numpy does something more complicated:

        # Each array value occupies a range within weight space, bounded
        # by its left/right_weight_bound.  If a weight_space_index falls within
        # those two bounds, the quantile will just take on that array value.
        # If the weight_space_index falls between neighboring ranges, that
        # means its quantile value must fall between neighboring array values
        # as well, so the gamma (computed in weight space) is used to
        # interpolate between those neighboring array values.

calculation of weight_space_index varies depending on the methodology, for the default 'linear' method it is get_virtual_index below:

    # To avoid some rounding issues, `(n-1) * quantiles` is preferred to
    # `_compute_virtual_index(n, quantiles, 1, 1)`.
    # They are mathematically equivalent.
    linear=dict(
        get_virtual_index=lambda n, quantiles: (n - 1) * quantiles,
        fix_gamma=lambda g, _: g,
        discrete_shortcut=None,
    ),

the left and right bounds are calculated thus:

            left_weight_bound = np.roll(wgts1d_cumsum, 1)
            left_weight_bound[0] = 0  # left-most weight bound fixed at 0
            right_weight_bound = wgts1d_cumsum - 1

(I don't really understand the reasoning behind the right bound...?) E.g., for weight array [2,3]:
cum sum = [2,5]
left weight bound = [0,2]
right weight bound = [1,4]
weight bounds = [0,1,2,4]

this is then used to calculate 'gamma' (the fractional part of the virtual index) which is used to interpolate between neighbouring values.

adrinjalali · 2024-04-15T09:51:14Z

Doesn't numpy/numpy#24254 (thanks to @lorentzenchr ) partially solve this?

lorentzenchr · 2024-04-26T17:02:29Z

TLDR Yes, let’s close, use the numpy 2.0 weighted quantile function (in the future) and mark other methods as "won't implement".

That numpy PR adds weights support for np.quantile(x, method="inverted_cdf") only. It was very hard to convince them at all about the statistical meaningfulness of weighted quantiles. inverted_cdf was the only method that made sense to me to extend (averaged_inverted_cdf would also be possible). But all the other methods are pretty much unclear how to extend to the weighted case. Literature is non-existent.

ENH improve the implementation of _weighted_percentile to provide sev…

b2abd20

…eral interpolation

github-actions bot added the module:utils label Jun 29, 2020

glemaitre added 2 commits June 29, 2020 16:22

iter

79911c3

typo

be5135e

glemaitre changed the title ~~[WIP] ENH improve _weighted_percentile to provide several interpolation~~ ENH improve _weighted_percentile to provide several interpolation Jun 29, 2020

TST adjust threshold

730c218

This was referenced Jun 29, 2020

[MRG] Add interpolation to _weighted_percentile (Addresses #6189) #7662

Closed

ENH add interpolation parameter to DummyRegressor for "median" and "quantile" strategies #17775

Closed

update whats new

35fbfb3

NicolasHug reviewed Jun 29, 2020

View reviewed changes

glemaitre and others added 8 commits June 29, 2020 19:26

Apply suggestions from code review

e3a54bb

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

improve doc

e8ed64b

Merge remote-tracking branch 'origin/master' into implement_weighted_…

0dc3b31

…percentile

iter

8ccbdea

iter

ab2696b

iter

ab0374e

revert the legacy nearest-rank

68b044a

add documentation to tests

d2a3e1e

adrinjalali reviewed Jun 30, 2020

View reviewed changes

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

sklearn/utils/stats.py Show resolved Hide resolved

lucyleeow reviewed Jun 30, 2020

View reviewed changes

sklearn/utils/stats.py Show resolved Hide resolved

lucyleeow reviewed Jun 30, 2020

View reviewed changes

sklearn/utils/stats.py Outdated Show resolved Hide resolved

glemaitre added 3 commits July 3, 2020 19:10

iter

c6de014

remove changelog

4578fa9

Merge remote-tracking branch 'origin/master' into implement_weighted_…

01e61bd

…percentile

address Adrin comments

a20c1cb

glemaitre added 3 commits July 7, 2020 08:57

remove debugging test

dab5204

remove sorting

310068e

iter

83daa43

glemaitre commented Jul 7, 2020

View reviewed changes

iter

dfaf653

Base automatically changed from master to main January 22, 2021 10:52

cmarmo added the Waiting for Reviewer label Dec 10, 2021

Merge remote-tracking branch 'origin/main' into implement_weighted_pe…

5029065

…rcentile

ogrisel mentioned this pull request Jan 6, 2022

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048

Closed

lorentzenchr mentioned this pull request Aug 8, 2023

Weighted quantile option in nanpercentile() numpy/numpy#8935

Closed

lorentzenchr mentioned this pull request Oct 18, 2023

ENH: add weighted quantile for inverted_cdf numpy/numpy#24254

Merged

lorentzenchr closed this Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH improve _weighted_percentile to provide several interpolation #17768

ENH improve _weighted_percentile to provide several interpolation #17768

glemaitre commented Jun 29, 2020 •

edited

Loading

glemaitre commented Jun 29, 2020 •

edited

Loading

glemaitre commented Jun 29, 2020

NicolasHug left a comment

NicolasHug Jun 29, 2020

adrinjalali left a comment

adrinjalali commented Jun 30, 2020

lucyleeow left a comment

lucyleeow Jun 30, 2020 •

edited

Loading

glemaitre Jul 7, 2020

glemaitre Jul 7, 2020

glemaitre commented Jul 7, 2020

glemaitre Jul 7, 2020

lucyleeow commented May 23, 2022

lorentzenchr commented May 23, 2022

lucyleeow commented May 23, 2022 •

edited

Loading

lucyleeow commented May 30, 2022 •

edited

Loading

lucyleeow commented Jul 5, 2022

adrinjalali commented Apr 15, 2024

lorentzenchr commented Apr 26, 2024



		@pytest.mark.parametrize("percentile", [0, 25, 50, 75, 100])
		def test_weighted_percentile_equivalence_weights_repeated_samples(percentile):

ENH improve _weighted_percentile to provide several interpolation #17768

ENH improve _weighted_percentile to provide several interpolation #17768

Conversation

glemaitre commented Jun 29, 2020 • edited Loading

glemaitre commented Jun 29, 2020 • edited Loading

glemaitre commented Jun 29, 2020

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug Jun 29, 2020

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali commented Jun 30, 2020

lucyleeow left a comment

Choose a reason for hiding this comment

lucyleeow Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

glemaitre Jul 7, 2020

Choose a reason for hiding this comment

glemaitre Jul 7, 2020

Choose a reason for hiding this comment

glemaitre commented Jul 7, 2020

glemaitre Jul 7, 2020

Choose a reason for hiding this comment

lucyleeow commented May 23, 2022

lorentzenchr commented May 23, 2022

lucyleeow commented May 23, 2022 • edited Loading

lucyleeow commented May 30, 2022 • edited Loading

lucyleeow commented Jul 5, 2022

adrinjalali commented Apr 15, 2024

lorentzenchr commented Apr 26, 2024

glemaitre commented Jun 29, 2020 •

edited

Loading

glemaitre commented Jun 29, 2020 •

edited

Loading

lucyleeow Jun 30, 2020 •

edited

Loading

lucyleeow commented May 23, 2022 •

edited

Loading

lucyleeow commented May 30, 2022 •

edited

Loading