Skip to content

ENH improve _weighted_percentile to provide several interpolation #17768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jun 29, 2020

closes #6189
closes #7662

Toward the completion of #17377
Next step #17775

Implement different interpolation strategies for _weighted_percentile

@glemaitre glemaitre changed the title [WIP] ENH improve _weighted_percentile to provide several interpolation ENH improve _weighted_percentile to provide several interpolation Jun 29, 2020
@glemaitre
Copy link
Member Author

glemaitre commented Jun 29, 2020

@jnothman I started to split the original PR. Now it should be easier to review. Thought this PR introduces a functionality which we are not using yet.

@glemaitre
Copy link
Member Author

ping @ogrisel @lucyleeow Since you reviewed the original work
ping @NicolasHug @adrinjalali since it will be used in the GBRT as well a bit later.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick first pass



@pytest.mark.parametrize("percentile", [0, 25, 50, 75, 100])
def test_weighted_percentile_equivalence_weights_repeated_samples(percentile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add short comments describing the tests (here and elsewhere)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glemaitre

@adrinjalali
Copy link
Member

I also mentioned it before, do we not want to have this in cython?

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I still have concerns about 'linear' but I can drop it. It's difficult to get weighted percentile to have similar properties to normal percentile.

P_w = (x - w) / (S_w - w),

where x in [1, N], w being the weight and S_w being the sum of the weights.
"""
Copy link
Member

@lucyleeow lucyleeow Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire to be consistent with numpy implementation of C=0 C=1 but I do think extrapolating to samples with weights causes problems. (copied over from old PR) It 'ignores' the last weight so, it doesn't give the same answer as with np.percentile and repeating the array by integer weights:

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> np.percentile(np.array([1, 2, 3, 4, 4, 4, 4, 4]), 50, interpolation='linear')
4.0
>>> np.median(np.array([1, 2, 3, 4, 4, 4, 4, 4]))
4.0

'ignoring' the last weight also means that these 2 have the same result:

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 5, 5]),
...                      percentile=50, interpolation="linear")
3.3

It (C=1) doesn't have the same properties when implemented for unit weights. I am not sure when it would be desirable to use this implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeating issue will not be a big deal because we are not going to use the linear interpolation to keep this property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the case

>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")

the percentile requested is between 3 and 4. It will be computed using linear interpolation. In this case, it is weird because we would have the intuition that the 4 should be more attractive. But then it means that linear interpolation is probably that we don't want with weighted percentile but we would expect a more non-linear interpolation.

@glemaitre
Copy link
Member Author

I also mentioned it before, do we not want to have this in cython?

We should benchmark to know if this is a bottleneck where it is used. If this is critical then we can cythonize it. For the moment, we greatly rely on some NumPy function.

Comment on lines 162 to 182
def _weighted_interpolation(
x, x_low, x_high, y_low, y_high, w_low, w_high,
):
"""Weighted interpolation between 2 values.

If the weights are null, then it is equivalent to a linear
interpolation.
"""
with np.errstate(invalid="ignore"):
percentile_value = (
y_low + ((1 - w_low) * (x - x_low)) * (
(y_high - y_low) / (
(1 - w_high) * (x_high - x) +
(1 - w_low) * (x - x_low)
)
)
)
if np.isnan(percentile_value):
# case that percentile is 100
return y_high
return percentile_value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow this weighted interpolation will do as on the picture below

Figure_1

While this closer to the intuition that one would have. I have no idea if this is mathematically correct.
Anybody with some background in interpolation? @jnothman @NicolasHug @thomasjpfan @jeremiedbb @lucyleeow ?

Base automatically changed from master to main January 22, 2021 10:52
@lucyleeow
Copy link
Member

I think this could be revived as some good work has been done and a better implementation of weighted percentile would be worthwhile.

The original aim was to add percentile methods included by numpy.percentile ('lower', 'higher', 'nearest' and 'linear'). Note however that since version 1.22 numpy now supports 13 different methods. The 9 methods described in the paper:

R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,” The American Statistician, 50(4), pp. 361-365, 1996

(I've seen this paper referenced a fair bit in my reading and seems like a good reference) and the 4 old 'discontinuous variations of 'linear'. 'linear' is described in the paper above and also in wiki as C=1 (see below).

In discussions above we have been using the 3 'linear interpolation between closest ranks' methods (C=0, C=1/2, C=1) described by wikipedia as reference. The variants can be thought to work like so: calculate the percentile ranks of all values in the sorted list of values, linearly interpolate between each of these points. This plot taken from the page shows the difference between the 3 approaches on the ordered list {15, 20, 35, 40, 50}:

image

Things to note:

  • all variants have the same value at p=0.5 (the median), which is a constraint of the method
  • there is the greatest difference near the margins (0 and 1), the percentile rank of the first value in the list being much higher for C=0 than C=1 and the percentile rank of the last value in the list being much lower for C=0 than C=0

What we have done here is extended the percentile options from numpy to weighted percentile. Using the same ordered list ({15, 20, 35, 40, 50}) and the weights ((4, 1, 1, 1, 4)) - adding arbitrary higher weights to the first and last values to highlight its effect.

a = np.array([15, 20, 35, 40, 50])
w = np.array([4, 1, 1, 1, 4])

Here is what our current implementation looks like:

image

Notes:

  • the methods 'lower', 'higher' and 'nearest' are all discontinuous
  • the p=0.5 varies between the methods (I am not sure we would expect them to be the same?)
linear: 35.0
lower: 20
higher: 35
nearest: 35
weighted: 35.0

np.median of the values repeated by their weights is 35.0.

np.median(np.array(np.repeat(a, w)))

Below is the plot again with only the discontinuous methods 'linear' and 'weighted':

image

Thoughts:

  • I am hesitant about our new 'weighted' interpolation method as I am not sure we want non-linear interpolation (all 9 methods described in the paper above linearly interpolate) and it is not an extension of an accepted percentile method that has been described elsewhere.
  • I am not sure not about extending the old numpy methods ('nearest', 'lower' and 'highest') to weighted percentile as the numpy documentation seems to suggest they are now there just for backwards compatibility reasons (From doc: "NumPy method kept for backwards compatibility.).
  • I wonder if extension of one of the other 9 methods mention in Hyndman & Fan to weighted percentile will be better..?
  • When plotted the 'linear' for weights does look reasonable but some unintuitive behaviours, mentioned previously, occur (mostly because of how percentile values are handled at the margins (0,1)):
    • when using integer weights, the weighted percentile is not the same as the percentile of the same list repeated by their integer weights, when 'extreme' weights are at the end (or start):
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> np.percentile(np.array([1, 2, 3, 4, 4, 4, 4, 4]), 50, interpolation='linear')
4.0
>>> np.median(np.array([1, 2, 3, 4, 4, 4, 4, 4]))
4.0
  • the following return the same answers:
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 5, 5]),
...                      percentile=50, interpolation="linear")
3.3
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([5, 1, 1, 1]),
...                      percentile=50, interpolation="linear")
1.7
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([5, 5, 1, 1]),
...                      percentile=50, interpolation="linear")
1.7
  • we do not ignore/mask 0 weighted values. E.g.,:
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 0, 0, 1]),
...                        percentile=50, interpolation="linear")
2.0

this is because adjusted_percentile looks like this:

adjusted percentile [[0. ]
 [0.5]
 [0.5]
 [1. ]]

TODO:

Thoughts welcome cc those who have already looked at this PR @glemaitre @ogrisel @adrinjalali @NicolasHug and @thomasjpfan

@lorentzenchr
Copy link
Member

I'd like to point to the epic numpy PR numpy/numpy#9211.
While I reeeeeeeally would like to have weighted quantiles, I think we should ask ourselves where we need them and implement only what's needed. IMHO, the general and publicly available solution has its place in numpy, not scikit-learn.

@lucyleeow
Copy link
Member

lucyleeow commented May 23, 2022

Nice! I didn't know about this. I would tend to agree with you @lorentzenchr, only concern is that this PR was started in 2016!
From a quick scroll - weights is implemented for all 13(!!) implementations of percentile supported by numpy (numpy/numpy#9211 (comment)), they seem to handle 0 weight well (numpy/numpy#9211 (comment)) and they expand to frequency weights sometimes. I need to dig into how they handle frequency weights and probability weights (e.g., (1,9) vs (0.1, 0.9)). It is certainly very complex and probably nice to have another package handle it than us.

What do you think the chances if it getting merged/timeframe are? There were some comments (numpy/numpy#9211 (comment)) about whether this is better in a stats package (e.g., scipy).

From the little reading, potential options are:

  • wait for numpy to merge the above PR
  • implement their extension of type 7 (from Hydnman and Fan, equal to 'linear') percentile method to weighted percentile (which seems to be main method to get extended to weighted percentile) until their PR gets merged...

@lucyleeow
Copy link
Member

lucyleeow commented May 30, 2022

For reference here are the above situations I highlighted using the 'linear' implementation of weighted quantile from the numpy PR numpy/numpy#9211

>>> np.percentile([1,2,3,4], weights=[1,1,1,5], q=50, method='linear')
array(4.)
>>> np.percentile([1,2,3,4], weights=[1,1,5,5], q=50, method='linear')
array(3.)
>>> np.percentile([1,2,3,4], weights=[1,0,0,1], q=50, method='linear')
array(2.5)

Also after discussion with @thomasjpfan, another potential solution is to take only 'linear' weighted percentile and update our implementation to match that in the numpy PR. A good point was raised that even when/if the PR is merged into numpy it will take a while before our min numpy version allows us to use it. Also considering the number of lines of code involved, it would not be trivial to extract the numpy implementation weighted and copy to sklearn.

Edit:
I think this is due to how weights are implemented, some discussion here: numpy/numpy#9211 (comment)
Will look into the difference between the 3 types of weights

@lucyleeow
Copy link
Member

Having a look at the current numpy PR implementation, it varies a bit from our implementation:

  • weights that are equal to 0 are made into NaN and ignored
  • any weights < 1 are normalized (afaict this is the only amendment made to weights):
    # Now check/renormalize weights if any is (0, 1)
    def _normalize(v):
        inds = v > 0
        if (v[inds] < 1).any():
            vec = v.copy()
            vec[inds] = vec[inds] / vec[inds].min()  # renormalization
            return vec
        else:
            return v
  • instead of calculating adjusted percentiles, then doing linear interpolation between the higher and lower percentile indicies like we do in this PR:
    P_w_i = (x - w_i) / (S_w - w_i),
    where x in [1, N], w_i being the weight of sample i and S_w being the sum
    of the weights.

numpy does something more complicated:

        # Each array value occupies a range within weight space, bounded
        # by its left/right_weight_bound.  If a weight_space_index falls within
        # those two bounds, the quantile will just take on that array value.
        # If the weight_space_index falls between neighboring ranges, that
        # means its quantile value must fall between neighboring array values
        # as well, so the gamma (computed in weight space) is used to
        # interpolate between those neighboring array values.

calculation of weight_space_index varies depending on the methodology, for the default 'linear' method it is get_virtual_index below:

    # To avoid some rounding issues, `(n-1) * quantiles` is preferred to
    # `_compute_virtual_index(n, quantiles, 1, 1)`.
    # They are mathematically equivalent.
    linear=dict(
        get_virtual_index=lambda n, quantiles: (n - 1) * quantiles,
        fix_gamma=lambda g, _: g,
        discrete_shortcut=None,
    ),

the left and right bounds are calculated thus:

            left_weight_bound = np.roll(wgts1d_cumsum, 1)
            left_weight_bound[0] = 0  # left-most weight bound fixed at 0
            right_weight_bound = wgts1d_cumsum - 1

(I don't really understand the reasoning behind the right bound...?) E.g., for weight array [2,3]:
cum sum = [2,5]
left weight bound = [0,2]
right weight bound = [1,4]
weight bounds = [0,1,2,4]

this is then used to calculate 'gamma' (the fractional part of the virtual index) which is used to interpolate between neighbouring values.

@adrinjalali
Copy link
Member

Doesn't numpy/numpy#24254 (thanks to @lorentzenchr ) partially solve this?

@lorentzenchr
Copy link
Member

TLDR Yes, let’s close, use the numpy 2.0 weighted quantile function (in the future) and mark other methods as "won't implement".

That numpy PR adds weights support for np.quantile(x, method="inverted_cdf") only. It was very hard to convince them at all about the statistical meaningfulness of weighted quantiles. inverted_cdf was the only method that made sense to me to extend (averaged_inverted_cdf would also be possible). But all the other methods are pretty much unclear how to extend to the weighted case. Literature is non-existent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make _weighted_percentile more robust
7 participants