-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH improve _weighted_percentile to provide several interpolation #17768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH improve _weighted_percentile to provide several interpolation #17768
Conversation
…eral interpolation
@jnothman I started to split the original PR. Now it should be easier to review. Thought this PR introduces a functionality which we are not using yet. |
ping @ogrisel @lucyleeow Since you reviewed the original work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick first pass
|
||
|
||
@pytest.mark.parametrize("percentile", [0, 25, 50, 75, 100]) | ||
def test_weighted_percentile_equivalence_weights_repeated_samples(percentile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add short comments describing the tests (here and elsewhere)
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @glemaitre
I also mentioned it before, do we not want to have this in cython? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I still have concerns about 'linear' but I can drop it. It's difficult to get weighted percentile to have similar properties to normal percentile.
P_w = (x - w) / (S_w - w), | ||
|
||
where x in [1, N], w being the weight and S_w being the sum of the weights. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the desire to be consistent with numpy implementation of C=0
C=1
but I do think extrapolating to samples with weights causes problems. (copied over from old PR) It 'ignores' the last weight so, it doesn't give the same answer as with np.percentile
and repeating the array by integer weights:
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
... percentile=50, interpolation="linear")
3.3
>>> np.percentile(np.array([1, 2, 3, 4, 4, 4, 4, 4]), 50, interpolation='linear')
4.0
>>> np.median(np.array([1, 2, 3, 4, 4, 4, 4, 4]))
4.0
'ignoring' the last weight also means that these 2 have the same result:
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
... percentile=50, interpolation="linear")
3.3
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 5, 5]),
... percentile=50, interpolation="linear")
3.3
It (C=1
) doesn't have the same properties when implemented for unit weights. I am not sure when it would be desirable to use this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeating issue will not be a big deal because we are not going to use the linear interpolation to keep this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the case
>>> _weighted_percentile(np.array([1, 2, 3, 4]), sample_weight=np.array([1, 1, 1, 5]),
... percentile=50, interpolation="linear")
the percentile requested is between 3 and 4. It will be computed using linear interpolation. In this case, it is weird because we would have the intuition that the 4 should be more attractive. But then it means that linear interpolation is probably that we don't want with weighted percentile but we would expect a more non-linear interpolation.
We should benchmark to know if this is a bottleneck where it is used. If this is critical then we can cythonize it. For the moment, we greatly rely on some NumPy function. |
sklearn/utils/stats.py
Outdated
def _weighted_interpolation( | ||
x, x_low, x_high, y_low, y_high, w_low, w_high, | ||
): | ||
"""Weighted interpolation between 2 values. | ||
|
||
If the weights are null, then it is equivalent to a linear | ||
interpolation. | ||
""" | ||
with np.errstate(invalid="ignore"): | ||
percentile_value = ( | ||
y_low + ((1 - w_low) * (x - x_low)) * ( | ||
(y_high - y_low) / ( | ||
(1 - w_high) * (x_high - x) + | ||
(1 - w_low) * (x - x_low) | ||
) | ||
) | ||
) | ||
if np.isnan(percentile_value): | ||
# case that percentile is 100 | ||
return y_high | ||
return percentile_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow this weighted interpolation will do as on the picture below
While this closer to the intuition that one would have. I have no idea if this is mathematically correct.
Anybody with some background in interpolation? @jnothman @NicolasHug @thomasjpfan @jeremiedbb @lucyleeow ?
I think this could be revived as some good work has been done and a better implementation of weighted percentile would be worthwhile. The original aim was to add percentile methods included by numpy.percentile ('lower', 'higher', 'nearest' and 'linear'). Note however that since version 1.22 numpy now supports 13 different methods. The 9 methods described in the paper:
(I've seen this paper referenced a fair bit in my reading and seems like a good reference) and the 4 old 'discontinuous variations of 'linear'. 'linear' is described in the paper above and also in wiki as C=1 (see below). In discussions above we have been using the 3 'linear interpolation between closest ranks' methods (C=0, C=1/2, C=1) described by wikipedia as reference. The variants can be thought to work like so: calculate the percentile ranks of all values in the sorted list of values, linearly interpolate between each of these points. This plot taken from the page shows the difference between the 3 approaches on the ordered list {15, 20, 35, 40, 50}: Things to note:
What we have done here is extended the percentile options from numpy to weighted percentile. Using the same ordered list ({15, 20, 35, 40, 50}) and the weights ((4, 1, 1, 1, 4)) - adding arbitrary higher weights to the first and last values to highlight its effect.
Here is what our current implementation looks like: Notes:
Below is the plot again with only the discontinuous methods 'linear' and 'weighted': Thoughts:
this is because
TODO:
Thoughts welcome cc those who have already looked at this PR @glemaitre @ogrisel @adrinjalali @NicolasHug and @thomasjpfan |
I'd like to point to the epic numpy PR numpy/numpy#9211. |
Nice! I didn't know about this. I would tend to agree with you @lorentzenchr, only concern is that this PR was started in 2016! What do you think the chances if it getting merged/timeframe are? There were some comments (numpy/numpy#9211 (comment)) about whether this is better in a stats package (e.g., scipy). From the little reading, potential options are:
|
For reference here are the above situations I highlighted using the 'linear' implementation of weighted quantile from the numpy PR numpy/numpy#9211
Also after discussion with @thomasjpfan, another potential solution is to take only 'linear' weighted percentile and update our implementation to match that in the numpy PR. A good point was raised that even when/if the PR is merged into numpy it will take a while before our min numpy version allows us to use it. Also considering the number of lines of code involved, it would not be trivial to extract the numpy implementation weighted and copy to sklearn. Edit: |
Having a look at the current numpy PR implementation, it varies a bit from our implementation:
# Now check/renormalize weights if any is (0, 1)
def _normalize(v):
inds = v > 0
if (v[inds] < 1).any():
vec = v.copy()
vec[inds] = vec[inds] / vec[inds].min() # renormalization
return vec
else:
return v
numpy does something more complicated:
calculation of # To avoid some rounding issues, `(n-1) * quantiles` is preferred to
# `_compute_virtual_index(n, quantiles, 1, 1)`.
# They are mathematically equivalent.
linear=dict(
get_virtual_index=lambda n, quantiles: (n - 1) * quantiles,
fix_gamma=lambda g, _: g,
discrete_shortcut=None,
), the left and right bounds are calculated thus: left_weight_bound = np.roll(wgts1d_cumsum, 1)
left_weight_bound[0] = 0 # left-most weight bound fixed at 0
right_weight_bound = wgts1d_cumsum - 1 (I don't really understand the reasoning behind the right bound...?) E.g., for weight array [2,3]: this is then used to calculate 'gamma' (the fractional part of the virtual index) which is used to interpolate between neighbouring values. |
Doesn't numpy/numpy#24254 (thanks to @lorentzenchr ) partially solve this? |
TLDR Yes, let’s close, use the numpy 2.0 weighted quantile function (in the future) and mark other methods as "won't implement". That numpy PR adds weights support for |
closes #6189
closes #7662
Toward the completion of #17377
Next step #17775
Implement different interpolation strategies for
_weighted_percentile