Skip to content

MNT _weighted_percentile supports np.nan values #29034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Mar 4, 2025

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented May 16, 2024

What does this implement/fix? Explain your changes.

This PR adds support for nan input into _weighted_percentile, which is used within SplineTransformer, for instance.

Any other comments?

This is for @ogrisel since we had talked about that:

This does not deliver equivalent results as np.nanpercentile in the current dev-version, because they were different before: _weighted_percentile is finding the next lower percentile, while np.nanpercentile seems to find their percentiles differently (closest?). The more variance in the sample_weights, the more important the effect of this gets.

Here an example without any nans and without weights:

import numpy as np
from sklearn.utils.stats import _weighted_percentile

percentile = 66
arr2D = np.array([[3,30],[1,20],[3,10]])
weights2D = np.array([[1,1],[1,1],[1,1]])
print(_weighted_percentile(arr2D, weights2D, percentile))
print(np.nanpercentile(arr2D, percentile, weights=weights2D, axis=0, method="inverted_cdf"))

output:

[3 20]
[3 30]

While setting percentile = 67 leads to the same results.

Edit: What I wrote is actually not true. The differing output was due to not putting sample weights of nan values to 0 before. Now that is done, we have the same results it seems.

Copy link

github-actions bot commented May 16, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 76e9882. Link to the linter CI: here

@StefanieSenger StefanieSenger changed the title ENH _weighted_percentile supports np.nan values MNT _weighted_percentile supports np.nan values May 17, 2024
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback.

"One feature contains too many NaN values. This error should "
"actually either raise within the API, or there needs to be some "
"validation here before to make sure it cannot happen."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to return np.nan valued percentile (maybe with a warning?) if some array is all-nan valued.

Copy link
Contributor Author

@StefanieSenger StefanieSenger May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and have removed this from here (and also repaired the buggy loop).

But some further thoughts, that I am not sure how important they are to be taken:

The reason I had put it here was that I had tested this on the branch from the nans in SplineTransformer PR where we add test_spline_transformer_handles_all_nans (quite a special case anyway) and this made it raise in a not very informative way.

I think the error needs to be handled somehow, but within SplineTransformer, so that _weighted_percentile can stay flexible in case it's used differently in another context at one point.

For the other cases, we could raise a warning similar to like its done in numpy:

/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy/lib/_nanfunctions_impl.py:1623: RuntimeWarning: All-NaN slice encountered return fnb._ureduce(a, ...

But should we add a warning within a private method? I'm a bit hesitant, as this could be confusing for the users who might not be aware about _weighted_percentile being internally used. Maybe this warning should instead be raised withing the realm of whatever API method is using it if need be. Candidates might be AbsoluteError.fit_intercept_only(), PinballLoss.fit_intercept_only(), HuberLoss.fit_intercept_only(), KBinsDiscretizer.fit() and maybe I should add similar all-nan-column tests and see if they raise or if we want to add a warning. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at this again and went through all the places where _weighted_percentile gets used.

The losses I mentioned above (only internal use) as well as set_huber_delta() and median_absolute_error() find a percentile from within y_true or y_true-some_other_1darray , where I think it should be pretty obvious for the user if all values were nan. This is why I think it's okay without a warning.

KBinsDiscretizer and SplineTransformer use it on the whole X within their fit methods and I think we could add a warning there for all-nan-columns if this kind of validation is not already done somewhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _weighted_percentile is called by the loss functions, we can expect that there will never be any nan values in y_true because the public method of the estimator (typically .fit) always checks that there are no nan values and raise an exception if needed before calling _weighted_percentile:

>>> from sklearn.datasets import make_regression
>>> import numpy as np
>>> from sklearn.linear_model import QuantileRegressor
>>> X, y = make_regression()
>>> y[0] = np.nan
>>> QuantileRegressor().fit(X, y)
Traceback (most recent call last):
  Cell In[14], line 1
    QuantileRegressor().fit(X, y)
  File ~/code/scikit-learn/sklearn/base.py:1514 in wrapper
    return fit_method(estimator, *args, **kwargs)
  File ~/code/scikit-learn/sklearn/linear_model/_quantile.py:163 in fit
    X, y = self._validate_data(
  File ~/code/scikit-learn/sklearn/base.py:650 in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File ~/code/scikit-learn/sklearn/utils/validation.py:1286 in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
  File ~/code/scikit-learn/sklearn/utils/validation.py:1308 in _check_y
    _assert_all_finite(y, input_name="y", estimator_name=estimator_name)
  File ~/code/scikit-learn/sklearn/utils/validation.py:123 in _assert_all_finite
    _assert_all_finite_element_wise(
  File ~/code/scikit-learn/sklearn/utils/validation.py:172 in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input y contains NaN.

and similarly for public scoring functions that rely on those losses:

>>> from sklearn.metrics import mean_pinball_loss
>>> mean_pinball_loss(y, y)
Traceback (most recent call last):
  Cell In[16], line 1
    mean_pinball_loss(y, y)
  File ~/code/scikit-learn/sklearn/utils/_param_validation.py:213 in wrapper
    return func(*args, **kwargs)
  File ~/code/scikit-learn/sklearn/metrics/_regression.py:318 in mean_pinball_loss
    y_type, y_true, y_pred, multioutput = _check_reg_targets(
  File ~/code/scikit-learn/sklearn/metrics/_regression.py:112 in _check_reg_targets
    y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
  File ~/code/scikit-learn/sklearn/utils/validation.py:1056 in check_array
    _assert_all_finite(
  File ~/code/scikit-learn/sklearn/utils/validation.py:123 in _assert_all_finite
    _assert_all_finite_element_wise(
  File ~/code/scikit-learn/sklearn/utils/validation.py:172 in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input contains NaN.

Still we might want to add a nan_policy parameter to _weighted_percentile with "raise" and "omit" as possible options as scipy does:

This would make the use of _weighted_percentile within or code base more explicit: loss functions would call it with nan_policy="error" (the default) while transformers such as SplineTransformer that actually need nan support would call it with nan_policy="omit".

WDYT?

Copy link
Contributor Author

@StefanieSenger StefanieSenger Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the occasion when we would want to raise (all values from a column being nans) is so edge, that we should not have a param in _weighted_percentile for it.

Having something like scipys nan_policy would be a larger design decision for the whole project and having it for our little purpose seems like an overkill to me, also because np.nanpercentile will be the future in a determined time anyways.

For the other cases, when there are only a few nan values, the users can express their intend in SplineTransformer.fit() with handle_missing='zeros' (once #28043 is merged). If they don't explicitly express this wish, the fit method would raise earlier, so there would be no reason to deselect _weighted_percentiles nan-dealing behaviour.

KBinsDiscretizer, which is a special case of SplineTransformer, currently cannot be used with nan values. If it will also get nan support, then probably like in the former.

And it seems to me there is no other method or function that would implicitly (without user intend) allow nans AND use _weighted_percentile internally.

Does my reasoning make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having something like scipys nan_policy would be a larger design decision for the whole project and having it for our little purpose seems like an overkill to me, also because np.nanpercentile will be the future in a determined time anyways.

If we introduce this just for private methods like _weighted_percentile, then it's fine. But indeed, this kind errors are better be raised by public methods because there are more likely to be able to output error messages with more informative contextual details than generic private method would.

So ok for not introducing nan_policy as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'm relieved. And sorry if I expressed myself a bit strong there.

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am addressing your comments in my latest push, @ogrisel, thank you.

That test now fails because it is an actual test. Through it I got aware that we also need to mask out the nans for calculating the percentile_idx. Simply ignoring nans within the indexing tasks is not enough. I am still trying to figure out how to do that using masked arrays, pushing the current state of my attempts.

Edit: I now found out what I did wrong before, I had created the nan mask on the input array, and but it should have been the sorted one. With the corrected nan mask we can indeed calculate percentile_idx and using a masked array is not necessary anymore, it seems. I fixed the errors below and all the tests pass.

@StefanieSenger StefanieSenger marked this pull request as ready for review May 29, 2024 13:22
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @StefanieSenger for the PR.

I think the code could be made a bit easier to follow and a bit more efficient with the following suggestions but otherwise, this LGTM.

I realize that this code should be easily adaptable to work with array API (I think). This could be done in a follow-up PR. This would unblock several items of the list in #26024 that require quantile or percentile that is not yet standardized in the array API spec (data-apis/array-api#795). Let's do that in a follow-up PR.

@ogrisel
Copy link
Member

ogrisel commented Jun 14, 2024

Note that what is done in this PR is one way to achieve the results that we want but it is not the only way.

The alternative to post-filtering nan percentile values would be to instead call the percentile computation on individual 1d sorted arrays by trimming the nans that should old be at the end of the 1d arrays. But since there is a variable number of nans per 1d array, then we would not be able to process the trimmed arrays as a single 2D stack datastructure of sorted values.

I think the current code in this PR is a less invasive change, however the 1d decomposition I suggest above would be more efficient when the data has many nans (because post-processing would do many iterations in this case).

StefanieSenger and others added 2 commits June 15, 2024 00:20
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that what is done in this PR is one way to achieve the results that we want but it is not the only way.

The alternative to post-filtering nan percentile values would be to instead call the percentile computation on individual 1d sorted arrays by trimming the nans that should old be at the end of the 1d arrays. But since there is a variable number of nans per 1d array, then we would not be able to process the trimmed arrays as a single 2D stack datastructure of sorted values.

I think the current code in this PR is a less invasive change, however the 1d decomposition I suggest above would be more efficient when the data has many nans (because post-processing would do many iterations in this case).

Yes, I was seeing this problem and couldn't think of a way to resolve this (after experimenting with masked arrays). There would not be an easy way to put the taken-apart array back together with keeping the rest of the order in check, or at least I couldn't think of any.

It would have helped me to talk about strategies in the beginning.

Since you have approved this PR, I assume despite being slower it's still alright how it is?

"One feature contains too many NaN values. This error should "
"actually either raise within the API, or there needs to be some "
"validation here before to make sure it cannot happen."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'm relieved. And sorry if I expressed myself a bit strong there.

@ogrisel
Copy link
Member

ogrisel commented Jun 19, 2024

Since you have approved this PR, I assume despite being slower it's still alright how it is?

I am not so sure. I would love to have a second opinion on this matter. Maybe it would help to measure the time it takes to call _weighted_percentile on an array with int(1e7) random values with 0 occurrence of np.nan vs 50% vs all nans. For instance using %timeit in an IPython or Jupyter notebook session.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jun 20, 2024

I am not so sure. I would love to have a second opinion on this matter. Maybe it would help to measure the time it takes to call _weighted_percentile on an array with int(1e7) random values with 0 occurrence of np.nan vs 50% vs all nans. For instance using %timeit in an IPython or Jupyter notebook session.

I see, I have made this performance test in a jupyter notebook (I hope it's correct) and the results look like this:

import time
import numpy as np
from sklearn.utils.stats import _weighted_percentile

def test_performance(X, sample_weight):
    start = time.time()
    _weighted_percentile(X, sample_weight)
    stop = time.time()

    return stop - start

###### test without nans #####
rng = np.random.RandomState(42)
X = rng.rand(100, 100) #int(1e7) is fulfilled here, I believe
sample_weight = np.ones_like(X)

res = 0
for i in range(10000):
    res += test_performance(X, sample_weight)

res
# 4.266921281814575

###### test with nans #####
X[rng.rand(*X.shape) < 0.5] = np.nan

res = 0
for i in range(10000):
    res += test_performance(X, sample_weight)

res
# 5.107046365737915

It is slower when it has to sort out the percentile across 100 columns in the end. I think the number of columns would influence the result strongly. I don't really have a feeling for determining if this performance drop is acceptable or not. But maybe this can help another maintainer form an opinion. Please tell me if I should push the notebook somewhere (but there is no more code than this).

StefanieSenger and others added 2 commits June 20, 2024 12:45
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel
Copy link
Member

ogrisel commented Jun 28, 2024

Thanks for the analysis, but I meant to run on a large 1D array with int(1e7) elements to be able to probe the scalability. I tried in the following, and it can be too slow, so I adapted the benchmark to run on 1 million (int(1e6)) elements instead.

Here is an adapted benchmark:

# %%
import time
import numpy as np
from sklearn.utils.stats import _weighted_percentile


def test_performance(func, *args, n_calls=10, **kwargs):
    start = time.time()
    for _ in range(n_calls):
        func(*args, **kwargs)
    stop = time.time()

    return (stop - start) / n_calls

# %%
rng = np.random.RandomState(42)
X = rng.rand(int(1e6))
sample_weight = np.ones_like(X)
test_performance(_weighted_percentile, X, sample_weight, n_calls=5)
# 0.10989542007446289

# %%
X_few_nan = X.copy()
X_few_nan[rng.rand(*X.shape) < 0.1] = np.nan
test_performance(_weighted_percentile, X_few_nan, sample_weight, n_calls=5)
# 0.10001721382141113

# %%
X_half_nan = X.copy()
X_half_nan[rng.rand(*X.shape) < 0.5] = np.nan
test_performance(_weighted_percentile, X_half_nan, sample_weight, n_calls=5)
# 0.07429156303405762

# %%
X_many_nan = X.copy()
X_many_nan[rng.rand(*X.shape) < 0.9] = np.nan
test_performance(_weighted_percentile, X_many_nan, sample_weight, n_calls=5)
# 0.04319744110107422

# %%
X_all_nan = np.full_like(X, np.nan)
test_performance(_weighted_percentile, X_all_nan, sample_weight, n_calls=5)
# 5.420448541641235

So interestingly, the duration can decrease when adding nan values to X. One would need to confirm with a profiler such as viztracer for instance, but I suspect that happens because argsorting with many repeated values (the nans) is easier hence faster and the gain in speed obtained by sorting is larger than the extra cost of post-processing the nans percentiles.

However, it's quite catastrophic for the "all nans" edge case with a 50x slow-down...

So based on those results I would rather not merge the PR as it is and instead implement one of the following two approaches:

  • detect all nans columns and skip the nan-post processing for them while keeping the rest of the PR,
  • refactor the code of _weighted_percentile to implement per-columns filtering of the nan values of X before calling np.argsort on each of the filtered columns of X. This way, no nan-specific post-processing would be needed.

I have the feeling that the second would be more maintainable / readable.

Feel free to open a second PR if you want to explore both options and compare their performance.

@StefanieSenger
Copy link
Contributor Author

Now after #30661 got merged, all the tests pass here.

@lorentzenchr, is this PR ready for merge now?

@lorentzenchr
Copy link
Member

@lorentzenchr, is this PR ready for merge now?

I'll approve when a test vs numpy's weighted quantile function is added.

@StefanieSenger
Copy link
Contributor Author

I'll approve when a test vs numpy's weighted quantile function is added.

Oh yes, sorry I forgot that.
I will ping you once it's added.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Jan 20, 2025

I've added tests that I think fit what you were thinking of. Would you have a look, @lorentzenchr ?

@lorentzenchr
Copy link
Member

I'll soon have a final look.

),
],
)
def test_weighted_percentile_like_numpy_quantile(percentile, arr, weights):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test and the next one, I would prefer data generated in a manner similar to test_weighted_percentile_nan_filtered, i.e. larger more diverse data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I have addressed this in my latest commit.

(With the larger data, my feeling would be that the parametrisation for percentile is not necessary anymore. Leaving this for you to judge.)

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the above comment, LGTM.
@StefanieSenger Thanks for your endurance.

Note that meanwhile scipy has it's own array API compatible quantile function, see scipy/scipy#22352.

@ogrisel
Copy link
Member

ogrisel commented Feb 28, 2025

Note that meanwhile scipy has it's own array API compatible quantile function, see scipy/scipy#22352.

Interesting. I think we might want to consolidate our work there in the medium term. In the short term, I think it's worth merging this PR as is, once the tests have been updated as suggested in #29034 (review).

As far as I understand, what we are missing from scipy's implementation are the following:

@ogrisel ogrisel enabled auto-merge (squash) March 3, 2025 17:23
@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2025

I resynced this PR with main and marked for auto-merge if the tests are still green. Thanks all!

@ogrisel ogrisel merged commit d6334e1 into scikit-learn:main Mar 4, 2025
31 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Missing value and nan support Mar 4, 2025
@StefanieSenger StefanieSenger deleted the _weighted_percentile branch March 4, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

3 participants