-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT _weighted_percentile
supports np.nan values
#29034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT _weighted_percentile
supports np.nan values
#29034
Conversation
_weighted_percentile
supports np.nan values_weighted_percentile
supports np.nan values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback.
sklearn/utils/stats.py
Outdated
"One feature contains too many NaN values. This error should " | ||
"actually either raise within the API, or there needs to be some " | ||
"validation here before to make sure it cannot happen." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to return np.nan
valued percentile (maybe with a warning?) if some array is all-nan valued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and have removed this from here (and also repaired the buggy loop).
But some further thoughts, that I am not sure how important they are to be taken:
The reason I had put it here was that I had tested this on the branch from the nans in SplineTransformer PR where we add test_spline_transformer_handles_all_nans
(quite a special case anyway) and this made it raise in a not very informative way.
I think the error needs to be handled somehow, but within SplineTransformer
, so that _weighted_percentile
can stay flexible in case it's used differently in another context at one point.
For the other cases, we could raise a warning similar to like its done in numpy:
/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy/lib/_nanfunctions_impl.py:1623: RuntimeWarning: All-NaN slice encountered return fnb._ureduce(a, ...
But should we add a warning within a private method? I'm a bit hesitant, as this could be confusing for the users who might not be aware about _weighted_percentile
being internally used. Maybe this warning should instead be raised withing the realm of whatever API method is using it if need be. Candidates might be AbsoluteError.fit_intercept_only()
, PinballLoss.fit_intercept_only()
, HuberLoss.fit_intercept_only()
, KBinsDiscretizer.fit()
and maybe I should add similar all-nan-column tests and see if they raise or if we want to add a warning. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked at this again and went through all the places where _weighted_percentile
gets used.
The losses I mentioned above (only internal use) as well as set_huber_delta()
and median_absolute_error()
find a percentile from within y_true
or y_true-some_other_1darray
, where I think it should be pretty obvious for the user if all values were nan. This is why I think it's okay without a warning.
KBinsDiscretizer
and SplineTransformer
use it on the whole X
within their fit methods and I think we could add a warning there for all-nan-columns if this kind of validation is not already done somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When _weighted_percentile
is called by the loss functions, we can expect that there will never be any nan values in y_true
because the public method of the estimator (typically .fit
) always checks that there are no nan values and raise an exception if needed before calling _weighted_percentile
:
>>> from sklearn.datasets import make_regression
>>> import numpy as np
>>> from sklearn.linear_model import QuantileRegressor
>>> X, y = make_regression()
>>> y[0] = np.nan
>>> QuantileRegressor().fit(X, y)
Traceback (most recent call last):
Cell In[14], line 1
QuantileRegressor().fit(X, y)
File ~/code/scikit-learn/sklearn/base.py:1514 in wrapper
return fit_method(estimator, *args, **kwargs)
File ~/code/scikit-learn/sklearn/linear_model/_quantile.py:163 in fit
X, y = self._validate_data(
File ~/code/scikit-learn/sklearn/base.py:650 in _validate_data
X, y = check_X_y(X, y, **check_params)
File ~/code/scikit-learn/sklearn/utils/validation.py:1286 in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
File ~/code/scikit-learn/sklearn/utils/validation.py:1308 in _check_y
_assert_all_finite(y, input_name="y", estimator_name=estimator_name)
File ~/code/scikit-learn/sklearn/utils/validation.py:123 in _assert_all_finite
_assert_all_finite_element_wise(
File ~/code/scikit-learn/sklearn/utils/validation.py:172 in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input y contains NaN.
and similarly for public scoring functions that rely on those losses:
>>> from sklearn.metrics import mean_pinball_loss
>>> mean_pinball_loss(y, y)
Traceback (most recent call last):
Cell In[16], line 1
mean_pinball_loss(y, y)
File ~/code/scikit-learn/sklearn/utils/_param_validation.py:213 in wrapper
return func(*args, **kwargs)
File ~/code/scikit-learn/sklearn/metrics/_regression.py:318 in mean_pinball_loss
y_type, y_true, y_pred, multioutput = _check_reg_targets(
File ~/code/scikit-learn/sklearn/metrics/_regression.py:112 in _check_reg_targets
y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
File ~/code/scikit-learn/sklearn/utils/validation.py:1056 in check_array
_assert_all_finite(
File ~/code/scikit-learn/sklearn/utils/validation.py:123 in _assert_all_finite
_assert_all_finite_element_wise(
File ~/code/scikit-learn/sklearn/utils/validation.py:172 in _assert_all_finite_element_wise
raise ValueError(msg_err)
ValueError: Input contains NaN.
Still we might want to add a nan_policy
parameter to _weighted_percentile
with "raise"
and "omit"
as possible options as scipy does:
This would make the use of _weighted_percentile
within or code base more explicit: loss functions would call it with nan_policy="error"
(the default) while transformers such as SplineTransformer
that actually need nan support would call it with nan_policy="omit"
.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the occasion when we would want to raise (all values from a column being nans) is so edge, that we should not have a param in _weighted_percentile
for it.
Having something like scipys nan_policy
would be a larger design decision for the whole project and having it for our little purpose seems like an overkill to me, also because np.nanpercentile
will be the future in a determined time anyways.
For the other cases, when there are only a few nan values, the users can express their intend in SplineTransformer.fit()
with handle_missing='zeros'
(once #28043 is merged). If they don't explicitly express this wish, the fit method would raise earlier, so there would be no reason to deselect _weighted_percentile
s nan-dealing behaviour.
KBinsDiscretizer
, which is a special case of SplineTransformer
, currently cannot be used with nan values. If it will also get nan support, then probably like in the former.
And it seems to me there is no other method or function that would implicitly (without user intend) allow nans AND use _weighted_percentile
internally.
Does my reasoning make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having something like scipys nan_policy would be a larger design decision for the whole project and having it for our little purpose seems like an overkill to me, also because np.nanpercentile will be the future in a determined time anyways.
If we introduce this just for private methods like _weighted_percentile
, then it's fine. But indeed, this kind errors are better be raised by public methods because there are more likely to be able to output error messages with more informative contextual details than generic private method would.
So ok for not introducing nan_policy
as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I'm relieved. And sorry if I expressed myself a bit strong there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am addressing your comments in my latest push, @ogrisel, thank you.
That test now fails because it is an actual test. Through it I got aware that we also need to mask out the nans for calculating the percentile_idx
. Simply ignoring nans within the indexing tasks is not enough. I am still trying to figure out how to do that using masked arrays, pushing the current state of my attempts.
Edit: I now found out what I did wrong before, I had created the nan mask on the input array, and but it should have been the sorted one. With the corrected nan mask we can indeed calculate percentile_idx and using a masked array is not necessary anymore, it seems. I fixed the errors below and all the tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @StefanieSenger for the PR.
I think the code could be made a bit easier to follow and a bit more efficient with the following suggestions but otherwise, this LGTM.
I realize that this code should be easily adaptable to work with array API (I think). This could be done in a follow-up PR. This would unblock several items of the list in #26024 that require quantile
or percentile
that is not yet standardized in the array API spec (data-apis/array-api#795). Let's do that in a follow-up PR.
Note that what is done in this PR is one way to achieve the results that we want but it is not the only way. The alternative to post-filtering nan percentile values would be to instead call the percentile computation on individual 1d sorted arrays by trimming the nans that should old be at the end of the 1d arrays. But since there is a variable number of nans per 1d array, then we would not be able to process the trimmed arrays as a single 2D stack datastructure of sorted values. I think the current code in this PR is a less invasive change, however the 1d decomposition I suggest above would be more efficient when the data has many nans (because post-processing would do many iterations in this case). |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that what is done in this PR is one way to achieve the results that we want but it is not the only way.
The alternative to post-filtering nan percentile values would be to instead call the percentile computation on individual 1d sorted arrays by trimming the nans that should old be at the end of the 1d arrays. But since there is a variable number of nans per 1d array, then we would not be able to process the trimmed arrays as a single 2D stack datastructure of sorted values.
I think the current code in this PR is a less invasive change, however the 1d decomposition I suggest above would be more efficient when the data has many nans (because post-processing would do many iterations in this case).
Yes, I was seeing this problem and couldn't think of a way to resolve this (after experimenting with masked arrays). There would not be an easy way to put the taken-apart array back together with keeping the rest of the order in check, or at least I couldn't think of any.
It would have helped me to talk about strategies in the beginning.
Since you have approved this PR, I assume despite being slower it's still alright how it is?
sklearn/utils/stats.py
Outdated
"One feature contains too many NaN values. This error should " | ||
"actually either raise within the API, or there needs to be some " | ||
"validation here before to make sure it cannot happen." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I'm relieved. And sorry if I expressed myself a bit strong there.
I am not so sure. I would love to have a second opinion on this matter. Maybe it would help to measure the time it takes to call |
I see, I have made this performance test in a jupyter notebook (I hope it's correct) and the results look like this: import time
import numpy as np
from sklearn.utils.stats import _weighted_percentile
def test_performance(X, sample_weight):
start = time.time()
_weighted_percentile(X, sample_weight)
stop = time.time()
return stop - start
###### test without nans #####
rng = np.random.RandomState(42)
X = rng.rand(100, 100) #int(1e7) is fulfilled here, I believe
sample_weight = np.ones_like(X)
res = 0
for i in range(10000):
res += test_performance(X, sample_weight)
res
# 4.266921281814575
###### test with nans #####
X[rng.rand(*X.shape) < 0.5] = np.nan
res = 0
for i in range(10000):
res += test_performance(X, sample_weight)
res
# 5.107046365737915 It is slower when it has to sort out the percentile across 100 columns in the end. I think the number of columns would influence the result strongly. I don't really have a feeling for determining if this performance drop is acceptable or not. But maybe this can help another maintainer form an opinion. Please tell me if I should push the notebook somewhere (but there is no more code than this). |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Thanks for the analysis, but I meant to run on a large 1D array with Here is an adapted benchmark: # %%
import time
import numpy as np
from sklearn.utils.stats import _weighted_percentile
def test_performance(func, *args, n_calls=10, **kwargs):
start = time.time()
for _ in range(n_calls):
func(*args, **kwargs)
stop = time.time()
return (stop - start) / n_calls
# %%
rng = np.random.RandomState(42)
X = rng.rand(int(1e6))
sample_weight = np.ones_like(X)
test_performance(_weighted_percentile, X, sample_weight, n_calls=5)
# 0.10989542007446289
# %%
X_few_nan = X.copy()
X_few_nan[rng.rand(*X.shape) < 0.1] = np.nan
test_performance(_weighted_percentile, X_few_nan, sample_weight, n_calls=5)
# 0.10001721382141113
# %%
X_half_nan = X.copy()
X_half_nan[rng.rand(*X.shape) < 0.5] = np.nan
test_performance(_weighted_percentile, X_half_nan, sample_weight, n_calls=5)
# 0.07429156303405762
# %%
X_many_nan = X.copy()
X_many_nan[rng.rand(*X.shape) < 0.9] = np.nan
test_performance(_weighted_percentile, X_many_nan, sample_weight, n_calls=5)
# 0.04319744110107422
# %%
X_all_nan = np.full_like(X, np.nan)
test_performance(_weighted_percentile, X_all_nan, sample_weight, n_calls=5)
# 5.420448541641235 So interestingly, the duration can decrease when adding nan values to X. One would need to confirm with a profiler such as viztracer for instance, but I suspect that happens because argsorting with many repeated values (the nans) is easier hence faster and the gain in speed obtained by sorting is larger than the extra cost of post-processing the nans percentiles. However, it's quite catastrophic for the "all nans" edge case with a 50x slow-down... So based on those results I would rather not merge the PR as it is and instead implement one of the following two approaches:
I have the feeling that the second would be more maintainable / readable. Feel free to open a second PR if you want to explore both options and compare their performance. |
Now after #30661 got merged, all the tests pass here. @lorentzenchr, is this PR ready for merge now? |
I'll approve when a test vs numpy's weighted quantile function is added. |
Oh yes, sorry I forgot that. |
I've added tests that I think fit what you were thinking of. Would you have a look, @lorentzenchr ? |
I'll soon have a final look. |
sklearn/utils/tests/test_stats.py
Outdated
), | ||
], | ||
) | ||
def test_weighted_percentile_like_numpy_quantile(percentile, arr, weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test and the next one, I would prefer data generated in a manner similar to test_weighted_percentile_nan_filtered
, i.e. larger more diverse data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I have addressed this in my latest commit.
(With the larger data, my feeling would be that the parametrisation for percentile
is not necessary anymore. Leaving this for you to judge.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the above comment, LGTM.
@StefanieSenger Thanks for your endurance.
Note that meanwhile scipy has it's own array API compatible quantile function, see scipy/scipy#22352.
Interesting. I think we might want to consolidate our work there in the medium term. In the short term, I think it's worth merging this PR as is, once the tests have been updated as suggested in #29034 (review). As far as I understand, what we are missing from scipy's implementation are the following:
|
I resynced this PR with |
…it-learn into _weighted_percentile
What does this implement/fix? Explain your changes.
This PR adds support for
nan
input into_weighted_percentile
, which is used withinSplineTransformer
, for instance.Any other comments?
This is for @ogrisel since we had talked about that:
This does not deliver equivalent results as
np.nanpercentile
in the current dev-version, because they were different before:_weighted_percentile
is finding the next lower percentile, whilenp.nanpercentile
seems to find their percentiles differently (closest?). The more variance in thesample_weights
, the more important the effect of this gets.Here an example without any nans and without weights:
output:
While setting
percentile = 67
leads to the same results.Edit: What I wrote is actually not true. The differing output was due to not putting sample weights of nan values to 0 before. Now that is done, we have the same results it seems.