-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: quantile should error when weights are all zeros #28595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a bunch of folks tried to fix this at the same time. This PR deals with more cases than the others, but handling NaN/Inf is a bit messier maybe. The other two PRs recently opened are:
- fix quantile all zeros error #28597
- BUG: bugfix for #28589 (quantile should error when weights are all zeros) #28594
I provided initial reviews to each since they are from first-time contributors, but we should probably consolidate one way or another.
numpy/lib/_function_base_impl.py
Outdated
raise ValueError("At least one weight must be non-zero") | ||
if weights.dtype != object: | ||
if np.any(np.isinf(weights)): | ||
raise ValueError("Weights must be non-infinite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: non-infinite-> finite probably
with pytest.raises(ValueError) as ex: | ||
a = np.quantile(arr, q, weights=wgt, method=m) | ||
assert "Weights must be non-infinite" in str(ex) | ||
wgt[i] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably sufficient to just check a single index rather than all of them. For the message check, I think we usually prefer using the match
builtin to pytest.raises
these days.
wgt[i] = np.inf | ||
with pytest.raises(ValueError) as ex: | ||
a = np.quantile(arr, q, weights=wgt, method=m) | ||
assert "Weights must be non-infinite" in str(ex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could parametrize the test to check one weights array with a few NaNs and another with just 1, but probably don't need these two exhaustive loops
wgt[i] = np.nan | ||
with pytest.raises(ValueError) as ex: | ||
a = np.quantile(arr, q, weights=wgt, method=m) | ||
assert "At least one weight is nan" in str(ex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also probably just parametrize over np.nan
and np.inf
+ expected message above for concision (and same comment about not needing the exhaustive loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
I'll try to make it better, but with Also, when I started working on the issue the other requests weren't open yet lol |
Is there anything I need to do to fix the tests? I'm kinda confused as to why they are failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments inline, but also a more philosophical one here. Python usually avoids "Look Before You Leap" (LBYL), in favour of "Easier to Ask Forgiveness than Permission" (EAFP), and I worry a bit when seeing many checks for possible failure modes, which make the code more complex, and also makes things slower for the common case where the user puts in sensible numbers.
For quantile
, there is already a check against negative weights, but those are quite important, because they would not lead to a failure, but still give a wrong result.
The case is a little different for inf
and all-zero - those give RuntimeWarning
, so users could know something is amiss. But nan
gives no warning (indeed, nan
are allowed in the values...).
Anyway, maybe overall protecting the user is worth it, but we should at least try to make the performance penalty as small as possible, hence my suggestions inline.
But looking closer, there may be another option that would be faster: the weights get transferred down via various functions to _quantile
. There, one sees:
numpy/numpy/lib/_function_base_impl.py
Lines 4892 to 4895 in a9a9bf8
# We use the weights to calculate the empirical cumulative | |
# distribution function cdf | |
cdf = weights.cumsum(axis=0, dtype=np.float64) | |
cdf /= cdf[-1, ...] # normalization to 1 |
So, at this point we have two benefits: any object array are already turned into float, and we only have to check the last element to see whether there are any problems, rather than scan the whole array. So, one could add in between these two lines,
if not np.all(np.isfinite(cdf[-1, ...]):
raise ValueError(...)
For the check for all-zero, one can include it a little below, inside another branch:
if np.any(cdf[0, ...] == 0):
# might as well guard against all-zero here.
if np.any(cdf[-1, ...] == 0):
raise ValueError(...)
cdf[cdf == 0] = -1
The above should be better for performance, though I can see the argument for just having nicely validated data to start with.
numpy/lib/_function_base_impl.py
Outdated
@@ -4535,8 +4535,26 @@ def quantile(a, | |||
if axis is not None: | |||
axis = _nx.normalize_axis_tuple(axis, a.ndim, argname="axis") | |||
weights = _weights_are_valid(weights=weights, a=a, axis=axis) | |||
if weights.dtype != object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment: I think these checks should happen inside _weights_ar_valid
- this will ensure they are used for percentile
as well.
@@ -4142,7 +4142,25 @@ def test_closest_observation(self): | |||
assert_equal(4, np.quantile(arr[0:9], q, method=m)) | |||
assert_equal(5, np.quantile(arr, q, method=m)) | |||
|
|||
|
|||
|
|||
@pytest.mark.parametrize(["err_msg", "weight"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd parametrize over np.quantile
and np.percentile
as well - they should have the same errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you pass in a list rather than an array, you could parametrize over dtype=float
and dtype=object
, to make this a little more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
numpy/lib/_function_base_impl.py
Outdated
@@ -4535,8 +4535,26 @@ def quantile(a, | |||
if axis is not None: | |||
axis = _nx.normalize_axis_tuple(axis, a.ndim, argname="axis") | |||
weights = _weights_are_valid(weights=weights, a=a, axis=axis) | |||
if weights.dtype != object: | |||
if np.any(np.isinf(weights)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another general comment: as written, the common case has to go through a lot of checks. I think it would be better to optimize for the common case, and not worry too much about distinguishing failure cases. E.g., you can do just one evaluation with:
if not np.all(np.isfinite(weights)):
raise ValueError("weights must be finite.")
numpy/lib/_function_base_impl.py
Outdated
# Since np.isinf and np.isnan do not work in dtype object arrays | ||
# Also, dtpye object arrays with np.nan in them break <, > and == opperators | ||
# This specific handling had to be done (Can be improved) | ||
elif weights.dtype == object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this loop can still give unexpected errors, because you are here counting on object arrays to be turned into their values as scalars. E.g.,
np.isnan(np.array([1.,None,np.inf])[1])
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
This will be an uninformative error!
I think we have two choices: just not check for object
dtype, or convert to float before checking (and then passing on it that conversion fails).
numpy/lib/_function_base_impl.py
Outdated
if np.any(weights < 0): | ||
raise ValueError("Weights must be non-negative.") | ||
elif np.all(weights == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again we could ensure the common case remains fast by doing:
if np.any(weights <= 0):
raise ValueError("weights must be non-negative and cannot be all zero.")
# or, more explicit error messages,
if np.all(weights == 0):
raise ValueError("At least one weight must be non-zero.")
else:
raise ValueError("Weights must be non-negative.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to keep this inline:
The issue with this is that some of the weights might be 0, but none of them are negative. So it would raise an error even though it shouldn't
You're right, I was too sloppy in writing this, the else
should be elif np.any(weights <0)
so that the case of some weights 0 falls through (slowly, but better than making all cases slow!).
p.s. Given this, I'd probably swap the order, i.e.,
if np.any(weights <= 0):
# Do these checks guarded by the above `if` to avoid slowing down the common case.
if np.any(weights < 0):
raise ValueError("Weights must be non-negative.")
elif np.all(weights == 0):
raise ValueError("At least one weight must be non-zero.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tontonio3 I don't see how you responded to this suggestion. Please make sure all reviewer feedback is addressed before requesting re-review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngoldbaum Your're right, I forgot to implement this
The issue is that |
The issue with this is that some of the weights might be 0, but none of them are negative. So it would raise an error even though it shouldn't |
I'd love to do that, the issue is that everything breaks with |
Duh, I thought I had checked that, but I now see I was wrong. Sorry about that! |
Yes, indeed, this is partially why I suggested to do the check much further down, when the object arrays are turned into float. |
@mhvk I've implemented your suggestions. Although now if you have a The Improved testing commit is to save the old error handing. |
@seberg @ngoldbaum is there anything else that needs to be done to close this issue? |
The linter still needs to be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Some minor comments inline.
if np.any(weights <= 0): | ||
if np.any(weights < 0): | ||
raise ValueError("Weights must be non-negative.") | ||
elif np.all(weights == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just say else
here?
q = 0.5 | ||
arr = [1, 2, 3, 4] | ||
wgts = np.array(weight, dtype=dty) | ||
with pytest.raises(err): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well use match
here too
Fixes issue #28589
np.quantile
now raises errors if:np.nan
np.inf