-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: Buttress handling of extreme values in randint #8846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
for dt in self.itype: | ||
lbnd = 0 if dt is np.bool_ else np.iinfo(dt).min | ||
ubnd = 2 if dt is np.bool_ else np.iinfo(dt).max + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would np.iinfo(np.bool_)
be a reasonable thing for numpy to contain? (EDIT: #8849)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, as it does not work for the moment:
>>> np.iinfo(np.bool_)
...
ValueError: Invalid integer data type.
So, one other problem to address is generating
Of course you could just ask for a direct 64-bit integer, but it would be nice to cover the case where a user api wants to generate up to some variable without a special case Thoughts? |
@eric-wieser : |
@gfyoung: Good catch on the typo. I meant generate numbers in |
numpy/random/mtrand/mtrand.pyx
Outdated
ilow = int(low) | ||
ihigh = int(high) | ||
|
||
if ilow < lowbnd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related issue that is the root cause here: #8851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, a lot of "quirks" (i.e. bugs) arose when I started working on this PR in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I'd be happy with this patch if you add a TODO
comment that this is a workaround for gh-8851
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problems doing that.
@eric-wieser : Got around to looking at this again, and I'm trying to refresh myself on what exactly is blocking this PR from being merged. Let me respond to the suggestions you provided above:
|
Main blocker is that I want someone else to weigh in here, and that this raises a bunch of other issues*. I agree that But I think this is in some senses a workaround - we should actually just fix the relational operators here, rather than add a workaround - Would that fix this bug? *For instance:
This fails because we already get an overflow before we make it to |
Okay. Fair enough. Two points in response:
|
I like the idea of adding those comparison ufunc loops. Adding new ufunc loops is pretty straightforward. It might complicate someone's theoretical table of how dtype promotion works, but for the comparison operators, I'm okay with that. :-) |
except Exception as e: | ||
raise AssertionError("No error should have been raised, " | ||
"but one was with the following " | ||
"message:\n\n%s" % str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little weird - better to leave the raw exception so that the tests produce a useful stack trace, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because I wanted to know which element in the for
-loop, if any, failed. No stack-trace is going to tell you that AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need an assert_no_exception
or some such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think assert_no_exception
would be all the useful - wouldn't that go on every single function call otherwise? Based on @gfyoung's reasining, parameterized test cases would solve this.
I did this because I wanted to know which element in the for-loop, if any, failed.
In what way does this do a better job of doing that, given that dt
is not used in the message?
raise ValueError("low >= high") | ||
|
||
with self.lock: | ||
ret = randfunc(low, high - 1, size, self.state_address) | ||
ret = randfunc(ilow, ihigh - 1, size, self.state_address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we claim that everything should work if comparisons get fixed, then this line doesn't need to change - right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. The following lines will not be needed:
ilow = int(ilow)
ihigh = int(high)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me be more clear: I think we should continue to pass low
and high
on this line - there is no reason to change to ilow
and ihigh
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the case:
>>> uint64_max = np.uint64(np.iinfo(np.uint64).max)
>>> uint64_max - 1
1.8446744073709552e+19
This PR cannot demonstrate any clearer just how fragile numpy
's operators are when handling large uint64
numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn, you're right. In that case, the comment above saying "remove these lines" should also point out that we need them to work around subtraction as well. Or just a comment to that effect next to the subtraction
Alternatively, np.subtract(high, 1, dtype=dtype)
would probably be safe here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. In fact, even your suggestion is not safe:
>>> uint64_max = np.uint64(np.iinfo(np.uint64).max)
>>>
# correct but bad for randint
>>> np.subtract(uint64_max, 1, dtype=np.int64)
-2
>>>
>>> np.subtract(uint64_max, 1, dtype=None) # oops
1.8446744073709552e+19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you meant by # correct but bad for randint
. You're in for a bad time if you ask for an upper bound that doesn't come close to fitting in the container type. What does passing dtype=None
into randint do normally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking at those examples again, I realize that those aren't valid in the context of randint
. The only time that we could do this subtraction is if dtype=np.uint64
, in which case the subtraction actually works correctly.
So ignore everything I just said in the previous comment above. I still would rather prefer to use high - 1
(as the current workaround) over np.subtract
, as that will make it clear to us later that we need to patch this once #8851 is fixed.
dtype=None
is not valid (per the docs) and will error as an invalid dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.subtract
has the benefit of extending to broadcasting more naturally.
I'm happy with the current workaround, provided there's a comment indicating that the subtraction too is a workaround
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment has been appended to the TODO
I wrote earlier.
What is the status of this? |
IINM @eric-wieser has OK'ed the changes, but he wanted feedback from other maintainers before merging this PR. |
@eric-wieser Seems reasonable to me. I'll merge this later if you don't beat me to it. |
My main qualm with this patch is that we lock ourselves into behaviour that will be difficult to maintain in #6938 |
@eric-wieser : What behavior are you referring to? |
@eric-wieser Looks like it just fixes a bug in the current version. What am I missing? |
That once we have broadcasting, casting the results to |
I guess my concern comes down to the fact that at some level this is a workaround for an underlying bug (uint64/int64 operations), which leaves us two options in future:
|
True that this is a workaround in some respects, but that patch should not affect the numbers generated by this function. I don't see how correctly implementing |
Worst case scenario, we decide that Let's put this in, with the assumption that in the unlikely event supporting this becomes a burden, we can just go through a deprecation cycle. |
Fair enough. If that does end up happening, I'd have no problems putting together the patches to deprecate given the unlikelihood. |
I suppose the other option is to raise an error for the bad combinations. |
Thanks @gfyoung . |
In
randint
, we perform integer comparisons to check whether the bounds are violated. Withnumpy
signed integers, those comparisons can be prone to overflow bugs at the extremes. This PR makes the function robust against those bugs.cc @eric-wieser