BUG: Buttress handling of extreme values in randint #8846

gfyoung · 2017-03-26T21:59:22Z

In randint, we perform integer comparisons to check whether the bounds are violated. With numpy signed integers, those comparisons can be prone to overflow bugs at the extremes. This PR makes the function robust against those bugs.

cc @eric-wieser

eric-wieser · 2017-03-27T01:02:32Z

numpy/random/tests/test_random.py

+
+        for dt in self.itype:
+            lbnd = 0 if dt is np.bool_ else np.iinfo(dt).min
+            ubnd = 2 if dt is np.bool_ else np.iinfo(dt).max + 1


Would np.iinfo(np.bool_) be a reasonable thing for numpy to contain? (EDIT: #8849)

Perhaps, as it does not work for the moment:

>>> np.iinfo(np.bool_) ... ValueError: Invalid integer data type.

eric-wieser · 2017-03-27T01:40:09Z

So, one other problem to address is generating np.uint64s in [0, u64top), where u64top = 2^64. I can think of a few solutions(superfluous casting added to be obvious about types, int is py3k):

np.random.randint(int(0), int(u64top))
require arguments to be ints, and make this impossible to achieve passing uint64s. But having a function that can't be invoked correctly with np scalars is unpleasant, and this can't be generalized to ufuncs. This is what this PR implements.
Obviously we want this to work to not surprise users, but I don't think this is enough.
np.random.randint(np.uint64(0), np.uint64(u64top-1), closed=True)
Allow the endpoint to be included. Could also be called endpoint for consistency with np.linspace. But closed ranges are uncommon in python, and we already removed random_integers in DEP: Deprecate random_integers #6931. This is at least more explicit, but might still not be desirable
np.random.randint(np.uint64(0), np.uint64(0), wrap=True)
The semantics would be that the number line of np.uint64 is considered circular - two equal numbers would be a full range, and otherwise asking for numbers between any two points is valid. This would completely disable the bound checking currently in place, as every set of inputs would be valid. This is more confusing than 2.

Of course you could just ask for a direct 64-bit integer, but it would be nice to cover the case where a user api wants to generate up to some variable without a special case

Thoughts?

gfyoung · 2017-03-27T03:52:28Z

@eric-wieser : np.iinfo(np.uint64).max = 18446744073709551615 = 2^64 - 1. What you're suggesting is to generate random integers that are beyond even what uint64 can hold, which is NOT the point of this PR in fact.

eric-wieser · 2017-03-27T11:04:19Z

@gfyoung: Good catch on the typo. I meant generate numbers in [0, 2^64), ie a open range - which would fit in a uint64

eric-wieser · 2017-03-27T11:24:58Z

numpy/random/mtrand/mtrand.pyx

+        ilow = int(low)
+        ihigh = int(high)
+
+        if ilow < lowbnd:


Related issue that is the root cause here: #8851

Indeed, a lot of "quirks" (i.e. bugs) arose when I started working on this PR in general.

Actually, I think I'd be happy with this patch if you add a TODO comment that this is a workaround for gh-8851

I have no problems doing that.

gfyoung · 2017-04-26T21:43:11Z

@eric-wieser : Got around to looking at this again, and I'm trying to refresh myself on what exactly is blocking this PR from being merged. Let me respond to the suggestions you provided above:

The casting to int is to avoid that annoying comparison that you explicitly raised as an issue in BUG: incorrect equality of np.int64 and np.uint64 #8851. If that bug didn't exist, I wouldn't need to do that casting to Python int. I think this is the most robust solution for the time being. I understand that you want to make this a ufunc but that is some ways away from being done IMO. This patch is otherwise robust and clean.
Yep, I don't think we'll be allowing closed intervals as the standard any time soon.
Agreed that this option is not desirable.

eric-wieser · 2017-04-26T22:18:51Z

Main blocker is that I want someone else to weigh in here, and that this raises a bunch of other issues^*. I agree that np.random.randint having different behaviour on integers and numpy scalars is a bug, and I also think that this is an ok patch make things consistent.

But I think this is in some senses a workaround - we should actually just fix the relational operators here, rather than add a workaround - np.int32 <op> np.uint32 already does the "right" thing (ie, not what C would do) for relational operators, so I'd propose we add qQ and Qq ufunc loops for relational operators (and probably lL and Ll, since sometimes l == q...)

Would that fix this bug?

^*For instance:

i64max = np.int64(np.iinfo(np.int64).max)
my_data = np.array([i64max - 4] * 5 + [i64max])

data_range = my_data.min(), my_data.max()

# generate some fake data in the same range as the existing data
faked_data = np.random.randint(data_range[0], data_range[1] + 1)

This fails because we already get an overflow before we make it to randint - so in this case, is an argument for re-introducing closed ranges (wrapping makes no sense on signed integers, as it is not well-defined anyway). Interestingly, this case doesn't come up for uint64s, as silent promotion to float happens.

gfyoung · 2017-04-26T22:56:24Z

Okay. Fair enough. Two points in response:

I think fixing comparison operators would patch this bug I think. However, I don't know how simple it will be to patch. The question is whether it will be worthwhile to wait or submit this now.
Could you tag this for 0.13 just so we can ensure that we will get eyes on this?

rkern · 2017-04-27T01:09:36Z

I like the idea of adding those comparison ufunc loops. Adding new ufunc loops is pretty straightforward. It might complicate someone's theoretical table of how dtype promotion works, but for the comparison operators, I'm okay with that. :-)

gfyoung · 2017-04-27T01:10:53Z

@rkern : FYI, #8851 for further discussion on comparison operators.

eric-wieser · 2017-04-27T09:09:15Z

numpy/random/tests/test_random.py

+            except Exception as e:
+                raise AssertionError("No error should have been raised, "
+                                     "but one was with the following "
+                                     "message:\n\n%s" % str(e))


This is a little weird - better to leave the raw exception so that the tests produce a useful stack trace, I think

I did this because I wanted to know which element in the for-loop, if any, failed. No stack-trace is going to tell you that AFAICT.

Maybe we need an assert_no_exception or some such.

I don't think assert_no_exception would be all the useful - wouldn't that go on every single function call otherwise? Based on @gfyoung's reasining, parameterized test cases would solve this.

I did this because I wanted to know which element in the for-loop, if any, failed.

In what way does this do a better job of doing that, given that dt is not used in the message?

eric-wieser · 2017-04-27T10:33:19Z

numpy/random/mtrand/mtrand.pyx

            raise ValueError("low >= high")

        with self.lock:
-            ret = randfunc(low, high - 1, size, self.state_address)
+            ret = randfunc(ilow, ihigh - 1, size, self.state_address)


If we claim that everything should work if comparisons get fixed, then this line doesn't need to change - right?

That is correct. The following lines will not be needed:

ilow = int(ilow) ihigh = int(high)

Let me be more clear: I think we should continue to pass low and high on this line - there is no reason to change to ilow and ihigh here.

That's not the case:

>>> uint64_max = np.uint64(np.iinfo(np.uint64).max) >>> uint64_max - 1 1.8446744073709552e+19

This PR cannot demonstrate any clearer just how fragile numpy's operators are when handling large uint64 numbers.

Darn, you're right. In that case, the comment above saying "remove these lines" should also point out that we need them to work around subtraction as well. Or just a comment to that effect next to the subtraction

Alternatively, np.subtract(high, 1, dtype=dtype) would probably be safe here too.

Fair enough. In fact, even your suggestion is not safe:

>>> uint64_max = np.uint64(np.iinfo(np.uint64).max) >>> # correct but bad for randint >>> np.subtract(uint64_max, 1, dtype=np.int64) -2 >>> >>> np.subtract(uint64_max, 1, dtype=None) # oops 1.8446744073709552e+19

Not sure what you meant by # correct but bad for randint. You're in for a bad time if you ask for an upper bound that doesn't come close to fitting in the container type. What does passing dtype=None into randint do normally?

Actually, looking at those examples again, I realize that those aren't valid in the context of randint. The only time that we could do this subtraction is if dtype=np.uint64, in which case the subtraction actually works correctly.

So ignore everything I just said in the previous comment above. I still would rather prefer to use high - 1 (as the current workaround) over np.subtract, as that will make it clear to us later that we need to patch this once #8851 is fixed.

dtype=None is not valid (per the docs) and will error as an invalid dtype.

np.subtract has the benefit of extending to broadcasting more naturally.

I'm happy with the current workaround, provided there's a comment indicating that the subtraction too is a workaround

Comment has been appended to the TODO I wrote earlier.

charris · 2017-05-07T18:46:40Z

What is the status of this?

gfyoung · 2017-05-07T18:47:41Z

IINM @eric-wieser has OK'ed the changes, but he wanted feedback from other maintainers before merging this PR.

charris · 2017-05-08T18:38:17Z

@eric-wieser Seems reasonable to me. I'll merge this later if you don't beat me to it.

eric-wieser · 2017-05-08T18:43:50Z

My main qualm with this patch is that we lock ourselves into behaviour that will be difficult to maintain in #6938

gfyoung · 2017-05-08T18:56:12Z

@eric-wieser : What behavior are you referring to?

charris · 2017-05-08T20:42:46Z

@eric-wieser Looks like it just fixes a bug in the current version. What am I missing?

eric-wieser · 2017-05-09T00:10:09Z

What am I missing?

That once we have broadcasting, casting the results to int is not a particularly viable strategy, as they are no longer scalars - the equivalent would be .astype(object), which would be pretty slow

eric-wieser · 2017-05-09T00:12:20Z

I guess my concern comes down to the fact that at some level this is a workaround for an underlying bug (uint64/int64 operations), which leaves us two options in future:

Fix the underlying bug to match the behaviour of the workaround
Decide on a different fix when we finally address the bug, and have to leave the workaround in for compatibility

gfyoung · 2017-05-09T00:45:43Z

True that this is a workaround in some respects, but that patch should not affect the numbers generated by this function. I don't see how correctly implementing uint64 subtraction and comparison would change what numbers we get.

eric-wieser · 2017-05-09T00:53:24Z

Worst case scenario, we decide that uint64/int64 ops are an error or warning, rather than being broken like they are now. If that happens, this would switch from "code that should work" to "code that shouldn't work". But that's an unlikely resolution.

Let's put this in, with the assumption that in the unlikely event supporting this becomes a burden, we can just go through a deprecation cycle.

gfyoung · 2017-05-09T00:54:47Z

Fair enough. If that does end up happening, I'd have no problems putting together the patches to deprecate given the unlikelihood.

charris · 2017-05-09T02:18:47Z

I suppose the other option is to raise an error for the bad combinations.

charris · 2017-05-09T14:15:39Z

Thanks @gfyoung .

gfyoung mentioned this pull request Mar 26, 2017

ENH: Allow Randint to Broadcast Arguments #6938

Closed

eric-wieser reviewed Mar 27, 2017

View reviewed changes

eric-wieser added the component: numpy.random label Mar 27, 2017

eric-wieser reviewed Mar 27, 2017

View reviewed changes

eric-wieser added this to the 1.13.0 release milestone Apr 26, 2017

gfyoung mentioned this pull request Apr 27, 2017

BUG: incorrect equality of np.int64 and np.uint64 #8851

Closed

eric-wieser reviewed Apr 27, 2017

View reviewed changes

BUG: Buttress handling of extreme values in randint

1f88730

charris merged commit 2b4ecc4 into numpy:master May 9, 2017

gfyoung deleted the randint-extreme-range branch May 9, 2017 14:42

eric-wieser mentioned this pull request Jun 15, 2017

randint over all of uint64 raises OverflowError if high=... is uint64 but not if it is int #9256

Closed

BUG: Buttress handling of extreme values in randint #8846

BUG: Buttress handling of extreme values in randint #8846

Conversation

gfyoung commented Mar 26, 2017 • edited Loading

eric-wieser Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser commented Mar 27, 2017 • edited Loading

gfyoung commented Mar 27, 2017 • edited Loading

eric-wieser commented Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Apr 26, 2017 • edited Loading

eric-wieser commented Apr 26, 2017

gfyoung commented Apr 26, 2017

rkern commented Apr 27, 2017

gfyoung commented Apr 27, 2017

eric-wieser Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

eric-wieser Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris commented May 7, 2017

gfyoung commented May 7, 2017 • edited Loading

charris commented May 8, 2017

eric-wieser commented May 8, 2017

gfyoung commented May 8, 2017

charris commented May 8, 2017

eric-wieser commented May 9, 2017 • edited Loading

eric-wieser commented May 9, 2017 • edited Loading

gfyoung commented May 9, 2017

eric-wieser commented May 9, 2017

gfyoung commented May 9, 2017

charris commented May 9, 2017

charris commented May 9, 2017

gfyoung commented Mar 26, 2017 •

edited

Loading

eric-wieser Mar 27, 2017 •

edited

Loading

eric-wieser commented Mar 27, 2017 •

edited

Loading

gfyoung commented Mar 27, 2017 •

edited

Loading

eric-wieser commented Mar 27, 2017 •

edited

Loading

gfyoung commented Apr 26, 2017 •

edited

Loading

eric-wieser Apr 27, 2017 •

edited

Loading

gfyoung Apr 27, 2017 •

edited

Loading

gfyoung Apr 29, 2017 •

edited

Loading

gfyoung Apr 29, 2017 •

edited

Loading

gfyoung Apr 29, 2017 •

edited

Loading

eric-wieser Apr 29, 2017 •

edited

Loading

gfyoung Apr 29, 2017 •

edited

Loading

gfyoung commented May 7, 2017 •

edited

Loading

eric-wieser commented May 9, 2017 •

edited

Loading

eric-wieser commented May 9, 2017 •

edited

Loading