Allow for larger int64s in random.randint #6824

gfyoung · 2015-12-12T04:57:47Z

Addresses enhancement requested in #6812 in which attempts to generate random, large 64-bit integers was causing overflow issues in Windows. This PR allows such generation to be done.

njsmith · 2015-12-12T05:04:06Z

I don't remember whether we consider it legit to use long long as a way to specify int64 -- does anyone know?

Could you add a test? And either run that test on Windows and report back, or if you don't have convenient access to a Windows box then speak up and we'll hassle @jreback or someone into doing it?

njsmith · 2015-12-12T05:07:41Z

Also, as a tip for the future: it's nice to write something like Fixes gh-6812 in your commit message, because (1) this documents which bug is being fixed, so someone reading the git log later can find it for reference, and (2) if you write Close gh-XXX or Fixes gh-XXX in your commit message, then github is smart enough to automatically close the original bug when the fix gets merged; otherwise someone has to remember to go back and close it manually.

(You can also write #6812 instead of gh-6812 -- their effect is identical -- but the latter is kinda nice because it will remain unambiguous if we someday change bug trackers, and strings like #6812 become ambiguous between the old bug tracker's numbering and the new bug tracker's numbering. This has already happened once...)

jaimefrio · 2015-12-12T05:10:43Z

For this to work, I think you need to make diff an unsigned long long, but that would require making changes to rk_interval as well.

gfyoung · 2015-12-12T06:54:20Z

@jaimefrio : Ah, I think you are right. However, if we propagate, I think that would also require changing rk_random and rk_ulong, but I do not fully understand the internals of the former. Might someone be able to explain?

Secondly, does anyone have an idea as to why only one build managed to pass the test I wrote, but the other builds failed? Does it have to do with what was discussed in my first paragraph?

njsmith · 2015-12-12T07:03:07Z

I haven't analyzed in detail, but the reason that that one build passed almost certainly has something to do with that being the one build that we run in 32-bit mode (so for that build long = int = 32 bit, for all the other builds long = 64 bit, int = 32 bit).

gfyoung · 2015-12-12T07:43:54Z

Okay. That seems to make sense. From what I understand, that seems to be related to the comments made by @jaimefrio. I think before I can make any further changes (to the C code most likely), I'll need to have a better understanding of the internals though.

charris · 2015-12-16T18:55:33Z

Close and reopen to restart tests.

gfyoung · 2015-12-16T20:26:32Z

@charris : Thanks!

I'm still working on understanding this random number generation thing. If anyone has any input/ideas they would like to contribute, that would be great!

gfyoung · 2015-12-17T23:14:29Z

Undid a good portion of my changes and left just the unit test because all of these long to long long conversions are confusing me (there are a lot more than I had expected). Need a fresh start! At least I know the unit test is correct for the builds where long is 64-bit.

gfyoung · 2015-12-18T23:54:03Z

Many long long conversion attempts later, I finally got Windows (32-bit) to pass the test I wrote. Hopefully I didn't break anything else though. Travis will soon tell me.

gfyoung · 2015-12-19T09:04:11Z

Travis is finally happy! 😄 Please review carefully! I did a lot of recasting all over the place, so I could have inadvertently broken a function that Travis failed to catch, much like the issue I was trying to fix in this PR.

pv · 2015-12-21T09:22:21Z

I think this is wrong (iow, the change of datatype in the rng code should not be made): this is the 32-bit Mersenne twister generator --- it does not become 64-bit just by increasing the data type. The 64-bit MT is a different rng.

The rng code should not be changed. The issue is only in the randint / rk_interval methods, and only those need modification.

pv · 2015-12-21T09:41:12Z

numpy/random/mtrand/randomkit.c

 rk_ulong(rk_state *state)
 {
-#if ULONG_MAX <= 0xffffffffUL
+#if ULLONG_MAX <= 0xffffffffULL
    return rk_random(state);
 #else
    return (rk_random(state) << 32) | (rk_random(state));


To get this part to work with long long / int64 (after reverting the change of datatype in the rng core), you need a cast: ((unsigned long long)rk_random(state) << 32) | ...

pv · 2015-12-21T09:54:58Z

The generator only produces 32-bit numbers: changing the data types in rk_random does not change this.

rk_random does not need to return long long in order for rk_interval to return long long. What's done there is calling rk_ulong when the required integers are larger than 32 bit. This is necessary because the generator produces only 32-bit numbers, even if sizeof(long) is larger than that.

gfyoung · 2015-12-21T09:57:48Z

The maximum long in 32-bit Python on my machine is np.iinfo(np.int32).max because passing anything higher as the upper bound in randint gives the OverflowError. This value turns out to be significantly less than RK_MAX. Thus, if we maintain long as the data type for elements in the key array and therefore for variables in rk_random, that would contradict the documentation for rk_random, which says it generates random numbers between 0 and RK_MAX inclusive, if I'm not mistaken.

rkern · 2015-12-21T10:16:35Z

rk_random() returns an unsigned long which does indeed generate numbers up through RK_MAX. The problem with RandomState.randint() is that it takes signed longs for the bounds and does signed long arithmetic. You only need to modify randint(), not the core PRNG.

gfyoung · 2015-12-21T10:18:36Z

Okay, but how are such values generated if long is unable to take on values up to RK_MAX, which is the case on my machine for example?

rkern · 2015-12-21T10:19:23Z

long is signed. unsigned long is unsigned.

gfyoung · 2015-12-21T10:21:23Z

I am referring to the rk_random method. I'm not understanding how it can generate values up to RK_MAX inclusive if none of the variables (which are all long) in rk_random are never anything remotely close to that.

rkern · 2015-12-21T10:23:49Z

They aren't; they are unsigned long. https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/randomkit.c#L211-L214

gfyoung · 2015-12-21T10:30:51Z

Oh right, right! My bad. For some reason I kept reading unsigned long as long in your replies @rkern , a good indicator that my eyes have been staring at the computer screen for too long now.

gfyoung · 2015-12-21T22:46:36Z

@rkern , @pv: I reverted those changes to the rng, and Travis seems to be happy with them. If there's anything else that needs reverting, let me know!

pv · 2015-12-22T09:18:36Z

numpy/random/mtrand/mtrand.pyx

@@ -703,13 +703,13 @@ cdef class RandomState:

        """
        cdef ndarray state "arrayObject_state"
-        state = <ndarray>np.empty(624, np.uint)
+        state = <ndarray>np.empty(624, np.uint64)


Needs to be reverted

pv · 2015-12-22T09:23:52Z

This changes the return type and random stream on 32-bit platforms, which can be considered backward incompatible. I'd suggest to continue instead the approach in the other PR --- adding a dtype parameter seems a better way to do this.

gfyoung · 2015-12-22T09:35:57Z

Adding the dtype parameter only I don't believe addresses the issue in this PR though because we still need to recast lo, hi, and rv as long long to prevent the OverflowError. Thus, to ensure that no such errors are generated, the return types I think should be changed in this PR to fix the issue brought up. My other PR can add in the dtype and address the "backwards incompatibility" issue you are bringing up here as well as the original issue for that PR. From an organizational standpoint, that seems to make sense in my head. How does that sound?

gfyoung · 2015-12-23T03:04:40Z

@pv: Reverted more of the code, and Travis is still happy. What are your thoughts about what I proposed for the PR organization?

rkern · 2015-12-23T11:08:22Z

You can't change rk_long and rk_ulong like this. They must return long and unsigned long respectively; it's in the contract specified by their name, much less backwards compatibility. You can add a new rk_uint64 if you like and make an rk_uint64_interval that uses it. I do recommend using a specific-bitwidth type instead of unsigned long long. In retrospect, I think it was a mistake to use longs everywhere in the API. In my defense, 64-bit systems were still rare at the time, and I did not have a 64-bit system to experiment on to learn what the issues are.

Making RandomState.randint() use these new functions is a little more complicated. It might have to wait for the method-versioning in #5911 to go in since it will change the stream. You may be able to get away with checking the range of the inputs; if they fit into the range served by rk_interval, use it, otherwise use rk_uint64_interval (under the theory that you would get an OverflowError in this conditions with the current code). I'm not sure about the cross-platform ramifications of that, but it's entirely possible that randint() is already not cross-platform. If rk_uint64_interval is implemented correctly (i.e. matches what rk_interval does on 64-bit-long systems even when using a 32-bit-long system), it's also possible that it will restore cross-platformness.

I do apologize, but this is unfortunately not the most new-contributor-friendly part of the code to start with. There are subtle and strict backwards-compatibility requirements that are unique to the PRNG subsystem.

gfyoung · 2015-12-23T11:43:57Z

@rkern : No worries. Given the subtleties though you have brought up, it would have been preferable if they had been brought up earlier. That may have avoided a lot of the confusion that transpired over the past week or so with this PR.

gfyoung · 2015-12-23T13:13:29Z

My alternative idea was to rename the functions to describe what they are doing as a result of my changes (e.g. ulong --> ulonglong). However, if the interest is to maintain the old code as much as possible, then building out identical methods for 32-bit long, while undesirable due to the boilerplate, seems to be the only way to go as you proposed. I'm not sure how the PR you mentioned would impact this choice.

It also occurs to me (and perhaps this is a stupid question, but I haven't found a convincing answer online): why is it not possible to run the 32-bit Mersenne Twister algorithm with 64-bit integers? As long as the internal values are constrained to having the upper 32 bits = 0, I think the numbers generated will be identical because it's shifting / bit-wise operations with numbers whose values also have the upper 32 bits = 0?

I do believe you are correct in saying that randint is cross-platform, but only for a certain range of inputs ([np.iinfo(np.int32).min, np.iinfo(np.int32).max]). Beyond that, 32-bit long architectures will crash, while 64-bit long ones will not. From a memory standpoint, if I'm not getting my semantics incorrect, any attempt to accommodate 32-bit long architecture for 64-bit inputs will already be breaking cross-platform uniformity.

rkern · 2015-12-23T13:41:27Z

It is possible to modify the MT code that way, but it doesn't solve any problem. It won't make randint() work for large numbers. You still need to assemble two draws from the MT core PRNG to assemble a full-range uint64.

As for the cross-platformness, this is what I meant by "subtle" backwards compatibility requirements. If the code currently raises an exception with certain inputs, then we do allow changes that will make those inputs work.

gfyoung · 2015-12-23T20:55:02Z

Could you define what you mean by "large"? Actually, as the current implementation stands, you actually cannot generate numbers up to np.iinfo(np.uint64).max. In fact, on 64-bit Python, I cannot input anything above np.iinfo(np.int64).max, as I will get the OverflowError described in the issue.

Actually, in light of this, I don't even think my changes will suffice to take in inputs like as large as np.iinfo(np.uint64).max Is that the type of inputs randint should be able to take?

rkern · 2015-12-23T21:35:18Z

I meant that it doesn't change anything with respect to the problem that you are trying to fix. The types internally used by the core PRNG are simply irrelevant. The problem that you seem to want to fix can be fixed solely within the confines of randint() and perhaps some new functions.

By "large", I meant np.iinfo(np.int64).max on 32-bit systems. randint() thus far has been defined to work on signed integers, not unsigned ones. That should probably continue to be the case for backwards compatibility reasons. If the user specifically requests an unsigned dtype via #6869, then you can go ahead and generate numbers up through np.iinfo(whatever_requested_dtype).max.

gfyoung · 2015-12-23T21:43:47Z

Okay, that makes sense. Building out new functions should not be too bad then I think since it is essentially boilerplate with different types (long long instead of long).

gfyoung · 2015-12-24T01:25:01Z

@rkern : Refactored with the new methods, and Travis is happy.

Addresses bug in random integer generation in which the bounds for the generation are stored as longs. For 32-bit Python, longs are only 32-bit. Random integer generation has to acommodate integers up to 64-bit. Thus, if a 33-bit integer is passed in for one of the bounds, an OverflowError is thrown. This commit adds '64-bit' versions of relevant random integer generation methods that are called if the original randint method overflows. Closes gh-6812.

homu · 2015-12-28T22:31:37Z

☔ The latest upstream changes (presumably #6885) made this pull request unmergeable. Please resolve the merge conflicts.

gfyoung · 2015-12-28T22:42:06Z

@homu : I'm actually going to refrain from rebasing until we can figure out exactly what's going on with this random integer generation thing. It may require just restarting from scratch given the refactoring I did in #6885.

rkern · 2015-12-28T23:04:06Z

@gfyoung @homu is just a bot. You don't have to explain yourself to it. ;-)

gfyoung · 2015-12-29T00:18:00Z

Whoops! Did not know that. 😄

Though the comment does seem relevant nonetheless given the discussion going on.

charris · 2016-01-01T19:27:49Z

@gfyoung I'm going top close this as #6910 seems to be up and running. Just needs tests now.

gfyoung · 2016-01-01T21:27:30Z

@charris : Would you mind adding my test from this PR to yours? That test addresses the issue brought up in #6812.

charris · 2016-01-01T21:36:16Z

@gfyoung I'll be adding tests at the limits for all the types, so that will get covered in the process.

gfyoung · 2016-01-01T21:39:09Z

@charris : Oh, okay. Cool, that works. Thanks!

charris added 01 - Enhancement component: numpy.random labels Dec 12, 2015

charris closed this Dec 16, 2015

charris reopened this Dec 16, 2015

gfyoung mentioned this pull request Dec 18, 2015

Stricter Argument Checking for Flatten Methods #6823

Merged

gfyoung mentioned this pull request Dec 20, 2015

Add dtypes for random number generation #6869

Closed

pv reviewed Dec 21, 2015
View reviewed changes

pv reviewed Dec 22, 2015
View reviewed changes

charris mentioned this pull request Dec 23, 2015

ERROR: test_mem_overlap.test_diophantine_fuzz #6881

Closed

gfyoung mentioned this pull request Dec 24, 2015

ENH: Expand Argument Range of random_integers #6885

Merged

gfyoung mentioned this pull request Dec 30, 2015

BUG: Broadcast Arguments in Random Integer Generation #6902

Closed

charris mentioned this pull request Dec 30, 2015

ENH: Add dtype argument to random.randint. #6910

Merged

charris closed this Jan 1, 2016

gfyoung deleted the rand_long_overflow branch January 1, 2016 21:29

ryanpeach mentioned this pull request Oct 12, 2017

Allow multiple versions of RandomState: init' work. #5911

Closed

Uh oh!

Allow for larger int64s in random.randint #6824

Allow for larger int64s in random.randint #6824

Uh oh!

Conversation

gfyoung commented Dec 12, 2015

Uh oh!

njsmith commented Dec 12, 2015

Uh oh!

njsmith commented Dec 12, 2015

Uh oh!

jaimefrio commented Dec 12, 2015

Uh oh!

gfyoung commented Dec 12, 2015

Uh oh!

njsmith commented Dec 12, 2015

Uh oh!

gfyoung commented Dec 12, 2015

Uh oh!

charris commented Dec 16, 2015

Uh oh!

gfyoung commented Dec 16, 2015

Uh oh!

gfyoung commented Dec 17, 2015

Uh oh!

gfyoung commented Dec 18, 2015

Uh oh!

gfyoung commented Dec 19, 2015

Uh oh!

pv commented Dec 21, 2015

Uh oh!

pv Dec 21, 2015

Choose a reason for hiding this comment

Uh oh!

pv commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

rkern commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

rkern commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

rkern commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

gfyoung commented Dec 21, 2015

Uh oh!

pv Dec 22, 2015

Choose a reason for hiding this comment

Uh oh!

pv commented Dec 22, 2015

Uh oh!

gfyoung commented Dec 22, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

rkern commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

rkern commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

rkern commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 23, 2015

Uh oh!

gfyoung commented Dec 24, 2015

Uh oh!

homu commented Dec 28, 2015

Uh oh!

gfyoung commented Dec 28, 2015