Skip to content

Allow for larger int64s in random.randint #6824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Allow for larger int64s in random.randint #6824

wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Contributor

@gfyoung gfyoung commented Dec 12, 2015

Addresses enhancement requested in #6812 in which attempts to generate random, large 64-bit integers was causing overflow issues in Windows. This PR allows such generation to be done.

@njsmith
Copy link
Member

njsmith commented Dec 12, 2015

I don't remember whether we consider it legit to use long long as a way to specify int64 -- does anyone know?

Could you add a test? And either run that test on Windows and report back, or if you don't have convenient access to a Windows box then speak up and we'll hassle @jreback or someone into doing it?

@njsmith
Copy link
Member

njsmith commented Dec 12, 2015

Also, as a tip for the future: it's nice to write something like Fixes gh-6812 in your commit message, because (1) this documents which bug is being fixed, so someone reading the git log later can find it for reference, and (2) if you write Close gh-XXX or Fixes gh-XXX in your commit message, then github is smart enough to automatically close the original bug when the fix gets merged; otherwise someone has to remember to go back and close it manually.

(You can also write #6812 instead of gh-6812 -- their effect is identical -- but the latter is kinda nice because it will remain unambiguous if we someday change bug trackers, and strings like #6812 become ambiguous between the old bug tracker's numbering and the new bug tracker's numbering. This has already happened once...)

@jaimefrio
Copy link
Member

For this to work, I think you need to make diff an unsigned long long, but that would require making changes to rk_interval as well.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 12, 2015

@jaimefrio : Ah, I think you are right. However, if we propagate, I think that would also require changing rk_random and rk_ulong, but I do not fully understand the internals of the former. Might someone be able to explain?

Secondly, does anyone have an idea as to why only one build managed to pass the test I wrote, but the other builds failed? Does it have to do with what was discussed in my first paragraph?

@njsmith
Copy link
Member

njsmith commented Dec 12, 2015

I haven't analyzed in detail, but the reason that that one build passed almost certainly has something to do with that being the one build that we run in 32-bit mode (so for that build long = int = 32 bit, for all the other builds long = 64 bit, int = 32 bit).

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 12, 2015

Okay. That seems to make sense. From what I understand, that seems to be related to the comments made by @jaimefrio. I think before I can make any further changes (to the C code most likely), I'll need to have a better understanding of the internals though.

@charris
Copy link
Member

charris commented Dec 16, 2015

Close and reopen to restart tests.

@charris charris closed this Dec 16, 2015
@charris charris reopened this Dec 16, 2015
@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 16, 2015

@charris : Thanks!

I'm still working on understanding this random number generation thing. If anyone has any input/ideas they would like to contribute, that would be great!

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 17, 2015

Undid a good portion of my changes and left just the unit test because all of these long to long long conversions are confusing me (there are a lot more than I had expected). Need a fresh start! At least I know the unit test is correct for the builds where long is 64-bit.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 18, 2015

Many long long conversion attempts later, I finally got Windows (32-bit) to pass the test I wrote. Hopefully I didn't break anything else though. Travis will soon tell me.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 19, 2015

Travis is finally happy! 😄 Please review carefully! I did a lot of recasting all over the place, so I could have inadvertently broken a function that Travis failed to catch, much like the issue I was trying to fix in this PR.

@pv
Copy link
Member

pv commented Dec 21, 2015

I think this is wrong (iow, the change of datatype in the rng code should not be made): this is the 32-bit Mersenne twister generator --- it does not become 64-bit just by increasing the data type. The 64-bit MT is a different rng.

The rng code should not be changed. The issue is only in the randint / rk_interval methods, and only those need modification.

rk_ulong(rk_state *state)
{
#if ULONG_MAX <= 0xffffffffUL
#if ULLONG_MAX <= 0xffffffffULL
return rk_random(state);
#else
return (rk_random(state) << 32) | (rk_random(state));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get this part to work with long long / int64 (after reverting the change of datatype in the rng core), you need a cast: ((unsigned long long)rk_random(state) << 32) | ...

@pv
Copy link
Member

pv commented Dec 21, 2015

The generator only produces 32-bit numbers: changing the data types in rk_random does not change this.

rk_random does not need to return long long in order for rk_interval to return long long. What's done there is calling rk_ulong when the required integers are larger than 32 bit. This is necessary because the generator produces only 32-bit numbers, even if sizeof(long) is larger than that.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

The maximum long in 32-bit Python on my machine is np.iinfo(np.int32).max because passing anything higher as the upper bound in randint gives the OverflowError. This value turns out to be significantly less than RK_MAX. Thus, if we maintain long as the data type for elements in the key array and therefore for variables in rk_random, that would contradict the documentation for rk_random, which says it generates random numbers between 0 and RK_MAX inclusive, if I'm not mistaken.

@rkern
Copy link
Member

rkern commented Dec 21, 2015

rk_random() returns an unsigned long which does indeed generate numbers up through RK_MAX. The problem with RandomState.randint() is that it takes signed longs for the bounds and does signed long arithmetic. You only need to modify randint(), not the core PRNG.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

Okay, but how are such values generated if long is unable to take on values up to RK_MAX, which is the case on my machine for example?

@rkern
Copy link
Member

rkern commented Dec 21, 2015

long is signed. unsigned long is unsigned.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

I am referring to the rk_random method. I'm not understanding how it can generate values up to RK_MAX inclusive if none of the variables (which are all long) in rk_random are never anything remotely close to that.

@rkern
Copy link
Member

rkern commented Dec 21, 2015

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

Oh right, right! My bad. For some reason I kept reading unsigned long as long in your replies @rkern , a good indicator that my eyes have been staring at the computer screen for too long now.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 21, 2015

@rkern , @pv: I reverted those changes to the rng, and Travis seems to be happy with them. If there's anything else that needs reverting, let me know!

@@ -703,13 +703,13 @@ cdef class RandomState:

"""
cdef ndarray state "arrayObject_state"
state = <ndarray>np.empty(624, np.uint)
state = <ndarray>np.empty(624, np.uint64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be reverted

@pv
Copy link
Member

pv commented Dec 22, 2015

This changes the return type and random stream on 32-bit platforms, which can be considered backward incompatible. I'd suggest to continue instead the approach in the other PR --- adding a dtype parameter seems a better way to do this.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 22, 2015

Adding the dtype parameter only I don't believe addresses the issue in this PR though because we still need to recast lo, hi, and rv as long long to prevent the OverflowError. Thus, to ensure that no such errors are generated, the return types I think should be changed in this PR to fix the issue brought up. My other PR can add in the dtype and address the "backwards incompatibility" issue you are bringing up here as well as the original issue for that PR. From an organizational standpoint, that seems to make sense in my head. How does that sound?

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

@pv: Reverted more of the code, and Travis is still happy. What are your thoughts about what I proposed for the PR organization?

@rkern
Copy link
Member

rkern commented Dec 23, 2015

You can't change rk_long and rk_ulong like this. They must return long and unsigned long respectively; it's in the contract specified by their name, much less backwards compatibility. You can add a new rk_uint64 if you like and make an rk_uint64_interval that uses it. I do recommend using a specific-bitwidth type instead of unsigned long long. In retrospect, I think it was a mistake to use longs everywhere in the API. In my defense, 64-bit systems were still rare at the time, and I did not have a 64-bit system to experiment on to learn what the issues are.

Making RandomState.randint() use these new functions is a little more complicated. It might have to wait for the method-versioning in #5911 to go in since it will change the stream. You may be able to get away with checking the range of the inputs; if they fit into the range served by rk_interval, use it, otherwise use rk_uint64_interval (under the theory that you would get an OverflowError in this conditions with the current code). I'm not sure about the cross-platform ramifications of that, but it's entirely possible that randint() is already not cross-platform. If rk_uint64_interval is implemented correctly (i.e. matches what rk_interval does on 64-bit-long systems even when using a 32-bit-long system), it's also possible that it will restore cross-platformness.

I do apologize, but this is unfortunately not the most new-contributor-friendly part of the code to start with. There are subtle and strict backwards-compatibility requirements that are unique to the PRNG subsystem.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

@rkern : No worries. Given the subtleties though you have brought up, it would have been preferable if they had been brought up earlier. That may have avoided a lot of the confusion that transpired over the past week or so with this PR.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

My alternative idea was to rename the functions to describe what they are doing as a result of my changes (e.g. ulong --> ulonglong). However, if the interest is to maintain the old code as much as possible, then building out identical methods for 32-bit long, while undesirable due to the boilerplate, seems to be the only way to go as you proposed. I'm not sure how the PR you mentioned would impact this choice.

It also occurs to me (and perhaps this is a stupid question, but I haven't found a convincing answer online): why is it not possible to run the 32-bit Mersenne Twister algorithm with 64-bit integers? As long as the internal values are constrained to having the upper 32 bits = 0, I think the numbers generated will be identical because it's shifting / bit-wise operations with numbers whose values also have the upper 32 bits = 0?

I do believe you are correct in saying that randint is cross-platform, but only for a certain range of inputs ([np.iinfo(np.int32).min, np.iinfo(np.int32).max]). Beyond that, 32-bit long architectures will crash, while 64-bit long ones will not. From a memory standpoint, if I'm not getting my semantics incorrect, any attempt to accommodate 32-bit long architecture for 64-bit inputs will already be breaking cross-platform uniformity.

@rkern
Copy link
Member

rkern commented Dec 23, 2015

It is possible to modify the MT code that way, but it doesn't solve any problem. It won't make randint() work for large numbers. You still need to assemble two draws from the MT core PRNG to assemble a full-range uint64.

As for the cross-platformness, this is what I meant by "subtle" backwards compatibility requirements. If the code currently raises an exception with certain inputs, then we do allow changes that will make those inputs work.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

Could you define what you mean by "large"? Actually, as the current implementation stands, you actually cannot generate numbers up to np.iinfo(np.uint64).max. In fact, on 64-bit Python, I cannot input anything above np.iinfo(np.int64).max, as I will get the OverflowError described in the issue.

Actually, in light of this, I don't even think my changes will suffice to take in inputs like as large as np.iinfo(np.uint64).max Is that the type of inputs randint should be able to take?

@rkern
Copy link
Member

rkern commented Dec 23, 2015

I meant that it doesn't change anything with respect to the problem that you are trying to fix. The types internally used by the core PRNG are simply irrelevant. The problem that you seem to want to fix can be fixed solely within the confines of randint() and perhaps some new functions.

By "large", I meant np.iinfo(np.int64).max on 32-bit systems. randint() thus far has been defined to work on signed integers, not unsigned ones. That should probably continue to be the case for backwards compatibility reasons. If the user specifically requests an unsigned dtype via #6869, then you can go ahead and generate numbers up through np.iinfo(whatever_requested_dtype).max.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 23, 2015

Okay, that makes sense. Building out new functions should not be too bad then I think since it is essentially boilerplate with different types (long long instead of long).

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 24, 2015

@rkern : Refactored with the new methods, and Travis is happy.

Addresses bug in random integer generation in which
the bounds for the generation are stored as longs.
For 32-bit Python, longs are only 32-bit. Random
integer generation has to acommodate integers up to
64-bit. Thus, if a 33-bit integer is passed in for
one of the bounds, an OverflowError is thrown. This
commit adds '64-bit' versions of relevant random
integer generation methods that are called if the
original randint method overflows.

Closes gh-6812.
@homu
Copy link
Contributor

homu commented Dec 28, 2015

☔ The latest upstream changes (presumably #6885) made this pull request unmergeable. Please resolve the merge conflicts.

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 28, 2015

@homu : I'm actually going to refrain from rebasing until we can figure out exactly what's going on with this random integer generation thing. It may require just restarting from scratch given the refactoring I did in #6885.

@rkern
Copy link
Member

rkern commented Dec 28, 2015

@gfyoung @homu is just a bot. You don't have to explain yourself to it. ;-)

@gfyoung
Copy link
Contributor Author

gfyoung commented Dec 29, 2015

Whoops! Did not know that. 😄

Though the comment does seem relevant nonetheless given the discussion going on.

@charris
Copy link
Member

charris commented Jan 1, 2016

@gfyoung I'm going top close this as #6910 seems to be up and running. Just needs tests now.

@charris charris closed this Jan 1, 2016
@gfyoung
Copy link
Contributor Author

gfyoung commented Jan 1, 2016

@charris : Would you mind adding my test from this PR to yours? That test addresses the issue brought up in #6812.

@gfyoung gfyoung deleted the rand_long_overflow branch January 1, 2016 21:29
@charris
Copy link
Member

charris commented Jan 1, 2016

@gfyoung I'll be adding tests at the limits for all the types, so that will get covered in the process.

@gfyoung
Copy link
Contributor Author

gfyoung commented Jan 1, 2016

@charris : Oh, okay. Cool, that works. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants