Skip to content

ENH: Add dtype argument to random.randint. #6910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 3, 2016

Conversation

charris
Copy link
Member

@charris charris commented Dec 30, 2015

This is a WIP, in particular, the examples need fixing, tests need to be added, and a decision should be made whether to access these functions through a dtype argument in the current random_integers function. If the latter is chosen rk_functions for uint8, uint16, and maybe bool need to be added for efficiency, on the other hand, less new documentation needs to be written ;)

The following dtypes are now available from the new dtype argument added to randint.

* np.bool,
* np.int8, np.uint8,
* np.int16, np.uint16,
* np.int32, np.uint32,
* np.int64, np.uint64,
* np.int_ (long), np.intp

Tests for all the new functionality are needed, but things seem to work. It should also be almost trivial to template the code added to mtrant.pyx, the 9 added functions only differ in using different types.

When merged #6812 will be closed and the integer part of #6790 will be done..

Alternate PR's addressing this are #6824 and #6869, but I like this one as far as it goes.

@charris charris added this to the 1.11.0 release milestone Dec 30, 2015
@gfyoung
Copy link
Contributor

gfyoung commented Dec 30, 2015

There seems to be a lot of boilerplate. Is it possible to condense? Especially with other issues like broadcasting (see #6902) or dtype (see #6869), it seems like this could easily blow up with so many functions implementing nearly the same thing but with 32-bit or 64-bit int.

Also, is there a way to test / demonstrate the issue with the random stream that is apparently rampant in the current implementation of random_integers? That would be quite useful (at least for my sake, given that I have been looking at this for so long) to illustrate why these changes are necessary.

"""
random_int32(low, high=None, size=None)

Return random npy_intp integers between `low` and `high`, inclusive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int32?

@charris
Copy link
Member Author

charris commented Dec 30, 2015

@gfyoung You can implement both the dtype and broadcasting using the added functions. You need both 32 and 64 to implement both intp and int with the same functions, the same functions added to randomkit (which uses unsigned) can also be used for unsigned. To do a efficient job with smaller integer types more functions need to be added to randomkit for efficient implementation, i.e., there are two int16 in one int32 sample.

Not clear on what you consider boiler plate besides the documentation...

@gfyoung
Copy link
Contributor

gfyoung commented Dec 30, 2015

@charris : Regarding dtype and broadcasting, that actually wasn't my question. I'm not worried about ability to implement. No, no, I know both are certainly possible, broadcasting to say the least given how many times it has been done already in mtrand.pyx.

My concern was that won't you have to implement everything at least twice now in your versions? Maybe I'm getting my semantics / definitions wrong, but your two functions are boilerplate given how similar they are to each other.

Also, you say "you need" to implement the functions separately for 32-bit and 64-bit. Again, is there a way to test or illustrate the shortcomings? Also, why is it that you are using the 32-bit Mersenne Twister for npy_uint64? From what I remember, @rkern said on my PR that this should not be done, and that there is a completely separate algorithm for 64-bit.

I unfortunately can't pull down the branch right now to try it myself, but how does the change address the failed method call in #6812 with np.iinfo(np.int64).max? Do you have to call random_int64, or does random_intp suffice?

@rkern
Copy link
Member

rkern commented Dec 30, 2015

That's not what I said. What I said was that just changing unsigned long to unsigned long long in the core PRNG would not help you. Instead, I suggested exactly what @charris implemented here, to create rk_uint64 that generates a uint64 from two draws from the core PRNG.

@charris
Copy link
Member Author

charris commented Dec 30, 2015

@gfyoung You need 64 bit random numbers in order to generate 64 bit random numbers. OTOH, it is inefficient to use 64 bit numbers to generate 32 random numbers. The two functions are similar, but not the same. Note that rk_random_uint64 will use 32 or 64 bit random samples depending on the size of the range, and we need to preserve that in order to retain backwards compatibility.

It would help if you would make specific comments about specific lines/functions. It it difficult to respond to generalities.

@gfyoung
Copy link
Contributor

gfyoung commented Dec 31, 2015

@rkern: Oh, sorry! You're right. Memory failed me there. By the way in the original code (i.e master branch), doesn't rk_ulong do exactly that, which is called from rk_interval, which in turn is called by random_integers? If not, then how was it then that my added test in #6824 passed both on my machine (32-bit) and on 64-bit Linux on Travis?

@charris: Apologies that I haven't been able to give more specific pointers. While I realize this is WIP, my comments have been directed at the overall organization and structure of the code. My concern was directed towards the future additions of dtype and broadcasting, which would have to be added in both functions. I'm just trying to understand why you organized the code the way you did, if that's okay.

Also, if you could address the question regarding the np.iinfo(np.int64).max call, that would be great as well!

@charris
Copy link
Member Author

charris commented Dec 31, 2015

@gfyoung If we go the dtype route, it only needs to be added to one function, which will call the appropriate others, which may be hidden (or not). But we do need the others, as they will all differ in lower level details.

I don't see an np.iinfo(np.int64).max call here, what are you referring to?

msg = 'Unimplemented for sizeof(long) == %d' % sizeof(long)
raise RuntimeError(msg)

def random_intp(self, low, high=None, size=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be random_integers instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. One is C long, the other is Py_ssize_t and the two are not always the same, that is why there is an error on windows 64 where C long is 32 bits and Py_ssize_t is 64 bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this function prevents the overflow error, and on remaining platforms (and if there are more, please correct me on this!), long is the same size of Py_ssize_t. Given that the difference is significant on few platforms, I would think that the difference between the two is minimal and therefore confusing for users. Thus, surfacing only one would make sense, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. They have different uses, np.intp is the indexing integer, and we need to keep np.int as is for backwards compatibility. Python 2 had two integer types, int, and long, not the same as the C types of the same name. The int type was implemented as C long, which was efficient, and the later long type has arbitrary precision. Python 3 has only the long type, but we need to carry along the historic crud so as to not break old code/results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. Could that be noted somewhere? That would be good to know.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a point, in that the way things are implemented the returned values would not change for the same ranges if only np.intp were used, but the array type would, doubling in memory footprint on Windows 64, and the returned scalar values would be of long python type for python 2.7 on 64 bit Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. I wonder then if you could put logic in random_integers to check the OS and Python version and then call the appropriate one (either the np.intp or long versions) to maintain backwards compatibility? Then the user wouldn't have to worry / think about which one to call. This would require abstracting the current code in random_integers into say a function called rand_int (placeholder, this name may not work so well as it would get confusing with randint; though IMHO the names random_integers and randint might need to be refactored to give a better idea of what they do in the future).

@gfyoung
Copy link
Contributor

gfyoung commented Dec 31, 2015

@charris : Ah, I see. Just add it to random_integers and then branch out to the appropriate calls in randomkit.c. For efficiency purposes, that's fair.

The np.iinfo(np.int64).max call I'm referring to is in the first comment of @jreback issue #6812, which technically is not the failing mem_overlap test failure as you described in your PR description.

@charris charris force-pushed the add-64-bit-random-int branch from 4d4a461 to 6204145 Compare December 31, 2015 02:12
else:
lo = low
hi = high

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could wrap this if...else block into a try...except block to provide a useful message about OverflowError issues that we are seeing in #6812, something like:

try:
... # if-else block
except OverflowError:
    raise OverflowError(("Boundary values are too large for int32.
                          Please call random_int64 instead."))

@rkern
Copy link
Member

rkern commented Dec 31, 2015

Generally speaking, the high-inclusive random_integers(low, high) is considered an API mistake. It's only here because it was in Numeric's RandomArray. I'd prefer to expand only randint() to other dtypes instead.

@gfyoung
Copy link
Contributor

gfyoung commented Dec 31, 2015

@rkern : That's interesting that you point that out. What happens if people wanted to generate numbers that included the maximum possible integer (e.g. np.iinfo(np.int64).max)?

@rkern
Copy link
Member

rkern commented Dec 31, 2015

That just means that we have to be careful about dealing with the high argument that is input, but it should be manageable.

@gfyoung
Copy link
Contributor

gfyoung commented Dec 31, 2015

@rkern : Oh, that was actually more of a theoretical question than a question directly related to the changes here. In any case, I'm not sure that I follow your comment regarding only implementing dtype for just randint and not random_integers since randint calls random_integers currently.

As an aside, perhaps once all of the refactoring has been done, it could be worthwhile to deprecate randint and random_integers with just one function that accepts a is_half_open argument defaulted to True. I'm not sure it's within the scope of this PR, but just throwing that out there.

@rkern
Copy link
Member

rkern commented Dec 31, 2015

It's the other way around. random_integers() calls randint().

@gfyoung
Copy link
Contributor

gfyoung commented Dec 31, 2015

@rkern : They have been reversed (#6885). Also, I had a question for you about the two draws from the PRNG, which I think got lost in the discussion, so I'll paste it here:

By the way in the original code (i.e master branch), doesn't rk_ulong do exactly that, which is called from rk_interval, which in turn is called by random_integers? If not, then how was it then that my added test in #6824 passed both on my machine (32-bit) and on 64-bit Linux on Travis?

@charris
Copy link
Member Author

charris commented Dec 31, 2015

@rkern OK, things can be fixed up. To be clear, randint is the preferred function. ISTM that in that case we should deprecate random_integers or at least document that randint should be used instead. Perhaps the proposed dtype argument should only be added to randint as well to encourage that.

@rkern
Copy link
Member

rkern commented Dec 31, 2015

@gfyoung rk_ulong() does make two draws when unsigned long is 64 bits wide. That's what that #if ULONG_MAX <= 0xffffffffUL test is controlling.

@charris charris force-pushed the add-64-bit-random-int branch from 6204145 to b3b3749 Compare December 31, 2015 23:19
@njsmith
Copy link
Member

njsmith commented Jan 1, 2016

Regarding the nitty-gritty details of the various method semantics: I do think it would indeed be nice if there were some way to sample a uniform uint64, though it's a bit tricky to get the API right because it's a weird corner case that most users probably won't care about. I also agree that the current random_integers/randint situation is confusing. An additional extra-confusing aspect that hasn't been mentioned yet in this thread is that the stdlib random module also has functions for sampling from both half-open and closed intervals, and it also uses the name randint for one of those functions, but they do it backwards from us -- their randint is our random_integers, and our randint is their randrange.

Proposal: soft deprecate both randint and random_integers (by "soft deprecate" I mean: document that you shouldn't use them, mayyyybe emit a warning when they're used, but don't make any plans to actually remove them), and replace them both with a new method randrange([start,] stop, [step, closed=False, dtype=int]). Rationale: the randrange name is more transparent to users who already know range and more consistent with the stdlib; the new closed= kwarg (as suggested by @gfyoung) provides an elegant way to sample from the full fixed-width integer range without cluttering up the top-level documentation and forcing users to understand the subtleties of how the half-open convention interacts with fixed-width integers just to get started.

If that sounds good to y'all then someone should post it to the list I guess...

Two things that might benefit from further refinement: (a) the closed= name is somewhat jargony for parts of our user base, maybe there's something even better? (include_max=?), (b) do we want to take advantage of the addition of this new method to clean up the default integer type?

@rkern
Copy link
Member

rkern commented Jan 1, 2016

I'd suggest fixing the technical issues with the existing function of interest first and then do the renaming for clarity separately.

As for the rangerange() proposal itself, I'd actually omit step= as a foolish consistency with random.randrange() which is itself a foolishly consistent with range(). This should be a fairly primitive function, so making it do more things with more options is a pessimization in terms of usability, IMO. As for closed=, there isn't a problem with specifying the full range with the open-interval convention; the caller is not restricted to passing in fixed-width integers.

@charris charris force-pushed the add-64-bit-random-int branch from d901fce to 6815e47 Compare January 2, 2016 03:20
@gfyoung
Copy link
Contributor

gfyoung commented Jan 2, 2016

@charris : There are still several documentation issues that should probably be fixed (see the comments I made that have not been collapsed yet in an outdated diff info). OTOH cool to see that there was a speed-up here.

@charris charris force-pushed the add-64-bit-random-int branch 4 times, most recently from 62643ed to 7eb88a0 Compare January 2, 2016 14:11
Random ndarrays of the following types can now be generated:

* np.bool,
* np.int8, np.uint8,
* np.int16, np.uint16,
* np.int32, np.uint32,
* np.int64, np.uint64,
* np.int_ (long), np.intp

The specification is by precision rather than by C type. Hence, on some
platforms np.int64 may be a `long` instead of `long long` even if the
specified dtype is `long long` because the two may have the same
precision. The resulting type depends on which c type numpy uses for the
given precision. The byteorder specification is also ignored, the
generated arrays are always in native byte order.

The dtype of the result could be made more explicit if desired without
changing the user visible results.
@charris charris force-pushed the add-64-bit-random-int branch from 7eb88a0 to 25b6e63 Compare January 2, 2016 19:30
* check exceptions
* check extreme bounds are reachable
* check that all values are in the specified bounds
* check repeatability of sequences

More exact statistical tests would be nice, but that is another
project.
The default randint function returns a C long type which does not have
enough range to test indexes on Windows 64. The fix here is to use
specify a np.intp dtype for the randint call now that we have that
option.

Closes numpy#6812.
Cython generated C code contains the number '-2147483648L', which
leads to a warning on 32 bit platforms:

"Warning: this decimal constant is unsigned only in ISO C90"

See the discussion at http://stackoverflow.com/questions/9941261/

The compiled code seems to run correctly despite the warning and
if there are problems, they should turn up in the nose testing.
AppVeyor failures were filtered out until numpy issues were fixed.
The last issues were the test_mem_overlap failures on 64 bit windows,
and those are fixed.
@charris charris force-pushed the add-64-bit-random-int branch from 25b6e63 to bc2a97b Compare January 3, 2016 03:05
@charris
Copy link
Member Author

charris commented Jan 3, 2016

An indication of just how much a jump backward in a loop costs.

In [13]: timeit np.random.randint(0, 128, size=10**6, dtype=np.uint64)
100 loops, best of 3: 3.8 ms per loop

In [14]: timeit np.random.randint(0, 65, size=10**6, dtype=np.uint64)
100 loops, best of 3: 15.5 ms per loop

If the jump overhead was zero, then the second timing would be 2x the first, but as there is a jump back before the second draw that has to account for the extra time. That suggests that the speed could be increased by perhaps about a factor of two by unrolling the loop a bit.

@charris
Copy link
Member Author

charris commented Jan 3, 2016

OK, I'm going to put this in so that the appveyor failures can show.

The specification is by precision rather than by C type. Hence, on some
platforms np.int64 may be a `long` instead of `long long` even if the
specified dtype is `long long` because the two may have the same
precision. The resulting type depends on which c type numpy uses for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, OK, I'll fix that in the notes revisions coming up.

charris added a commit that referenced this pull request Jan 3, 2016
ENH: Add dtype argument to random.randint.
@charris charris merged commit 9fb9288 into numpy:master Jan 3, 2016
@charris charris deleted the add-64-bit-random-int branch January 3, 2016 22:29
@@ -908,6 +1219,13 @@ cdef class RandomState:
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. Default is None, in which case a
single value is returned.
dtype : dtype, optional
Desired dtype of the result. All dtypes are determined by their
name, i.e., 'int64', 'int`, etc, so byteorder is not available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quotes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants