ENH: Add dtype argument to random.randint. #6910

charris · 2015-12-30T21:58:17Z

This is a WIP, in particular, the examples need fixing, tests need to be added, and a decision should be made whether to access these functions through a dtype argument in the current random_integers function. If the latter is chosen rk_functions for uint8, uint16, and maybe bool need to be added for efficiency, on the other hand, less new documentation needs to be written ;)

The following dtypes are now available from the new dtype argument added to randint.

* np.bool,
* np.int8, np.uint8,
* np.int16, np.uint16,
* np.int32, np.uint32,
* np.int64, np.uint64,
* np.int_ (long), np.intp

Tests for all the new functionality are needed, but things seem to work. It should also be almost trivial to template the code added to mtrant.pyx, the 9 added functions only differ in using different types.

When merged #6812 will be closed and the integer part of #6790 will be done..

~~Alternate PR's addressing this are #6824 and #6869, but I like this one as far as it goes.~~

gfyoung · 2015-12-30T22:48:12Z

There seems to be a lot of boilerplate. Is it possible to condense? Especially with other issues like broadcasting (see #6902) or dtype (see #6869), it seems like this could easily blow up with so many functions implementing nearly the same thing but with 32-bit or 64-bit int.

Also, is there a way to test / demonstrate the issue with the random stream that is apparently rampant in the current implementation of random_integers? That would be quite useful (at least for my sake, given that I have been looking at this for so long) to illustrate why these changes are necessary.

rkern · 2015-12-30T22:53:27Z

numpy/random/mtrand/mtrand.pyx

+        """
+        random_int32(low, high=None, size=None)
+
+        Return random npy_intp integers between `low` and `high`, inclusive.


charris · 2015-12-30T22:59:20Z

@gfyoung You can implement both the dtype and broadcasting using the added functions. You need both 32 and 64 to implement both intp and int with the same functions, the same functions added to randomkit (which uses unsigned) can also be used for unsigned. To do a efficient job with smaller integer types more functions need to be added to randomkit for efficient implementation, i.e., there are two int16 in one int32 sample.

Not clear on what you consider boiler plate besides the documentation...

gfyoung · 2015-12-30T23:39:09Z

@charris : Regarding dtype and broadcasting, that actually wasn't my question. I'm not worried about ability to implement. No, no, I know both are certainly possible, broadcasting to say the least given how many times it has been done already in mtrand.pyx.

My concern was that won't you have to implement everything at least twice now in your versions? Maybe I'm getting my semantics / definitions wrong, but your two functions are boilerplate given how similar they are to each other.

Also, you say "you need" to implement the functions separately for 32-bit and 64-bit. Again, is there a way to test or illustrate the shortcomings? Also, why is it that you are using the 32-bit Mersenne Twister for npy_uint64? From what I remember, @rkern said on my PR that this should not be done, and that there is a completely separate algorithm for 64-bit.

I unfortunately can't pull down the branch right now to try it myself, but how does the change address the failed method call in #6812 with np.iinfo(np.int64).max? Do you have to call random_int64, or does random_intp suffice?

rkern · 2015-12-30T23:50:03Z

That's not what I said. What I said was that just changing unsigned long to unsigned long long in the core PRNG would not help you. Instead, I suggested exactly what @charris implemented here, to create rk_uint64 that generates a uint64 from two draws from the core PRNG.

charris · 2015-12-30T23:52:07Z

@gfyoung You need 64 bit random numbers in order to generate 64 bit random numbers. OTOH, it is inefficient to use 64 bit numbers to generate 32 random numbers. The two functions are similar, but not the same. Note that rk_random_uint64 will use 32 or 64 bit random samples depending on the size of the range, and we need to preserve that in order to retain backwards compatibility.

It would help if you would make specific comments about specific lines/functions. It it difficult to respond to generalities.

gfyoung · 2015-12-31T00:17:36Z

@rkern: Oh, sorry! You're right. Memory failed me there. By the way in the original code (i.e master branch), doesn't rk_ulong do exactly that, which is called from rk_interval, which in turn is called by random_integers? If not, then how was it then that my added test in #6824 passed both on my machine (32-bit) and on 64-bit Linux on Travis?

@charris: Apologies that I haven't been able to give more specific pointers. While I realize this is WIP, my comments have been directed at the overall organization and structure of the code. My concern was directed towards the future additions of dtype and broadcasting, which would have to be added in both functions. I'm just trying to understand why you organized the code the way you did, if that's okay.

Also, if you could address the question regarding the np.iinfo(np.int64).max call, that would be great as well!

charris · 2015-12-31T00:32:31Z

@gfyoung If we go the dtype route, it only needs to be added to one function, which will call the appropriate others, which may be hidden (or not). But we do need the others, as they will all differ in lower level details.

I don't see an np.iinfo(np.int64).max call here, what are you referring to?

gfyoung · 2015-12-31T00:46:54Z

numpy/random/mtrand/mtrand.pyx

+            msg = 'Unimplemented for sizeof(long) == %d' % sizeof(long)
+            raise RuntimeError(msg)
+
+    def random_intp(self, low, high=None, size=None):


Can this just be random_integers instead?

No. One is C long, the other is Py_ssize_t and the two are not always the same, that is why there is an error on windows 64 where C long is 32 bits and Py_ssize_t is 64 bits.

But this function prevents the overflow error, and on remaining platforms (and if there are more, please correct me on this!), long is the same size of Py_ssize_t. Given that the difference is significant on few platforms, I would think that the difference between the two is minimal and therefore confusing for users. Thus, surfacing only one would make sense, wouldn't it?

Not really. They have different uses, np.intp is the indexing integer, and we need to keep np.int as is for backwards compatibility. Python 2 had two integer types, int, and long, not the same as the C types of the same name. The int type was implemented as C long, which was efficient, and the later long type has arbitrary precision. Python 3 has only the long type, but we need to carry along the historic crud so as to not break old code/results.

Okay, that makes sense. Could that be noted somewhere? That would be good to know.

You have a point, in that the way things are implemented the returned values would not change for the same ranges if only np.intp were used, but the array type would, doubling in memory footprint on Windows 64, and the returned scalar values would be of long python type for python 2.7 on 64 bit Windows.

That is true. I wonder then if you could put logic in random_integers to check the OS and Python version and then call the appropriate one (either the np.intp or long versions) to maintain backwards compatibility? Then the user wouldn't have to worry / think about which one to call. This would require abstracting the current code in random_integers into say a function called rand_int (placeholder, this name may not work so well as it would get confusing with randint; though IMHO the names random_integers and randint might need to be refactored to give a better idea of what they do in the future).

gfyoung · 2015-12-31T00:57:35Z

@charris : Ah, I see. Just add it to random_integers and then branch out to the appropriate calls in randomkit.c. For efficiency purposes, that's fair.

The np.iinfo(np.int64).max call I'm referring to is in the first comment of @jreback issue #6812, which technically is not the failing mem_overlap test failure as you described in your PR description.

gfyoung · 2015-12-31T04:25:58Z

numpy/random/mtrand/mtrand.pyx

+        else:
+            lo = low
+            hi = high
+


I wonder if we could wrap this if...else block into a try...except block to provide a useful message about OverflowError issues that we are seeing in #6812, something like:

try: ... # if-else block except OverflowError: raise OverflowError(("Boundary values are too large for int32. Please call random_int64 instead."))

rkern · 2015-12-31T10:29:00Z

Generally speaking, the high-inclusive random_integers(low, high) is considered an API mistake. It's only here because it was in Numeric's RandomArray. I'd prefer to expand only randint() to other dtypes instead.

gfyoung · 2015-12-31T10:35:48Z

@rkern : That's interesting that you point that out. What happens if people wanted to generate numbers that included the maximum possible integer (e.g. np.iinfo(np.int64).max)?

rkern · 2015-12-31T10:41:49Z

That just means that we have to be careful about dealing with the high argument that is input, but it should be manageable.

gfyoung · 2015-12-31T10:48:33Z

@rkern : Oh, that was actually more of a theoretical question than a question directly related to the changes here. In any case, I'm not sure that I follow your comment regarding only implementing dtype for just randint and not random_integers since randint calls random_integers currently.

As an aside, perhaps once all of the refactoring has been done, it could be worthwhile to deprecate randint and random_integers with just one function that accepts a is_half_open argument defaulted to True. I'm not sure it's within the scope of this PR, but just throwing that out there.

rkern · 2015-12-31T10:59:53Z

It's the other way around. random_integers() calls randint().

gfyoung · 2015-12-31T11:07:10Z

@rkern : They have been reversed (#6885). Also, I had a question for you about the two draws from the PRNG, which I think got lost in the discussion, so I'll paste it here:

By the way in the original code (i.e master branch), doesn't rk_ulong do exactly that, which is called from rk_interval, which in turn is called by random_integers? If not, then how was it then that my added test in #6824 passed both on my machine (32-bit) and on 64-bit Linux on Travis?

charris · 2015-12-31T13:29:20Z

@rkern OK, things can be fixed up. To be clear, randint is the preferred function. ISTM that in that case we should deprecate random_integers or at least document that randint should be used instead. Perhaps the proposed dtype argument should only be added to randint as well to encourage that.

rkern · 2015-12-31T15:51:23Z

@gfyoung rk_ulong() does make two draws when unsigned long is 64 bits wide. That's what that #if ULONG_MAX <= 0xffffffffUL test is controlling.

njsmith · 2016-01-01T01:18:34Z

Regarding the nitty-gritty details of the various method semantics: I do think it would indeed be nice if there were some way to sample a uniform uint64, though it's a bit tricky to get the API right because it's a weird corner case that most users probably won't care about. I also agree that the current random_integers/randint situation is confusing. An additional extra-confusing aspect that hasn't been mentioned yet in this thread is that the stdlib random module also has functions for sampling from both half-open and closed intervals, and it also uses the name randint for one of those functions, but they do it backwards from us -- their randint is our random_integers, and our randint is their randrange.

Proposal: soft deprecate both randint and random_integers (by "soft deprecate" I mean: document that you shouldn't use them, mayyyybe emit a warning when they're used, but don't make any plans to actually remove them), and replace them both with a new method randrange([start,] stop, [step, closed=False, dtype=int]). Rationale: the randrange name is more transparent to users who already know range and more consistent with the stdlib; the new closed= kwarg (as suggested by @gfyoung) provides an elegant way to sample from the full fixed-width integer range without cluttering up the top-level documentation and forcing users to understand the subtleties of how the half-open convention interacts with fixed-width integers just to get started.

If that sounds good to y'all then someone should post it to the list I guess...

Two things that might benefit from further refinement: (a) the closed= name is somewhat jargony for parts of our user base, maybe there's something even better? (include_max=?), (b) do we want to take advantage of the addition of this new method to clean up the default integer type?

rkern · 2016-01-01T01:37:05Z

I'd suggest fixing the technical issues with the existing function of interest first and then do the renaming for clarity separately.

As for the rangerange() proposal itself, I'd actually omit step= as a foolish consistency with random.randrange() which is itself a foolishly consistent with range(). This should be a fairly primitive function, so making it do more things with more options is a pessimization in terms of usability, IMO. As for closed=, there isn't a problem with specifying the full range with the open-interval convention; the caller is not restricted to passing in fixed-width integers.

gfyoung · 2016-01-02T03:41:56Z

@charris : There are still several documentation issues that should probably be fixed (see the comments I made that have not been collapsed yet in an outdated diff info). OTOH cool to see that there was a speed-up here.

Random ndarrays of the following types can now be generated: * np.bool, * np.int8, np.uint8, * np.int16, np.uint16, * np.int32, np.uint32, * np.int64, np.uint64, * np.int_ (long), np.intp The specification is by precision rather than by C type. Hence, on some platforms np.int64 may be a `long` instead of `long long` even if the specified dtype is `long long` because the two may have the same precision. The resulting type depends on which c type numpy uses for the given precision. The byteorder specification is also ignored, the generated arrays are always in native byte order. The dtype of the result could be made more explicit if desired without changing the user visible results.

* check exceptions * check extreme bounds are reachable * check that all values are in the specified bounds * check repeatability of sequences More exact statistical tests would be nice, but that is another project.

The default randint function returns a C long type which does not have enough range to test indexes on Windows 64. The fix here is to use specify a np.intp dtype for the randint call now that we have that option. Closes numpy#6812.

Cython generated C code contains the number '-2147483648L', which leads to a warning on 32 bit platforms: "Warning: this decimal constant is unsigned only in ISO C90" See the discussion at http://stackoverflow.com/questions/9941261/ The compiled code seems to run correctly despite the warning and if there are problems, they should turn up in the nose testing.

AppVeyor failures were filtered out until numpy issues were fixed. The last issues were the test_mem_overlap failures on 64 bit windows, and those are fixed.

charris · 2016-01-03T21:46:10Z

An indication of just how much a jump backward in a loop costs.

In [13]: timeit np.random.randint(0, 128, size=10**6, dtype=np.uint64)
100 loops, best of 3: 3.8 ms per loop

In [14]: timeit np.random.randint(0, 65, size=10**6, dtype=np.uint64)
100 loops, best of 3: 15.5 ms per loop

If the jump overhead was zero, then the second timing would be 2x the first, but as there is a jump back before the second draw that has to account for the extra time. That suggests that the speed could be increased by perhaps about a factor of two by unrolling the loop a bit.

charris · 2016-01-03T22:21:59Z

OK, I'm going to put this in so that the appveyor failures can show.

rkern · 2016-01-03T22:29:28Z

doc/release/1.11.0-notes.rst

+  The specification is by precision rather than by C type. Hence, on some
+  platforms np.int64 may be a `long` instead of `long long` even if the
+  specified dtype is `long long` because the two may have the same
+  precision. The resulting type depends on which c type numpy uses for the


Heh, OK, I'll fix that in the notes revisions coming up.

ENH: Add dtype argument to random.randint.

rkern · 2016-01-03T22:30:56Z

numpy/random/mtrand/mtrand.pyx

@@ -908,6 +1219,13 @@ cdef class RandomState:
            Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
            ``m * n * k`` samples are drawn.  Default is None, in which case a
            single value is returned.
+        dtype : dtype, optional
+            Desired dtype of the result. All dtypes are determined by their
+            name, i.e., 'int64', 'int`, etc, so byteorder is not available


charris added 01 - Enhancement component: numpy.random labels Dec 30, 2015

charris added this to the 1.11.0 release milestone Dec 30, 2015

rkern reviewed Dec 30, 2015
View reviewed changes

gfyoung reviewed Dec 31, 2015
View reviewed changes

charris force-pushed the add-64-bit-random-int branch from 4d4a461 to 6204145 Compare December 31, 2015 02:12

gfyoung reviewed Dec 31, 2015
View reviewed changes

charris force-pushed the add-64-bit-random-int branch from 6204145 to b3b3749 Compare December 31, 2015 23:19

charris force-pushed the add-64-bit-random-int branch from d901fce to 6815e47 Compare January 2, 2016 03:20

charris force-pushed the add-64-bit-random-int branch 4 times, most recently from 62643ed to 7eb88a0 Compare January 2, 2016 14:11

charris force-pushed the add-64-bit-random-int branch from 7eb88a0 to 25b6e63 Compare January 2, 2016 19:30

charris added 6 commits January 2, 2016 20:05

TST: Add tests for new randint functionality.

5aca879

* check exceptions * check extreme bounds are reachable * check that all values are in the specified bounds * check repeatability of sequences More exact statistical tests would be nice, but that is another project.

TST,BUG: Fix use of randint in test_mem_overlap.py.

51726e5

The default randint function returns a C long type which does not have enough range to test indexes on Windows 64. The fix here is to use specify a np.intp dtype for the randint call now that we have that option. Closes numpy#6812.

BUG: Get rid of C++ style comment in multiarray_tests.c.src

5132ce0

TST: Enable AppVeyor failure to be reported.

07fa8cc

AppVeyor failures were filtered out until numpy issues were fixed. The last issues were the test_mem_overlap failures on 64 bit windows, and those are fixed.

DOC: Document new randint dtype parameter in 1.11.0 release notes.

bc2a97b

charris force-pushed the add-64-bit-random-int branch from 25b6e63 to bc2a97b Compare January 3, 2016 03:05

rkern reviewed Jan 3, 2016
View reviewed changes

charris added a commit that referenced this pull request Jan 3, 2016

Merge pull request #6910 from charris/add-64-bit-random-int

9fb9288

ENH: Add dtype argument to random.randint.

charris merged commit 9fb9288 into numpy:master Jan 3, 2016

charris deleted the add-64-bit-random-int branch January 3, 2016 22:29

rkern reviewed Jan 3, 2016
View reviewed changes

This was referenced Jan 3, 2016

Faster random.shuffle via static typing. #6776

Closed

BUG: Broadcast Arguments in Random Integer Generation #6902

Closed

This was referenced Jan 3, 2016

DEP: Deprecate random_integers #6931

Merged

The arguments of randint() and random_integers() do not broadcast. #6745

Closed

ENH: Allow Randint to Broadcast Arguments #6938

Closed

rkern mentioned this pull request Jul 11, 2016

random.random_integers fails if upper bound is 2**31 or higher #4085

Closed

rth mentioned this pull request Jan 3, 2018

randint output have different dtype on Linux and Windows #10317

Closed

drsteve mentioned this pull request Jun 7, 2019

Test failure: test_poppy.AssocTests.testAssocBootstrapRegress spacepy/spacepy#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add dtype argument to random.randint. #6910

ENH: Add dtype argument to random.randint. #6910

charris commented Dec 30, 2015

gfyoung commented Dec 30, 2015

rkern Dec 30, 2015

charris commented Dec 30, 2015

gfyoung commented Dec 30, 2015

rkern commented Dec 30, 2015

charris commented Dec 30, 2015

gfyoung commented Dec 31, 2015

charris commented Dec 31, 2015

gfyoung Dec 31, 2015

charris Dec 31, 2015

gfyoung Dec 31, 2015

charris Dec 31, 2015

gfyoung Dec 31, 2015

charris Dec 31, 2015

gfyoung Dec 31, 2015

gfyoung commented Dec 31, 2015

gfyoung Dec 31, 2015

rkern commented Dec 31, 2015

gfyoung commented Dec 31, 2015

rkern commented Dec 31, 2015

gfyoung commented Dec 31, 2015

rkern commented Dec 31, 2015

gfyoung commented Dec 31, 2015

charris commented Dec 31, 2015

rkern commented Dec 31, 2015

njsmith commented Jan 1, 2016

rkern commented Jan 1, 2016

gfyoung commented Jan 2, 2016

charris commented Jan 3, 2016

charris commented Jan 3, 2016

rkern Jan 3, 2016

charris Jan 3, 2016

rkern Jan 3, 2016

ENH: Add dtype argument to random.randint. #6910

ENH: Add dtype argument to random.randint. #6910

Conversation

charris commented Dec 30, 2015

gfyoung commented Dec 30, 2015

Choose a reason for hiding this comment

charris commented Dec 30, 2015

gfyoung commented Dec 30, 2015

rkern commented Dec 30, 2015

charris commented Dec 30, 2015

gfyoung commented Dec 31, 2015

charris commented Dec 31, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Dec 31, 2015

Choose a reason for hiding this comment

rkern commented Dec 31, 2015

gfyoung commented Dec 31, 2015

rkern commented Dec 31, 2015

gfyoung commented Dec 31, 2015

rkern commented Dec 31, 2015

gfyoung commented Dec 31, 2015

charris commented Dec 31, 2015

rkern commented Dec 31, 2015

njsmith commented Jan 1, 2016

rkern commented Jan 1, 2016

gfyoung commented Jan 2, 2016

charris commented Jan 3, 2016

charris commented Jan 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment