-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Add dtype argument to random.randint. #6910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There seems to be a lot of boilerplate. Is it possible to condense? Especially with other issues like broadcasting (see #6902) or Also, is there a way to test / demonstrate the issue with the random stream that is apparently rampant in the current implementation of |
""" | ||
random_int32(low, high=None, size=None) | ||
|
||
Return random npy_intp integers between `low` and `high`, inclusive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32
?
@gfyoung You can implement both the dtype and broadcasting using the added functions. You need both 32 and 64 to implement both intp and int with the same functions, the same functions added to randomkit (which uses unsigned) can also be used for unsigned. To do a efficient job with smaller integer types more functions need to be added to randomkit for efficient implementation, i.e., there are two int16 in one int32 sample. Not clear on what you consider boiler plate besides the documentation... |
@charris : Regarding My concern was that won't you have to implement everything at least twice now in your versions? Maybe I'm getting my semantics / definitions wrong, but your two functions are boilerplate given how similar they are to each other. Also, you say "you need" to implement the functions separately for 32-bit and 64-bit. Again, is there a way to test or illustrate the shortcomings? Also, why is it that you are using the 32-bit Mersenne Twister for I unfortunately can't pull down the branch right now to try it myself, but how does the change address the failed method call in #6812 with |
That's not what I said. What I said was that just changing |
@gfyoung You need 64 bit random numbers in order to generate 64 bit random numbers. OTOH, it is inefficient to use 64 bit numbers to generate 32 random numbers. The two functions are similar, but not the same. Note that It would help if you would make specific comments about specific lines/functions. It it difficult to respond to generalities. |
@rkern: Oh, sorry! You're right. Memory failed me there. By the way in the original code (i.e @charris: Apologies that I haven't been able to give more specific pointers. While I realize this is WIP, my comments have been directed at the overall organization and structure of the code. My concern was directed towards the future additions of Also, if you could address the question regarding the |
@gfyoung If we go the dtype route, it only needs to be added to one function, which will call the appropriate others, which may be hidden (or not). But we do need the others, as they will all differ in lower level details. I don't see an |
msg = 'Unimplemented for sizeof(long) == %d' % sizeof(long) | ||
raise RuntimeError(msg) | ||
|
||
def random_intp(self, low, high=None, size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just be random_integers
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. One is C long
, the other is Py_ssize_t
and the two are not always the same, that is why there is an error on windows 64 where C long
is 32 bits and Py_ssize_t
is 64 bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this function prevents the overflow error, and on remaining platforms (and if there are more, please correct me on this!), long
is the same size of Py_ssize_t
. Given that the difference is significant on few platforms, I would think that the difference between the two is minimal and therefore confusing for users. Thus, surfacing only one would make sense, wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. They have different uses, np.intp
is the indexing integer, and we need to keep np.int
as is for backwards compatibility. Python 2 had two integer types, int
, and long
, not the same as the C types of the same name. The int
type was implemented as C long
, which was efficient, and the later long
type has arbitrary precision. Python 3 has only the long
type, but we need to carry along the historic crud so as to not break old code/results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense. Could that be noted somewhere? That would be good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point, in that the way things are implemented the returned values would not change for the same ranges if only np.intp
were used, but the array type would, doubling in memory footprint on Windows 64, and the returned scalar values would be of long
python type for python 2.7 on 64 bit Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. I wonder then if you could put logic in random_integers
to check the OS and Python version and then call the appropriate one (either the np.intp
or long
versions) to maintain backwards compatibility? Then the user wouldn't have to worry / think about which one to call. This would require abstracting the current code in random_integers
into say a function called rand_int
(placeholder, this name may not work so well as it would get confusing with randint
; though IMHO the names random_integers
and randint
might need to be refactored to give a better idea of what they do in the future).
@charris : Ah, I see. Just add it to The |
4d4a461
to
6204145
Compare
else: | ||
lo = low | ||
hi = high | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could wrap this if...else
block into a try...except
block to provide a useful message about OverflowError
issues that we are seeing in #6812, something like:
try:
... # if-else block
except OverflowError:
raise OverflowError(("Boundary values are too large for int32.
Please call random_int64 instead."))
Generally speaking, the |
@rkern : That's interesting that you point that out. What happens if people wanted to generate numbers that included the maximum possible integer (e.g. |
That just means that we have to be careful about dealing with the |
@rkern : Oh, that was actually more of a theoretical question than a question directly related to the changes here. In any case, I'm not sure that I follow your comment regarding only implementing As an aside, perhaps once all of the refactoring has been done, it could be worthwhile to deprecate |
It's the other way around. |
@rkern : They have been reversed (#6885). Also, I had a question for you about the two draws from the PRNG, which I think got lost in the discussion, so I'll paste it here: By the way in the original code (i.e master branch), doesn't |
@rkern OK, things can be fixed up. To be clear, |
@gfyoung |
6204145
to
b3b3749
Compare
Regarding the nitty-gritty details of the various method semantics: I do think it would indeed be nice if there were some way to sample a uniform Proposal: soft deprecate both If that sounds good to y'all then someone should post it to the list I guess... Two things that might benefit from further refinement: (a) the |
I'd suggest fixing the technical issues with the existing function of interest first and then do the renaming for clarity separately. As for the |
d901fce
to
6815e47
Compare
@charris : There are still several documentation issues that should probably be fixed (see the comments I made that have not been collapsed yet in an outdated diff info). OTOH cool to see that there was a speed-up here. |
62643ed
to
7eb88a0
Compare
Random ndarrays of the following types can now be generated: * np.bool, * np.int8, np.uint8, * np.int16, np.uint16, * np.int32, np.uint32, * np.int64, np.uint64, * np.int_ (long), np.intp The specification is by precision rather than by C type. Hence, on some platforms np.int64 may be a `long` instead of `long long` even if the specified dtype is `long long` because the two may have the same precision. The resulting type depends on which c type numpy uses for the given precision. The byteorder specification is also ignored, the generated arrays are always in native byte order. The dtype of the result could be made more explicit if desired without changing the user visible results.
7eb88a0
to
25b6e63
Compare
* check exceptions * check extreme bounds are reachable * check that all values are in the specified bounds * check repeatability of sequences More exact statistical tests would be nice, but that is another project.
The default randint function returns a C long type which does not have enough range to test indexes on Windows 64. The fix here is to use specify a np.intp dtype for the randint call now that we have that option. Closes numpy#6812.
Cython generated C code contains the number '-2147483648L', which leads to a warning on 32 bit platforms: "Warning: this decimal constant is unsigned only in ISO C90" See the discussion at http://stackoverflow.com/questions/9941261/ The compiled code seems to run correctly despite the warning and if there are problems, they should turn up in the nose testing.
AppVeyor failures were filtered out until numpy issues were fixed. The last issues were the test_mem_overlap failures on 64 bit windows, and those are fixed.
25b6e63
to
bc2a97b
Compare
An indication of just how much a jump backward in a loop costs.
If the jump overhead was zero, then the second timing would be 2x the first, but as there is a jump back before the second draw that has to account for the extra time. That suggests that the speed could be increased by perhaps about a factor of two by unrolling the loop a bit. |
OK, I'm going to put this in so that the appveyor failures can show. |
The specification is by precision rather than by C type. Hence, on some | ||
platforms np.int64 may be a `long` instead of `long long` even if the | ||
specified dtype is `long long` because the two may have the same | ||
precision. The resulting type depends on which c type numpy uses for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, OK, I'll fix that in the notes revisions coming up.
ENH: Add dtype argument to random.randint.
@@ -908,6 +1219,13 @@ cdef class RandomState: | |||
Output shape. If the given shape is, e.g., ``(m, n, k)``, then | |||
``m * n * k`` samples are drawn. Default is None, in which case a | |||
single value is returned. | |||
dtype : dtype, optional | |||
Desired dtype of the result. All dtypes are determined by their | |||
name, i.e., 'int64', 'int`, etc, so byteorder is not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quotes.
This is a WIP, in particular, the examples need fixing, tests need to be added, and a decision should be made whether to access these functions through a dtype argument in the currentrandom_integers
function. If the latter is chosen rk_functions for uint8, uint16, and maybe bool need to be added for efficiency, on the other hand, less new documentation needs to be written ;)The following dtypes are now available from the new
dtype
argument added to randint.Tests for all the new functionality are needed, but things seem to work. It should also be almost trivial to template the code added to mtrant.pyx, the 9 added functions only differ in using different types.
When merged #6812 will be closed and the integer part of #6790 will be done..
Alternate PR's addressing this are #6824 and #6869, but I like this one as far as it goes.