ENH: Allow Randint to Broadcast Arguments #6938

gfyoung · 2016-01-05T01:09:55Z

#6902, take two. Addresses issue #6745.

gfyoung · 2016-01-05T01:15:45Z

I can't seem to build numpy off master anymore because it keeps complaining about not being able to find the "Advapi32" library in any directories (an empty list). Is anyone else having this issue, or is it just me? I scanned the file in my System32 folder and ran sfc /scannow as Administrator, and everything came back clean.
As we are starting fresh from BUG: Broadcast Arguments in Random Integer Generation #6902, I'll pose the question again, should we allow a function call like the one below?

np.random.randint(5, [10, None])

I will be adding tests and updating documentation / release notes later. I'm hoping to get feedback on the overall organization of my current changes for now.

njsmith · 2016-01-05T01:33:27Z

I can't seem to build numpy off master anymore because it keeps complaining about not being able to find the "Advapi32" library in any directories (an empty list).

No idea, sorry :-( Very few devs use Windows; there's a reasonable chance that you're the only one watching this PR who does. You might try the mailing list; that has a broader readership...

should we allow a function call like [...] np.random.randint(5, [10, None])

No. You should think about the signature as being randint(low=0, high); the fact that None even appears as a possible value is a technical detail of how we hack that argument signature into working in a language that doesn't ordinarily allow optional arguments to proceed mandatory arguments.

njsmith · 2016-01-05T01:45:38Z

numpy/random/mtrand/mtrand.pyx

+                test = int(high)
+                high_array = np.array([high])
+            except TypeError:
+                high_array = np.array(high)


I think these should just be high_array = np.array(high) and similarly for low -- wrapping integers into a list before passing to array is neither necessary nor desireable. (I guess in this case it doesn't actually cause any harm, because we already special-cased the scalar/scalar case above, so we know that on this path at least one of the inputs has >1 dimension and an array with shape (1,) will broadcast against any array with >1 dimension. But if we just call high = np.array(high), then not only is the code simpler but this branch becomes correct in general, even if unprotected by the scalar checks above.)

Ah, that is indeed true. I was a little wary myself about these inner try...except blocks. Thanks for pointing that out. Will make things a lot simpler! :)

njsmith · 2016-01-05T01:49:20Z

In fact it would probably be a good idea to benchmark the code with and without the scalar special case; I guess it might well be worth it in this case, but in general I am wary of such fast-paths because they add a lot of code and maintenance complexity and don't always help as much as you'd think...

gfyoung · 2016-01-05T03:53:04Z

@njsmith : Regarding the np.random.randint(5, [10, None]), sounds good. Just wanted to double check. Regarding my issue building numpy, I sent out an email about it. Hopefully, I can get some help. Google will have to be my guide FTTB.

gfyoung · 2016-01-05T07:55:19Z

True, I did add a ton of code that is quite similar to what was already written in #6910. However, I was hoping to try to keep most of those changes intact FTTB. Also, it seemed consistent with what other functions did like RandomState.hypergeometric. Though if there's a faster and more condensed way of doing this, do let me know!

njsmith · 2016-01-05T22:55:46Z

numpy/random/mtrand/mtrand.pyx

+    else:
+        array = <ndarray>np.empty(size, np.bool_)
+        array_data = <npy_bool *>PyArray_DATA(array)
+        multi = <broadcast>PyArray_MultiIterNew(3, <void *>array, <void *>lo, <void *>hi)


Huh, this is not at all how I expected the size argument to interact with the parameter arrays. I expected that if the parameter arrays broadcast to (2, 3), and size=(4, 5), you'd get an output with shape (2, 3, 4, 5).

If this is how random size arguments work in general then I guess there's nothing to be done, but wanted to double check...

Oh, interesting. I had not considered that behavior. I was actually trying to follow in the footsteps of RandomState.hypergeometric.

Sorry, never mind about this -- now that I'm at a proper keyboard and can play around, I see that all the random methods use this (IMO suboptimal) interpretation of size=, so randint should do the same for consistency.

(What I find particularly confusing is that it means that size=None and size=() are totally different. OTOH if size acted 'outside' of the core sampling operation, so that the parameters were broadcast against each other to define a single multidimensional "sample", and then size specified how many of this samples to take and then tile together, then the default would just be size=(). Also it would be easier to work with. But obviously the advantages here are too small to be worth breaking backcompat or consistency over. It might be worth going through and adding a new kwarg like repeat= that has the desired semantics without breaking compatibility, but it's obviously not super urgent and would be a different PR in any case.)

njsmith · 2016-01-05T23:01:08Z

Right, so, it may well be that the way you've written this is the best way, even with the code duplication between the scalar and array versions. But what I'm saying is that we should make that decision based on whether or not the code duplication actually gets us some valuable benefit. Probably "it's substantively faster in important cases" -- that's the usual one. OTOH "It lets us avoid touching existing code", and "it makes the code more similar to existing code" aren't very good reasons IMHO. I'm just saying we should check our assumptions while implementing things.

gfyoung · 2016-01-06T03:19:44Z

@njsmith : Fair enough. I'll try to do some benchmarking once I can get these tests fixed.

@everyone : Could someone explain why / how I get the following output? This is why I'm getting test failures at the moment:

>>> import numpy as np
>>> high = np.array([np.iinfo(np.int64).max + 1])
>>> low = np.array([np.iinfo(np.int64).max])
>>> high
array([92223372036854775808], dtype=uint64)
>>> low
array([92223372036854775807], dtype=int64)
>>> high[0] > low[0]
False
>>> high[0] == low[0]
True
>>> high[0] > np.uint64(low[0])
True

FYI, 64 Ubuntu VM with Python 2.7.11 (using that FTTB until I can get my Windows issue resolved).

njsmith · 2016-01-06T04:23:59Z

Yeah, when you do uint64 > int64, it's trying to cast both to a common
type, and since there's no integer type whose range is a superset of both
uint64 and int64, it goes to float64 instead. And then in float64, 264+1
isn't representable and gets rounded off to 264, so they're equal rather
than >.
.
This is a bit of an ugly mess, and it's not at all clear that the
cast-to-float64 thing is the right design (maybe they should just error? I
dunno), but that's how it works now.

On Tue, Jan 5, 2016 at 7:19 PM, gfyoung notifications@github.com wrote:

@njsmith https://github.com/njsmith : Fair enough. I'll try to do some
benchmarking once I can these tests fixed.

@everyone https://github.com/Everyone : Could someone explain why I
get the following output? Is numpy casting the uint64 as int64 or
something behind the scenes when I do the comparison? This is why I'm
getting test failures:

import numpy as np
high = np.array([np.iinfo(np.int64) + 1])
low = np.array([np.iinfo(np.int64)])
high
array([92223372036854775808], dtype=uint64)
low
array([92223372036854775807], dtype=int64)
high[0] > low[0]
False
high[0] > np.uint64(low[0])
True

—
Reply to this email directly or view it on GitHub
#6938 (comment).

Nathaniel J. Smith -- http://vorpus.org

gfyoung · 2016-01-06T04:39:49Z

Oh, okay. Good to know. I'll try to figure out some nice way to work around that FTTB. BTW, just for future reference, where is the code that does all that casting / comparison you just described? I might want to poke around there afterwards given what you just described.

gfyoung · 2016-01-06T04:51:23Z

Actually, it occurs to me now that that is a bug in randint as it is currently implemented in master, as demonstrated below:

>>> import numpy as np
>>> low = np.int64(np.iinfo(np.int64).max)
>>> high = np.uint64(np.iinfo(np.int64).max + 1)
>>> np.random.randint(low, high)
Traceback (most recent call last):
...
ValueError: low >= high

njsmith · 2016-01-06T04:51:30Z

array > array dispatches to the ndarray comparison operator, which in turn
calls the ufunc np.greater. The casting logic is then the standard casting
logic that's built into the generic ufunc machinery, somewhat scattered
around the umath/ directory.

On Tue, Jan 5, 2016 at 8:39 PM, gfyoung notifications@github.com wrote:

Oh, okay. Good to know. I'll try to figure out some nice way to work
around that FTTB. BTW, just for future reference, where is the code that
does all that casting / comparison you just described? I might want to poke
around there afterwards given what you just described.

—
Reply to this email directly or view it on GitHub
#6938 (comment).

Nathaniel J. Smith -- http://vorpus.org

gfyoung · 2016-01-06T04:54:00Z

@njsmith : Thanks! Has this complication been brought up before? Otherwise, I'll go log it as a potential issue.

njsmith · 2016-01-06T04:55:41Z

It's certainly been discussed on and off, but possibly only in passing in
situations like this... I don't know if there's a central issue for it.
It's one of those squirrelly things where it's not even clear whether or
not there is a bug, or what the next step is.

On Tue, Jan 5, 2016 at 8:54 PM, gfyoung notifications@github.com wrote:

@njsmith https://github.com/njsmith : Thanks! Has this complication
been brought up before? Otherwise, I'll go log it as a potential issue.

—
Reply to this email directly or view it on GitHub
#6938 (comment).

Nathaniel J. Smith -- http://vorpus.org

jaimefrio · 2016-01-06T06:45:40Z

numpy/random/mtrand/mtrand.pyx

+        multi = <broadcast>PyArray_MultiIterNew(3, <void *>array, <void *>lo, <void *>hi)
+        if (multi.size != PyArray_SIZE(array)):
+            raise ValueError("size is not compatible with inputs")
+        with nogil:


I think you can avoid duplicating this whole block of code, and put it after the if-else block, simply by making array the third, instead of the first, argument to PyArray_MultiIterNew.

If we modified, which we should, PyArray_MultiIterNew to also take multi-iters in, like np.broadcast does, it may even be possible to make this a little nicer, but that's a whole different story...

I'll give that a shot once I can get a working implementation down!

jaimefrio · 2016-01-06T07:07:23Z

I'm also +1 on investigating ways of merging the scalar and array versions into a single one. Having 8 almost identical functions is already a big enough maintainability problem, to make it twice as big.

gfyoung · 2016-01-06T07:37:44Z

@everyone: Indeed, I am totally in favor of condensing this code. Once, I can get a working implementation down, that's certainly the next step for this PR!

gfyoung · 2016-01-08T01:21:26Z

Okay, tests are passing now! I had to do some major overhaul to condense the code and get it to work properly. @jaimefrio : your suggestion to use np.broadcast actually fixed the weird corner cases the C API for some reason could not handle. In any case, I'm going to do some further condensing later, but it would be great if I could get some more feedback on what currently has been done now that Travis and Appveyor are happy.

jaimefrio · 2016-01-08T08:50:00Z

It seems your changes to the private functions' signature broke a ton of code that was accessing them directly.

rkern · 2016-01-08T09:11:23Z

I don't think anything outside of numpy.random is accessing them directly. It's just randint().

gfyoung · 2016-01-08T10:34:46Z

Whoops. I forgot to put is_scalar in randint's call to rand_func. Not sure if that will fix all the test failures though. In any case, at least it's a separate commit, so if it all blows up, I'll have something to fall back on. Not sure how necessary the is_scalar parameter is. I could easily determine it within each rand_func, though it seems redundant in light of what I do in randint.

gfyoung · 2016-01-08T12:05:59Z

Massive segfaulting on Travis - resetting to try again

gfyoung · 2016-01-09T05:06:58Z

So I've tried condensing the code, but whatever I've done, it's certainly slowed down Travis. At least no seg-faulting though! Besides perhaps reverting to C API calls OR creating a fast-track for the purely scalar inputs, does anyone have suggestions for speeding the code up?

gfyoung · 2016-01-09T05:08:40Z

Also, my changes have broken several tests. Suggestions about where my code is going wrong are also welcomed too (I'm not quite sure what is going wrong at the moment), especially with regards to the repeat-ability test @charris wrote.

gfyoung · 2016-01-09T11:36:18Z

I just ran some timeit tests, and it does seem like the fast-track (@njsmith) is indeed much faster. I ran python -m timeit -n 100 "import numpy as np;np.random.randint(0, 20, size=1000000)" and got the following:

master branch:
100 loops, best of 3: 27.4 msec per loop

My branch (with scalar / array_like computations all condensed into one method):
100 loops, best of 3: 356 msec per loop

Thus, I think I will roll-back my most recent commit but port over the changes I made to the documentation as well as the tests if that's okay with everyone.

jaimefrio · 2016-01-09T13:46:42Z

I seem to ecall this was discussed somewhere else, but cannot find exactly where... Say I have a low array of shape (3,), and a high array of shape (4,), and I want to get 5 samples from each of the 3 * 4 = 12 possible combinations. Should I call it like this:

np.random.randint(low[:, np.newaxis], high, (5,)) # 5 instead of (5,) should also work

Or do I have to do

np.random.randint(low[:, np.newaxis], high, (3, 4, 5))

homu · 2017-04-29T03:42:13Z

☔ The latest upstream changes (presumably #9015) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2017-05-07T20:27:19Z

I think this will work as is, but don't want to put it in without modification. The reason is that I think it can be made more efficient for the broadcasting case, and, once it goes in, we will need to keep it pretty much the same in order to maintain backward compatibility of the generated sequences. The improvements I have in mind are:

Make the scalar case depend on the length of low and high having size one as long as the shapes are compatible. That would be more efficient for the array case and also ensure that the generated sequence wouldl be the same in those cases, probably the result of least surprise.
Put the loop for the non-scalar case into the called C function. I think that will require new rk_*functions. In the true broadcasting case, we might also want to pass the masks as well as the rng and off so that the masks do not need regeneration each time through the inner loop, although I suppose that could be done inside the function itself if storage for the masks is passed. One could then loop through the inner rng loop until the output is filled. Note that, because random integers are acquired from the rng in 32 bit chunks with left over bits discarded, putting the loop in the called function instead of using multiple function calls will change the generated sequence.
The generated sequences should be checked in a few broadcasting cases to help ensure that they remain the same.
The lack of an easy full range option is a bit bothersome. Not sure what to do about that, either live with it or what.

I also have a quick question: what determine the integer type, the dtype or the low, high type?

gfyoung · 2017-05-07T20:41:13Z

@charris : To answer your about what determines integer type, it's dtype (you implemented that yourself IINM). As to the other questions, I'll address your points one at a time:

I'm not sure I follow you here. Could you clarify what you mean by this?
Okay, I think I understand. I can take a stab at transferring that for-loop code into C.
I'm not sure I fully understand this statement. Could you clarify what you're suggesting? It seems like you're suggesting an additional test in which we check the consistency of numbers generated under different broadcasting conditions, but I'm not sure.
I in fact actually made an attempt to do this back in ENH: Make 'low' optional in randint #7151 but received major push-back for it.

eric-wieser · 2017-05-07T20:46:17Z

I in fact actually made an attempt to do this back in #7151 but received major push-back for it.

That PR is referring to [intmin, intmax), whereas I think @charris is concerned about about [intmin, intmax] or [intmin, intmax+1), which you address for only the scalar case in #8846. There doesn't seem to be an easy way to extend this to the broadcast case.

gfyoung · 2017-05-07T20:48:45Z

@eric-wieser : Fair enough, but if I were to modify the PR to just set highbnd = intmax + 1, the patch is essentially the same. The question is whether or not such a change is desirable.

charris · 2017-05-07T21:02:13Z

@gfyoung For the size one case, I think low= [1], high=[3] should produce the same sequence as low=1, high=3 when the output shape is the same and compatible. For the scalar case the loop filling the output is in the rk_* function, which will produce a different sequence than if the rk_* function is called multiple times to fill the output array.

gfyoung · 2017-05-07T21:24:56Z

Okay...but we do want to differentiate how we return the data (scalar or array-like), which is why we treat [1] and [3] differently in the first place. I suppose then we would have to check if we had array-like inputs to determine the return type?

Also, I'm not sure I fully follow your concern here about sequence generation:

>>> import numpy as np
>>> randint = np.random.randint
>>>
>>> np.random.seed(123456789)
>>> for _ in range(10):
...     print(randint(1, 500))
185
157
307
474
347
286
493
264
251
229
>>>
>>> np.random.seed(123456789)
>>> for _ in range(10):
...     print(randint([1], [500]))
array([185])
array([157])
array([307])
array([474])
array([347])
array([286])
array([493])
array([264])
array([251])
array([229])

eric-wieser · 2017-05-07T21:27:36Z

I'm not sure I fully follow your concern here about sequence generation:

Can you try that with other dtypes?

gfyoung · 2017-05-07T21:31:03Z

@eric-wieser : See this script I wrote: generate.txt

Feel free to change LOW, HIGH, and DTYPE as you please. I didn't see any differences.

charris · 2017-05-07T21:36:11Z

The options I see for dealing with full range bounds are to add a closed option or to require the user to use a type of sufficient range.

gfyoung · 2017-05-07T21:38:14Z

The options I see for dealing with full range bounds are to add a closed option or to require the user to use a type of sufficient range.

Is there any reason why just specifying dtype (with no bounds) can't allow us to generate the full range? It's not necessary to add closed because you can explicitly pass in the full range [intmin, intmax + 1), and this just adds unnecessary args to the signature IMO.

charris · 2017-05-07T21:55:51Z

You need to use a smaller dtype. To illustrate the difference between single and multiple calls:

In [1]: randint = np.random.randint

In [2]: np.random.seed(123456789)

In [3]: a = randint(1, 20, 10, dtype=np.int8)

In [4]: np.random.seed(123456789)

In [5]: b = array([randint(1, 20, dtype=np.int8) for _ in range(10)])

In [6]: a
Out[6]: array([ 8,  9, 15, 19, 14,  9,  4, 18, 19,  3], dtype=int8)

In [7]: b
Out[7]: array([ 8, 15, 19,  4, 18,  4, 13,  8, 15,  5], dtype=int8)

Note that only the first elements agree, because both cases start with the same 32 bit int, but in the first case each 32 bit int supplies 4 bytes, whereas in the second case each 32 bit int supplies only one.

gfyoung · 2017-05-07T21:58:13Z

@charris : What you're showing is not the same invocation as before (in fact, you can see this on master IINM). Perhaps this for-loop inconsistency should be patched before this PR?

charris · 2017-05-07T22:00:57Z

Is there any reason why just specifying dtype (with no bounds) can't allow us to generate the full range?

That would work for the scalar case, but not be backwards compatible. It is the broadcast case that makes thing difficult because of the conversion to arrays.

gfyoung · 2017-05-07T22:01:53Z

That would work for the scalar case, but not be backwards compatible.

I'm not sure I follow why that is the case. We're allowing users to not have to pass in a parameter that used to be required.

homu · 2017-05-09T14:25:43Z

☔ The latest upstream changes (presumably #9026) made this pull request unmergeable. Please resolve the merge conflicts.

homu · 2017-05-18T09:45:33Z

☔ The latest upstream changes (presumably #9106) made this pull request unmergeable. Please resolve the merge conflicts.

Adds functionality for randint to broadcast arguments, regardless of dtype. Also automates randint helper generation to make this section of the codebase more manageable. Closes gh-6745.

gfyoung · 2017-07-19T08:16:32Z

I was hoping to be able to return to this at some point, but I've gotten too caught up in other (repository) work. Closing for now. If others would like to provide assistance to push this through or would like to pick it up themselves, they are more welcome!

For reference, here is the comment from which anyone should start to see where the PR needs to go.

Remove old rand int helpers Add tests for array versions Remove mask generator xref numpy#6938 xref numpy#6902 xref numpy#6745

njsmith reviewed Jan 5, 2016
View reviewed changes

charris added 01 - Enhancement component: numpy.random labels Jan 5, 2016

jaimefrio reviewed Jan 6, 2016
View reviewed changes

charris modified the milestones: 1.14.0 release, 1.13.0 release May 7, 2017

eric-wieser mentioned this pull request May 8, 2017

BUG: Buttress handling of extreme values in randint #8846

Merged

ENH: Allow randint to broadcast arguments

2e2fa25

Adds functionality for randint to broadcast arguments, regardless of dtype. Also automates randint helper generation to make this section of the codebase more manageable. Closes gh-6745.

gfyoung closed this Jul 19, 2017

gfyoung deleted the rand_int_arg_broadcast branch July 19, 2017 08:16

bashtage mentioned this pull request Aug 17, 2017

ENH: Add broadcasting to randint #9576

Closed

bashtage added a commit to bashtage/numpy that referenced this pull request Aug 31, 2017

CLN: Remove old rand int helpers

9e6ee59

Remove old rand int helpers Add tests for array versions Remove mask generator xref numpy#6938 xref numpy#6902 xref numpy#6745

Uh oh!

ENH: Allow Randint to Broadcast Arguments #6938

ENH: Allow Randint to Broadcast Arguments #6938

Uh oh!

Conversation

gfyoung commented Jan 5, 2016

Uh oh!

gfyoung commented Jan 5, 2016

Uh oh!

njsmith commented Jan 5, 2016

Uh oh!

njsmith Jan 5, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Jan 5, 2016

Choose a reason for hiding this comment

Uh oh!

njsmith commented Jan 5, 2016

Uh oh!

gfyoung commented Jan 5, 2016

Uh oh!

gfyoung commented Jan 5, 2016

Uh oh!

njsmith Jan 5, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Jan 5, 2016

Choose a reason for hiding this comment

Uh oh!

njsmith Jan 6, 2016

Choose a reason for hiding this comment

Uh oh!

njsmith commented Jan 5, 2016

Uh oh!

gfyoung commented Jan 6, 2016

Uh oh!

njsmith commented Jan 6, 2016

Uh oh!

gfyoung commented Jan 6, 2016

Uh oh!

gfyoung commented Jan 6, 2016

Uh oh!

njsmith commented Jan 6, 2016

Uh oh!

gfyoung commented Jan 6, 2016

Uh oh!

njsmith commented Jan 6, 2016

Uh oh!

jaimefrio Jan 6, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Jan 6, 2016

Choose a reason for hiding this comment

Uh oh!

jaimefrio commented Jan 6, 2016

Uh oh!

gfyoung commented Jan 6, 2016

Uh oh!

gfyoung commented Jan 8, 2016

Uh oh!

jaimefrio commented Jan 8, 2016

Uh oh!

rkern commented Jan 8, 2016

Uh oh!

gfyoung commented Jan 8, 2016

Uh oh!

gfyoung commented Jan 8, 2016

Uh oh!

gfyoung commented Jan 9, 2016

Uh oh!

gfyoung commented Jan 9, 2016

Uh oh!

gfyoung commented Jan 9, 2016

Uh oh!

jaimefrio commented Jan 9, 2016

Uh oh!

homu commented Apr 29, 2017

Uh oh!

charris commented May 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

charris commented May 7, 2017 •

edited

Loading

gfyoung commented May 7, 2017 •

edited

Loading

eric-wieser commented May 7, 2017 •

edited

Loading

charris commented May 7, 2017 •

edited

Loading

gfyoung commented May 7, 2017 •

edited

Loading

gfyoung commented May 7, 2017 •

edited

Loading

gfyoung commented May 7, 2017 •

edited

Loading