-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Allow Randint to Broadcast Arguments #6938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
No idea, sorry :-( Very few devs use Windows; there's a reasonable chance that you're the only one watching this PR who does. You might try the mailing list; that has a broader readership...
No. You should think about the signature as being |
numpy/random/mtrand/mtrand.pyx
Outdated
test = int(high) | ||
high_array = np.array([high]) | ||
except TypeError: | ||
high_array = np.array(high) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should just be high_array = np.array(high)
and similarly for low
-- wrapping integers into a list before passing to array
is neither necessary nor desireable. (I guess in this case it doesn't actually cause any harm, because we already special-cased the scalar/scalar case above, so we know that on this path at least one of the inputs has >1 dimension and an array with shape (1,)
will broadcast against any array with >1 dimension. But if we just call high = np.array(high)
, then not only is the code simpler but this branch becomes correct in general, even if unprotected by the scalar checks above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that is indeed true. I was a little wary myself about these inner try...except
blocks. Thanks for pointing that out. Will make things a lot simpler! :)
In fact it would probably be a good idea to benchmark the code with and without the scalar special case; I guess it might well be worth it in this case, but in general I am wary of such fast-paths because they add a lot of code and maintenance complexity and don't always help as much as you'd think... |
@njsmith : Regarding the |
True, I did add a ton of code that is quite similar to what was already written in #6910. However, I was hoping to try to keep most of those changes intact FTTB. Also, it seemed consistent with what other functions did like |
numpy/random/mtrand/mtrand.pyx
Outdated
else: | ||
array = <ndarray>np.empty(size, np.bool_) | ||
array_data = <npy_bool *>PyArray_DATA(array) | ||
multi = <broadcast>PyArray_MultiIterNew(3, <void *>array, <void *>lo, <void *>hi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, this is not at all how I expected the size argument to interact with the parameter arrays. I expected that if the parameter arrays broadcast to (2, 3), and size=(4, 5), you'd get an output with shape (2, 3, 4, 5).
If this is how random size arguments work in general then I guess there's nothing to be done, but wanted to double check...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. I had not considered that behavior. I was actually trying to follow in the footsteps of RandomState.hypergeometric
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, never mind about this -- now that I'm at a proper keyboard and can play around, I see that all the random methods use this (IMO suboptimal) interpretation of size=
, so randint
should do the same for consistency.
(What I find particularly confusing is that it means that size=None
and size=()
are totally different. OTOH if size acted 'outside' of the core sampling operation, so that the parameters were broadcast against each other to define a single multidimensional "sample", and then size specified how many of this samples to take and then tile together, then the default would just be size=(). Also it would be easier to work with. But obviously the advantages here are too small to be worth breaking backcompat or consistency over. It might be worth going through and adding a new kwarg like repeat=
that has the desired semantics without breaking compatibility, but it's obviously not super urgent and would be a different PR in any case.)
Right, so, it may well be that the way you've written this is the best way, even with the code duplication between the scalar and array versions. But what I'm saying is that we should make that decision based on whether or not the code duplication actually gets us some valuable benefit. Probably "it's substantively faster in important cases" -- that's the usual one. OTOH "It lets us avoid touching existing code", and "it makes the code more similar to existing code" aren't very good reasons IMHO. I'm just saying we should check our assumptions while implementing things. |
@njsmith : Fair enough. I'll try to do some benchmarking once I can get these tests fixed. @everyone : Could someone explain why / how I get the following output? This is why I'm getting test failures at the moment:
FYI, 64 Ubuntu VM with Python 2.7.11 (using that FTTB until I can get my Windows issue resolved). |
Yeah, when you do uint64 > int64, it's trying to cast both to a common On Tue, Jan 5, 2016 at 7:19 PM, gfyoung notifications@github.com wrote:
Nathaniel J. Smith -- http://vorpus.org |
Oh, okay. Good to know. I'll try to figure out some nice way to work around that FTTB. BTW, just for future reference, where is the code that does all that casting / comparison you just described? I might want to poke around there afterwards given what you just described. |
Actually, it occurs to me now that that is a bug in
|
array > array dispatches to the ndarray comparison operator, which in turn On Tue, Jan 5, 2016 at 8:39 PM, gfyoung notifications@github.com wrote:
Nathaniel J. Smith -- http://vorpus.org |
@njsmith : Thanks! Has this complication been brought up before? Otherwise, I'll go log it as a potential issue. |
It's certainly been discussed on and off, but possibly only in passing in On Tue, Jan 5, 2016 at 8:54 PM, gfyoung notifications@github.com wrote:
Nathaniel J. Smith -- http://vorpus.org |
numpy/random/mtrand/mtrand.pyx
Outdated
multi = <broadcast>PyArray_MultiIterNew(3, <void *>array, <void *>lo, <void *>hi) | ||
if (multi.size != PyArray_SIZE(array)): | ||
raise ValueError("size is not compatible with inputs") | ||
with nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid duplicating this whole block of code, and put it after the if
-else
block, simply by making array
the third, instead of the first, argument to PyArray_MultiIterNew
.
If we modified, which we should, PyArray_MultiIterNew
to also take multi-iters in, like np.broadcast
does, it may even be possible to make this a little nicer, but that's a whole different story...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give that a shot once I can get a working implementation down!
I'm also +1 on investigating ways of merging the scalar and array versions into a single one. Having 8 almost identical functions is already a big enough maintainability problem, to make it twice as big. |
@everyone: Indeed, I am totally in favor of condensing this code. Once, I can get a working implementation down, that's certainly the next step for this PR! |
Okay, tests are passing now! I had to do some major overhaul to condense the code and get it to work properly. @jaimefrio : your suggestion to use |
It seems your changes to the private functions' signature broke a ton of code that was accessing them directly. |
I don't think anything outside of |
Whoops. I forgot to put |
Massive segfaulting on Travis - resetting to try again |
So I've tried condensing the code, but whatever I've done, it's certainly slowed down Travis. At least no seg-faulting though! Besides perhaps reverting to C API calls OR creating a fast-track for the purely scalar inputs, does anyone have suggestions for speeding the code up? |
Also, my changes have broken several tests. Suggestions about where my code is going wrong are also welcomed too (I'm not quite sure what is going wrong at the moment), especially with regards to the repeat-ability test @charris wrote. |
I just ran some
My branch (with scalar / array_like computations all condensed into one method): Thus, I think I will roll-back my most recent commit but port over the changes I made to the documentation as well as the tests if that's okay with everyone. |
I seem to ecall this was discussed somewhere else, but cannot find exactly where... Say I have a np.random.randint(low[:, np.newaxis], high, (5,)) # 5 instead of (5,) should also work Or do I have to do np.random.randint(low[:, np.newaxis], high, (3, 4, 5)) |
☔ The latest upstream changes (presumably #9015) made this pull request unmergeable. Please resolve the merge conflicts. |
I think this will work as is, but don't want to put it in without modification. The reason is that I think it can be made more efficient for the broadcasting case, and, once it goes in, we will need to keep it pretty much the same in order to maintain backward compatibility of the generated sequences. The improvements I have in mind are:
I also have a quick question: what determine the integer type, the |
@charris : To answer your about what determines integer type, it's
|
That PR is referring to |
@eric-wieser : Fair enough, but if I were to modify the PR to just set |
@gfyoung For the size one case, I think |
Okay...but we do want to differentiate how we return the data (scalar or array-like), which is why we treat Also, I'm not sure I fully follow your concern here about sequence generation: >>> import numpy as np
>>> randint = np.random.randint
>>>
>>> np.random.seed(123456789)
>>> for _ in range(10):
... print(randint(1, 500))
185
157
307
474
347
286
493
264
251
229
>>>
>>> np.random.seed(123456789)
>>> for _ in range(10):
... print(randint([1], [500]))
array([185])
array([157])
array([307])
array([474])
array([347])
array([286])
array([493])
array([264])
array([251])
array([229]) |
Can you try that with other dtypes? |
@eric-wieser : See this script I wrote: generate.txt Feel free to change |
The options I see for dealing with full range bounds are to add a |
Is there any reason why just specifying |
You need to use a smaller dtype. To illustrate the difference between single and multiple calls:
Note that only the first elements agree, because both cases start with the same 32 bit int, but in the first case each 32 bit int supplies 4 bytes, whereas in the second case each 32 bit int supplies only one. |
@charris : What you're showing is not the same invocation as before (in fact, you can see this on |
That would work for the scalar case, but not be backwards compatible. It is the broadcast case that makes thing difficult because of the conversion to arrays. |
I'm not sure I follow why that is the case. We're allowing users to not have to pass in a parameter that used to be required. |
☔ The latest upstream changes (presumably #9026) made this pull request unmergeable. Please resolve the merge conflicts. |
☔ The latest upstream changes (presumably #9106) made this pull request unmergeable. Please resolve the merge conflicts. |
Adds functionality for randint to broadcast arguments, regardless of dtype. Also automates randint helper generation to make this section of the codebase more manageable. Closes gh-6745.
I was hoping to be able to return to this at some point, but I've gotten too caught up in other (repository) work. Closing for now. If others would like to provide assistance to push this through or would like to pick it up themselves, they are more welcome! For reference, here is the comment from which anyone should start to see where the PR needs to go. |
Remove old rand int helpers Add tests for array versions Remove mask generator xref numpy#6938 xref numpy#6902 xref numpy#6745
#6902, take two. Addresses issue #6745.