-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: random: Add multivariate_hypergeometric function. #8056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A description of the distribution on wikipedia: https://en.wikipedia.org/wiki/Hypergeometric_distribution#Multivariate_hypergeometric_distribution |
We usually have a check for repeatability. Here, with the two different implementations, we need at least two. It isn't possible to try all the different color combinations, but maybe pick something simple and a long sequence. I used a hash of the sequences for some checks, but attention needs to be paid to little/big endian so that veiwed as bytes the sequences are the same. See line 179 in |
For small sample lengths (< 256) you could just convert the output to |
numpy/random/mtrand/mtrand.pyx
Outdated
ncolors = len(colors) | ||
shape = _shape_from_size(size, ncolors) | ||
samples = np.zeros(shape, dtype=np.int) | ||
samples_ptr = <long*> PyArray_DATA(samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think np.int_
is the approved type for long
.
This will likely have conflicts with #6938, which I want to merge first. EDIT: I also want to make a closer check of the details, current comments are first look. |
No problem. And thanks for the prompt "first look"! |
I note that |
In an inline comment, @charris asked about the rule used to choose between the two algorithms. With
I found that algorithm 1 (iterated calls to I also experimented with even larger values of I used only one computer for this experimentation, so I don't know how the result would be affected by using a different system. I can reexpose the individual algorithms if anyone wants to experiment further. I guess we could expose both algorithms permanently, with, say, a |
7544896
to
3b6a49a
Compare
I pushed an update that addresses most of the comments so far. I left the two algorithms as (private) python functions. If anyone wants to experiment with the performance of the two algorithms, they are available as |
Just an additional data point for the tuning (from a moderately aged Callisto processor running Python 3.5 on Ubuntu 16.04 with ATLAS). I used your script to gather the data and this bit to process it. For this data, the heuristic is correct in 96% of cases, that's not bad, though by changing 12 to 19 you can get close to 99%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the nitpicks, feel free to ignore them, I haven't found anything more significant in the whole thing :-)
@wrwrwr All nits worth picking. :) I'll fix them when I update again. Thanks for testing the heuristic. I'll experiment some more with 12 vs. 19 (or some other value) before updating the code. |
Another bit. Couldn't we speed up cases with the number of samples close to the total number of items using symmetry (possibly just by adding a flag and some subtractions to algorithm 2)? For example, |
☔ The latest upstream changes (presumably #8219) made this pull request unmergeable. Please resolve the merge conflicts. |
3b6a49a
to
87c1dab
Compare
That's fine. I have ended up rewriting the hypergeometric distribution functions in private functions to be used in |
c444d9a
to
e0c7ea1
Compare
I have updated the code with a faster version of the "marginals" method. This should not be merged yet. I am investigating the behavior of the method when it is given extremely large arguments. |
87c0508
to
3c303b7
Compare
The test failure on "Shippable" appears to be a bug that is not directly related to this pull request (thought possibly triggered because not all the tests set a random seed). The test that is failing is
Those |
3c303b7
to
2a7a561
Compare
I made a lot of changes since the last review, and the branch for 1.16 is imminent, so I won't be surprised if this gets pushed off to the next release. @charris, just letting you know that you don't have to feel guilty when you do it. 😃 This pull request includes a new C implementation of the (univariate) hypergeometric distribution that is used in the "marginals" method of the multivariate hypergeometric distribution. I went back to Stadlober's thesis and paper to better understand what was going on, and I added a lot more documentation in the function I have limited the "marginals" method to The loss of precision could be alleviated by reformulating the calculation of
that computes With the latest faster version of the
|
@WarrenWeckesser So are you happy with this in its current state? |
@charris, I just noticed that the code in the function |
Pushed off to 1.17. Let me know when you think this is ready. |
No problem. I ran some tests for ranges of the inputs that I hadn't tried before, and I noticed some bias in the "big" tests that I've been running. Now I need to determine if that issue is inherent in the HRUA algorithm, or if it is a limitation of (or bug in) my test code. |
@bashtage is this something we should consider before the 1.17 release, or can it be added afterward? |
Is there a target date for the release of 1.17? |
As must this PR is only about multivariate_hypergeometric, it could wait until later. However, I think @WarrenWeckesser is also planning on improving hypergeometric to work better in edge cases. Any changes to hypergeometric really should get in before 1.17 since these will change the stream (and so might require a warning cycle). |
With randomgen now merged, it makes sense to break this pull request into two parts. The first part is the rewrite of the univariate hypergeometric code. That is now in #13761. The second part will be to add the new multivariate hypergeometric distribution, but that might have to wait until 1.18. |
@WarrenWeckesser Closing this then. |
I assume there will be a new PR with the second part. |
@charris wrote
Yes. If the branching of 1.17 is postponed because of the ongoing discussion of the new randomgen code, it might be possible to get |
Add a function that draws from a multivariate hypergeometric distribution.