-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…uments. The changes in this pull request restrict the arguments ngood and nbad of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the sense of the old joke "Doctor, it hurts when I do *this*") makes the following problems impossible to encounter: * Python hangs with these calls (see numpygh-11443): np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2) np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12) * Also reported in numpygh-11443: the distribution of the variates of np.random.hypergeometric(ngood, nbad, nsample) is not correct when nsample >= 10 and ngood + nbad is sufficiently large. I don't have a rigorous bound, but a histogram of the p-values of 1500 runs of a G-test with size=5000000 in each run starts to look suspicious when ngood + nbad is approximately 2**37. When the sum is greater than 2**38, the distribution is clearly wrong, and, as reported in the pull request numpygh-9834, the following call returns all zeros: np.random.hypergeometric(2**55, 2**55, 10, size=20) * The code does not check for integer overflow of ngood + nbad, so the following call np.random.hypergeometric(2**62, 2**62, 1) on a system where the default integer is 64 bit results in `ValueError: ngood + nbad < nsample`. By restricting the two arguments to be less than 2**30, the values are well within the space of inputs for which I have no evidence of the function generating an incorrect distribution, and the overflow problem is avoided for both 32-bit and 64-bit systems. I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0 with hypegeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a regression test for a bug in which the function would enter an infinite loop when the arguments were sufficiently large. The regression test for the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) on systems with 64 bit integers. This call is now disallowed, so I add a call with the maximum allowed values of ngood and nbad.
Here are a couple plots showing that the function |
The failure looks like it might be related, if rare. I think this would be fine according to our current policy, but I might suggest making the appropriate modifications over in randomgen's new |
It looks like the new test somehow occasionally puts the code into an infinite loop. I'll see if I can reproduce that behavior.
In the post-NEP-19 world, this change should be part of the legacy A side effect of working on #8056 is that I am rewriting the code underlying |
I don't see a problem with restricting the parameter values if the results are clearly wrong for some values. |
In a post-NEP-19 world, we do not change |
@rkern Perhaps a warning then? Think of it as motivation for moving to the new generators. If the NEP doesn't lay out how to handle new found bugs in the old implementation, then we should add that. |
Perfectly happy with a warning. I don't particularly object to an error in this case, either, just wanted to raise the consideration. |
@WarrenWeckesser any chance you could port these to @mattip randomgen branch? This version is happy to accept improvements, even if they break the stream. |
@WarrenWeckesser That's right. That sounds fine. I think it is a good idea to get things in early (in particular before a release) so that the old way is never released. |
@bashtage, the changes I have in mind for the hypergeometric distribution are much more substantial than what is in this pull request. I implemented a new version of the (univariate) hypergeometric distribution as part of the multivariate hypergeometric distribution proposed in #8056, but it is not exposed as part of the public API in that PR (couldn't change the stream). It might make more sense to not try to get the extensive rewrite of the hypergeometric distribution into @mattip's branch, because (I think) reviewing the changes will be nontrivial, and it looks like there are more fundamental issues still to be resolved before the randomgen branch is merged into numpy. But I'm willing to try it if you and mattip prefer doing it that way. I do agree that it would be best to get the new implementation in before the next release of numpy, to minimize the changes in behavior between releases. |
If you think porting it is easy, then there is no need to get it ready. If not, then it might help it is was ready to submit a PR against the randomgen branch once it is merged. |
@WarrenWeckesser now (before 1.17 in particular) would be the ideal time to make these improvements to |
@bashtage, I will get back to this later this week. The changes I have in mind are to clean up the the hypergeometric distribution and add the multivariate hypergeometric distribution. It's great to see the new randomgen code finally merged! |
The new implementation of |
As far as I know legacy is no change, no bug fix.
…On Sat, Jun 15, 2019, 21:48 Warren Weckesser ***@***.***> wrote:
Closed #11475 <#11475>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11475?email_source=notifications&email_token=ABKTSRMLVVM3LKZP7D53WQTP2VITFA5CNFSM4FH23FNKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOR75OZ6I#event-2415586553>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKTSROGT66CENXT4FCI4PLP2VITFANCNFSM4FH23FNA>
.
|
The changes in this pull request restrict the arguments ngood and nbad
of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the
sense of the old joke "Doctor, it hurts when I do this") makes the
following problems impossible to encounter:
Python hangs with these calls (see BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443):
Also reported in BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443: the distribution of the variates of
is not correct when nsample >= 10 and ngood + nbad is sufficiently large.
I don't have a rigorous bound, but a histogram of the p-values of 1500
runs of a G-test with size=5000000 in each run starts to look suspicious
when ngood + nbad is approximately
2**37
. When the sum is greater than2**38
, the distribution is clearly wrong, and, as reported in the pullrequest WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834, the following call returns all zeros:
The code does not check for integer overflow of ngood + nbad, so the
following call
on a system where the default integer is 64 bit results in
ValueError: ngood + nbad < nsample
.By restricting the two arguments to be less than
2**30
, the values arewell within the space of inputs for which I have no evidence of the
function generating an incorrect distribution, and the overflow problem
is avoided for both 32-bit and 64-bit systems.
I replaced the test for
hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0
with
hypergeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0
. The test was aregression test for a bug in which the function would enter an infinite
loop when the arguments were sufficiently large. The regression test for
the fix included the call
hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2)
on systems with 64 bit integers. This call is now disallowed, so I add a
call with the maximum allowed values of ngood and nbad.