BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475

WarrenWeckesser · 2018-07-02T05:37:15Z

The changes in this pull request restrict the arguments ngood and nbad
of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the
sense of the old joke "Doctor, it hurts when I do this") makes the
following problems impossible to encounter:

Python hangs with these calls (see BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443):
```
np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)
```
Also reported in BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443: the distribution of the variates of
```
np.random.hypergeometric(ngood, nbad, nsample)
```
is not correct when nsample >= 10 and ngood + nbad is sufficiently large.
I don't have a rigorous bound, but a histogram of the p-values of 1500
runs of a G-test with size=5000000 in each run starts to look suspicious
when ngood + nbad is approximately 2**37. When the sum is greater than
2**38, the distribution is clearly wrong, and, as reported in the pull
request WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834, the following call returns all zeros:
```
np.random.hypergeometric(2**55, 2**55, 10, size=20)
```
The code does not check for integer overflow of ngood + nbad, so the
following call
```
np.random.hypergeometric(2**62, 2**62, 1)
```
on a system where the default integer is 64 bit results in
ValueError: ngood + nbad < nsample.

By restricting the two arguments to be less than 2**30, the values are
well within the space of inputs for which I have no evidence of the
function generating an incorrect distribution, and the overflow problem
is avoided for both 32-bit and 64-bit systems.

I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0
with hypergeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a
regression test for a bug in which the function would enter an infinite
loop when the arguments were sufficiently large. The regression test for
the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2)
on systems with 64 bit integers. This call is now disallowed, so I add a
call with the maximum allowed values of ngood and nbad.

…uments. The changes in this pull request restrict the arguments ngood and nbad of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the sense of the old joke "Doctor, it hurts when I do *this*") makes the following problems impossible to encounter: * Python hangs with these calls (see numpygh-11443): np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2) np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12) * Also reported in numpygh-11443: the distribution of the variates of np.random.hypergeometric(ngood, nbad, nsample) is not correct when nsample >= 10 and ngood + nbad is sufficiently large. I don't have a rigorous bound, but a histogram of the p-values of 1500 runs of a G-test with size=5000000 in each run starts to look suspicious when ngood + nbad is approximately 2**37. When the sum is greater than 2**38, the distribution is clearly wrong, and, as reported in the pull request numpygh-9834, the following call returns all zeros: np.random.hypergeometric(2**55, 2**55, 10, size=20) * The code does not check for integer overflow of ngood + nbad, so the following call np.random.hypergeometric(2**62, 2**62, 1) on a system where the default integer is 64 bit results in `ValueError: ngood + nbad < nsample`. By restricting the two arguments to be less than 2**30, the values are well within the space of inputs for which I have no evidence of the function generating an incorrect distribution, and the overflow problem is avoided for both 32-bit and 64-bit systems. I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0 with hypegeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a regression test for a bug in which the function would enter an infinite loop when the arguments were sufficiently large. The regression test for the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) on systems with 64 bit integers. This call is now disallowed, so I add a call with the maximum allowed values of ngood and nbad.

WarrenWeckesser · 2018-07-02T05:53:55Z

Here are a couple plots showing that the function hypergeometric generates incorrect distributions when ngood and nbad are sufficiently large. The plots show histograms of the p-values of 1500 G-tests. Each G-test is computed with a sample size of 5000000. The histograms should be approximately uniform, with the value in each bin approximately 75. The dashed line in each plot is at 75. (1500 p-values and 20 histogram bins gives the expected count of 75 p-values per bin.) The top three histograms are for nsample=12 and the bottom three are for nsample=25. In both cases, the last plot in the group (where ngood=2**38 and nbad=2**37) clearly shows that the distribution of the p-values is not uniform.

rkern · 2018-07-02T08:02:13Z

The failure looks like it might be related, if rare.

I think this would be fine according to our current policy, but I might suggest making the appropriate modifications over in randomgen's new RandomGenerator instead, where the integers are always int64_t on all platforms. This close to making the switch, I'm inclined to leave RandomState alone, bug traps and all.

WarrenWeckesser · 2018-07-02T15:39:29Z

The failure looks like it might be related, if rare.

It looks like the new test somehow occasionally puts the code into an infinite loop. I'll see if I can reproduce that behavior.

I might suggest making the appropriate modifications over in randomgen's new RandomGenerator instead, where the integers are always int64_t on all platforms. This close to making the switch, I'm inclined to leave RandomState alone, bug traps and all.

In the post-NEP-19 world, this change should be part of the legacy numpy.random.RandomState. It doesn't change the stream of variates--it just restricts the arguments of hypergeometric to those where the function behaves correctly (well, that's what the change is supposed to do, anyway 😄). Admittedly, the cases that it rules out are pretty extreme, and not likely to appear in the wild. If those cases seem rare enough that we don't expect anyone to ever encounter them, we can just close this PR without merging. If randomgen has code that will eventually become the legacy RandomState, I'd be happy to make the change there, but I don't think that part of the NEP has shown up in randomgen yet.

A side effect of working on #8056 is that I am rewriting the code underlying hypergeometric, and I'd like to get the rewrite of hypergeometric into the new RandomGenerator. Something like the change in this pull request might still be necessary, but the upper bounds on the arguments might be different. It would make sense to get the upper bounds on the arguments figured out as part of the rewrite, rather than using the value 2**30 that is used in this pull request, so I don't think it is worthwhile making just this change in RandomGenerator.

charris · 2018-07-02T17:19:48Z

I don't see a problem with restricting the parameter values if the results are clearly wrong for some values.

rkern · 2018-07-02T23:25:23Z

In a post-NEP-19 world, we do not change RandomState at all: "It is intended that this class will no longer be modified, except to keep it working when numpy internals change." Perhaps I should have been clearer that this means not fixing correctness bugs.

charris · 2018-07-02T23:36:28Z

@rkern Perhaps a warning then? Think of it as motivation for moving to the new generators. If the NEP doesn't lay out how to handle new found bugs in the old implementation, then we should add that.

rkern · 2018-07-03T02:20:43Z

Perfectly happy with a warning. I don't particularly object to an error in this case, either, just wanted to raise the consideration.

bashtage · 2019-04-10T15:12:11Z

@WarrenWeckesser any chance you could port these to @mattip randomgen branch? This version is happy to accept improvements, even if they break the stream.

WarrenWeckesser · 2019-04-11T05:14:49Z

@bashtage, so the plan is to get @mattip's branch into shape, and then merge that into numpy? I have a couple other improvements to the hypergeometric distribution in pull requests that I closed. I planned on getting back to them once NEP-19 was implemented.

bashtage · 2019-04-11T06:27:22Z

@WarrenWeckesser That's right. That sounds fine. I think it is a good idea to get things in early (in particular before a release) so that the old way is never released.

WarrenWeckesser · 2019-04-11T09:56:25Z

@bashtage, the changes I have in mind for the hypergeometric distribution are much more substantial than what is in this pull request. I implemented a new version of the (univariate) hypergeometric distribution as part of the multivariate hypergeometric distribution proposed in #8056, but it is not exposed as part of the public API in that PR (couldn't change the stream). It might make more sense to not try to get the extensive rewrite of the hypergeometric distribution into @mattip's branch, because (I think) reviewing the changes will be nontrivial, and it looks like there are more fundamental issues still to be resolved before the randomgen branch is merged into numpy. But I'm willing to try it if you and mattip prefer doing it that way.

I do agree that it would be best to get the new implementation in before the next release of numpy, to minimize the changes in behavior between releases.

bashtage · 2019-04-11T20:23:21Z

If you think porting it is easy, then there is no need to get it ready. If not, then it might help it is was ready to submit a PR against the randomgen branch once it is merged.

bashtage · 2019-05-28T23:00:09Z

@WarrenWeckesser now (before 1.17 in particular) would be the ideal time to make these improvements to np.random.Generator, which was merged today, if you have time.

WarrenWeckesser · 2019-05-28T23:22:10Z

@bashtage, I will get back to this later this week. The changes I have in mind are to clean up the the hypergeometric distribution and add the multivariate hypergeometric distribution.

It's great to see the new randomgen code finally merged!

WarrenWeckesser · 2019-06-15T20:48:31Z

The new implementation of hypergeometric limits the values of ngood and nbad to less than 10**9. I'm closing this issue, but if anyone thinks it is important to add these limits to the legacy version, just say so.

bashtage · 2019-06-15T21:02:42Z

As far as I know legacy is no change, no bug fix.

…

On Sat, Jun 15, 2019, 21:48 Warren Weckesser ***@***.***> wrote: Closed #11475 <#11475>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11475?email_source=notifications&email_token=ABKTSRMLVVM3LKZP7D53WQTP2VITFA5CNFSM4FH23FNKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOR75OZ6I#event-2415586553>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKTSROGT66CENXT4FCI4PLP2VITFANCNFSM4FH23FNA> .

WarrenWeckesser closed this Jul 2, 2018

WarrenWeckesser reopened this Jul 2, 2018

charris added 00 - Bug component: numpy.random labels Jul 2, 2018

rkern mentioned this pull request Jul 3, 2018

NEP: clarify bugfix policy for legacy RandomState. #11488

Merged

mattip mentioned this pull request Jun 6, 2019

ENH: tracking issue for merging randomgen into numpy #13164

Closed

16 tasks

WarrenWeckesser closed this Jun 15, 2019

Uh oh!

BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475

BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475

Uh oh!

Conversation

WarrenWeckesser commented Jul 2, 2018

Uh oh!

WarrenWeckesser commented Jul 2, 2018

Uh oh!

rkern commented Jul 2, 2018

Uh oh!

WarrenWeckesser commented Jul 2, 2018

Uh oh!

charris commented Jul 2, 2018

Uh oh!

rkern commented Jul 2, 2018

Uh oh!

charris commented Jul 2, 2018

Uh oh!

rkern commented Jul 3, 2018

Uh oh!

bashtage commented Apr 10, 2019

Uh oh!

WarrenWeckesser commented Apr 11, 2019

Uh oh!

bashtage commented Apr 11, 2019

Uh oh!

WarrenWeckesser commented Apr 11, 2019

Uh oh!

bashtage commented Apr 11, 2019

Uh oh!

bashtage commented May 28, 2019

Uh oh!

WarrenWeckesser commented May 28, 2019

Uh oh!

WarrenWeckesser commented Jun 15, 2019

Uh oh!

bashtage commented Jun 15, 2019 via email

Uh oh!

Uh oh!