Skip to content

BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

WarrenWeckesser
Copy link
Member

The changes in this pull request restrict the arguments ngood and nbad
of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the
sense of the old joke "Doctor, it hurts when I do this") makes the
following problems impossible to encounter:

  • Python hangs with these calls (see BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443):

    np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
    np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)
    
  • Also reported in BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443: the distribution of the variates of

    np.random.hypergeometric(ngood, nbad, nsample)
    

    is not correct when nsample >= 10 and ngood + nbad is sufficiently large.
    I don't have a rigorous bound, but a histogram of the p-values of 1500
    runs of a G-test with size=5000000 in each run starts to look suspicious
    when ngood + nbad is approximately 2**37. When the sum is greater than
    2**38, the distribution is clearly wrong, and, as reported in the pull
    request WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834, the following call returns all zeros:

    np.random.hypergeometric(2**55, 2**55, 10, size=20)
    
  • The code does not check for integer overflow of ngood + nbad, so the
    following call

    np.random.hypergeometric(2**62, 2**62, 1)
    

    on a system where the default integer is 64 bit results in
    ValueError: ngood + nbad < nsample.

By restricting the two arguments to be less than 2**30, the values are
well within the space of inputs for which I have no evidence of the
function generating an incorrect distribution, and the overflow problem
is avoided for both 32-bit and 64-bit systems.

I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0
with hypergeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a
regression test for a bug in which the function would enter an infinite
loop when the arguments were sufficiently large. The regression test for
the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2)
on systems with 64 bit integers. This call is now disallowed, so I add a
call with the maximum allowed values of ngood and nbad.

…uments.

The changes in this pull request restrict the arguments ngood and nbad
of numpy.random.hypergeometric to be less than 2**30.  This "fix" (in the
sense of the old joke "Doctor, it hurts when I do *this*") makes the
following problems impossible to encounter:

* Python hangs with these calls (see numpygh-11443):

      np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
      np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)

* Also reported in numpygh-11443: the distribution of the variates of

      np.random.hypergeometric(ngood, nbad, nsample)

  is not correct when nsample >= 10 and ngood + nbad is sufficiently large.
  I don't have a rigorous bound, but a histogram of the p-values of 1500
  runs of a G-test with size=5000000 in each run starts to look suspicious
  when ngood + nbad is approximately 2**37.  When the sum is greater than
  2**38, the distribution is clearly wrong, and, as reported in the pull
  request numpygh-9834, the following call returns all zeros:

      np.random.hypergeometric(2**55, 2**55, 10, size=20)

* The code does not check for integer overflow of ngood + nbad, so the
  following call

      np.random.hypergeometric(2**62, 2**62, 1)

  on a system where the default integer is 64 bit results in
  `ValueError: ngood + nbad < nsample`.

By restricting the two arguments to be less than 2**30, the values are
well within the space of inputs for which I have no evidence of the
function generating an incorrect distribution, and the overflow problem
is avoided for both 32-bit and 64-bit systems.

I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0
with hypegeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0.  The test was a
regression test for a bug in which the function would enter an infinite
loop when the arguments were sufficiently large.  The regression test for
the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2)
on systems with 64 bit integers.  This call is now disallowed, so I add a
call with the maximum allowed values of ngood and nbad.
@WarrenWeckesser
Copy link
Member Author

Here are a couple plots showing that the function hypergeometric generates incorrect distributions when ngood and nbad are sufficiently large. The plots show histograms of the p-values of 1500 G-tests. Each G-test is computed with a sample size of 5000000. The histograms should be approximately uniform, with the value in each bin approximately 75. The dashed line in each plot is at 75. (1500 p-values and 20 histogram bins gives the expected count of 75 p-values per bin.) The top three histograms are for nsample=12 and the bottom three are for nsample=25. In both cases, the last plot in the group (where ngood=2**38 and nbad=2**37) clearly shows that the distribution of the p-values is not uniform.

gtest-n12
gtest-n25

@rkern
Copy link
Member

rkern commented Jul 2, 2018

The failure looks like it might be related, if rare.

I think this would be fine according to our current policy, but I might suggest making the appropriate modifications over in randomgen's new RandomGenerator instead, where the integers are always int64_t on all platforms. This close to making the switch, I'm inclined to leave RandomState alone, bug traps and all.

@WarrenWeckesser
Copy link
Member Author

The failure looks like it might be related, if rare.

It looks like the new test somehow occasionally puts the code into an infinite loop. I'll see if I can reproduce that behavior.

I might suggest making the appropriate modifications over in randomgen's new RandomGenerator instead, where the integers are always int64_t on all platforms. This close to making the switch, I'm inclined to leave RandomState alone, bug traps and all.

In the post-NEP-19 world, this change should be part of the legacy numpy.random.RandomState. It doesn't change the stream of variates--it just restricts the arguments of hypergeometric to those where the function behaves correctly (well, that's what the change is supposed to do, anyway 😄). Admittedly, the cases that it rules out are pretty extreme, and not likely to appear in the wild. If those cases seem rare enough that we don't expect anyone to ever encounter them, we can just close this PR without merging. If randomgen has code that will eventually become the legacy RandomState, I'd be happy to make the change there, but I don't think that part of the NEP has shown up in randomgen yet.

A side effect of working on #8056 is that I am rewriting the code underlying hypergeometric, and I'd like to get the rewrite of hypergeometric into the new RandomGenerator. Something like the change in this pull request might still be necessary, but the upper bounds on the arguments might be different. It would make sense to get the upper bounds on the arguments figured out as part of the rewrite, rather than using the value 2**30 that is used in this pull request, so I don't think it is worthwhile making just this change in RandomGenerator.

@charris
Copy link
Member

charris commented Jul 2, 2018

I don't see a problem with restricting the parameter values if the results are clearly wrong for some values.

@rkern
Copy link
Member

rkern commented Jul 2, 2018

In a post-NEP-19 world, we do not change RandomState at all: "It is intended that this class will no longer be modified, except to keep it working when numpy internals change." Perhaps I should have been clearer that this means not fixing correctness bugs.

@charris
Copy link
Member

charris commented Jul 2, 2018

@rkern Perhaps a warning then? Think of it as motivation for moving to the new generators. If the NEP doesn't lay out how to handle new found bugs in the old implementation, then we should add that.

@rkern
Copy link
Member

rkern commented Jul 3, 2018

Perfectly happy with a warning. I don't particularly object to an error in this case, either, just wanted to raise the consideration.

@bashtage
Copy link
Contributor

@WarrenWeckesser any chance you could port these to @mattip randomgen branch? This version is happy to accept improvements, even if they break the stream.

@WarrenWeckesser
Copy link
Member Author

@bashtage, so the plan is to get @mattip's branch into shape, and then merge that into numpy? I have a couple other improvements to the hypergeometric distribution in pull requests that I closed. I planned on getting back to them once NEP-19 was implemented.

@bashtage
Copy link
Contributor

@WarrenWeckesser That's right. That sounds fine. I think it is a good idea to get things in early (in particular before a release) so that the old way is never released.

@WarrenWeckesser
Copy link
Member Author

@bashtage, the changes I have in mind for the hypergeometric distribution are much more substantial than what is in this pull request. I implemented a new version of the (univariate) hypergeometric distribution as part of the multivariate hypergeometric distribution proposed in #8056, but it is not exposed as part of the public API in that PR (couldn't change the stream). It might make more sense to not try to get the extensive rewrite of the hypergeometric distribution into @mattip's branch, because (I think) reviewing the changes will be nontrivial, and it looks like there are more fundamental issues still to be resolved before the randomgen branch is merged into numpy. But I'm willing to try it if you and mattip prefer doing it that way.

I do agree that it would be best to get the new implementation in before the next release of numpy, to minimize the changes in behavior between releases.

@bashtage
Copy link
Contributor

If you think porting it is easy, then there is no need to get it ready. If not, then it might help it is was ready to submit a PR against the randomgen branch once it is merged.

@bashtage
Copy link
Contributor

@WarrenWeckesser now (before 1.17 in particular) would be the ideal time to make these improvements to np.random.Generator, which was merged today, if you have time.

@WarrenWeckesser
Copy link
Member Author

@bashtage, I will get back to this later this week. The changes I have in mind are to clean up the the hypergeometric distribution and add the multivariate hypergeometric distribution.

It's great to see the new randomgen code finally merged!

@WarrenWeckesser
Copy link
Member Author

The new implementation of hypergeometric limits the values of ngood and nbad to less than 10**9. I'm closing this issue, but if anyone thinks it is important to add these limits to the legacy version, just say so.

@bashtage
Copy link
Contributor

bashtage commented Jun 15, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants