WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834

WarrenWeckesser · 2017-10-07T17:40:25Z

Note: This pull request cannot be merged in its current state. It changes the
stream of random numbers generated by numpy.random.hypergeometric()
when the argument nsample is 10 or less.

Before a change such as this can be made, an API for selecting a version of
the random number generator must be developed.

When the arguments 'ngood' or 'nbad' are ridiculously large and 'nsample'
is 10 or less, the function 'numpy.random.hypergeometric' generates invalid
samples.

For example,

In [77]: np.random.hypergeometric(2**55, 2**55, 10, size=20)
Out[77]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The problem is this code from the C function rk_hypergeometric_hyp:

        Y -= (long)floor(U + Y/(d1 + K));

'Y' is a double, and the expression on the right (after the cast to long)
is either 0 or 1. The problem is that when Y is too big, the ULP for Y
is greater than 1, so that code doesn't change Y, regardless of the value
of the right side.

I have rewritten the function without using floating point values.

With the changes in this pull request:

In [6]: np.random.seed(8675309)

In [7]: np.random.hypergeometric(2**55, 2**55, 10, size=20)
Out[7]: array([5, 2, 6, 6, 4, 6, 5, 5, 3, 6, 8, 6, 4, 5, 5, 5, 6, 2, 6, 7])

The new function is also faster. Here is an example using the new
implementation:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.14.0.dev0+7dfc89a'

In [3]: %timeit r = np.random.hypergeometric(1000, 2500, 9, size=5000000)
478 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit r = np.random.hypergeometric(10, 5, 9, size=5000000)
556 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using the existing implementatin in the numpy master branch:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.14.0.dev0+6af2c42'

In [3]: %timeit r = np.random.hypergeometric(1000, 2500, 9, size=5000000)
692 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit r = np.random.hypergeometric(10, 5, 9, size=5000000)
679 ms ± 3.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When the arguments 'good' or 'bad' are ridiculously large and 'sample' is 10 or less, the function 'numpy.random.hypergeometric' generates invalid samples. For example, In [77]: np.random.hypergeometric(2**55, 2**55, 10, size=20) Out[77]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) The problem is this code from the C function `rk_hypergeometric_hyp`: Y -= (long)floor(U + Y/(d1 + K)); 'Y' is a double, and the expression on the right (after the cast to long) is either 0 or 1. The problem is that when Y is too big, the ULP for Y is greater than 1, so that code doesn't change Y, regardless of the value of the right side. I have rewritten the function without using floating point values. The new function is also faster.

…uments. The changes in this pull request restrict the arguments ngood and nbad of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the sense of the old joke "Doctor, it hurts when I do *this*") makes the following problems impossible to encounter: * Python hangs with these calls (see numpygh-11443): np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2) np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12) * Also reported in numpygh-11443: the distribution of the variates of np.random.hypergeometric(ngood, nbad, nsample) is not correct when nsample >= 10 and ngood + nbad is sufficiently large. I don't have a rigorous bound, but a histogram of the p-values of 1500 runs of a G-test with size=5000000 in each run starts to look suspicious when ngood + nbad is approximately 2**37. When the sum is greater than 2**38, the distribution is clearly wrong, and, as reported in the pull request numpygh-9834, the following call returns all zeros: np.random.hypergeometric(2**55, 2**55, 10, size=20) * The code does not check for integer overflow of ngood + nbad, so the following call np.random.hypergeometric(2**62, 2**62, 1) on a system where the default integer is 64 bit results in `ValueError: ngood + nbad < nsample`. By restricting the two arguments to be less than 2**30, the values are well within the space of inputs for which I have no evidence of the function generating an incorrect distribution, and the overflow problem is avoided for both 32-bit and 64-bit systems. I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0 with hypegeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a regression test for a bug in which the function would enter an infinite loop when the arguments were sufficiently large. The regression test for the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) on systems with 64 bit integers. This call is now disallowed, so I add a call with the maximum allowed values of ngood and nbad.

WarrenWeckesser · 2018-07-02T15:08:07Z

I'm closing this PR. When NEP 019 is implemented (assuming that it will be accepted), we can fix this in the RandomGenerator class.

Summary of the changes: * Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua and random_hypergeometric from distributions.c to legacy-distributions.c. These are now the legacy implementation of hypergeometric. * Add the files logfactorial.c and logfactorial.h, containing the function logfactorial(int64_t k). * Add the files random_hypergeometric.c and random_hypergeometric.h, containing the function random_hypergeometric (the new implementation of the hypergeometric distribution). See more details below. * Fix two tests in numpy/random/tests/test_generator_mt19937.py that used values returned by the hypergeometric distribution. The new implementation changes the stream, so those tests needed to be updated. * Remove another test obviated by an added constraint on the arguments of hypergeometric. Details of the rewrite: If you carefully step through the old function rk_hypergeometric_hyp(), you'll see that the end result is basically the same as the new function hypergeometric_sample(), but the new function accomplishes the result with just integers. The floating point calculations in the old code caused problems when the arguments were extremely large (explained in more detail in the unmerged pull request numpy#9834). The new version of hypergeometric_hrua() is a new translation of Stadlober's ratio-of-uniforms algorithm for the hypergeometric distribution. It fixes a mistake in the old implementation that made the method less efficient than it could be (see the details in the unmerged pull request numpy#11138), and uses a faster function for computing log(k!). The HRUA algorithm suffers from loss of floating point precision when the arguments are *extremely* large (see the comments in github issue 11443). To avoid these problems, the arguments `ngood` and `nbad` of hypergeometric must be less than 10**9. This constraint obviates an existing regression test that was run on systems with 64 bit long integers, so that test was removed.

WarrenWeckesser force-pushed the fix-hypergeom branch 2 times, most recently from bbe52ab to 5f08ed0 Compare October 7, 2017 18:33

WarrenWeckesser force-pushed the fix-hypergeom branch from 5f08ed0 to 5044e98 Compare October 7, 2017 18:41

WarrenWeckesser mentioned this pull request Oct 26, 2017

ENH: random: Add multivariate_hypergeometric function. #8056

Closed

charris added 01 - Enhancement 51 - In progress component: numpy.random labels Oct 26, 2017

WarrenWeckesser mentioned this pull request Jun 28, 2018

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

Closed

WarrenWeckesser mentioned this pull request Jul 2, 2018

BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475

Closed

WarrenWeckesser closed this Jul 2, 2018

WarrenWeckesser mentioned this pull request Jun 12, 2019

MAINT: random: Rewrite the hypergeometric distribution. #13761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834

WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834

WarrenWeckesser commented Oct 7, 2017 •

edited

Loading

WarrenWeckesser commented Jul 2, 2018

WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834

WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834

Conversation

WarrenWeckesser commented Oct 7, 2017 • edited Loading

WarrenWeckesser commented Jul 2, 2018

WarrenWeckesser commented Oct 7, 2017 •

edited

Loading