Skip to content

WIP/BUG: Replace the C function rk_hypergeometric_hyp() #9834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Oct 7, 2017

Note: This pull request cannot be merged in its current state. It changes the
stream of random numbers generated by numpy.random.hypergeometric()
when the argument nsample is 10 or less.

Before a change such as this can be made, an API for selecting a version of
the random number generator must be developed.


When the arguments 'ngood' or 'nbad' are ridiculously large and 'nsample'
is 10 or less, the function 'numpy.random.hypergeometric' generates invalid
samples.

For example,

In [77]: np.random.hypergeometric(2**55, 2**55, 10, size=20)
Out[77]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The problem is this code from the C function rk_hypergeometric_hyp:

        Y -= (long)floor(U + Y/(d1 + K));

'Y' is a double, and the expression on the right (after the cast to long)
is either 0 or 1. The problem is that when Y is too big, the ULP for Y
is greater than 1, so that code doesn't change Y, regardless of the value
of the right side.

I have rewritten the function without using floating point values.

With the changes in this pull request:

In [6]: np.random.seed(8675309)

In [7]: np.random.hypergeometric(2**55, 2**55, 10, size=20)
Out[7]: array([5, 2, 6, 6, 4, 6, 5, 5, 3, 6, 8, 6, 4, 5, 5, 5, 6, 2, 6, 7])

The new function is also faster. Here is an example using the new
implementation:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.14.0.dev0+7dfc89a'

In [3]: %timeit r = np.random.hypergeometric(1000, 2500, 9, size=5000000)
478 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit r = np.random.hypergeometric(10, 5, 9, size=5000000)
556 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using the existing implementatin in the numpy master branch:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.14.0.dev0+6af2c42'

In [3]: %timeit r = np.random.hypergeometric(1000, 2500, 9, size=5000000)
692 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit r = np.random.hypergeometric(10, 5, 9, size=5000000)
679 ms ± 3.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@WarrenWeckesser WarrenWeckesser force-pushed the fix-hypergeom branch 2 times, most recently from bbe52ab to 5f08ed0 Compare October 7, 2017 18:33
When the arguments 'good' or 'bad' are ridiculously large and 'sample'
is 10 or less, the function 'numpy.random.hypergeometric' generates invalid
samples.

For example,

    In [77]: np.random.hypergeometric(2**55, 2**55, 10, size=20)
    Out[77]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The problem is this code from the C function `rk_hypergeometric_hyp`:

            Y -= (long)floor(U + Y/(d1 + K));

'Y' is a double, and the expression on the right (after the cast to long)
is either 0 or 1.  The problem is that when Y is too big, the ULP for Y
is greater than 1, so that code doesn't change Y, regardless of the value
of the right side.

I have rewritten the function without using floating point values.

The new function is also faster.
WarrenWeckesser added a commit to WarrenWeckesser/numpy that referenced this pull request Jul 2, 2018
…uments.

The changes in this pull request restrict the arguments ngood and nbad
of numpy.random.hypergeometric to be less than 2**30.  This "fix" (in the
sense of the old joke "Doctor, it hurts when I do *this*") makes the
following problems impossible to encounter:

* Python hangs with these calls (see numpygh-11443):

      np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
      np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)

* Also reported in numpygh-11443: the distribution of the variates of

      np.random.hypergeometric(ngood, nbad, nsample)

  is not correct when nsample >= 10 and ngood + nbad is sufficiently large.
  I don't have a rigorous bound, but a histogram of the p-values of 1500
  runs of a G-test with size=5000000 in each run starts to look suspicious
  when ngood + nbad is approximately 2**37.  When the sum is greater than
  2**38, the distribution is clearly wrong, and, as reported in the pull
  request numpygh-9834, the following call returns all zeros:

      np.random.hypergeometric(2**55, 2**55, 10, size=20)

* The code does not check for integer overflow of ngood + nbad, so the
  following call

      np.random.hypergeometric(2**62, 2**62, 1)

  on a system where the default integer is 64 bit results in
  `ValueError: ngood + nbad < nsample`.

By restricting the two arguments to be less than 2**30, the values are
well within the space of inputs for which I have no evidence of the
function generating an incorrect distribution, and the overflow problem
is avoided for both 32-bit and 64-bit systems.

I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0
with hypegeometric(2**30 - 1, 2**30 - 1, 2**30 - 1) > 0.  The test was a
regression test for a bug in which the function would enter an infinite
loop when the arguments were sufficiently large.  The regression test for
the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2)
on systems with 64 bit integers.  This call is now disallowed, so I add a
call with the maximum allowed values of ngood and nbad.
@WarrenWeckesser
Copy link
Member Author

I'm closing this PR. When NEP 019 is implemented (assuming that it will be accepted), we can fix this in the RandomGenerator class.

WarrenWeckesser added a commit to WarrenWeckesser/numpy that referenced this pull request Jun 14, 2019
Summary of the changes:

* Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua
  and random_hypergeometric from distributions.c to legacy-distributions.c.
  These are now the legacy implementation of hypergeometric.
* Add the files logfactorial.c and logfactorial.h,
  containing the function logfactorial(int64_t k).
* Add the files random_hypergeometric.c and random_hypergeometric.h,
  containing the function random_hypergeometric (the new implementation
  of the hypergeometric distribution).  See more details below.
* Fix two tests in numpy/random/tests/test_generator_mt19937.py that
  used values returned by the hypergeometric distribution.  The
  new implementation changes the stream, so those tests needed to
  be updated.
* Remove another test obviated by an added constraint on the arguments
  of hypergeometric.

Details of the rewrite:

If you carefully step through the old function rk_hypergeometric_hyp(),
you'll see that the end result is basically the same as the new function
hypergeometric_sample(), but the new function accomplishes the result with
just integers.  The floating point calculations in the old code caused
problems when the arguments were extremely large (explained in more detail
in the unmerged pull request numpy#9834).

The new version of hypergeometric_hrua() is a new translation of
Stadlober's ratio-of-uniforms algorithm for the hypergeometric
distribution.  It fixes a mistake in the old implementation that made the
method less efficient than it could be (see the details in the unmerged
pull request numpy#11138), and uses a faster
function for computing log(k!).

The HRUA algorithm suffers from loss of floating point precision when
the arguments are *extremely* large (see the comments in github issue
11443).  To avoid these problems, the arguments `ngood` and `nbad` of
hypergeometric must be less than 10**9.  This constraint obviates an
existing regression test that was run on systems with 64 bit long
integers, so that test was removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants