Skip to content

BUG: random: Fix formula in rk_hypergeometric_hrua #11138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

WarrenWeckesser
Copy link
Member

The HRUA algorithm is described in Stadlober's thesis "Sampling from
Poisson, binomial and hypergeometric distributions: ratio of uniforms
as a simple and fast alternative." The pseudo-code for the algorithm
begins on page 82, where the parameter c is defined to be

c = sqrt((N - n)*n*p*q/(N - 1) + 1/2)

N is the population size, n is the sample size, p = M/N is the
fraction of the first type (i.e. the fraction of "good" values or
successes) and q is 1 - p.

The C code takes advantage of the symmetry and uses

m = min(sample, popsize - sample);

as the sample size to be computed, where popsize is (not surprisingly)
the population size (i.e. good + bad), and sample is the requested
sample size.

The formula for c above is translated in the C code as

d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5);

where d4 and d5 are p and q, resp.

The problem is that sample should have been replaced by m in that
formula. It should be

d7 = sqrt((double)(popsize - m) * m * d4 * d5 / (popsize - 1) + 0.5);

I haven't thoroughly analyzed the implications of this mistake. It will
only have an effect when sample is greater than half of popsize.
(When sample is less than or equal to half of popsize, m == sample.)
A few tests of the expected values of the distribution don't show
any obvious biases, but I haven't run it through a rigorous statistical
testing procedure.

It appears that it simply means that the bound for the rejection
method is too big, so the acceptance ratio of the method is much
lower than it should be. This can be seen in a test such as:

In [9]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
2.07 s ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
1.14 s ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Those two computations should take the same amount of time.
After the fixing the formula, they do:

In [4]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
1.14 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
1.14 s ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The HRUA algorithm is described in Stadlober's thesis "Sampling from
Poisson, binomial and hypergeometric distributions: ratio of uniforms
as a simple and fast alternative."  The pseudo-code for the algorithm
begins on page 82, where the parameter c is defined to be

    c = sqrt((N - n)*n*p*q/(N - 1) + 1/2)

`N` is the population size, `n` is the sample size, `p` = `M/N` is the
fraction of the first type (i.e. the fraction of "good" values or
successes) and `q` is `1 - p`.

The C code takes advantage of the symmetry and uses

    m = min(sample, popsize - sample);

as the sample size to be computed, where `popsize` is (not surprisingly)
the population size (i.e. `good + bad`), and `sample` is the requested
sample size.

The formula for `c` above is translated in the C code as

    d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5);

where `d4` and `d5` are `p` and `q`, resp.

The problem is that `sample` should have been replaced by `m` in that
formula. It should be

    d7 = sqrt((double)(popsize - m) * m * d4 * d5 / (popsize - 1) + 0.5);

I haven't thoroughly analyzed the implications of this mistake.  It will
only have an effect when `sample` is greater than half of `popsize`.
(When `sample` is less than or equal to half of `popsize`, `m == sample`.)
A few tests of the expected values of the distribution don't show
any obvious biases, but I haven't run it through a rigorous statistical
testing procedure.

It appears that it simply means that the bound for the rejection
method is too big, so the acceptance ratio of the method is much
lower than it should be.  This can be seen in a test such as:

    In [9]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
    2.07 s ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    In [10]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
    1.14 s ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Those two computations should take the same amount of time.
After the fixing the formula, they do:

    In [4]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
    1.14 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    In [5]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
    1.14 s ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
@WarrenWeckesser
Copy link
Member Author

This changes the stream of values produced by numpy.random.hypergeometric(), so the test for repeatability will fail.

@bashtage
Copy link
Contributor

Can you verify the problem using a Q-Q or similar plot? If this is indeed a bug then it should be fixable despite breaking the stream.

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented May 22, 2018

The mistake is pretty clear; at this point, I don't think I'll spend more time on numerical experiments related to it. Instead, I'm continuing to look into the derivation of the HRUA algorithm. Once I understand that better, it should be clear whether or not the mistake would introduce any kind of bias. If the only effect is to make the code slower than it could be, then there is no urgent need to fix it in numpy, and hypergeometric can remain in its petrified state. :)

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented May 23, 2018

A bit more reading confirms what I observed. The mistake makes the shape parameter of the "table mountain hat" function used in the ratio-of-uniforms method bigger than necessary, so it doesn't cover the hypergeometric distribution's scaled PMF as tightly as it could. And that means that the rejection method is not as efficient as it could be. The variates that are generated, however, are not wrong.

So there is no urgent need to merge this pull request. It can wait until numpy implements some form of versioning for the random module.

@WarrenWeckesser
Copy link
Member Author

I'm closing this PR. When NEP 019 is implemented (assuming that it will be accepted), we can fix this in the RandomGenerator class.

WarrenWeckesser added a commit to WarrenWeckesser/numpy that referenced this pull request Jun 14, 2019
Summary of the changes:

* Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua
  and random_hypergeometric from distributions.c to legacy-distributions.c.
  These are now the legacy implementation of hypergeometric.
* Add the files logfactorial.c and logfactorial.h,
  containing the function logfactorial(int64_t k).
* Add the files random_hypergeometric.c and random_hypergeometric.h,
  containing the function random_hypergeometric (the new implementation
  of the hypergeometric distribution).  See more details below.
* Fix two tests in numpy/random/tests/test_generator_mt19937.py that
  used values returned by the hypergeometric distribution.  The
  new implementation changes the stream, so those tests needed to
  be updated.
* Remove another test obviated by an added constraint on the arguments
  of hypergeometric.

Details of the rewrite:

If you carefully step through the old function rk_hypergeometric_hyp(),
you'll see that the end result is basically the same as the new function
hypergeometric_sample(), but the new function accomplishes the result with
just integers.  The floating point calculations in the old code caused
problems when the arguments were extremely large (explained in more detail
in the unmerged pull request numpy#9834).

The new version of hypergeometric_hrua() is a new translation of
Stadlober's ratio-of-uniforms algorithm for the hypergeometric
distribution.  It fixes a mistake in the old implementation that made the
method less efficient than it could be (see the details in the unmerged
pull request numpy#11138), and uses a faster
function for computing log(k!).

The HRUA algorithm suffers from loss of floating point precision when
the arguments are *extremely* large (see the comments in github issue
11443).  To avoid these problems, the arguments `ngood` and `nbad` of
hypergeometric must be less than 10**9.  This constraint obviates an
existing regression test that was run on systems with 64 bit long
integers, so that test was removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants