-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: random: Fix formula in rk_hypergeometric_hrua #11138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The HRUA algorithm is described in Stadlober's thesis "Sampling from Poisson, binomial and hypergeometric distributions: ratio of uniforms as a simple and fast alternative." The pseudo-code for the algorithm begins on page 82, where the parameter c is defined to be c = sqrt((N - n)*n*p*q/(N - 1) + 1/2) `N` is the population size, `n` is the sample size, `p` = `M/N` is the fraction of the first type (i.e. the fraction of "good" values or successes) and `q` is `1 - p`. The C code takes advantage of the symmetry and uses m = min(sample, popsize - sample); as the sample size to be computed, where `popsize` is (not surprisingly) the population size (i.e. `good + bad`), and `sample` is the requested sample size. The formula for `c` above is translated in the C code as d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5); where `d4` and `d5` are `p` and `q`, resp. The problem is that `sample` should have been replaced by `m` in that formula. It should be d7 = sqrt((double)(popsize - m) * m * d4 * d5 / (popsize - 1) + 0.5); I haven't thoroughly analyzed the implications of this mistake. It will only have an effect when `sample` is greater than half of `popsize`. (When `sample` is less than or equal to half of `popsize`, `m == sample`.) A few tests of the expected values of the distribution don't show any obvious biases, but I haven't run it through a rigorous statistical testing procedure. It appears that it simply means that the bound for the rejection method is too big, so the acceptance ratio of the method is much lower than it should be. This can be seen in a test such as: In [9]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000) 2.07 s ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [10]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000) 1.14 s ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Those two computations should take the same amount of time. After the fixing the formula, they do: In [4]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000) 1.14 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000) 1.14 s ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
This changes the stream of values produced by |
Can you verify the problem using a Q-Q or similar plot? If this is indeed a bug then it should be fixable despite breaking the stream. |
The mistake is pretty clear; at this point, I don't think I'll spend more time on numerical experiments related to it. Instead, I'm continuing to look into the derivation of the HRUA algorithm. Once I understand that better, it should be clear whether or not the mistake would introduce any kind of bias. If the only effect is to make the code slower than it could be, then there is no urgent need to fix it in numpy, and |
A bit more reading confirms what I observed. The mistake makes the shape parameter of the "table mountain hat" function used in the ratio-of-uniforms method bigger than necessary, so it doesn't cover the hypergeometric distribution's scaled PMF as tightly as it could. And that means that the rejection method is not as efficient as it could be. The variates that are generated, however, are not wrong. So there is no urgent need to merge this pull request. It can wait until numpy implements some form of versioning for the random module. |
I'm closing this PR. When NEP 019 is implemented (assuming that it will be accepted), we can fix this in the |
Summary of the changes: * Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua and random_hypergeometric from distributions.c to legacy-distributions.c. These are now the legacy implementation of hypergeometric. * Add the files logfactorial.c and logfactorial.h, containing the function logfactorial(int64_t k). * Add the files random_hypergeometric.c and random_hypergeometric.h, containing the function random_hypergeometric (the new implementation of the hypergeometric distribution). See more details below. * Fix two tests in numpy/random/tests/test_generator_mt19937.py that used values returned by the hypergeometric distribution. The new implementation changes the stream, so those tests needed to be updated. * Remove another test obviated by an added constraint on the arguments of hypergeometric. Details of the rewrite: If you carefully step through the old function rk_hypergeometric_hyp(), you'll see that the end result is basically the same as the new function hypergeometric_sample(), but the new function accomplishes the result with just integers. The floating point calculations in the old code caused problems when the arguments were extremely large (explained in more detail in the unmerged pull request numpy#9834). The new version of hypergeometric_hrua() is a new translation of Stadlober's ratio-of-uniforms algorithm for the hypergeometric distribution. It fixes a mistake in the old implementation that made the method less efficient than it could be (see the details in the unmerged pull request numpy#11138), and uses a faster function for computing log(k!). The HRUA algorithm suffers from loss of floating point precision when the arguments are *extremely* large (see the comments in github issue 11443). To avoid these problems, the arguments `ngood` and `nbad` of hypergeometric must be less than 10**9. This constraint obviates an existing regression test that was run on systems with 64 bit long integers, so that test was removed.
The HRUA algorithm is described in Stadlober's thesis "Sampling from
Poisson, binomial and hypergeometric distributions: ratio of uniforms
as a simple and fast alternative." The pseudo-code for the algorithm
begins on page 82, where the parameter c is defined to be
N
is the population size,n
is the sample size,p
=M/N
is thefraction of the first type (i.e. the fraction of "good" values or
successes) and
q
is1 - p
.The C code takes advantage of the symmetry and uses
as the sample size to be computed, where
popsize
is (not surprisingly)the population size (i.e.
good + bad
), andsample
is the requestedsample size.
The formula for
c
above is translated in the C code aswhere
d4
andd5
arep
andq
, resp.The problem is that
sample
should have been replaced bym
in thatformula. It should be
I haven't thoroughly analyzed the implications of this mistake. It will
only have an effect when
sample
is greater than half ofpopsize
.(When
sample
is less than or equal to half ofpopsize
,m == sample
.)A few tests of the expected values of the distribution don't show
any obvious biases, but I haven't run it through a rigorous statistical
testing procedure.
It appears that it simply means that the bound for the rejection
method is too big, so the acceptance ratio of the method is much
lower than it should be. This can be seen in a test such as:
Those two computations should take the same amount of time.
After the fixing the formula, they do: