-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MAINT: random: Rewrite the hypergeometric distribution. #13761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5314214
to
213456e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard for me to assess the quality of the modified code. Do we have any proof-of-correctness test for showing that a large random sample's CDF matches the expected theoretical one?
@mattip wrote
I thought about that. Ideally, a thorough reviewer would get a copy of Stadlober's papers and do a line-by-line inspection of the code to ensure that it matches the HRUA algorithm. That's a heavy load for a reviewer, and I'm not sure who has the time, interest or background to do it. Slightly less burdensome is to compare the new HRUA to the previous version in Another approach is for me to build a compelling case that the code is demonstrably correct. I can do that with a bunch of examples that exercise the new code with increasingly rigorous demands on the statistical behavior of the method, including a chi-square test of the histogram of a large sample against the expected histogram. I've done that in my scratch work, in an assortment of scripts and ipython sessions, but I need to get those better organized before putting them somewhere public. I will be working on that today. In the meantime, a couple quick tests are to (1) compute a few large samples, and verify that the mean of the sample approaches the expected mean of the distribution as the sample size increases; (2) plot the histogram of a large sample and visually compare it to the expected histogram based on the PMF. Neither of those is a proof of correctness, but they can confirm that the code is not totally wrong. Here's an example of checking the mean:
Here I compute and plot the histogram and the expected histogram of a large sample.
In the plot, the blue diamonds are the sample histogram, and the black dots show the expected histogram: |
I put a file containing some code for testing The two main functions in the file are The other main function, Here's an ipython session showing them in use:
All those p-values look reasonable. If the code is correct, and if the asymptotic assumptions of the G-test are satisfied, then the p-value should be uniformly distributed on the interval [0, 1]. That means there will occasionally be small p-values in the above table, and these do not necessarily mean something is wrong. On the other hand, many tiny values would indicate a problem. If there is one that seems suspiciously small, those parameters can be tested again multiple times with
Those values look reasonable. |
A few comments about the C code:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the changes is not allowed.
accf268
to
189e459
Compare
Summary of the changes: * Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua and random_hypergeometric from distributions.c to legacy-distributions.c. These are now the legacy implementation of hypergeometric. * Add the files logfactorial.c and logfactorial.h, containing the function logfactorial(int64_t k). * Add the files random_hypergeometric.c and random_hypergeometric.h, containing the function random_hypergeometric (the new implementation of the hypergeometric distribution). See more details below. * Fix two tests in numpy/random/tests/test_generator_mt19937.py that used values returned by the hypergeometric distribution. The new implementation changes the stream, so those tests needed to be updated. * Remove another test obviated by an added constraint on the arguments of hypergeometric. Details of the rewrite: If you carefully step through the old function rk_hypergeometric_hyp(), you'll see that the end result is basically the same as the new function hypergeometric_sample(), but the new function accomplishes the result with just integers. The floating point calculations in the old code caused problems when the arguments were extremely large (explained in more detail in the unmerged pull request numpy#9834). The new version of hypergeometric_hrua() is a new translation of Stadlober's ratio-of-uniforms algorithm for the hypergeometric distribution. It fixes a mistake in the old implementation that made the method less efficient than it could be (see the details in the unmerged pull request numpy#11138), and uses a faster function for computing log(k!). The HRUA algorithm suffers from loss of floating point precision when the arguments are *extremely* large (see the comments in github issue 11443). To avoid these problems, the arguments `ngood` and `nbad` of hypergeometric must be less than 10**9. This constraint obviates an existing regression test that was run on systems with 64 bit long integers, so that test was removed.
189e459
to
b2d2b67
Compare
@bashtage, I pushed an update with your suggested changes:
Note that the |
The logs for the "LGTM analysis: C/C++" failure don't explain what failed (at least, I can't find a clear error message in there). Am I missing something obvious? |
@WarrenWeckesser The LGTM C/C++ check has been acting up lately and is under investigation. |
LGTM. |
Thanks @WarrenWeckesser, @bashtage |
Summary of the changes:
and random_hypergeometric from distributions.c to legacy-distributions.c.
These are now the legacy implementation of hypergeometric.
containing the function logfactorial(int64_t k).
containing the function random_hypergeometric (the new implementation
of the hypergeometric distribution). See more details below.
used values returned by the hypergeometric distribution. The
new implementation changes the stream, so those tests needed to
be updated.
of hypergeometric.
Details of the rewrite:
If you carefully step through the old function rk_hypergeometric_hyp(),
you'll see that the end result is basically the same as the new function
hypergeometric_sample(), but the new function accomplishes the result with
just integers. The floating point calculations in the old code caused
problems when the arguments were extremely large (explained in more detail
in the unmerged pull request #9834).
The new version of hypergeometric_hrua() is a new translation of
Stadlober's ratio-of-uniforms algorithm for the hypergeometric
distribution. It fixes a mistake in the old implementation that made the
method less efficient than it could be (see the details in the unmerged
pull request #11138), and uses a faster
function for computing log(k!).
The HRUA algorithm suffers from loss of floating point precision when
the arguments are extremely large (see the comments in github issue
11443). To avoid these problems, the arguments
ngood
andnbad
ofhypergeometric must be less than 10**9. This constraint obviates an
existing regression test that was run on systems with 64 bit long
integers, so that test was removed.