Skip to content

MAINT: random: Rewrite the hypergeometric distribution. #13761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 15, 2019

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Jun 12, 2019

Summary of the changes:

  • Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua
    and random_hypergeometric from distributions.c to legacy-distributions.c.
    These are now the legacy implementation of hypergeometric.
  • Add the files logfactorial.c and logfactorial.h,
    containing the function logfactorial(int64_t k).
  • Add the files random_hypergeometric.c and random_hypergeometric.h,
    containing the function random_hypergeometric (the new implementation
    of the hypergeometric distribution). See more details below.
  • Fix two tests in numpy/random/tests/test_generator_mt19937.py that
    used values returned by the hypergeometric distribution. The
    new implementation changes the stream, so those tests needed to
    be updated.
  • Remove another test obviated by an added constraint on the arguments
    of hypergeometric.

Details of the rewrite:

If you carefully step through the old function rk_hypergeometric_hyp(),
you'll see that the end result is basically the same as the new function
hypergeometric_sample(), but the new function accomplishes the result with
just integers. The floating point calculations in the old code caused
problems when the arguments were extremely large (explained in more detail
in the unmerged pull request #9834).

The new version of hypergeometric_hrua() is a new translation of
Stadlober's ratio-of-uniforms algorithm for the hypergeometric
distribution. It fixes a mistake in the old implementation that made the
method less efficient than it could be (see the details in the unmerged
pull request #11138), and uses a faster
function for computing log(k!).

The HRUA algorithm suffers from loss of floating point precision when
the arguments are extremely large (see the comments in github issue
11443). To avoid these problems, the arguments ngood and nbad of
hypergeometric must be less than 10**9. This constraint obviates an
existing regression test that was run on systems with 64 bit long
integers, so that test was removed.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard for me to assess the quality of the modified code. Do we have any proof-of-correctness test for showing that a large random sample's CDF matches the expected theoretical one?

@WarrenWeckesser
Copy link
Member Author

@mattip wrote

It is hard for me to assess the quality of the modified code. Do we have any proof-of-correctness test for showing that a large random sample's CDF matches the expected theoretical one?

I thought about that. Ideally, a thorough reviewer would get a copy of Stadlober's papers and do a line-by-line inspection of the code to ensure that it matches the HRUA algorithm. That's a heavy load for a reviewer, and I'm not sure who has the time, interest or background to do it.

Slightly less burdensome is to compare the new HRUA to the previous version in rk_hypergeometric_hrua (now in the legacy code) . It is the same algorithm, but because I was referring back to the original papers to understand the implementation and verify that it was correct, I found it much easier to use variable names that matched Stadlober's papers. That should make it easier for any future maintainers, too.

Another approach is for me to build a compelling case that the code is demonstrably correct. I can do that with a bunch of examples that exercise the new code with increasingly rigorous demands on the statistical behavior of the method, including a chi-square test of the histogram of a large sample against the expected histogram. I've done that in my scratch work, in an assortment of scripts and ipython sessions, but I need to get those better organized before putting them somewhere public. I will be working on that today. In the meantime, a couple quick tests are to (1) compute a few large samples, and verify that the mean of the sample approaches the expected mean of the distribution as the sample size increases; (2) plot the histogram of a large sample and visually compare it to the expected histogram based on the PMF. Neither of those is a proof of correctness, but they can confirm that the code is not totally wrong.

Here's an example of checking the mean:

In [20]: import numpy as np

In [21]: gen = np.random.Generator()

In [22]: good = 15

In [23]: bad = 45

In [24]: nsample = 20

In [25]: nsample * (good/(good + bad))  # population mean of the distribution
Out[25]: 5.0

In [26]: for size in [100, 1000, 10000, 100000, 1000000, 10000000]:
    ...:     s = gen.hypergeometric(good, bad, nsample, size=size)
    ...:     print(s.mean())
    ...:
5.22
4.99
4.9749
5.002
5.002137
4.9995735

Here I compute and plot the histogram and the expected histogram of a large sample. mpsci.distributions.hypergeometric.pmf is used to compute the PMF. (scipy.stats.hypergeom.pmf could be used, but I don't have scipy installed in my current numpy development environment.)

In [49]: gen = np.random.Generator()

In [50]: good = 15

In [51]: bad = 45

In [52]: nsample = 20

In [53]: size = 10000000

In [54]: s = gen.hypergeometric(good, bad, nsample, size=size)

In [55]: h = np.bincount(s, minlength=good+1)  # histogram

In [56]: k = np.arange(good+1)  # support (in this case)

In [57]: from mpsci.distributions import hypergeometric

In [58]: expected = [size*float(hypergeometric.pmf(m, good+bad, good, nsample)) for m in k]

In [59]: plt.plot(k, h, 'cd', alpha=0.3, markersize=7)
Out[59]: [<matplotlib.lines.Line2D at 0x124115e48>]

In [60]: plt.plot(k, expected, 'k.', alpha=0.75, markersize=3)
Out[60]: [<matplotlib.lines.Line2D at 0x123128588>]

In the plot, the blue diamonds are the sample histogram, and the black dots show the expected histogram:

Figure_1

@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Jun 13, 2019

I put a file containing some code for testing hypergeometric in a gist on github: https://gist.github.com/WarrenWeckesser/26d7ef31e52561a695b5efad854f17bf

The two main functions in the file are hypergeometric_wide_range_test(size=100_000_000, rng=None) and hypergeometric_pvalues(good, bad, sample, size, ntrials, rng=None). The first function generates a wide range of parameter sets (good, bad, sample), and for each parameter set, generates 100000000 variates (by default), and computes the histogram for the sample and for the expected histogram. Values in the histogram below 100 are aggregated. The sample and expected histograms are then compared using the G-test to compute a p-value. The results of all the tests are printed in a table. The first three columns are good, bad and sample, and the last column is the p-value. (The intervening columns give some details of the aggregation of the low expected values.)

The other main function, hypergeometric_pvalues(good, bad, sample, size, ntrials, rng=None), does repeated G-tests for the same parameter values.

Here's an ipython session showing them in use:

In [1]: run compute_pvalues_for_assorted_samples.py                                                                                                                 

In [2]: from numpy.random import Generator, Xoshiro256                                                                                                              

In [3]: seed = abs(hash("plate of shrimp"))                                                                                                                         

In [4]: gen = np.random.Generator(Xoshiro256(seed))                                                                                                                 

In [5]: hypergeometric_wide_range_test(rng=gen)                                                                                                                     
                   hypergeometric distribution test
                         size = 100000000

                                    len           total low  g-test
      good        bad     nsample support   nlow   expected    len   p-value
         6         27          3      4        0      0.000      4   0.885811
         4          9          6      5        0      0.000      5   0.260972
        19          1         16      2        0      0.000      2   0.360859
        21        158        131     22        6     61.384     16   0.254341
         6       1343        394      7        0      0.000      7   0.756573
         2       1363        696      3        0      0.000      3   0.430773
         1      11377       1979      2        0      0.000      2   0.587718
        13      19995       1340     14        6     37.670      8   0.148194
         2      84080      42338      3        0      0.000      3   0.093160
         1     121648       5405      2        0      0.000      2   0.160055
        10     552645     130278     11        1     52.973     10   0.143860
         8    1058111      32532      9        3      2.241      6   0.248078
        29          6          1      2        0      0.000      2   0.975053
        17          3          3      4        0      0.000      4   0.114197
         5        177          5      6        2     56.276      4   0.823513
        15         10          6      7        0      0.000      7   0.588383
        13        846         34     14        7     11.668      7   0.749650
        59       1582         45     46       35     19.791     11   0.516174
        83      11179       6207     84       42    196.633     43   0.454819
       269      13419       4091    270      201    301.646     70   0.444135
        99      13010       1145    100       75     85.989     25   0.619522
        38      37714       7188     39       18     70.969     21   0.313536
        13     503884      58318     14        4      8.831     10   0.354573
       388     716447     305386    389      300    446.491     90   0.499288
       682          5         27      6        1      6.423      5   0.514032
       197          7         53      8        0      0.000      8   0.102122
        62        233        152     63       30    150.178     34   0.587657
      2800         67       1202     68       30     95.619     38   0.963229
      1144       2069        636    637      538    451.111    100   0.037327
       861        331         25     26        7     56.348     19   0.836489
      2466        208        579    209      155    167.867     55   0.704305
      1763      39606        662    663      615    236.782     49   0.781564
      2558      41742       6980   2559     2399    756.969    161   0.631420
       945     670936      37614    946      880    249.158     67   0.653672
       894    3093159     815516    895      776    595.764    120   0.290478
       243    1708794     333550    244      186    203.972     59   0.707746
      8715          3       2476      4        0      0.000      4   0.014482
     40987         14       6246     15        4     23.623     11   0.724169
      3761         93       2611     94       52    150.667     43   0.981594
      5247        104          2      3        0      0.000      3   0.873658
     11833        551       4543    552      451    464.825    102   0.369602
     10572        693        573    574      521    162.498     54   0.106560
      7780       2571        660    661      563    476.204     99   0.049029
      1923      10153         13     14        3      9.289     11   0.893499
      5455      94142      71695   5456     5176   1406.942    281   0.156760
     14418      38184       5287   5288     5021   1416.575    268   0.767730
     21098     598581     126118  21099    20616   2644.748    484   0.214580
      6322    3037501     199729   6323     6148    847.590    176   0.658101
     88970          9      37019     10        0      0.000     10   0.335287
     36072         11       2983     12        4     28.409      8   0.579486
    159483        236      55795    237      169    282.215     69   0.656032
     89131         19      26124     20        3      7.726     17   0.820100
     47416        696      23813    697      579    630.381    119   0.044259
    366817         45     157819     46       15    100.263     32   0.451664
     26382        842       1379    843      785    290.531     59   0.793285
     54511       2734       6356   2735     2591    686.559    145   0.439886
    203928     287197      53751  53752    52878   5046.277    875   0.781830
     77845     136219        225    226      159    286.936     68   0.768440
     19708    2428382    1026104  19709    19134   3078.298    576   0.833665
    106761      88198       3950   3951     3682   1390.429    270   0.182850
    115213          9      16785     10        1      2.949      9   0.182563
    116103          8       8260      9        2      6.901      7   0.734816
    729793        304     399011    305      225    362.574     81   0.559511
    414719        306      97283    307      239    375.541     69   0.365590
    797151        825      36942    826      769    177.137     58   0.931600
    253021        595      53758    596      505    448.180     92   0.146083
     64641       2116      17683   2117     1939    821.769    179   0.007157
   2735937      24350     207418  24351    24000   1818.906    352   0.453351
   3332989       3450     209772   3451     3323    681.173    129   0.924802
    495363     222787     441121 222788   221295   9151.143   1494   0.212660
   4492837    2337980    4862820 1967998  1963928  29056.593   4071   0.907815
   1271277    1459851     495701 495702   493302  15882.020   2401   0.199247

All those p-values look reasonable. If the code is correct, and if the asymptotic assumptions of the G-test are satisfied, then the p-value should be uniformly distributed on the interval [0, 1]. That means there will occasionally be small p-values in the above table, and these do not necessarily mean something is wrong. On the other hand, many tiny values would indicate a problem.

If there is one that seems suspiciously small, those parameters can be tested again multiple times with hypergeometric_pvalues(good, bad, sample, size, ntrials, rng=None). For example, the row in the above table with good=64641, bad=2116 and sample=17683 shows a p-value of 0.007 (cue the Bond villain theme...). Here are some more tests for that parameter set:

In [6]: hypergeometric_pvalues(64641, 2116, 17683, 100000000, 10)                                                                                                   
Out[6]: 
[0.8191117004811211,
 0.282999545102632,
 0.28633452497537415,
 0.137801147739903,
 0.36847388486618193,
 0.16345315121742077,
 0.5434198448355886,
 0.5235508667177059,
 0.0242750275554843,
 0.43094414166451833]

Those values look reasonable.

@WarrenWeckesser
Copy link
Member Author

A few comments about the C code:

  • The file distributions.c is now 1741 lines. If I kept the hypergeometric distribution in there, it would be close to 2000 lines. My preference is for shorter files, so I put the new implementation of hypergeometric in its own file, random_hypergeometric.c. Any new distributions added later could also go in separate files, so if and when the multivariate hypergeometric distribution is added, it would be in its own file. Of course, if there is a strong objection to this, it can all go in distributions.c.
  • I put the function logfactorial (a replacement for loggam when the argument is an integer) in its own file. This might be useful elsewhere.

@charris charris added this to the 1.17.0 release milestone Jun 14, 2019
Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the changes is not allowed.

@WarrenWeckesser WarrenWeckesser force-pushed the random-hypergeometric branch 2 times, most recently from accf268 to 189e459 Compare June 14, 2019 19:31
Summary of the changes:

* Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua
  and random_hypergeometric from distributions.c to legacy-distributions.c.
  These are now the legacy implementation of hypergeometric.
* Add the files logfactorial.c and logfactorial.h,
  containing the function logfactorial(int64_t k).
* Add the files random_hypergeometric.c and random_hypergeometric.h,
  containing the function random_hypergeometric (the new implementation
  of the hypergeometric distribution).  See more details below.
* Fix two tests in numpy/random/tests/test_generator_mt19937.py that
  used values returned by the hypergeometric distribution.  The
  new implementation changes the stream, so those tests needed to
  be updated.
* Remove another test obviated by an added constraint on the arguments
  of hypergeometric.

Details of the rewrite:

If you carefully step through the old function rk_hypergeometric_hyp(),
you'll see that the end result is basically the same as the new function
hypergeometric_sample(), but the new function accomplishes the result with
just integers.  The floating point calculations in the old code caused
problems when the arguments were extremely large (explained in more detail
in the unmerged pull request numpy#9834).

The new version of hypergeometric_hrua() is a new translation of
Stadlober's ratio-of-uniforms algorithm for the hypergeometric
distribution.  It fixes a mistake in the old implementation that made the
method less efficient than it could be (see the details in the unmerged
pull request numpy#11138), and uses a faster
function for computing log(k!).

The HRUA algorithm suffers from loss of floating point precision when
the arguments are *extremely* large (see the comments in github issue
11443).  To avoid these problems, the arguments `ngood` and `nbad` of
hypergeometric must be less than 10**9.  This constraint obviates an
existing regression test that was run on systems with 64 bit long
integers, so that test was removed.
@WarrenWeckesser WarrenWeckesser force-pushed the random-hypergeometric branch from 189e459 to b2d2b67 Compare June 14, 2019 20:14
@WarrenWeckesser
Copy link
Member Author

@bashtage, I pushed an update with your suggested changes:

  • random_poisson_ptrs is back to using loggam instead of logfactorial.
  • The limits on the arguments ngood and nbad are mentioned in the Parameters section.
  • In the Notes, where these limits are also discussed, the binomial distribution is suggested as an alternative.

Note that the static qualifier of loggam was removed, because it is now used in both distributions.c and in legacy_distributions.c.

@WarrenWeckesser
Copy link
Member Author

The logs for the "LGTM analysis: C/C++" failure don't explain what failed (at least, I can't find a clear error message in there). Am I missing something obvious?

@charris
Copy link
Member

charris commented Jun 14, 2019

@WarrenWeckesser The LGTM C/C++ check has been acting up lately and is under investigation.

@bashtage
Copy link
Contributor

LGTM.

@mattip mattip merged commit 6b1590d into numpy:master Jun 15, 2019
@mattip
Copy link
Member

mattip commented Jun 15, 2019

Thanks @WarrenWeckesser, @bashtage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants