ENH: random: Add the multivariate hypergeometric distribution. #13794

WarrenWeckesser · 2019-06-17T02:32:18Z

The new method

multivariate_hypergeometric(self, object colors, object nsample,
                            size=None, method='marginals')

of the class numpy.random.Generator implements the multivariate
hypergeometric distribution; see
https://en.wikipedia.org/wiki/Hypergeometric_distribution,
specifically the section "Multivariate hypergeometric distribution".

Two algorithms are implemented. The user selects which algorithm
to use with the method parameter. The default, method='marginals',
is based on repeated calls of the univariate hypergeometric
distribution function. The other algorithm, selected with
method='count', is a brute-force method that allocates an
internal array of length sum(colors). It should only be used
when that value is small, but it can be much faster than the
"marginals" algorithm in that case.

The C implementations of the two methods are in the files
random_mvhg_count.c and random_mvhg_marginals.c in
numpy/random/src/distributions.

The new file numpy/random/src/distributions/util.c contains the
function safe_sum_nonneg_int64() for summing nonnegative int64
values while checking for overflow.

WarrenWeckesser · 2019-06-17T02:34:19Z

This is the updated version of #8056.

WarrenWeckesser · 2019-06-17T03:22:30Z

The refguide check doesn't like the blank line in the output of this example in the docstring:

    >>> gen.multivariate_hypergeometric(colors, 6, size=(2, 2))
    array([[[5, 1, 0],
            [4, 2, 0]],

           [[2, 2, 2],
            [4, 2, 0]]])

The error reported in the refguide check on the azure macOS build is

File "build/testenv/lib/python3.6/site-packages/numpy/random/generator.cpython-36m-darwin.so", line ?, in Generator.multivariate_hypergeometric
Failed example:
    gen.multivariate_hypergeometric(colors, 6, size=(2, 2))
Expected:
    array([[[5, 1, 0],
            [4, 2, 0]],
Got:
    array([[[5, 1, 0],
            [4, 2, 0]],
    <BLANKLINE>
           [[2, 2, 2],
            [4, 2, 0]]])

It looks like the checker ignores the part of the array that occurs after the blank line.

Is this is a known issue? Is there a work-around?

EDIT: Worked around: I removed the blank line from the docstring.

.lgtm.yml

WarrenWeckesser · 2019-06-17T17:15:29Z

LGTM C/C++ passed. 🎉 🎆 😀 I don't know if that is just luck or if the LGTM folks have fixed something. Anyway, it's nice to see the green check mark.

WarrenWeckesser · 2019-06-19T03:29:16Z

I've run some validation scripts for the multivariate hypergeometric distribution. For a given set of parameters and sample size, the basic calculation is to compute the expected counts for each point in the support of the distribution, and then do a G-test to compare that to the observed counts in a sample of that size. Under the usual statistical assumptions of the G-test, we expect the p-value to be uniformly distributed on [0, 1] if multivariate_hypergeometric is correctly generating variates from the distribution.

I put the scripts in my mvhg_validation repository. The two main scripts are "check1.py" and "check2.py". "check1.py" is for parameters where the both sum(colors) and nsample are small. The script selects the sample size so that the smallest expected value of any point in the support is at least 100. (If this script is run with large parameters, that calculation could easily require a sample size of a gazillion or more.)

"check2.py" can handle bigger input values (but the inputs still can't be too big if you want the calculation to complete in a reasonable amount of time). The script will aggregate points in the support where the expected value is less than 100 by grouping points such that the sum of the expected value in each group is at least 100. I've been running "check2.py" so that it generates one to three thousand p-values. The script plots a histogram of the p-values. A visual inspection can confirm that the distribution of the p-values looks uniform.

Here's a typical output printed by "check1.py":

=== method: marginals ===

colors: [10, 90]   nsample: 40 size: 2042128   min expected count: 100.00002039706943
p-value: 0.76614134883673501696
p-value: 0.80032715886854746921
p-value: 0.28646592404637900574

colors: [5, 10, 10]   nsample: 12 size: 52003000   min expected count: 100.0
p-value: 0.47885957645276647213
p-value: 0.15488694832529717484
p-value: 0.90967770179322239413

colors: [40, 6, 5]   nsample: 19 size: 63012380   min expected count: 100.00000017133492
p-value: 0.35531305116108942806
p-value: 0.74187174727261006604
p-value: 0.21760603383501619266

colors: [40, 6, 5]   nsample: 30 size: 13502653   min expected count: 100.00000122932808
p-value: 0.34773389687467755291
p-value: 0.8784454513961536854
p-value: 0.66566849746858336538

colors: [3, 4, 5, 6]   nsample: 8 size: 4375800   min expected count: 100.0
p-value: 0.62493583602354243111
p-value: 0.46977151593588086797
p-value: 0.12208932027017217148

colors: [2, 3, 3, 3, 3, 4, 5]   nsample: 13 size: 114406600   min expected count: 100.0
p-value: 0.40644759025965258828
p-value: 0.13206413901661218496
p-value: 0.88757976584140347645

=== method: count ===

colors: [10, 90]   nsample: 40 size: 2042128   min expected count: 100.00002039706943
p-value: 0.7924735267779910425
p-value: 0.10806047197238961777
p-value: 0.6017646563892657599

colors: [5, 10, 10]   nsample: 12 size: 52003000   min expected count: 100.0
p-value: 0.68786055960586100563
p-value: 0.6552933632895854508
p-value: 0.33080547907126165367

colors: [40, 6, 5]   nsample: 19 size: 63012380   min expected count: 100.00000017133492
p-value: 0.83923738455884193607
p-value: 0.71541951382572945313
p-value: 0.15817358197836481528

colors: [40, 6, 5]   nsample: 30 size: 13502653   min expected count: 100.00000122932808
p-value: 0.040057685646735838088
p-value: 0.75932942773257168664
p-value: 0.62433805773198801097

colors: [3, 4, 5, 6]   nsample: 8 size: 4375800   min expected count: 100.0
p-value: 0.41021088926200340939
p-value: 0.50847417267434287225
p-value: 0.83636688795034767405

colors: [2, 3, 3, 3, 3, 4, 5]   nsample: 13 size: 114406600   min expected count: 100.0
p-value: 0.39852057363014235066
p-value: 0.51818919493381321297
p-value: 0.11866139396718165838

And here are several plots generated by "check2.py":

WarrenWeckesser · 2019-06-21T00:10:18Z

One more p-value histogram, using method='count', with len(colors) = 11, sum(colors) = 24 and nsample=12. Each trial used 270415600 variates, and 2000 p-values were computed to generate the histogram. The large value for size meant that no aggregation was required for the G-test; the expected count for each value in the support was at least 100. (Note: it is just a coincidence that 100 also happens to be the expected size of the bars in the histrogram. I computed 2000 p-values and then created the histogram with 20 bins. This value is the dashed line in the plot.)

WarrenWeckesser · 2019-06-21T00:12:20Z

This is ready for review. (Reminder: this is the updated version of #8056.)

WarrenWeckesser · 2019-06-24T21:06:34Z

A friendly request: could someone remove the "WIP" tag from this pull request? It is ready for review.

I know that the folks who would likely review this are focused on finishing the randomgen update, so I don't expect a review to happen soon, but I want to be sure that someone who does find time doesn't pass over this one because of the "WIP" tag.

WarrenWeckesser · 2019-06-24T21:10:22Z

Thanks @seberg!

mattip · 2019-07-05T23:53:24Z

@WarrenWeckesser there is a merge conflict.

@bashtage, @rkern any opinions?

WarrenWeckesser · 2019-07-06T00:45:20Z

The pull request includes two methods for generating a sample:

The "count" method generates an array containing the full population defined by colors (i.e. if colors is [2, 5, 3], the method generates a temporary array containing [0, 0, 1, 1, 1, 1, 1, 2, 2, 2]). It then does a partial Fisher-Yates shuffle to create a random selection of length nsample from that array, and then counts the occurrences of the items in the sample.
The "marginals" method uses the hypergeometric distribution iteratively to generate the sample.

It is not in the pull request, but I also experimented with a third implementation that used Floyd's algorithm:

select nsample integers from range(0, sum(colors)) without replacement;
map each of those integers to its color (binary search on cumsum(colors)), and increment the color's count.

It avoided creating the (potentially big) array that is created by the "count" method. I used the C library khash for the hash table. This third method is not in the pull request because, in all my tests, either the "count" method or the "marginals" method was faster.

After seeing the pull request "MAINT: Rewrite Floyd algorithm" (#13812), I'm inspired to experiment a bit more with the hash table method. I only timed the full process of generating samples, so I don't know if swapping out khash for the simple hash table with linear probing would make a signifcant difference.

numpy/random/tests/test_generator_mt19937.py

rkern

The code generally LGTM. master needs to be merged in and the tests adapted to the current style.

Whether this goes into 1.17, I leave up to @charris.

mattip · 2019-07-14T03:47:56Z

@WarrenWeckesser ping

mattip · 2019-08-08T13:25:52Z

It is a shame this nice addition didn't make it into 1.17. @WarrenWeckesser will you be picking it up again?

WarrenWeckesser · 2019-08-08T13:40:33Z

Yes, I will get back to this in a few weeks, if not sooner.

WarrenWeckesser · 2019-08-25T21:56:47Z

Rebased and updated.

WarrenWeckesser · 2019-10-12T04:19:48Z

@rkern's last comment was

The code generally LGTM. master needs to be merged in and the tests adapted to the current style.

Whether this goes into 1.17, I leave up to @charris.

I have since rebased and updated the tests.

For anyone taking a fresh look at this PR, the essential additions are:

The C functions random_mvhg_count and random_mvhg_marginals implement two algorithms for generating random variates from the multivariate hypergeometric distribution. They take the same arguments, the first of which is a bitgen_t pointer. The implementations are in the files (respectively):
- numpy/random/src/distributions/random_mvhg_count.c
- numpy/random/src/distributions/random_mvhg_marginals.c
The method multivariate_hypergeometric is added to the Generator class. The method has a method argument that lets the user choose which algorithm to use. The implementation is in
- numpy/random/generator.pyx

All the other changes are to support the above, add tests, or update the documentation.

mattip · 2019-10-12T16:06:54Z

xref #14608, since both that and this should be merged before 1.18, and both touch generator.pyx

The new method multivariate_hypergeometric(self, object colors, object nsample, size=None, method='marginals') of the class numpy.random.Generator implements the multivariate hypergeometric distribution; see https://en.wikipedia.org/wiki/Hypergeometric_distribution, specifically the section "Multivariate hypergeometric distribution". Two algorithms are implemented. The user selects which algorithm to use with the `method` parameter. The default, `method='marginals'`, is based on repeated calls of the univariate hypergeometric distribution function. The other algorithm, selected with `method='count'`, is a brute-force method that allocates an internal array of length ``sum(colors)``. It should only be used when that value is small, but it can be much faster than the "marginals" algorithm in that case. The C implementations of the two methods are in the files random_mvhg_count.c and random_mvhg_marginals.c in numpy/random/src/distributions.

WarrenWeckesser · 2019-10-18T08:55:57Z

I updated the PR to match the changes in #14608.

I don't know why the Azure tests didn't run.

rebased and changed as requested

mattip · 2019-10-18T09:07:58Z

no idea why azure is not running :(

…or docs.

mattip · 2019-10-20T13:11:34Z

Thanks @WarrenWeckesser

WarrenWeckesser · 2019-10-20T14:57:51Z

Thanks @mattip & @rkern.

WarrenWeckesser force-pushed the new-mvhg branch from d221568 to 85d360c Compare June 17, 2019 02:32

WarrenWeckesser force-pushed the new-mvhg branch from 85d360c to e9d9742 Compare June 17, 2019 03:06

mattip reviewed Jun 17, 2019

View reviewed changes

.lgtm.yml Outdated Show resolved Hide resolved

WarrenWeckesser force-pushed the new-mvhg branch from 9e01f26 to c838dcb Compare June 17, 2019 15:48

WarrenWeckesser force-pushed the new-mvhg branch 2 times, most recently from e350e65 to 570ea89 Compare June 17, 2019 22:59

WarrenWeckesser changed the title ~~WIP/ENH: random: Add the multivariate hypergeometric distribution.~~ ENH: random: Add the multivariate hypergeometric distribution. Jun 17, 2019

WarrenWeckesser changed the title ~~ENH: random: Add the multivariate hypergeometric distribution.~~ WIP/ENH: random: Add the multivariate hypergeometric distribution. Jun 17, 2019

WarrenWeckesser force-pushed the new-mvhg branch 2 times, most recently from 07ec3c3 to f0753d7 Compare June 19, 2019 02:08

WarrenWeckesser force-pushed the new-mvhg branch from f0753d7 to b0b9c33 Compare June 20, 2019 16:52

charris added 01 - Enhancement 25 - WIP component: numpy.random labels Jun 20, 2019

WarrenWeckesser changed the title ~~WIP/ENH: random: Add the multivariate hypergeometric distribution.~~ ENH: random: Add the multivariate hypergeometric distribution. Jun 21, 2019

seberg removed the 25 - WIP label Jun 24, 2019

rkern reviewed Jul 6, 2019

View reviewed changes

numpy/random/tests/test_generator_mt19937.py Outdated Show resolved Hide resolved

rkern previously requested changes Jul 6, 2019

View reviewed changes

WarrenWeckesser force-pushed the new-mvhg branch 2 times, most recently from 44ca481 to e29f9c1 Compare August 25, 2019 21:05

WarrenWeckesser force-pushed the new-mvhg branch from e29f9c1 to 244270d Compare August 26, 2019 05:21

mattip requested a review from rkern August 26, 2019 12:45

mattip added this to the 1.18.0 release milestone Oct 12, 2019

WarrenWeckesser force-pushed the new-mvhg branch from 244270d to 8634b18 Compare October 18, 2019 08:15

WarrenWeckesser closed this Oct 18, 2019

WarrenWeckesser reopened this Oct 18, 2019

mattip closed this Oct 18, 2019

mattip reopened this Oct 18, 2019

mattip closed this Oct 18, 2019

mattip reopened this Oct 18, 2019

DOC: random: Add entry for multivariate_hypergeometric to the Generat…

0455447

…or docs.

mattip approved these changes Oct 19, 2019

View reviewed changes

mattip merged commit 2aa3bba into numpy:master Oct 20, 2019

WarrenWeckesser deleted the new-mvhg branch October 20, 2019 14:57

mattip mentioned this pull request Oct 27, 2019

API: restructure and document numpy.random C-API #14604

Merged

tirthasheshpatel mentioned this pull request Sep 14, 2020

ENH: scipy.stats: add multivariate hypergeometric distribution scipy/scipy#12839

Merged

2 tasks

WarrenWeckesser mentioned this pull request Sep 15, 2023

Validate the statistics of the random distributions #15911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: random: Add the multivariate hypergeometric distribution. #13794

ENH: random: Add the multivariate hypergeometric distribution. #13794

WarrenWeckesser commented Jun 17, 2019 •

edited

Loading

WarrenWeckesser commented Jun 17, 2019

WarrenWeckesser commented Jun 17, 2019 •

edited

Loading

WarrenWeckesser commented Jun 17, 2019

WarrenWeckesser commented Jun 19, 2019

WarrenWeckesser commented Jun 21, 2019 •

edited

Loading

WarrenWeckesser commented Jun 21, 2019

WarrenWeckesser commented Jun 24, 2019 •

edited by mattip

Loading

WarrenWeckesser commented Jun 24, 2019

mattip commented Jul 5, 2019

WarrenWeckesser commented Jul 6, 2019

rkern left a comment

mattip commented Jul 14, 2019

mattip commented Aug 8, 2019

WarrenWeckesser commented Aug 8, 2019

WarrenWeckesser commented Aug 25, 2019

WarrenWeckesser commented Oct 12, 2019

mattip commented Oct 12, 2019

WarrenWeckesser commented Oct 18, 2019

mattip commented Oct 18, 2019

mattip commented Oct 20, 2019

WarrenWeckesser commented Oct 20, 2019

ENH: random: Add the multivariate hypergeometric distribution. #13794

ENH: random: Add the multivariate hypergeometric distribution. #13794

Conversation

WarrenWeckesser commented Jun 17, 2019 • edited Loading

WarrenWeckesser commented Jun 17, 2019

WarrenWeckesser commented Jun 17, 2019 • edited Loading

WarrenWeckesser commented Jun 17, 2019

WarrenWeckesser commented Jun 19, 2019

WarrenWeckesser commented Jun 21, 2019 • edited Loading

WarrenWeckesser commented Jun 21, 2019

WarrenWeckesser commented Jun 24, 2019 • edited by mattip Loading

WarrenWeckesser commented Jun 24, 2019

mattip commented Jul 5, 2019

WarrenWeckesser commented Jul 6, 2019

rkern left a comment

Choose a reason for hiding this comment

mattip commented Jul 14, 2019

mattip commented Aug 8, 2019

WarrenWeckesser commented Aug 8, 2019

WarrenWeckesser commented Aug 25, 2019

WarrenWeckesser commented Oct 12, 2019

mattip commented Oct 12, 2019

WarrenWeckesser commented Oct 18, 2019

mattip commented Oct 18, 2019

mattip commented Oct 20, 2019

WarrenWeckesser commented Oct 20, 2019

WarrenWeckesser commented Jun 17, 2019 •

edited

Loading

WarrenWeckesser commented Jun 17, 2019 •

edited

Loading

WarrenWeckesser commented Jun 21, 2019 •

edited

Loading

WarrenWeckesser commented Jun 24, 2019 •

edited by mattip

Loading