ENH: random: Add multivariate_hypergeometric function. #8056

WarrenWeckesser · 2016-09-17T14:18:35Z

Add a function that draws from a multivariate hypergeometric distribution.

WarrenWeckesser · 2016-09-17T15:27:54Z

A description of the distribution on wikipedia: https://en.wikipedia.org/wiki/Hypergeometric_distribution#Multivariate_hypergeometric_distribution

numpy/random/mtrand/mtrand.pyx

numpy/random/tests/test_random.py

numpy/random/mtrand/mtrand.pyx

charris · 2016-09-18T00:49:14Z

We usually have a check for repeatability. Here, with the two different implementations, we need at least two. It isn't possible to try all the different color combinations, but maybe pick something simple and a long sequence. I used a hash of the sequences for some checks, but attention needs to be paid to little/big endian so that veiwed as bytes the sequences are the same. See line 179 in test_random.py.

charris · 2016-09-18T00:57:32Z

For small sample lengths (< 256) you could just convert the output to uint8 if you go the hash route.

charris · 2016-09-18T01:25:39Z

numpy/random/mtrand/mtrand.pyx

+        ncolors = len(colors)
+        shape = _shape_from_size(size, ncolors)
+        samples = np.zeros(shape, dtype=np.int)
+        samples_ptr = <long*> PyArray_DATA(samples)


I think np.int_ is the approved type for long.

charris · 2016-09-18T01:50:24Z

This will likely have conflicts with #6938, which I want to merge first.

EDIT: I also want to make a closer check of the details, current comments are first look.

WarrenWeckesser · 2016-09-19T00:51:06Z

This will likely have conflicts with #6938, which I want to merge first.

No problem. And thanks for the prompt "first look"!

charris · 2016-09-19T02:56:06Z

I note that rk_hypergeometric is not implemented in the most efficient manner. It can't be completely fixed up in the obvious way and maintain backwards compatibility, but the floor function and cast to long can be gotten rid of.

numpy/random/mtrand/mtrand.pyx

WarrenWeckesser · 2016-09-19T22:28:36Z

In an inline comment, @charris asked about the rule used to choose between the two algorithms. With _multivariate_hypergeometric1 and _multivariate_hypergeometric2 made public, I used this simple script (and other variations) to investigate the performance:

from __future__ import print_function, division

import sys
import numpy as np
import timeit

num_colors = [3, 5, 10, 15, 20, 25, 50, 100, 200, 500, 1000]
totals = [25, 50, 75, 100, 150, 200, 500, 750, 1000, 1500, 2000, 3000, 5000, 6000, 8000, 10000]

f = sys.stdout
for total in totals:
    for ncolors in num_colors:
        nsamples = np.unique((np.logspace(np.log10(3), np.log10(total), 25) + 0.5).astype(int))
        colors = np.bincount(np.random.choice(ncolors, total), minlength=ncolors)
        for nsample in nsamples:
            number = int(20000 / ncolors)
            t1 = timeit.timeit('np.random.multivariate_hypergeometric1(colors, nsample, size=100)',
                              globals=globals(), number=number)
            t2 = timeit.timeit('np.random.multivariate_hypergeometric2(colors, nsample, size=100)',
                              globals=globals(), number=number)
            if t1 < t2:
                best = 1
            else:
                best = 2
            print("%5d %5d %5d %10.6f %10.6f %d" %
                  (ncolors, total, nsample, t1, t2, best), file=f)

I found that algorithm 1 (iterated calls to rk_hypergeometric) was faster when 12*nsample > len(colors). It is not a perfect rule, but in the cases where I saw it pick the wrong algorithm, the difference in performance was on the order of 25 to 30%. (Note that in some extreme cases, the difference in performance can be several orders of magnitude. The simple rule gets these right.)

I also experimented with even larger values of sum(colors) in an ipython shell. When sum(colors) is huge (e.g. a million), algorithm 1 is faster. So the rule that I ended up with is when sum(colors) > 500000 or nsample > 12*len(colors), algorithm 1 is used.

I used only one computer for this experimentation, so I don't know how the result would be affected by using a different system. I can reexpose the individual algorithms if anyone wants to experiment further.

I guess we could expose both algorithms permanently, with, say, a method argument, but I'd rather not do that without a compelling use-case. If the rule for choosing the method works well enough, we shouldn't clutter the API with options that would rarely, if ever, be needed by a numpy user. (If a compelling use-case shows up later, it is easy to add the option in a future version of numpy.)

WarrenWeckesser · 2016-10-08T23:08:59Z

I pushed an update that addresses most of the comments so far. I left the two algorithms as (private) python functions. If anyone wants to experiment with the performance of the two algorithms, they are available as numpy.random.mtrand._rand._multivariate_hypergeometric#, where # is 1 or 2.

wrwrwr · 2016-10-11T09:46:00Z

Just an additional data point for the tuning (from a moderately aged Callisto processor running Python 3.5 on Ubuntu 16.04 with ATLAS). I used your script to gather the data and this bit to process it.

For this data, the heuristic is correct in 96% of cases, that's not bad, though by changing 12 to 19 you can get close to 99%.

numpy/random/mtrand/mtrand.pyx

wrwrwr

Sorry for the nitpicks, feel free to ignore them, I haven't found anything more significant in the whole thing :-)

WarrenWeckesser · 2016-10-11T21:24:50Z

@wrwrwr All nits worth picking. :) I'll fix them when I update again. Thanks for testing the heuristic. I'll experiment some more with 12 vs. 19 (or some other value) before updating the code.

wrwrwr · 2016-10-12T10:38:22Z

Another bit. Couldn't we speed up cases with the number of samples close to the total number of items using symmetry (possibly just by adding a flag and some subtractions to algorithm 2)?

For example, multivariate_hypergeometric(3 * [int(1e16)], int(3e16) - 5) currently takes a while to compute, but could be sampled by choosing colors of the few items not to be drawn. The example is rather extreme, but my feeling is that such an approach could compete with algorithm 1 where it wins with 2.

homu · 2016-10-28T15:20:04Z

☔ The latest upstream changes (presumably #8219) made this pull request unmergeable. Please resolve the merge conflicts.

WarrenWeckesser · 2018-05-25T22:03:41Z

That's fine. I have ended up rewriting the hypergeometric distribution functions in private functions to be used in multivariate_hypergeometric. It is more work, and it will take more effort to review, but it means that the multivariate hypergeometric function with two colors (which is the univariate hypergeometric distrbution) will be faster than the existing hypergeometric function. When numpy.random gets an API for versioning, we'll be able to use the new code in hypergeometric.

WarrenWeckesser · 2018-09-10T15:26:22Z

I have updated the code with a faster version of the "marginals" method.

This should not be merged yet. I am investigating the behavior of the method when it is given extremely large arguments.

WarrenWeckesser · 2018-11-12T21:15:15Z

The test failure on "Shippable" appears to be a bug that is not directly related to this pull request (thought possibly triggered because not all the tests set a random seed). The test that is failing is TestRandomDist.test_weibull_0, which does assert_equal(np.random.weibull(a=0), 0). The problem is that, even in numpy 1.15.4, we have:

In [36]: np.__version__
Out[36]: '1.15.4'

In [37]: np.random.seed(123)

In [38]: np.random.weibull(a=0, size=10)
Out[38]: array([inf,  0.,  0.,  0., inf,  0., inf, inf,  0.,  0.])

Those inf values should not be there. I created a new issue for this: #12371

WarrenWeckesser · 2018-11-13T04:16:56Z

I made a lot of changes since the last review, and the branch for 1.16 is imminent, so I won't be surprised if this gets pushed off to the next release. @charris, just letting you know that you don't have to feel guilty when you do it. 😃

This pull request includes a new C implementation of the (univariate) hypergeometric distribution that is used in the "marginals" method of the multivariate hypergeometric distribution. I went back to Stadlober's thesis and paper to better understand what was going on, and I added a lot more documentation in the function hypergeometric_hrua. This implementation fixes several of the issues that I've reported in the past, but it doesn't replace the existing implementation. We can't do that until PEP 19 takes effect. One new feature in the C code is the function double logfactorial(unsigned long k) that computes log(k!) for an integer k. This is much faster than using loggam; in fact, for 0 <= k <= 125, logfactorial(k) is just a lookup table.

I have limited the "marginals" method to sum(colors) < 1e9. When sum(colors) is very large, there is loss of precision in the line T = g - gp; in the C function hypergeometric_hrua.

The loss of precision could be alleviated by reformulating the calculation of T to use a function such as

double log_ratio_factorials(long a, long b)
{
    double result = 0.0;

    if (a < b) {
        long i;
        for (i = b; i > a; --i) {
            result -= log(i);
        }
    }
    else {
        long i;
        for (i = a; i > b; --i) {
            result += log(i);
        }
    }
    return result;
}

that computes log(a!/b!) where a and b are integers, but that particular implementation is slow (time complexity O(|a - b|), and |a - b| can be big for the parameter ranges where using this function makes a significant difference). And I don't think it is worthwhile spending more time on this without having a solid use-case for sum(colors) >> 1e9. Even the upper bound 1e9 is probably far beyond what anybody will attempt, except perhaps by accident.

With the latest faster version of the "marginals" method, I was tempted to remove the "count" method. But when colors has a lot of small values, and nsample is not too big, the "count" method is still much faster than the "marginals" method. For example,

In [42]: colors = np.full(25, fill_value=5)

In [43]: %timeit np.random.multivariate_hypergeometric(colors, nsample=8, size=1000000, method='marginals')
1.56 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %timeit np.random.multivariate_hypergeometric(colors, nsample=8, size=1000000, method='count')
156 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

charris · 2018-11-13T19:05:02Z

@WarrenWeckesser So are you happy with this in its current state?

WarrenWeckesser · 2018-11-15T04:35:40Z

@charris, I just noticed that the code in the function mvhg_marginals() was making the substitution nsample = total - nsample if nsample > total/2, but then the routines that it (ultimately) calls, hypergeometric_sample() and hypergeometric_hrua(), also make a similar check and transformation for the (univariate) hypergeometric distribution. I'm cleaning that up now.

charris · 2018-11-17T16:28:17Z

Pushed off to 1.17. Let me know when you think this is ready.

WarrenWeckesser · 2018-11-17T17:02:15Z

No problem. I ran some tests for ranges of the inputs that I hadn't tried before, and I noticed some bias in the "big" tests that I've been running. Now I need to determine if that issue is inherent in the HRUA algorithm, or if it is a limitation of (or bug in) my test code.

mattip · 2019-06-06T05:49:39Z

@bashtage is this something we should consider before the 1.17 release, or can it be added afterward?

WarrenWeckesser · 2019-06-06T05:53:22Z

Is there a target date for the release of 1.17?

bashtage · 2019-06-06T07:53:11Z

As must this PR is only about multivariate_hypergeometric, it could wait until later. However, I think @WarrenWeckesser is also planning on improving hypergeometric to work better in edge cases. Any changes to hypergeometric really should get in before 1.17 since these will change the stream (and so might require a warning cycle).

WarrenWeckesser · 2019-06-12T06:27:12Z

With randomgen now merged, it makes sense to break this pull request into two parts. The first part is the rewrite of the univariate hypergeometric code. That is now in #13761. The second part will be to add the new multivariate hypergeometric distribution, but that might have to wait until 1.18.

charris · 2019-06-14T14:51:15Z

@WarrenWeckesser Closing this then.

charris · 2019-06-14T14:52:24Z

I assume there will be a new PR with the second part.

WarrenWeckesser · 2019-06-14T15:15:49Z

@charris wrote

I assume there will be a new PR with the second part.

Yes. If the branching of 1.17 is postponed because of the ongoing discussion of the new randomgen code, it might be possible to get multivariate_hypergeometric into 1.17, but I'm not counting on that.

charris added 01 - Enhancement component: numpy.random labels Sep 17, 2016

charris reviewed Sep 18, 2016

View reviewed changes