Skip to content

ENH: random: Add multivariate_hypergeometric function. #8056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

WarrenWeckesser
Copy link
Member

Add a function that draws from a multivariate hypergeometric distribution.

@WarrenWeckesser
Copy link
Member Author

@charris
Copy link
Member

charris commented Sep 18, 2016

We usually have a check for repeatability. Here, with the two different implementations, we need at least two. It isn't possible to try all the different color combinations, but maybe pick something simple and a long sequence. I used a hash of the sequences for some checks, but attention needs to be paid to little/big endian so that veiwed as bytes the sequences are the same. See line 179 in test_random.py.

@charris
Copy link
Member

charris commented Sep 18, 2016

For small sample lengths (< 256) you could just convert the output to uint8 if you go the hash route.

ncolors = len(colors)
shape = _shape_from_size(size, ncolors)
samples = np.zeros(shape, dtype=np.int)
samples_ptr = <long*> PyArray_DATA(samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think np.int_ is the approved type for long.

@charris
Copy link
Member

charris commented Sep 18, 2016

This will likely have conflicts with #6938, which I want to merge first.

EDIT: I also want to make a closer check of the details, current comments are first look.

@WarrenWeckesser
Copy link
Member Author

This will likely have conflicts with #6938, which I want to merge first.

No problem. And thanks for the prompt "first look"!

@charris
Copy link
Member

charris commented Sep 19, 2016

I note that rk_hypergeometric is not implemented in the most efficient manner. It can't be completely fixed up in the obvious way and maintain backwards compatibility, but the floor function and cast to long can be gotten rid of.

@WarrenWeckesser
Copy link
Member Author

In an inline comment, @charris asked about the rule used to choose between the two algorithms. With _multivariate_hypergeometric1 and _multivariate_hypergeometric2 made public, I used this simple script (and other variations) to investigate the performance:

from __future__ import print_function, division

import sys
import numpy as np
import timeit

num_colors = [3, 5, 10, 15, 20, 25, 50, 100, 200, 500, 1000]
totals = [25, 50, 75, 100, 150, 200, 500, 750, 1000, 1500, 2000, 3000, 5000, 6000, 8000, 10000]

f = sys.stdout
for total in totals:
    for ncolors in num_colors:
        nsamples = np.unique((np.logspace(np.log10(3), np.log10(total), 25) + 0.5).astype(int))
        colors = np.bincount(np.random.choice(ncolors, total), minlength=ncolors)
        for nsample in nsamples:
            number = int(20000 / ncolors)
            t1 = timeit.timeit('np.random.multivariate_hypergeometric1(colors, nsample, size=100)',
                              globals=globals(), number=number)
            t2 = timeit.timeit('np.random.multivariate_hypergeometric2(colors, nsample, size=100)',
                              globals=globals(), number=number)
            if t1 < t2:
                best = 1
            else:
                best = 2
            print("%5d %5d %5d %10.6f %10.6f %d" %
                  (ncolors, total, nsample, t1, t2, best), file=f)

I found that algorithm 1 (iterated calls to rk_hypergeometric) was faster when 12*nsample > len(colors). It is not a perfect rule, but in the cases where I saw it pick the wrong algorithm, the difference in performance was on the order of 25 to 30%. (Note that in some extreme cases, the difference in performance can be several orders of magnitude. The simple rule gets these right.)

I also experimented with even larger values of sum(colors) in an ipython shell. When sum(colors) is huge (e.g. a million), algorithm 1 is faster. So the rule that I ended up with is when sum(colors) > 500000 or nsample > 12*len(colors), algorithm 1 is used.

I used only one computer for this experimentation, so I don't know how the result would be affected by using a different system. I can reexpose the individual algorithms if anyone wants to experiment further.

I guess we could expose both algorithms permanently, with, say, a method argument, but I'd rather not do that without a compelling use-case. If the rule for choosing the method works well enough, we shouldn't clutter the API with options that would rarely, if ever, be needed by a numpy user. (If a compelling use-case shows up later, it is easy to add the option in a future version of numpy.)

@WarrenWeckesser
Copy link
Member Author

I pushed an update that addresses most of the comments so far. I left the two algorithms as (private) python functions. If anyone wants to experiment with the performance of the two algorithms, they are available as numpy.random.mtrand._rand._multivariate_hypergeometric#, where # is 1 or 2.

@wrwrwr
Copy link
Contributor

wrwrwr commented Oct 11, 2016

Just an additional data point for the tuning (from a moderately aged Callisto processor running Python 3.5 on Ubuntu 16.04 with ATLAS). I used your script to gather the data and this bit to process it.

For this data, the heuristic is correct in 96% of cases, that's not bad, though by changing 12 to 19 you can get close to 99%.

Copy link
Contributor

@wrwrwr wrwrwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nitpicks, feel free to ignore them, I haven't found anything more significant in the whole thing :-)

@WarrenWeckesser
Copy link
Member Author

@wrwrwr All nits worth picking. :) I'll fix them when I update again. Thanks for testing the heuristic. I'll experiment some more with 12 vs. 19 (or some other value) before updating the code.

@wrwrwr
Copy link
Contributor

wrwrwr commented Oct 12, 2016

Another bit. Couldn't we speed up cases with the number of samples close to the total number of items using symmetry (possibly just by adding a flag and some subtractions to algorithm 2)?

For example, multivariate_hypergeometric(3 * [int(1e16)], int(3e16) - 5) currently takes a while to compute, but could be sampled by choosing colors of the few items not to be drawn. The example is rather extreme, but my feeling is that such an approach could compete with algorithm 1 where it wins with 2.

@homu
Copy link
Contributor

homu commented Oct 28, 2016

☔ The latest upstream changes (presumably #8219) made this pull request unmergeable. Please resolve the merge conflicts.

@charris charris added this to the 1.13.0 release milestone Nov 4, 2016
@WarrenWeckesser
Copy link
Member Author

That's fine. I have ended up rewriting the hypergeometric distribution functions in private functions to be used in multivariate_hypergeometric. It is more work, and it will take more effort to review, but it means that the multivariate hypergeometric function with two colors (which is the univariate hypergeometric distrbution) will be faster than the existing hypergeometric function. When numpy.random gets an API for versioning, we'll be able to use the new code in hypergeometric.

@WarrenWeckesser
Copy link
Member Author

I have updated the code with a faster version of the "marginals" method.

This should not be merged yet. I am investigating the behavior of the method when it is given extremely large arguments.

@WarrenWeckesser WarrenWeckesser force-pushed the multivar-hypergeom branch 2 times, most recently from 87c0508 to 3c303b7 Compare November 12, 2018 20:28
@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Nov 12, 2018

The test failure on "Shippable" appears to be a bug that is not directly related to this pull request (thought possibly triggered because not all the tests set a random seed). The test that is failing is TestRandomDist.test_weibull_0, which does assert_equal(np.random.weibull(a=0), 0). The problem is that, even in numpy 1.15.4, we have:

In [36]: np.__version__
Out[36]: '1.15.4'

In [37]: np.random.seed(123)

In [38]: np.random.weibull(a=0, size=10)
Out[38]: array([inf,  0.,  0.,  0., inf,  0., inf, inf,  0.,  0.])

Those inf values should not be there. I created a new issue for this: #12371

@WarrenWeckesser
Copy link
Member Author

I made a lot of changes since the last review, and the branch for 1.16 is imminent, so I won't be surprised if this gets pushed off to the next release. @charris, just letting you know that you don't have to feel guilty when you do it. 😃

This pull request includes a new C implementation of the (univariate) hypergeometric distribution that is used in the "marginals" method of the multivariate hypergeometric distribution. I went back to Stadlober's thesis and paper to better understand what was going on, and I added a lot more documentation in the function hypergeometric_hrua. This implementation fixes several of the issues that I've reported in the past, but it doesn't replace the existing implementation. We can't do that until PEP 19 takes effect. One new feature in the C code is the function double logfactorial(unsigned long k) that computes log(k!) for an integer k. This is much faster than using loggam; in fact, for 0 <= k <= 125, logfactorial(k) is just a lookup table.

I have limited the "marginals" method to sum(colors) < 1e9. When sum(colors) is very large, there is loss of precision in the line T = g - gp; in the C function hypergeometric_hrua.

The loss of precision could be alleviated by reformulating the calculation of T to use a function such as

double log_ratio_factorials(long a, long b)
{
    double result = 0.0;

    if (a < b) {
        long i;
        for (i = b; i > a; --i) {
            result -= log(i);
        }
    }
    else {
        long i;
        for (i = a; i > b; --i) {
            result += log(i);
        }
    }
    return result;
}

that computes log(a!/b!) where a and b are integers, but that particular implementation is slow (time complexity O(|a - b|), and |a - b| can be big for the parameter ranges where using this function makes a significant difference). And I don't think it is worthwhile spending more time on this without having a solid use-case for sum(colors) >> 1e9. Even the upper bound 1e9 is probably far beyond what anybody will attempt, except perhaps by accident.

With the latest faster version of the "marginals" method, I was tempted to remove the "count" method. But when colors has a lot of small values, and nsample is not too big, the "count" method is still much faster than the "marginals" method. For example,

In [42]: colors = np.full(25, fill_value=5)

In [43]: %timeit np.random.multivariate_hypergeometric(colors, nsample=8, size=1000000, method='marginals')
1.56 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %timeit np.random.multivariate_hypergeometric(colors, nsample=8, size=1000000, method='count')
156 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@charris
Copy link
Member

charris commented Nov 13, 2018

@WarrenWeckesser So are you happy with this in its current state?

@WarrenWeckesser
Copy link
Member Author

@charris, I just noticed that the code in the function mvhg_marginals() was making the substitution nsample = total - nsample if nsample > total/2, but then the routines that it (ultimately) calls, hypergeometric_sample() and hypergeometric_hrua(), also make a similar check and transformation for the (univariate) hypergeometric distribution. I'm cleaning that up now.

@charris charris modified the milestones: 1.16.0 release, 1.17.0 release Nov 17, 2018
@charris
Copy link
Member

charris commented Nov 17, 2018

Pushed off to 1.17. Let me know when you think this is ready.

@WarrenWeckesser
Copy link
Member Author

No problem. I ran some tests for ranges of the inputs that I hadn't tried before, and I noticed some bias in the "big" tests that I've been running. Now I need to determine if that issue is inherent in the HRUA algorithm, or if it is a limitation of (or bug in) my test code.

@mattip
Copy link
Member

mattip commented Jun 6, 2019

@bashtage is this something we should consider before the 1.17 release, or can it be added afterward?

@WarrenWeckesser
Copy link
Member Author

Is there a target date for the release of 1.17?

@bashtage
Copy link
Contributor

bashtage commented Jun 6, 2019

As must this PR is only about multivariate_hypergeometric, it could wait until later. However, I think @WarrenWeckesser is also planning on improving hypergeometric to work better in edge cases. Any changes to hypergeometric really should get in before 1.17 since these will change the stream (and so might require a warning cycle).

@WarrenWeckesser
Copy link
Member Author

With randomgen now merged, it makes sense to break this pull request into two parts. The first part is the rewrite of the univariate hypergeometric code. That is now in #13761. The second part will be to add the new multivariate hypergeometric distribution, but that might have to wait until 1.18.

@charris
Copy link
Member

charris commented Jun 14, 2019

@WarrenWeckesser Closing this then.

@charris charris closed this Jun 14, 2019
@charris charris removed this from the 1.17.0 release milestone Jun 14, 2019
@charris
Copy link
Member

charris commented Jun 14, 2019

I assume there will be a new PR with the second part.

@WarrenWeckesser
Copy link
Member Author

@charris wrote

I assume there will be a new PR with the second part.

Yes. If the branching of 1.17 is postponed because of the ongoing discussion of the new randomgen code, it might be possible to get multivariate_hypergeometric into 1.17, but I'm not counting on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants