Skip to content

BUG: Fix numpy.random.dirichlet returns NaN for small 'alpha' parameters. #14924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

trendelkampschroer
Copy link
Contributor

@trendelkampschroer trendelkampschroer commented Nov 17, 2019

BUG: Fix numpy.random.dirichlet returns NaN for small 'alpha' parameters.

The PR introduces a second code branch into the 'dirichlet' method of 'Generator'. This branch will be executed whenever the maximum of all 'alpha' parameters for the dirichlet distribution is smaller than one. In this case the random vector will be generated by the stick breaking approach, cf. https://en.wikipedia.org/wiki/Dirichlet_process.

When the 'alpha' smallness condition is not triggered, i.e. 'alpha.max() >= 1.' the existing code path is executed.

Closes #5851.

@charris charris changed the title Feature/dirichlet stick breaking BUG: Fix numpy.random.dirichlet returns NaN for small 'alpha' parameters. Nov 18, 2019
@WarrenWeckesser
Copy link
Member

I haven't reviewed the code line by line, but something is not correct. In the following large sample, the mean of the last component is not close to the theoretical expected value:

In [49]: alpha = np.array([0.15, 0.15, 0.2])

In [50]: x = rng.dirichlet(alpha, size=1000000)

In [51]: alpha/alpha.sum()   # Expected values
Out[51]: array([0.3, 0.3, 0.4])

In [52]: x.mean(axis=0)   # Sample mean
Out[52]: array([0.30014439, 0.30044703, 0.42944236])

The sample mean of the last component is far from the expected value of 0.4. The difference is too big to be accounted for by sample variation. If you plot a histogram of x[:,2], you'll see that it is quite different from the expected beta distribution.

Copy link
Member

@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test that would have caught the bug that I described above. For example, something like

    @pytest.mark.slow
    def test_dirichlet_with_moderately_small_alphas():
        """For a large sample, compare the sample mean to the expected mean."""
        alpha = np.array([0.15, 0.15, 0.2])
        expected_mean = alpha / alpha.sum()
        rng = np.random.default_rng(11335577)  # Or no seed?
        x = rng.dirichlet(alpha, size=20000000)
        sample_mean = x.mean(axis=0)
        assert_allclose(sample_mean, expected_mean, rtol=1e-3)

(It could be added to the other dirichlet tests in TestRandomDist, in which case it needs a self argument, and the random seed could be self.seed.)


The rationale for deciding which method to choose should be documented. Here are some notes along those lines:

The old code generates variates from gamma variates, where each alpha value is used as the shape parameter of a gamma distribution. The gamma variates are normalized to have a sum 1. If all the alpha values are sufficiently small, there is a high probability that all the gamma variates will be 0. When that happens, the normalization process ends up computing 0/0, giving nan.

The condition for using the new branch, which is based on beta variates, is alpha_arr.max() < 1.0. Is this a good choice? Why?

It is certainly the correct choice for an example such as alpha = [1e-3, 1e-4, 1e-3]. But what about something like alpha = [1.25, 1e-3, 1e-7]? In the current version of the PR, the old code will be used.

In the old code, the random variates are normalized gamma variates, and the problem occurs when all parameters are so small that all the gamma variates are 0, breaking the normalization code. The PDF for the gamma distribution has the term x**(alpha - 1), so if alpha > 1, the probability of the random variate being very close to 0 is extremely small, and effectively we can be sure it will never be 0. So the sum of the gamma variates will also never be 0, and we won't ever compute 0/0.

We could lower the threshold from 1.0 to something moderately smaller, and still still be confident that the gamma variate associated with the maximum alpha value will (effectively) never be 0, but deciding by how much we can safely lower the threshold will require some more work. I'm not sure we need to do that.

The motivation for lowering the threshold is performance. The old code tends to be faster than the new method. For example,

In [106]: alpha = np.array([0.15, 0.15, 1.0])   # max is 1, so old code is used.

In [107]: %timeit rng.dirichlet(alpha, size=100000)
27.6 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [108]: alpha = np.array([0.15, 0.15, np.nextafter(1.0, 0)])  # max < 1, so new code is used.

In [109]: %timeit rng.dirichlet(alpha, size=100000)
38.3 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In this case, the new code take about 1.4 times as long. If folks are interested in avoiding the performance hit as much as possible, we'll have to work on choosing a lower threshold.

val_data[i + j] = val_data[i + j] * invacc
i = i + k

# Small alpha case: Use stick-breaking approach with beta random variates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8: Keep line lengths 79 characters or less (here and elsewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WarrenWeckesser : Thanks a lot for the extensive review. So far I have been able to address the following:

  • Adding the suggested test
  • Fixing the formatting issues.

As for documenting the rationale in the code change: Thanks a lot for your suggestions. I haven't found time to do a formal analysis giving the probability of generating gamma variates which sum to zero for a given vector of alpha parameters. That could potentially help to determine a new threshold for the switch between the two code branches.

If I remember correctly from the past discussion the point of adding the new code branch is to ensure that the old code branch is never called when there is positive probability of returning NaN. I think this is why we ended up with one being the threshold in the first place, but I'd need to backtrack the argument again to be 100% sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with keeping the threshold at 1. @bashtage, @rkern, any opinions?

Can you expand the comments in the code with a sentence or two about why there are two methods? E.g. the normalized gamma variates method is faster than the "stick breaking" method, but it generates nans when all the gamma variates happen to be 0, and that can happen when all the alpha values are sufficiently small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that, as soon as possible.

Copy link
Contributor Author

@trendelkampschroer trendelkampschroer Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an inline docstring that provides the rationale for the switch to the stick-breaking approach.

As for the choice of the threshold, the probability for a gamma RV to be smaller than the smallest postive floating precision number can be computed using the regularized incomplete gamma function

import numpy as np
from scipy.special import gammainc
delta = np.finfo(np.float).tiny
delta
params = [1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.]
probs = [gammainc(p, delta) for p in params]
probs
>>>[0.9929467992914456,
 0.9316650395588774,
 0.4927171386203597,
 0.0008432274068068666,
 1.8046532550200846e-31,
 2.2250738585072626e-308]

With these numbers one could argue that if alpha.max() >= 0.1 the probability of generating all zeros is still so tiny that it can be neglected and the threshold can be reduced to 0.1

Not sure if that is small enough though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was working through similar reasoning, using np.finfo(np.float).tiny as the threshold for "numerically zero", and came to roughly the same conclusion. With the revised threshold condition max(alpha) > 0.1, a conservative upper bound on the probability of generating all "tiny" gamma variates is 1.8e-31, so I think a threshold of 0.1 should be safe. For comparsion, here are some other values for max(alpha) and the associated upper bounds for the probabilty that all the gamma variates are "tiny"

   max(alpha)      p
   ----------  ---------
   0.05        4.3e-16
   0.1         1.8e-31
   0.2         3.2e-62
   0.25        1.3e-77
   0.5         1.7e-154
   1.0         2.2e-308

@bashtage, what do you think? Does the threshold 0.1 sound reasonable?

Copy link
Contributor Author

@trendelkampschroer trendelkampschroer Nov 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great comments and suggestions!

I have changed the threshold from 1. to 0.1 to:

  • Not unneccesarily 'break the stream' compared to the legacy method of generation
  • Use the method with a higher performance whenever the probability of failure is acceptable
    From the previous discussion I understand that a probabilty of failure < 1.8e-31 is acceptable.

I have also extended (the now quite extensive) inline docstring for the strick breakin approach in the dirichlet method. Could you double check whether this is a) the right place to document this and b) if the 'verbosity level' is ok.

In order to move this forward I'd need some guidance

  • in decoding the output of the failing CI, at the moment I have no clue what I need to change in order for the build to pass
  • about how the individual commits in this PR need to be edited for integration into master

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the threshold to 0.1 (so the estimate of the upper bound of the probability that the gamma variates method produces nan is 1.8e-31) seems fine, but to be honest, I don't have a rigorous or even heuristic argument for what we should consider acceptable. How about this... On my computer, dirichlet with alpha=[0.05, 0.5, 1.5] generates approximately 7 million samples per second. After running for a million years (single-threaded), I would generate only about 2.2e20 samples, so the probability that I would get all 0s in a sample is still negligible.

Don't forget to also update the unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the unit tests to respect the updated threshold.

@bashtage
Copy link
Contributor

bashtage commented Nov 27, 2019 via email

alpha_csum_data[j + 1])
val_data[i + j] = acc * v
acc *= (1. - v)
val_data[i + j + 1] = acc
Copy link
Member

@WarrenWeckesser WarrenWeckesser Nov 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning in the travis CI log indicates a real issue:

  numpy/random/_generator.c:19904:12: warning: ‘__pyx_v_j’ may be used uninitialized in this function [-Wmaybe-uninitialized]

It corresponds to the Cython line

                val_data[i + j + 1] = acc

that occurs after the j for-loop. The variable j is presumed to have been set to the final value from the for-loop. But if k is 1, then range(k - 1) is empty, the for-loop does not execute, and j is not set. The simple fix is to replace that line with

                val_data[i + k - 1] = acc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. This is fixed now.

val_data[i + j + 1] = acc
i = i + k

# Standard case: Unit normalisation of a vector of gamma random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this style of commenting the else clause. It seems more natural to put the comment after the else, e.g.:

        else:
            # Standard case: Unit normalisation of a vector of gamma random
            # variates

I realize that you also don't do this in the big block of comments before the if statement, but there you are explaining the reason behind this if-else statement, so it seems OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the large comment block and I have moved the small comment blocks into the if-else cases as suggested.

val_data[i + j] = val_data[i + j] * invacc
i = i + k

# Small alpha case: Use stick-breaking approach with beta random variates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the threshold to 0.1 (so the estimate of the upper bound of the probability that the gamma variates method produces nan is 1.8e-31) seems fine, but to be honest, I don't have a rigorous or even heuristic argument for what we should consider acceptable. How about this... On my computer, dirichlet with alpha=[0.05, 0.5, 1.5] generates approximately 7 million samples per second. After running for a million years (single-threaded), I would generate only about 2.2e20 samples, so the probability that I would get all 0s in a sample is still negligible.

Don't forget to also update the unit test.

Copy link
Member

@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking into the compiler warning, I noticed a backwards compatibility break in how alpha=[] is handled. Before this PR:

In [68]: rng.dirichlet([], size=3)                                                               
Out[68]: array([], shape=(3, 0), dtype=float64)

With this PR:

In [14]: rng.dirichlet([], size=3)                                              
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-661d8cc9cdb5> in <module>
----> 1 rng.dirichlet([], size=3)

_generator.pyx in numpy.random._generator.Generator.dirichlet()

~/mc37numpy/lib/python3.7/site-packages/numpy-1.18.0.dev0+ed19dea-py3.7-macosx-10.7-x86_64.egg/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial, where)
     28 def _amax(a, axis=None, out=None, keepdims=False,
     29           initial=_NoValue, where=True):
---> 30     return umr_maximum(a, axis, None, out, keepdims, initial, where)
     31 
     32 def _amin(a, axis=None, out=None, keepdims=False,

ValueError: zero-size array to reduction operation maximum which has no identity

The code needs to check for alpha having size 0 before attempting to call the max() method.

@WarrenWeckesser
Copy link
Member

One more thing: a release note about the change in the stream of variates produced by dirichlet is needed. Add an entry in https://github.com/numpy/numpy/tree/master/doc/release/upcoming_changes, e.g. 14924.change.rst, that explains the change.

@WarrenWeckesser WarrenWeckesser added this to the 1.19.0 release milestone Dec 1, 2019
@WarrenWeckesser
Copy link
Member

@trendelkampschroer, are you interested in continuing to work on this? I think there are just a few small things to fix up.

@trendelkampschroer
Copy link
Contributor Author

trendelkampschroer commented Mar 12, 2020

Thanks for bearing with me all the time. Yes, I will try to address the remaining points in the coming days.

@trendelkampschroer
Copy link
Contributor Author

trendelkampschroer commented Mar 12, 2020

The updated code checks for k=len(alpha) > 0 before calling the max method. Would you be so kind as to point me to the appropriate release note for the changes in this PR. I guess the one you linked above is outdated.

Do I need to reformat/squash commits? Do I need to rebase on or merge with an upstream branch?

@WarrenWeckesser
Copy link
Member

Thanks, @trendelkampschroer. I probably won't review the latest updates until tomorrow.

Would you be so kind as to point me to the appropriate release note for the changes in this PR.

Ah, my link was a github link. From the top level of the numpy source, the location for the release notes is doc/release/upcoming_changes. The file README.rst in that directory explains the format to use for the release note. Check out some of the existing notes for examples.

Do I need to reformat/squash commits? Do I need to rebase on or merge with an upstream branch?

No need to rebase. We'll squash the commits when we merge the PR. That means we'll edit the commit message to reflect just the final change that is merged. If you want complete control over the commit message, you can rebase and squash the commits yourself, and then force push to your feature branch, but that is not necessary.

@WarrenWeckesser WarrenWeckesser self-assigned this Mar 13, 2020
@WarrenWeckesser
Copy link
Member

I've committed the small bug fix I noted, and the simplification that Eric suggested. If the tests all pass, I think this is ready to be (squashed and) merged.

@WarrenWeckesser
Copy link
Member

Updated with a release note.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @trendelkampschroer and Warren for the fixup, looks good to me. I think we could move the cumsum to later. @WarrenWeckesser just feel free to merge if you are happy.

WarrenWeckesser and others added 2 commits April 22, 2020 21:14
Co-Authored-By: Sebastian Berg <sebastian@sipsolutions.net>
Co-Authored-By: Sebastian Berg <sebastian@sipsolutions.net>
@WarrenWeckesser
Copy link
Member

Azure Linux PyPy3 test failure appears to be unrelated to this PR. It looks like it failed the same way in #16083

@mattip
Copy link
Member

mattip commented Apr 28, 2020

We should pin the PyPy version we test against. We are using a nightly build rather than an official release. While this is nice for PyPy, it occasionally, like in last night's build, messes up.

@mattip mattip merged commit 44e26f2 into numpy:master Apr 29, 2020
@mattip
Copy link
Member

mattip commented Apr 29, 2020

Thanks @trendelkampschroer and the other contributors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in np.random.dirichlet for small alpha parameters
8 participants