-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: Fix numpy.random.dirichlet returns NaN for small 'alpha' parameters. #14924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix numpy.random.dirichlet returns NaN for small 'alpha' parameters. #14924
Conversation
…r small alpha (and beta) values
I haven't reviewed the code line by line, but something is not correct. In the following large sample, the mean of the last component is not close to the theoretical expected value:
The sample mean of the last component is far from the expected value of 0.4. The difference is too big to be accounted for by sample variation. If you plot a histogram of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test that would have caught the bug that I described above. For example, something like
@pytest.mark.slow
def test_dirichlet_with_moderately_small_alphas():
"""For a large sample, compare the sample mean to the expected mean."""
alpha = np.array([0.15, 0.15, 0.2])
expected_mean = alpha / alpha.sum()
rng = np.random.default_rng(11335577) # Or no seed?
x = rng.dirichlet(alpha, size=20000000)
sample_mean = x.mean(axis=0)
assert_allclose(sample_mean, expected_mean, rtol=1e-3)
(It could be added to the other dirichlet tests in TestRandomDist
, in which case it needs a self
argument, and the random seed could be self.seed
.)
The rationale for deciding which method to choose should be documented. Here are some notes along those lines:
The old code generates variates from gamma variates, where each alpha value is used as the shape parameter of a gamma distribution. The gamma variates are normalized to have a sum 1. If all the alpha values are sufficiently small, there is a high probability that all the gamma variates will be 0. When that happens, the normalization process ends up computing 0/0, giving nan.
The condition for using the new branch, which is based on beta variates, is alpha_arr.max() < 1.0
. Is this a good choice? Why?
It is certainly the correct choice for an example such as alpha = [1e-3, 1e-4, 1e-3]. But what about something like alpha = [1.25, 1e-3, 1e-7]? In the current version of the PR, the old code will be used.
In the old code, the random variates are normalized gamma variates, and the problem occurs when all parameters are so small that all the gamma variates are 0, breaking the normalization code. The PDF for the gamma distribution has the term x**(alpha - 1), so if alpha > 1, the probability of the random variate being very close to 0 is extremely small, and effectively we can be sure it will never be 0. So the sum of the gamma variates will also never be 0, and we won't ever compute 0/0.
We could lower the threshold from 1.0 to something moderately smaller, and still still be confident that the gamma variate associated with the maximum alpha value will (effectively) never be 0, but deciding by how much we can safely lower the threshold will require some more work. I'm not sure we need to do that.
The motivation for lowering the threshold is performance. The old code tends to be faster than the new method. For example,
In [106]: alpha = np.array([0.15, 0.15, 1.0]) # max is 1, so old code is used.
In [107]: %timeit rng.dirichlet(alpha, size=100000)
27.6 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [108]: alpha = np.array([0.15, 0.15, np.nextafter(1.0, 0)]) # max < 1, so new code is used.
In [109]: %timeit rng.dirichlet(alpha, size=100000)
38.3 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In this case, the new code take about 1.4 times as long. If folks are interested in avoiding the performance hit as much as possible, we'll have to work on choosing a lower threshold.
numpy/random/_generator.pyx
Outdated
val_data[i + j] = val_data[i + j] * invacc | ||
i = i + k | ||
|
||
# Small alpha case: Use stick-breaking approach with beta random variates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP 8: Keep line lengths 79 characters or less (here and elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WarrenWeckesser : Thanks a lot for the extensive review. So far I have been able to address the following:
- Adding the suggested test
- Fixing the formatting issues.
As for documenting the rationale in the code change: Thanks a lot for your suggestions. I haven't found time to do a formal analysis giving the probability of generating gamma variates which sum to zero for a given vector of alpha parameters. That could potentially help to determine a new threshold for the switch between the two code branches.
If I remember correctly from the past discussion the point of adding the new code branch is to ensure that the old code branch is never called when there is positive probability of returning NaN. I think this is why we ended up with one being the threshold in the first place, but I'd need to backtrack the argument again to be 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with keeping the threshold at 1. @bashtage, @rkern, any opinions?
Can you expand the comments in the code with a sentence or two about why there are two methods? E.g. the normalized gamma variates method is faster than the "stick breaking" method, but it generates nans when all the gamma variates happen to be 0, and that can happen when all the alpha values are sufficiently small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do that, as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an inline docstring that provides the rationale for the switch to the stick-breaking approach.
As for the choice of the threshold, the probability for a gamma RV to be smaller than the smallest postive floating precision number can be computed using the regularized incomplete gamma function
import numpy as np
from scipy.special import gammainc
delta = np.finfo(np.float).tiny
delta
params = [1.0e-5, 1.0e-4, 1.0e-3, 1.0e-2, 1.0e-1, 1.]
probs = [gammainc(p, delta) for p in params]
probs
>>>[0.9929467992914456,
0.9316650395588774,
0.4927171386203597,
0.0008432274068068666,
1.8046532550200846e-31,
2.2250738585072626e-308]
With these numbers one could argue that if alpha.max() >= 0.1 the probability of generating all zeros is still so tiny that it can be neglected and the threshold can be reduced to 0.1
Not sure if that is small enough though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I was working through similar reasoning, using np.finfo(np.float).tiny
as the threshold for "numerically zero", and came to roughly the same conclusion. With the revised threshold condition max(alpha) > 0.1
, a conservative upper bound on the probability of generating all "tiny" gamma variates is 1.8e-31, so I think a threshold of 0.1 should be safe. For comparsion, here are some other values for max(alpha)
and the associated upper bounds for the probabilty that all the gamma variates are "tiny"
max(alpha) p
---------- ---------
0.05 4.3e-16
0.1 1.8e-31
0.2 3.2e-62
0.25 1.3e-77
0.5 1.7e-154
1.0 2.2e-308
@bashtage, what do you think? Does the threshold 0.1 sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great comments and suggestions!
I have changed the threshold from 1. to 0.1 to:
- Not unneccesarily 'break the stream' compared to the legacy method of generation
- Use the method with a higher performance whenever the probability of failure is acceptable
From the previous discussion I understand that a probabilty of failure < 1.8e-31 is acceptable.
I have also extended (the now quite extensive) inline docstring for the strick breakin approach in the dirichlet
method. Could you double check whether this is a) the right place to document this and b) if the 'verbosity level' is ok.
In order to move this forward I'd need some guidance
- in decoding the output of the failing CI, at the moment I have no clue what I need to change in order for the build to pass
- about how the individual commits in this PR need to be edited for integration into master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the threshold to 0.1 (so the estimate of the upper bound of the probability that the gamma variates method produces nan
is 1.8e-31) seems fine, but to be honest, I don't have a rigorous or even heuristic argument for what we should consider acceptable. How about this... On my computer, dirichlet
with alpha=[0.05, 0.5, 1.5]
generates approximately 7 million samples per second. After running for a million years (single-threaded), I would generate only about 2.2e20 samples, so the probability that I would get all 0s in a sample is still negligible.
Don't forget to also update the unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the unit tests to respect the updated threshold.
…cters line length
…e mean is close to the exact mean.
…ck-breaking method over the standard approach
0.1 or 0.2 since with this threshold there should be 0 chance of the
problem given the resolution of a 53 bit float. Happy either way.
…On Wed, Nov 27, 2019, 06:09 Warren Weckesser ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In numpy/random/_generator.pyx
<#14924 (comment)>:
> @@ -4150,17 +4156,32 @@ cdef class Generator:
i = 0
totsize = np.PyArray_SIZE(val_arr)
- with self.lock, nogil:
- while i < totsize:
- acc = 0.0
- for j in range(k):
- val_data[i+j] = random_standard_gamma(&self._bitgen,
- alpha_data[j])
- acc = acc + val_data[i + j]
- invacc = 1/acc
- for j in range(k):
- val_data[i + j] = val_data[i + j] * invacc
- i = i + k
+
+ # Small alpha case: Use stick-breaking approach with beta random variates
Nice. I was working through similar reasoning, using
np.finfo(np.float).tiny as the threshold for "numerically zero", and came
to roughly the same conclusion. With the revised threshold condition max(alpha)
> 0.1, a conservative upper bound on the probability of generating all
"tiny" gamma variates is 1.8e-31, so I think a threshold of 0.1 should be
safe. For comparsion, here are some other values for max(alpha) and the
associated upper bounds for the probabilty that all the gamma variates are
"tiny"
max(alpha) p
---------- ---------
0.05 4.3e-16
0.1 1.8e-31
0.2 3.2e-62
0.25 1.3e-77
0.5 1.7e-154
1.0 2.2e-308
@bashtage <https://github.com/bashtage>, what do you think? Does the
threshold 0.1 sound reasonable?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14924?email_source=notifications&email_token=ABKTSRMXQQJOSQPS4VNFZILQVZIOFA5CNFSM4JOL7UU2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCNE2SBI#discussion_r351226725>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSROLWABAS6A4XM37YALQVZIOFANCNFSM4JOL7UUQ>
.
|
numpy/random/_generator.pyx
Outdated
alpha_csum_data[j + 1]) | ||
val_data[i + j] = acc * v | ||
acc *= (1. - v) | ||
val_data[i + j + 1] = acc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning in the travis CI log indicates a real issue:
numpy/random/_generator.c:19904:12: warning: ‘__pyx_v_j’ may be used uninitialized in this function [-Wmaybe-uninitialized]
It corresponds to the Cython line
val_data[i + j + 1] = acc
that occurs after the j
for
-loop. The variable j
is presumed to have been set to the final value from the for
-loop. But if k
is 1, then range(k - 1)
is empty, the for
-loop does not execute, and j
is not set. The simple fix is to replace that line with
val_data[i + k - 1] = acc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. This is fixed now.
numpy/random/_generator.pyx
Outdated
val_data[i + j + 1] = acc | ||
i = i + k | ||
|
||
# Standard case: Unit normalisation of a vector of gamma random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this style of commenting the else
clause. It seems more natural to put the comment after the else
, e.g.:
else:
# Standard case: Unit normalisation of a vector of gamma random
# variates
I realize that you also don't do this in the big block of comments before the if
statement, but there you are explaining the reason behind this if-else
statement, so it seems OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the large comment block and I have moved the small comment blocks into the if-else cases as suggested.
numpy/random/_generator.pyx
Outdated
val_data[i + j] = val_data[i + j] * invacc | ||
i = i + k | ||
|
||
# Small alpha case: Use stick-breaking approach with beta random variates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the threshold to 0.1 (so the estimate of the upper bound of the probability that the gamma variates method produces nan
is 1.8e-31) seems fine, but to be honest, I don't have a rigorous or even heuristic argument for what we should consider acceptable. How about this... On my computer, dirichlet
with alpha=[0.05, 0.5, 1.5]
generates approximately 7 million samples per second. After running for a million years (single-threaded), I would generate only about 2.2e20 samples, so the probability that I would get all 0s in a sample is still negligible.
Don't forget to also update the unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking into the compiler warning, I noticed a backwards compatibility break in how alpha=[]
is handled. Before this PR:
In [68]: rng.dirichlet([], size=3)
Out[68]: array([], shape=(3, 0), dtype=float64)
With this PR:
In [14]: rng.dirichlet([], size=3)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-661d8cc9cdb5> in <module>
----> 1 rng.dirichlet([], size=3)
_generator.pyx in numpy.random._generator.Generator.dirichlet()
~/mc37numpy/lib/python3.7/site-packages/numpy-1.18.0.dev0+ed19dea-py3.7-macosx-10.7-x86_64.egg/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial, where)
28 def _amax(a, axis=None, out=None, keepdims=False,
29 initial=_NoValue, where=True):
---> 30 return umr_maximum(a, axis, None, out, keepdims, initial, where)
31
32 def _amin(a, axis=None, out=None, keepdims=False,
ValueError: zero-size array to reduction operation maximum which has no identity
The code needs to check for alpha
having size 0 before attempting to call the max()
method.
One more thing: a release note about the change in the stream of variates produced by |
@trendelkampschroer, are you interested in continuing to work on this? I think there are just a few small things to fix up. |
Thanks for bearing with me all the time. Yes, I will try to address the remaining points in the coming days. |
…rd to the stick breaking approach when generating Dirichlet RVs
…threshold of 0.1.
The updated code checks for Do I need to reformat/squash commits? Do I need to rebase on or merge with an upstream branch? |
Thanks, @trendelkampschroer. I probably won't review the latest updates until tomorrow.
Ah, my link was a github link. From the top level of the numpy source, the location for the release notes is
No need to rebase. We'll squash the commits when we merge the PR. That means we'll edit the commit message to reflect just the final change that is merged. If you want complete control over the commit message, you can rebase and squash the commits yourself, and then force push to your feature branch, but that is not necessary. |
…or.pyx Co-Authored-By: Eric Wieser <wieser.eric@gmail.com>
I've committed the small bug fix I noted, and the simplification that Eric suggested. If the tests all pass, I think this is ready to be (squashed and) merged. |
Updated with a release note. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @trendelkampschroer and Warren for the fixup, looks good to me. I think we could move the cumsum to later. @WarrenWeckesser just feel free to merge if you are happy.
Co-Authored-By: Sebastian Berg <sebastian@sipsolutions.net>
Co-Authored-By: Sebastian Berg <sebastian@sipsolutions.net>
Azure Linux PyPy3 test failure appears to be unrelated to this PR. It looks like it failed the same way in #16083 |
We should pin the PyPy version we test against. We are using a nightly build rather than an official release. While this is nice for PyPy, it occasionally, like in last night's build, messes up. |
Thanks @trendelkampschroer and the other contributors |
BUG: Fix numpy.random.dirichlet returns NaN for small 'alpha' parameters.
The PR introduces a second code branch into the 'dirichlet' method of 'Generator'. This branch will be executed whenever the maximum of all 'alpha' parameters for the dirichlet distribution is smaller than one. In this case the random vector will be generated by the stick breaking approach, cf. https://en.wikipedia.org/wiki/Dirichlet_process.
When the 'alpha' smallness condition is not triggered, i.e. 'alpha.max() >= 1.' the existing code path is executed.
Closes #5851.