-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Bug in np.random.dirichlet for small alpha parameters #5851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Verified for current master. |
There is a vintage issue (#2089) also related to Dirichlet parameters. It seems that all that code could use a rethink... |
This looks fixable with logarithms. |
something like this, except in C from __future__ import print_function
import numpy as np
def _jhonk_naive(a, b):
while True:
u, v = np.random.rand(2)
x, y = np.power((u, v), (1/a, 1/b))
if x + y <= 1:
return x / (x + y)
def _jhonk(a, b):
while True:
u, v = np.random.rand(2)
logx = np.log(u) / a
logy = np.log(v) / b
m = np.maximum(logx, logy)
logxm = logx - m
logym = logy - m
logxpym = np.log(np.exp(logxm) + np.exp(logym))
lse_denom = m + logxpym
if lse_denom <= 0:
return np.exp(logxm - logxpym)
def jhonk(a, b, n, naive=False):
rvs = _jhonk_naive if naive else _jhonk
return np.array([rvs(a, b) for _ in range(n)])
print(np.random.beta(0.001, 0.001, 10), '\n')
print(jhonk(0.001, 0.001, 10, naive=True), '\n')
print(jhonk(0.001, 0.001, 10, naive=False), '\n')
print(np.histogram(jhonk(0.001, 0.001, 10000, naive=False))[0], '\n')
print(jhonk(1, 1, 10, naive=False), '\n')
print(np.histogram(jhonk(1, 1, 10000, naive=False))[0], '\n')
Edit: This would be for the beta distribution. |
As far as I can tell this is a very nice solution to the problem. The naive branch of the code will be much faster though. It might be good to define values a_min, b_min so that the log-space code is only executed whenever a<a_min and b<b_min. I think the probability for x to be smaller than some tolerance, tol is tol^a, so that the probability for x and y to be jointly smaller than some tolerance is tol^(a+b). For tol=1e-323 a= 0.1 and b=1e-15 the probability to find both x, y<tol is still smaller than 10^{-33}. A test with a=0.1, b=0.0001 for N=1e9 samples with the naive approach, did not results in any nans, In [9]: x = np.random.beta(0.1, 0.0001, size=(1e4,1e5))
In [10]: np.any(np.isnan(x))
Out[10]: False A similar test with a=0.01, b=0.001 results in nan values already for N=1e4. Based on this observation and the above probability argument it might be safe to switch to the log-space algorithm whenever max(a, b)<0.1. There is still a tiny chance that the code will produce nan, but I believe it is very small so that it might not be worth to sacrifice the efficiency of the naive approach by using it for all a, b<1. |
Does your proposal change what values you get from a RNG using a given seed?
|
That's soce nice math indeed! I think it would be better to have the certainty, not a high probabilistic expectation, that NaNs are not going to be produced. Does the following have any loophole in it?
The only uncertain cases come if either Answering Nathaniel's question, I don't think the RNG state would be changed. |
@jaimefrio Your implementation looks good to me. It has the advantage of more clearly preserving the number of iterations (important for Nathaniel's concern for RNG state backwards compatibility), and of not introducing additional transcendental function overhead when there are no |
Do you want to send a PR and become a numpy contributor, @trendelkampschroer? I'll be happy to steal @argriffing's thunder if you don't... |
I think you should go ahead with it, since you proposed an elegant solution. I can not take credit for anything more than pointing out the problem. I am quite happy to know that there is such a nice fix, I am not sure if the same strategy could be applied to the Dirichlet case, though. The current approach of generating via the Gamma distribution will definitely suffer from the same problem (zero denominator in the normalization). |
I'm not 100% sure, but isn't there a way to recursively partition the set of dirichlet dimensions so that you can use beta to sample from it? Or something like http://en.wikipedia.org/wiki/Dirichlet_distribution#Marginal_beta_distributions This change to the dirichlet distribution sampler would break RNG backwards compatibility. As far as I am aware, for each probability distribution in numpy the sampling scheme is not allowed to substantially change after the first implementation. For example, numpy is locked to Box-Muller sampling for the normal distribution and this cannot easily be changed to ziggurat sampling (#2047) even though ziggurat sampling is better and a scipy patch is available. There seems to be an unsettled debate between adding new sampling functions for each new scheme (e.g. |
Ah, but you could take credit for the work :0 |
#5858 fixes the beta problemx (thanks @argriffing!), but the issues for Dirichlet remain, so reopening. |
@argriffing: I think your proposal of generating Dirichlet RNGs using Beta distributed RNGs (stick-breaking approach) should get rid of the problem - given that the Beta-sampler is stable for a,b<1. In the case that \max{\alpha_i} < 1 the Dirichlet sampler could switch from generation via Gamma-RVs to generation via Beta-RVs. The generation via Beta-RVs does not require normalization, so that a float division will not be encountered, even if all generated Beta-RVs are zero, This simply entails that the last element of the resulting Dirichlet vector becomes one and all other entries will be equal to zero. Since #5858 fixes this issue for the Beta-sampler (thanks @jaimefrio) the issue could be fixed for the Dirichlet sampler. I am not sure how to approach this issue given the RNG-backward compability policy, though. One could add a switch in mtrand.dirichlet, but this would surely break backwards compability. Or implement np.random.dirichlet_with_recursive_beta(...) just to handle the case \alpha_i<1 for all i. Future versions could issue a warning whenever the algorithm using Gamma RVs is called with a parameter vector (\alpha_1,...,\alpha_n) with \alpha_i < 1. This does not seem very elegant to me, though. |
I'd say the only way is a second dirichlet function, perhaps an unimaginative dirichlet2, and issuing a deprecation warning for the other one. |
dirichlet2 would then contain a switch i) generation via gamma RVs if \max(\alpha_i) >=1 ii) generation via beta RVs if \max(\alpha_i)<1? |
@jaimefrio: Instead of going down the |
I have added a dirichlet2 function using beta-RVs whenever \max(\alpha_i) < 1. def dirichlet2(self, object alpha, size=None):
cdef npy_intp k
cdef npy_intp totsize
cdef ndarray alpha_arr, val_arr
cdef double *alpha_data
cdef double *val_data
cdef npy_intp i, j
cdef double acc, invacc
cdef double alpha_max
cdef double alpha_sum
k = len(alpha)
alpha_arr = <ndarray>PyArray_ContiguousFromObject(alpha, NPY_DOUBLE, 1, 1)
alpha_data = <double*>PyArray_DATA(alpha_arr)
shape = _shape_from_size(size, k)
diric = np.zeros(shape, np.float64)
val_arr = <ndarray>diric
val_data= <double*>PyArray_DATA(val_arr)
i = 0
totsize = PyArray_SIZE(val_arr)
alpha_max = alpha_arr.max()
if alpha_max < 1.0:
alpha_sum = alpha_arr.sum()
"""Generation via beta-RVs"""
with self.lock, nogil:
while i < totsize:
acc = 1.0
invacc = alpha_sum
for j from 0<= j < k-1:
invacc -= alpha_data[j]
val_data[i+j] = acc * rk_beta(self.internal_state, alpha_data[j], invacc)
acc = acc - val_data[i+j]
val_data[i+(k-1)] = acc
i = i+k
else:
with self.lock, nogil:
while i < totsize:
acc = 0.0
for j from 0 <= j < k:
val_data[i+j] = rk_standard_gamma(self.internal_state,
alpha_data[j])
acc = acc + val_data[i+j]
invacc = 1/acc
for j from 0 <= j < k:
val_data[i+j] = val_data[i+j] * invacc
i = i + k
return diric The code gets rid of the float division error. It is necessary to compute the maximum over the parameter array in order to chose whether to sample via gamma or via beta RVs. I also need two more doubles for storing the max and the sum of the parameter array. Any suggestions/improvements? I can do a pull request anytime, but I probably need to add a test in random/tests/test_random.py beforehand. I am not sure about the test philosophy here. Do I just generate values with the given seed once and assert equal in the test for these values? Unfortunately the solution proposed/discussed in #5299 is currently out of my scope. If you want to use this fix within the proposed new infrastructure I'd be glad to contribute it. |
It seems unlikely that this dirichlet sampling issue would provide enough motivation to move forward on the years-long discussion, where ziggurat and random choice sampling have been unable to do so. To me it looks like a dirichlet2 function would be the realistic option. |
That's a rather self-fulfilling prophecy, though... A proper fix to all
|
Okay, I'll go ahead with a pull request. Let me know, if I can do something about the unit-test. |
np.random.dirichlet fails for small alpha parameters (see numpy#5851). The new implmentation np.random.dirichlet2 switches to generation via beta RVs (stick breaking approach) whenever all alpha parameters are smaller than one.
@trendelkampschroer Yes, that would be appropriate even if it's not merged. I guess there is a 'good is the enemy of the great' argument by @njsmith. Ideally there would already be a framework in place that would allow improving random sampling schemes, but there just isn't. So there is a choice between including this new function vs. delaying until the framework is written, and there are drawbacks to both options. If |
Does this really have the same backwards compatibility requirements than those other discussions? In my mind this is a proper bug, and I have trouble believing that anyone will complain that dirichlet is not generating the backwards compatible sequence of NaNs and 1s that it used to. I don't know if this fully makes sense, but perhaps one option is to put in It would be slower and uglier, but we would keep the current interface, and only break backwards compatibility on streams that were already broken, which may be controversial, but I think mostly right. And it allows us to punt on the future random API that will put an end to all random APIs... The grander RNG scheme is something we should properly document for requirements, and turn into a GSoC project: it's a pity how NumPy is not taking advantage of the free labor that is being made available to us. |
I haven't looked carefully at what triggers the NaNs, but the
|
And in particular, GSoC works best in a greenfield project where there aren't so many constraints that they are just implementing a spec that we've declared. PS, @njsmith your email replies are consistently misformatted by Github. Some people are likely not seeing most of your comment text because every paragraph after the first is hidden. |
Implementing several new RNGs would be the greenfield part of the project, setting up the infrastructure to support them, the "pound of flesh" demanded from whoever wants to get to do the fun part of the project. Maybe it has always been like this, and I just need to enjoy the ride and stop worrying. But it seems to me that numpy is en route to a serious shortage-of-active-developers problem. In this situation, we really can't afford to not take advantage of GSoC, if only to get new blood involved. |
@rkern: Thanks for the heads up. @jaimefrio: I suggest putting the shortage-of-active-developers discussion on our meeting agenda :-). |
What's the current status of this issue? Is there a standard solution for the problem of small alpha parameters? |
The stream compatibility is still a blocking issue given our current policy. I am currently drafting a NEP about changing that policy, and introducing @bashtage's work for the proposed new PRNG system. You may wish to contribute this improved |
Closing, very small alpha values now give |
I think we should still consider this a bug. Returning
Now that NEP 19 is implemented, we can revisit the pull request that was proposed to fix this: #5872 |
Hi,
I encountered a bug when using np.random.dirichlet with small alpha parameters. Call and traceback are below.
I am using numpy-1.9.1.
I believe this is a floating point issue, the distribution has almost all of its mass very close to either (1, 0, 0) or (0, 0, 1) .
The 'float division' error already occurs for larger values<1, e.g. 0.001.
It is likely that this occurs because of the Dirichlet distribution is usually sampled via the Gamma distribution followed by normalization. If all values returned from Gamma sampling are zero than a float division error occurs.
In addition
produces 'nan' most of the time, while it should be alternating 'almost always' between (1, 0) and
(0, 1)
python scipy.special.betainc(0.0001, 0.0001, 1e-50) = 0.494..
It might not be able to fix that in the current algorithmic framework but maybe it is possible to discourage/prevent users from supplying too small parameters.
Wow, this wasn't supposed to become such a long post. Thanks to anyone reading/considering this issue.
The text was updated successfully, but these errors were encountered: