Skip to content

Bug in np.random.dirichlet for small alpha parameters #5851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
trendelkampschroer opened this issue May 7, 2015 · 31 comments · Fixed by #14924
Closed

Bug in np.random.dirichlet for small alpha parameters #5851

trendelkampschroer opened this issue May 7, 2015 · 31 comments · Fixed by #14924

Comments

@trendelkampschroer
Copy link
Contributor

Hi,

I encountered a bug when using np.random.dirichlet with small alpha parameters. Call and traceback are below.

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-86-73c2067e20c1> in <module>()
----> 1 np.random.dirichlet([0.0001, 0.0, 0.0001])

mtrand.pyx in mtrand.RandomState.dirichlet (numpy/random/mtrand/mtrand.c:24477)()

mtrand.pyx in mtrand.RandomState.dirichlet (numpy/random/mtrand/mtrand.c:24387)()

ZeroDivisionError: float division

I am using numpy-1.9.1.

I believe this is a floating point issue, the distribution has almost all of its mass very close to either (1, 0, 0) or (0, 0, 1) .

The 'float division' error already occurs for larger values<1, e.g. 0.001.

It is likely that this occurs because of the Dirichlet distribution is usually sampled via the Gamma distribution followed by normalization. If all values returned from Gamma sampling are zero than a float division error occurs.

In addition

np.random.beta(0.0001, 0.0001)

produces 'nan' most of the time, while it should be alternating 'almost always' between (1, 0) and
(0, 1)

python scipy.special.betainc(0.0001, 0.0001, 1e-50) = 0.494..

It might not be able to fix that in the current algorithmic framework but maybe it is possible to discourage/prevent users from supplying too small parameters.

Wow, this wasn't supposed to become such a long post. Thanks to anyone reading/considering this issue.

@charris
Copy link
Member

charris commented May 7, 2015

Verified for current master.

@jaimefrio
Copy link
Member

There is a vintage issue (#2089) also related to Dirichlet parameters. It seems that all that code could use a rethink...

@argriffing
Copy link
Contributor

This looks fixable with logarithms.

@argriffing
Copy link
Contributor

something like this, except in C

from __future__ import print_function
import numpy as np

def _jhonk_naive(a, b):
    while True:
        u, v = np.random.rand(2)
        x, y = np.power((u, v), (1/a, 1/b))
        if x + y <= 1:
            return x / (x + y)

def _jhonk(a, b):
    while True:
        u, v = np.random.rand(2)
        logx = np.log(u) / a
        logy = np.log(v) / b
        m = np.maximum(logx, logy)
        logxm = logx - m
        logym = logy - m
        logxpym = np.log(np.exp(logxm) + np.exp(logym))
        lse_denom = m + logxpym
        if lse_denom <= 0:
            return np.exp(logxm - logxpym)

def jhonk(a, b, n, naive=False):
    rvs = _jhonk_naive if naive else _jhonk
    return np.array([rvs(a, b) for _ in range(n)])

print(np.random.beta(0.001, 0.001, 10), '\n')
print(jhonk(0.001, 0.001, 10, naive=True), '\n')
print(jhonk(0.001, 0.001, 10, naive=False), '\n')
print(np.histogram(jhonk(0.001, 0.001, 10000, naive=False))[0], '\n')
print(jhonk(1, 1, 10, naive=False), '\n')
print(np.histogram(jhonk(1, 1, 10000, naive=False))[0], '\n')
$ python check-beta-sample.py 
[  0.  nan  nan   1.   1.  nan   0.  nan   1.  nan] 

check-beta-sample.py:9: RuntimeWarning: invalid value encountered in double_scalars
  return x / (x + y)
[  1.   1.   0.  nan   0.   1.   0.   0.  nan  nan] 

[  1.00000000e+000   1.00000000e+000   1.00000000e+000   0.00000000e+000
   1.00000000e+000   2.66868323e-085   1.00000000e+000   1.00000000e+000
   2.89780592e-157   1.00000000e+000] 

[5001    3    3    3    2    1    1    0    3 4983] 

[ 0.35706113  0.54406512  0.35814279  0.14493494  0.96739129  0.09188022
  0.13299241  0.39228852  0.4331817   0.31667118] 

[ 973 1069  977 1005  989 1027  939 1018 1010  993] 

Edit: This would be for the beta distribution.

@trendelkampschroer
Copy link
Contributor Author

As far as I can tell this is a very nice solution to the problem. The naive branch of the code will be much faster though. It might be good to define values a_min, b_min so that the log-space code is only executed whenever a<a_min and b<b_min.

I think the probability for x to be smaller than some tolerance, tol is tol^a, so that the probability for x and y to be jointly smaller than some tolerance is tol^(a+b).

For tol=1e-323 a= 0.1 and b=1e-15 the probability to find both x, y<tol is still smaller than 10^{-33}.

A test with a=0.1, b=0.0001 for N=1e9 samples with the naive approach, did not results in any nans,

In [9]: x = np.random.beta(0.1, 0.0001, size=(1e4,1e5))

In [10]: np.any(np.isnan(x))
Out[10]: False

A similar test with a=0.01, b=0.001 results in nan values already for N=1e4. Based on this observation and the above probability argument it might be safe to switch to the log-space algorithm whenever max(a, b)<0.1.

There is still a tiny chance that the code will produce nan, but I believe it is very small so that it might not be worth to sacrifice the efficiency of the naive approach by using it for all a, b<1.

@njsmith
Copy link
Member

njsmith commented May 8, 2015

Does your proposal change what values you get from a RNG using a given seed?
On May 8, 2015 3:14 AM, "Benjamin Trendelkamp-Schroer" <
notifications@github.com> wrote:

As far as I can tell this is a very nice solution to the problem. The
naive branch of the code will be much faster though. It might be good to
define values a_min, b_min so that the log-space code is only executed
whenever a<a_min and b<b_min.

I think the probability for x to be smaller than some tolerance, tol is
tol^a, so that the probability for x and y to be jointly smaller than some
tolerance is tol^(a+b).

For tol=1e-323 a= 0.1 and b=1e-15 the probability to find both x, y<tol is
still smaller than 10^{-33}.

A test with a=0.1, b=0.0001 for N=1e9 samples with the naive approach, did
not results in any nans,

In [9]: x = np.random.beta(0.1, 0.0001, size=(1e4,1e5))

In [10]: np.any(np.isnan(x))
Out[10]: False

A similar test with a=0.01, b=0.001 results in nan values already for
N=1e4. Based on this observation and the above probability argument it
might be safe to switch to the log-space algorithm whenever max(a, b)<0.1.

There is still a chance that the code will produce nan, but I believe it
is very small so that it might not be worth to sacrifice the efficiency of
the naive approach by using it for all a, b<1.


Reply to this email directly or view it on GitHub
#5851 (comment).

@jaimefrio
Copy link
Member

That's soce nice math indeed!

I think it would be better to have the certainty, not a high probabilistic expectation, that NaNs are not going to be produced. Does the following have any loophole in it?

while(1) {
    double u = rk_double(state);
    double v = rk_double(state);
    double x = pow(u, 1.0/a);
    double y = pow(v, 1.0/b);

    if (x + y <= 1) {
        if (x + y > 0) { /* Should we perhaps check against some epsilon instead? */
            return x / (x + y);
        }
        else {
            double logx = log(u) / a;
            double logy = log(v) / b;
            double logm = logx > logy ? logx : logy;
            logx -= logm;
            logy -= logm;

            return exp(logx - log(exp(logx) + exp(logy)));
        }
    }
}

The only uncertain cases come if either u or v happen to be exactly zero. If infinities are properly generated and handled by log and exp, then I think it would work. The only funky case would be if both u and v are zero, which probably has no practical importance, but I believe we would end up doing -inf + inf and therefore have a genuine NaN generated.

Answering Nathaniel's question, I don't think the RNG state would be changed.

@argriffing
Copy link
Contributor

@jaimefrio Your implementation looks good to me. It has the advantage of more clearly preserving the number of iterations (important for Nathaniel's concern for RNG state backwards compatibility), and of not introducing additional transcendental function overhead when there are no nans. Want to make a PR?

@jaimefrio
Copy link
Member

Do you want to send a PR and become a numpy contributor, @trendelkampschroer? I'll be happy to steal @argriffing's thunder if you don't...

@trendelkampschroer
Copy link
Contributor Author

I think you should go ahead with it, since you proposed an elegant solution. I can not take credit for anything more than pointing out the problem.

I am quite happy to know that there is such a nice fix, I am not sure if the same strategy could be applied to the Dirichlet case, though. The current approach of generating via the Gamma distribution will definitely suffer from the same problem (zero denominator in the normalization).

@argriffing
Copy link
Contributor

I am not sure if the same strategy could be applied to the Dirichlet case, though.

I'm not 100% sure, but isn't there a way to recursively partition the set of dirichlet dimensions so that you can use beta to sample from it? Or something like http://en.wikipedia.org/wiki/Dirichlet_distribution#Marginal_beta_distributions

This change to the dirichlet distribution sampler would break RNG backwards compatibility. As far as I am aware, for each probability distribution in numpy the sampling scheme is not allowed to substantially change after the first implementation. For example, numpy is locked to Box-Muller sampling for the normal distribution and this cannot easily be changed to ziggurat sampling (#2047) even though ziggurat sampling is better and a scipy patch is available. There seems to be an unsettled debate between adding new sampling functions for each new scheme (e.g. np.random.normal_with_ziggurat(...), np.random.dirichlet_with_recursive_beta(...)) vs. using new random state objects (e.g. rng = np.random.BleedingEdgeRNG(); x = rng.dirichlet(...)).

@charris
Copy link
Member

charris commented May 8, 2015

I can not take credit for anything more than pointing out the problem.

Ah, but you could take credit for the work :0

jaimefrio added a commit to jaimefrio/numpy that referenced this issue May 9, 2015
@jaimefrio
Copy link
Member

#5858 fixes the beta problemx (thanks @argriffing!), but the issues for Dirichlet remain, so reopening.

@jaimefrio jaimefrio reopened this May 10, 2015
@trendelkampschroer
Copy link
Contributor Author

@argriffing: I think your proposal of generating Dirichlet RNGs using Beta distributed RNGs (stick-breaking approach) should get rid of the problem - given that the Beta-sampler is stable for a,b<1.

In the case that \max{\alpha_i} < 1 the Dirichlet sampler could switch from generation via Gamma-RVs to generation via Beta-RVs. The generation via Beta-RVs does not require normalization, so that a float division will not be encountered, even if all generated Beta-RVs are zero, This simply entails that the last element of the resulting Dirichlet vector becomes one and all other entries will be equal to zero.

Since #5858 fixes this issue for the Beta-sampler (thanks @jaimefrio) the issue could be fixed for the Dirichlet sampler.

I am not sure how to approach this issue given the RNG-backward compability policy, though. One could add a switch in mtrand.dirichlet, but this would surely break backwards compability.

Or implement np.random.dirichlet_with_recursive_beta(...) just to handle the case \alpha_i<1 for all i. Future versions could issue a warning whenever the algorithm using Gamma RVs is called with a parameter vector (\alpha_1,...,\alpha_n) with \alpha_i < 1. This does not seem very elegant to me, though.

@jaimefrio
Copy link
Member

I'd say the only way is a second dirichlet function, perhaps an unimaginative dirichlet2, and issuing a deprecation warning for the other one.

@trendelkampschroer
Copy link
Contributor Author

dirichlet2 would then contain a switch i) generation via gamma RVs if \max(\alpha_i) >=1 ii) generation via beta RVs if \max(\alpha_i)<1?

@njsmith
Copy link
Member

njsmith commented May 11, 2015

@jaimefrio: Instead of going down the dirichlet2 route, see discussion around #5299 (comment)

@trendelkampschroer
Copy link
Contributor Author

I have added a dirichlet2 function using beta-RVs whenever \max(\alpha_i) < 1.

def dirichlet2(self, object alpha, size=None):
        cdef npy_intp   k
        cdef npy_intp   totsize
        cdef ndarray    alpha_arr, val_arr
        cdef double     *alpha_data
        cdef double     *val_data
        cdef npy_intp   i, j
        cdef double     acc, invacc

        cdef double alpha_max
        cdef double alpha_sum

        k           = len(alpha)
        alpha_arr   = <ndarray>PyArray_ContiguousFromObject(alpha, NPY_DOUBLE, 1, 1)
        alpha_data  = <double*>PyArray_DATA(alpha_arr)

        shape = _shape_from_size(size, k)

        diric   = np.zeros(shape, np.float64)
        val_arr = <ndarray>diric
        val_data= <double*>PyArray_DATA(val_arr)        

        i = 0
        totsize = PyArray_SIZE(val_arr)
        alpha_max = alpha_arr.max()

        if alpha_max < 1.0:
            alpha_sum = alpha_arr.sum()
            """Generation via beta-RVs"""            
            with self.lock, nogil:
                while i < totsize:
                    acc = 1.0
                    invacc = alpha_sum
                    for j from 0<= j < k-1:
                        invacc -= alpha_data[j]
                        val_data[i+j] = acc * rk_beta(self.internal_state, alpha_data[j], invacc)
                        acc = acc - val_data[i+j]
                    val_data[i+(k-1)] = acc
                    i = i+k

        else:                       
            with self.lock, nogil:
                while i < totsize:
                    acc = 0.0
                    for j from 0 <= j < k:
                        val_data[i+j]   = rk_standard_gamma(self.internal_state,
                                                            alpha_data[j])
                        acc             = acc + val_data[i+j]
                    invacc  = 1/acc
                    for j from 0 <= j < k:
                        val_data[i+j]   = val_data[i+j] * invacc
                    i = i + k

        return diric

The code gets rid of the float division error. It is necessary to compute the maximum over the parameter array in order to chose whether to sample via gamma or via beta RVs. I also need two more doubles for storing the max and the sum of the parameter array.

Any suggestions/improvements?

I can do a pull request anytime, but I probably need to add a test in random/tests/test_random.py beforehand. I am not sure about the test philosophy here. Do I just generate values with the given seed once and assert equal in the test for these values?

Unfortunately the solution proposed/discussed in #5299 is currently out of my scope. If you want to use this fix within the proposed new infrastructure I'd be glad to contribute it.

@argriffing
Copy link
Contributor

Instead of going down the dirichlet2 route, see discussion around #5299 (comment)

It seems unlikely that this dirichlet sampling issue would provide enough motivation to move forward on the years-long discussion, where ziggurat and random choice sampling have been unable to do so. To me it looks like a dirichlet2 function would be the realistic option.

@njsmith
Copy link
Member

njsmith commented May 13, 2015

That's a rather self-fulfilling prophecy, though... A proper fix to all
these issues would be a bit of work, but not that much; we merge bigger
pull requests all the time.
On May 13, 2015 7:57 AM, "argriffing" notifications@github.com wrote:

Instead of going down the dirichlet2 route, see discussion around #5299
#5299 (comment)

It seems unlikely that this dirichlet sampling issue would provide enough
motivation to move forward on the years-long discussion, where ziggurat and
random choice sampling have been unable to do so. To me it looks like a
dirichlet2 function would be the realistic option.


Reply to this email directly or view it on GitHub
#5851 (comment).

@trendelkampschroer
Copy link
Contributor Author

Okay, I'll go ahead with a pull request. Let me know, if I can do something about the unit-test.

trendelkampschroer added a commit to trendelkampschroer/numpy that referenced this issue May 13, 2015
np.random.dirichlet fails for small alpha parameters (see numpy#5851).
The new implmentation np.random.dirichlet2 switches to generation
via beta RVs (stick breaking approach) whenever all alpha parameters
are smaller than one.
@argriffing
Copy link
Contributor

@trendelkampschroer Yes, that would be appropriate even if it's not merged.

I guess there is a 'good is the enemy of the great' argument by @njsmith. Ideally there would already be a framework in place that would allow improving random sampling schemes, but there just isn't. So there is a choice between including this new function vs. delaying until the framework is written, and there are drawbacks to both options. If dirichlet2 is added, then it will cause some backwards compatibility cruft when the 'proper fix' is implemented, and it could set a precedent for adding other functions like choice2 or randn2 and possibly further delay the implementation of the better framework. On the other hand delaying until the 'proper fix' means that numpy will have a broken dirichlet sampling function for a range of parameters until the proper fix is implemented, and the decision could have the side-effect of discouraging drive-by contributors. Personally I favor adding dirichlet2 in this case but it is subjective.

@jaimefrio
Copy link
Member

Does this really have the same backwards compatibility requirements than those other discussions? In my mind this is a proper bug, and I have trouble believing that anyone will complain that dirichlet is not generating the backwards compatible sequence of NaNs and 1s that it used to.

I don't know if this fully makes sense, but perhaps one option is to put in dirichlet2 as a private function, then before generating every random variate, dirichlet would save the RNG state and call the current algorithm, and only call dirichlet2, with the saved state, if the return had NaNs.

It would be slower and uglier, but we would keep the current interface, and only break backwards compatibility on streams that were already broken, which may be controversial, but I think mostly right. And it allows us to punt on the future random API that will put an end to all random APIs...

The grander RNG scheme is something we should properly document for requirements, and turn into a GSoC project: it's a pity how NumPy is not taking advantage of the free labor that is being made available to us.

@njsmith
Copy link
Member

njsmith commented May 14, 2015

I haven't looked carefully at what triggers the NaNs, but the
try-and-fall-back approach does seem at least potentially viable.
.
NB the "random api to end all random APIs" is -- if we leave out the
replaceable core generators for now, we can always add that later -- like
10-20 lines of code to track an extra integer through
init/seed/get_state/set_state, plus tests. The main hard part is
someone has to write down how it would work in that terrifying language
known as "English".
.
FWIW my experience with GSoC is that it goes best when you don't think of
it as free labour, but rather as a chance to provide mentoring and a
learning opportunity for a student, with any actually usable code coming
out being a nice surprise.
On May 13, 2015 9:32 AM, "Jaime" notifications@github.com wrote:

Does this really have the same backwards compatibility requirements than
those other discussions? In my mind this is a proper bug, and I have
trouble believing that anyone will complain that dirichlet is not
generating the backwards compatible sequence of NaNs and 1s that it used to.

I don't know if this fully makes sense, but perhaps one option is to put
in dirichlet2 as a private function, then before generating every random
variate, dirichlet would save the RNG state and call the current
algorithm, and only call dirichlet2, with the saved state, if the return
had NaNs.

It would be slower and uglier, but we would keep the current interface,
and only break backwards compatibility on streams that were already broken,
which may be controversial, but I think mostly right. And it allows us to
punt on the future random API that will put an end to all random APIs...

The grander RNG scheme is something we should properly document for
requirements, and turn into a GSoC project: it's a pity how NumPy is not
taking advantage of the free labor that is being made available to us.


Reply to this email directly or view it on GitHub
#5851 (comment).

@rkern
Copy link
Member

rkern commented May 14, 2015

And in particular, GSoC works best in a greenfield project where there aren't so many constraints that they are just implementing a spec that we've declared.

PS, @njsmith your email replies are consistently misformatted by Github. Some people are likely not seeing most of your comment text because every paragraph after the first is hidden.

@jaimefrio
Copy link
Member

Implementing several new RNGs would be the greenfield part of the project, setting up the infrastructure to support them, the "pound of flesh" demanded from whoever wants to get to do the fun part of the project.

Maybe it has always been like this, and I just need to enjoy the ride and stop worrying. But it seems to me that numpy is en route to a serious shortage-of-active-developers problem. In this situation, we really can't afford to not take advantage of GSoC, if only to get new blood involved.

@njsmith
Copy link
Member

njsmith commented May 15, 2015

@rkern: Thanks for the heads up.

@jaimefrio: I suggest putting the shortage-of-active-developers discussion on our meeting agenda :-).

@carlosgmartin
Copy link
Contributor

What's the current status of this issue? Is there a standard solution for the problem of small alpha parameters?

@rkern
Copy link
Member

rkern commented Mar 27, 2018

The stream compatibility is still a blocking issue given our current policy.

I am currently drafting a NEP about changing that policy, and introducing @bashtage's work for the proposed new PRNG system. You may wish to contribute this improved dirichlet() implementation to his prototype here: https://github.com/bashtage/randomgen/

@mattip
Copy link
Member

mattip commented Jul 5, 2019

Closing, very small alpha values now give nan rather than an error, both in legacy and in the new random Generator code. Please reopen if I misunderstood.

@mattip mattip closed this as completed Jul 5, 2019
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Nov 14, 2019

Closing, very small alpha values now give nan rather than an error, both in legacy and in the new random Generator code. Please reopen if I misunderstood.

I think we should still consider this a bug. Returning nan with no warning is not nice, .e.g.

In [2]: rng = np.random.default_rng(None)                                                 

In [3]: alpha = np.array([1e-4, 1e-5, 1e-6])                                              

In [4]: rng.dirichlet(alpha, size=5)                                                      
Out[4]: 
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

Now that NEP 19 is implemented, we can revisit the pull request that was proposed to fix this: #5872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants