-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
BUG: fix negative samples generated by Wald distribution #29609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In numpy.random, when generating Wald samples, if the mean and scale have a large discrepancy, the current implementation suffers from catastrophic cancellation. An explicit example of this has been added to test_generator_mt19937.py. The key line implicated in distributions.c: `X = mean + mu_2l * (Y - sqrt(4 * scale * Y + Y * Y));` which has numerical issues when Y >> scale. I have replaced this with the equivalent rationalized form: ``` d = Y + sqrt(Y) * sqrt(Y + 4 * scale); X = mean - (2 * mean * Y) / d; ``` And now the test passes.
Thanks @mmaaz-git, I confirmed that the problem exists on the main development branch. At first glance, your reformulation looks mathematically correct and avoids the problem. I probably won't get to a closer look until this weekend (anyone else can jump in, of course). One change I suggest before I click the "Approve workflows to run": you say the new test fails for you without the modified formulas, but in my build of numpy, your new test passes using the main development branch. I was able to get the test to fail by increasing |
I've changed the test case to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For anyone interested, the method that is used to generate variates is described on the wikipedia page
https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution#Sampling_from_an_inverse-Gaussian_distribution
Our current implementation uses the following formulation to compute x
. I changed the notation slightly to be closer to the wikipedia article and avoid reusing variable names, so mean
is mu
and scale
is lam
(i.e. lambda).
n = Normal(0, 1) # Standard normal variate
mu_2l = mu / (2 * lam)
y = mu * n**2
x = mu + mu_2l * (y - sqrt(4 * lam * y + y * y))
That is just the first step, where x
is computed.
The problem is the loss of precision in y - np.sqrt(4 * lam * y + y * y)
. When y
is sufficiently large, the two terms in that subtraction are relatively close, which results in a catastrophic loss of precision.
In this pull request, the formulation is changed to
n = Normal(0, 1) # Standard normal variate
y = mu * n**2
d = y + sqrt(y) * sqrt(y + 4 * lam)
x = mu - (2 * mu * y) / d
In effect, @mmaaz-git has "rationalized the numerator" and replaced the problematic subtraction with an addition in the denominator.
The following plot of the relative error in x
as a function of the normal variate n
, with mean = mu = 1e8
and scale = lam = 2.25
, shows the effectiveness of this reformulation. The relative error was determined by computing a high precision result using mpmath
.

@mmaaz-git, one problem with your version is that the normal variate (and therefore y
) could be 0, and your formula generates nan
in that case. You'll need to check for the normal variate being 0; just return mu
in that case.
We can also make a few small changes to the expression to cut down on the number of floating point operations, and end up with something like the following (where I also show the check for the normal variate being 0):
n = Normal(0, 1) # Standard normal variate
if n == 0:
return mu
y = mu * n**2
d = 1 + sqrt(1 + 4 * lam / y)
x = mu * (1 - 2 / d)
(No need to change the actual variable names; keep mean
, scale
etc.)
Thanks for the detailed review! Really satisfying to see the relative error plot, and great points about the normal rv being 0, and how to simplify some of the expressions. I have made the requested changes. |
In
numpy.random
, when generating Wald samples, if the mean and scale have a large discrepancy, the current implementation sometimes generates negative samples. An explicit example of this has been added totest_generator_mt19937.py
,mean=1e8, scale=2.25
(which is specific to the seed used in that test file).The key line implicated in
distributions.c
:which has numerical issues when
Y >> scale
.I have replaced this with the equivalent rationalized form:
This can be seen my noting the following algebraic identity:
As
mu_2l = mean / (2 * scale)
, we then have:Introduce a new variable for the denominator and factoring gives the final expressions:
And now the test passes, i.e., all samples are non-negative.
I have not made this change to the legacy implementation (nor the legacy test) as I figure they are deprecated, although it also suffers from the same issue.
Disclaimer: I found this failing property during a research project I am working on using LLM agents to write property-based tests (i.e., in Hypothesis). This property, that all Wald samples should be non-negative, was proposed and written "autonomously" by the agent. I spent several hours analyzing the failure and becoming acquainted with the NumPy codebase myself in order to verify it was a real issue, running it with the seed in the test suite, etc. I also pinpointed the cause of the problem and wrote this fix and the PR myself. I take full responsibility for it.
The Hypothesis test that caught the issue was simply:
Note that even after this PR, this property fails, albeit the minimal example is now
mean=280432326968266.0, scale=0.25
(i.e., even more extreme values). But, at such values, it seems like we hit the limits of floating point computations.