-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Numpy mean fails/gives huge precision issues with large arrays and axis selection #11331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Well, numpy is not overly fancy about how to calculate means, there is a thing that you get better then naive precision in some cases (which kicks in for the full array here), see also for example gh-8116 |
Ok. So I get why a naive summation with float32 would that, but then it was quite confusing to see it succeed without axis selection. |
The reason is memory layout and thus speed. Doing the (mostly) pairwise summation in numpy with the typical machinery only works reasonably along a single axis (where it comes with no performance loss at all). But only when summing the fast axis it is feasable to do this, because otherwise others would be complaining about massive performance drops. I agree that there should be more documentation on this, heck I was even hesitant when we first put this in.... It would also be nice to have more stable summations in general.... |
I'm glad to see #9393 that will add something in the documentation. But this can be a serious issue that can go unnoticed. A warning would be even more welcome. A fix is of course the idea situation. My recommended workaround for people running into this issue is to add
Another minimalist example that illustrates this issue: a=np.random.rand(30*1000*1000,2).astype(np.float32)*.01+3.9 # 30 million pairs
print('Expected:')
print(' mean: ', np.mean(a,axis=0,dtype=np.float64), ' (analytical = 3.905)')
print(' std: ', np.std(a,axis=0,dtype=np.float64), ' (analytical = .01/sqrt(12) = 0.00288...)')
print('Instead, you get:')
print(' mean: ', np.mean(a,axis=0))
print(' std: ', np.std(a,axis=0)) Outputs:
I expect this situation to be encountered a lot and without noticing in neural network data normalization code where statistics over large training data gets computed. People tend to go for single (or half) rather than double precision because of performance and memory savings, and the fact that successful neural network training rarely requires double precision. |
I suppose the warning could be made length dependent. |
Note that this behavour is of course inherited into `np.add.reduce` and many other reductions such as `mean` or users of this reduction, such as `cov`. This is ignored here. Closes numpygh-11331, numpygh-9393, numpygh-13734
On Numpy 1.14.2 I get the following:
results in:
[127.50656009 127.49165182 127.51390158]
[64. 64. 64.]
127.50413
Even considering float32 precision, this type of failure seems odd, especially given that the entire array's mean can be calculated succesfully
The text was updated successfully, but these errors were encountered: