Skip to content

sum result different when slicing before sum #13734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
takluyver opened this issue Jun 7, 2019 · 11 comments
Closed

sum result different when slicing before sum #13734

takluyver opened this issue Jun 7, 2019 · 11 comments
Labels
57 - Close? Issues which may be closable unless discussion continued

Comments

@takluyver
Copy link
Contributor

I've run into a case where these two expressions give substantially different results:

a.sum(axis=0)[0]  # 3223672.5
a[:, 0].sum(axis=0)  # 3223679.2

I imagine I'm running into the limits of floating point precision in some form - the data are float32. But I'm puzzled why they're different - I'd expect these two to sum up exactly the same values and so be affected by finite precision in the same way.

Other details:

  • The difference is large enough to fail np.testing.assert_allclose
  • As far as I can find, all values are 'normal' numbers - no NaN or infinity
  • They don't span a particularly large range of values - less than two orders of magnitude.
  • All values are positive (I vaguely recall that floating point error can be magnified if large positive and negative numbers cancel each other out).

I can see a difference with random data, but the data I'm using seems to produce a particularly big difference, so I've made an example in a gist including a .npy array with this data.

git clone https://gist.github.com/07c7922f90749ae30d7d45fa02ab6851.git numpy_sum_behaviour
cd numpy_sum_behaviour/
python3 numpy_slice_sum.py

Numpy/Python version information:

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.16.3 3.7.3 (default, May 11 2019, 00:38:04) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)]

I originally saw this on 1.16.4 on another machine, and cut the example down on my laptop with 1.16.3.

@seberg
Copy link
Member

seberg commented Jun 7, 2019

Yes, summing along the fast axis gives additional numerical precision because numpy uses pairwise summation. But if you sum along the slow axis (in a multi dimensional array), doing the pairwise summation would be much slower, so it doesn't happen.

It is a strange beast, but since you basically get bonus precision, in the end we have it...

@seberg seberg added the 57 - Close? Issues which may be closable unless discussion continued label Jun 7, 2019
@seberg
Copy link
Member

seberg commented Jun 7, 2019

There should be an open issue about it somewhere, that it should be documented better (although I have a doubt it would be easy to discover even if documented).

@takluyver
Copy link
Contributor Author

Thanks for the explanation - I didn't know about pairwise summation.

If it helps with the documentation, I went looking for notes relating to this in the docs for numpy.sum.

It looks like #9393 is the issue to document this. Feel free to close this issue. Now that I have a search term, I see that #11331 is also likely very similar to this.

@seberg
Copy link
Member

seberg commented Jun 7, 2019

Hmmm, I tried to write a bit:

    The numerical precision of sum (and ``np.add.reduce``) is in general
    limited by directly adding each number individually to the result
    causing rounding errors in every step.
    However, often numpy will use a  numerically better approach
    (pairwise summation) leading to improved precision in many use cases.
    This improved precision is always provided when no ``axis`` is given.
    When ``axis`` is given, it will depend on which axis is summed.
    Technically, to provide the best speed possible, the improved precision
    is only used when the summation is along the fast axis in memory.
    Note that the exact precision may vary depending on other parameters.
    In contrast to NumPy, Python's ``sum`` function uses a slower but more
    precise approach to summation.

but not sure it isn't too technical... Will probably open a PR with it in any case.

seberg added a commit to seberg/numpy that referenced this issue Jun 7, 2019
Note that this behavour is of course inherited into `np.add.reduce` and
many other reductions such as `mean` or users of this reduction, such
as `cov`. This is ignored here.

Closes numpygh-11331, numpygh-9393, numpygh-13734
@bbbbbbbbba
Copy link

Um... what approach does Python's builtin sum() use? Is it a language feature or an implementation detail?

@seberg
Copy link
Member

seberg commented Jun 8, 2019

@bbbbbbbbba I have no idea, I guess imlementation detail, see also:

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

The standard library of the Python computer language specifies an fsum function for exactly rounded summation, using the Shewchuk algorithm[9] to track multiple partial sums.

Maybe I should link that site...

@seberg
Copy link
Member

seberg commented Jun 8, 2019

Which means its incorrect though, fsum uses Shewchuck, sum does not (which makes sense, it would have to recognize floats).

@bbbbbbbbba
Copy link

Actually, isn't it just that builtin sum is starting from 0 by default?

>>> sum(np.ones((10 ** 8,), dtype = np.float32))
100000000.0
>>> type(0 + np.float32(1))
<class 'numpy.float64'>
>>>
>>> sum(np.ones((10 ** 8,), dtype = np.float32), np.float32(0))
16777216.0
>>> type(np.float32(0) + np.float32(1))
<class 'numpy.float32'>
>>>
>>> sum(np.ones((10 ** 8, 1), dtype = np.float32))
array([16777216.], dtype=float32)
>>> type(0 + np.float32([1]))
<class 'numpy.ndarray'>
>>> (0 + np.float32([1])).dtype
dtype('float32')

@seberg
Copy link
Member

seberg commented Jun 9, 2019

@bbbbbbbbba I am not sure I parse... Just try to add lots of float64, and you will see differences, especially comparing to fsum.

@bbbbbbbbba
Copy link

Oh, I think we are actually on the same page now. Apparently you were thinking of fsum when you talked about the "slower but more precise approach to summation". I noticed that sum sometimes also seemed to be more precise than the naive approach, but it turned out that it was only because it used float64.

@mattip
Copy link
Member

mattip commented Jun 11, 2019

Closed by #13737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

No branches or pull requests

4 participants