Skip to content

ENH: Faster algorithm for computing histograms with equal-size bins #6100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 23, 2015

Conversation

astrofrog
Copy link
Contributor

About

When bins are equal-sized, we don't have use searchsorted to find which bins values fall in, we can instead just scale and convert the values to the bin indices and use bincount.

Results

I carried out speed tests with arrays going from 1 to 10^8 elements. The values were randomly sampled with:

x = np.random.uniform(-10., 100., N)

and the histogram was computed with:

hist, edges = np.histogram(x, bins=45, range=[-5, 99.])

The speed difference is as follows:

comparison

The new method is always faster, and for more than 10^5 elements, the speedup is a factor of ~10x.

I then checked that the memory usage is not increased, when binning an array with 10^9 values:

memory

There is no memory penalty!

Note that I originally implemented this without the iteration over blocks, and not only did it use more memory, it was 2x as slow as the version with looping over chunks (to my surprise).

Future work

If this PR is merged, we can also consider switching to using C or Cython for the part where the bin indices are computed. This could provide a further ~3x improvement in performance.

Notes

This partially addresses #6099 (np.histogram could be made even faster by switching to C in places)

A similar speedup could be made for histogramdd

Of course, given how extensively this function is used, it would be great if people could subject to more rigorous testing (I will also do some more testing to see if I can find cases where there is a performance regression)

@@ -175,54 +175,102 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
raise AttributeError(
'max must be larger than min in range parameter.')

# Histogram is an integer or a float array depending on the weights.
if weights is None:
ntype = int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For indices and counts we always use np.intp rather than int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok - but did you mean to comment on the line with indices = tmp_a.astype(int)? (ntype is not used for indices - the code you commented as is as before, I just moved it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see! It then is a bug on the current implementation. An unlikely bug, but the histogram of an array of 2**31 items on a 64 bit system, all of which end up in the same bin, will show a -1 count for that bin, which is not quite what we are after. Since you are at it, you may as well fix it, since it does affect your code: I'm pretty sure bincount always returns either np.intp (without weights) or np.double (with weights).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaimefrio - ah, I see. Actually, testing for this uncovered a bug in bincount:

In [1]: import numpy as np

In [2]: values = np.ones(2**32, dtype=np.intp)

In [3]: np.bincount(values)
/Volumes/Raptor/miniconda3/envs/production/bin/python.app: line 3:  7001 Segmentation fault: 11  /Volumes/Raptor/miniconda3/envs/production/python.app/Contents/MacOS/python "$@"

Do you see the same issue?

@jaimefrio
Copy link
Member

One generic comment: by using bincount you are forcing your weighted histogram to be convertible to double. Not sure if there is a use case for complex, or object weights, but it may be a good idea to route those to the generic, slower algorithm.

@astrofrog
Copy link
Contributor Author

@jaimefrio - thanks for the feedback! I agree about needing to be careful about the weights type. I'm not sure if the code was originally intended to be used with non-float weights since there was the comment # Histogram is an integer or a float array depending on the weights. but I'll try and find some examples of non-float weights and will add some unit tests (and will make sure we use the slower algorithm there).

@astrofrog
Copy link
Contributor Author

@jaimefrio - I've now made sure that if the weights are complex, we do the right thing with bincount, and if they are not complex, float, or int, we revert to the original algorithm (and I added a test for all these).

Note that there is still an issue with bincount mentioned in #6100 (comment) but I think that can be dealt with separately. It's not specific to this PR, and it's a definite corner case.

@astrofrog astrofrog force-pushed the faster-histogram-equal-bins branch from 1c73e0a to 6fc0889 Compare July 21, 2015 20:18
@ctk3b
Copy link

ctk3b commented Jul 21, 2015

Quick question: In one of my use cases, instead of providing the number of bins, I provide an array of equally spaced bins. The obvious fix is on my end but I believe you are technically speaking not using the optimized code for the case where an array of equally spaced bins is provided. Is that correct?

Would there be problems associated doing that check as well? I.e. something along the lines of

bin_widths = np.diff(bins)
if not iterable(bins) or np.all(bin_widths == bin_widths[0])

There's some overhead for that check of course and there are probably rounding issues so the simplest way out may just be to use the function as intended.

@ewmoore
Copy link
Contributor

ewmoore commented Jul 21, 2015

@ctk3b, you might want the range keyword.

The issue with checking the bin widths is that it can fail due to rounding error, since a set of bins that are equally spaced in floating point doesn't have to pass that check. e.g.:

In [136]: a = np.arange(10) * (1./7)

In [137]: bins = np.arange(10) * (1./7)

In [138]: bin_widths = np.diff(bins)

In [139]: bin_widths == bin_widths[0]
Out[139]: array([ True,  True,  True,  True, False, False, False, False, False], dtype=bool)

edit: doh I see you are aware that rounding issues exist.

@ctk3b
Copy link

ctk3b commented Jul 21, 2015

Yeah that's the tricky part I guess. One thing to maybe consider, if the cost of the check is acceptable, is to do a np.allclose() check and suggest to the user that they might want to consider using range and specify number of bins instead.

*but they have to be absolutely sure that's what they want

@astrofrog
Copy link
Contributor Author

I personally don't think that we should check for equal bin spacing in the bins array (if it's not an int), but happy to add it if there is a consensus. An alternative is to simply make the docstring clearer and add a note about how to get optimal performance.

@astrofrog
Copy link
Contributor Author

@jaimefrio - apart from that question, I think this is pretty much ready for a final review.


# At this point, if the weights are not integer, floating point, or
# complex, we have to use the slow algorithm.
if weights is not None and not weights.dtype.kind in ('i', 'f', 'c'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better implemented as:

if weights is not None and not (np.can_cast(weights.dtype, np.double) or
                                np.can_cast(weights.dtype, np.complex)):

@jaimefrio
Copy link
Member

It mostly LGTM. Have quite a few questions/comments, but aside from the int-that-should-be-an-intp, there is nothing fundamentally wrong that I see.

@astrofrog
Copy link
Contributor Author

@jaimefrio - I'll reply here about the blocks/chunks to make sure the comment stays here in the long term.

I copied the use of the blocks from the older/slower method. It turns out that doing this results in a factor of 2 improvement in performance. Let's take the case of a 10^8 array. The times to do the histogram (with the new method in this PR) are:

BLOCK         TIME (s)
2**16         1.15103316307 (current implementation)
2**17         1.10001802444
2**18         1.08962392807
2**19         1.10686206818
2**20         1.23000407219
2**21         1.69827389717
2**22         1.80364918709
2**23         1.91272115707
...
2**28         2.28366589546

So with BLOCK = 65536 we are basically 2x faster than with no chunking (it turns out we could do even better with BLOCK=2**18). I presume this must have something to do with larger arrays needing disproportionately more time to be allocated? In addition, the memory requirements are very different - for the 10^8 array, the peak RAM with BLOCK = 65536 is a little under 900MB, while if we don't use blocks, the memory requirement goes up to 2.7Gb.

While slightly inelegant, I do think a 2x improvement in speed and 3x improvement in memory usage is worth it. In fact, I think that having a 3x worsening in memory usage would be considered a performance regression by many, even if the code ran 5x faster. Do you agree that it makes sense to leave it as-is? I can add a comment about why this is done.

@jaimefrio
Copy link
Member

I'll buy that 2x. Thanks for takinig the time to substantiate with measurements all my existential doubts,

Correct me if I am wrong, but it seems we both agree that, eventually, the right thing to do would be to rewrite at least this particular path in C, right?

If you could fix that int-that-should-be-an-intp, and get rid of some of the extraneous white space, I'll go ahead and merge it.

@astrofrog
Copy link
Contributor Author

@jaimefrio - yes, I think in the long term we could benefit from switching over to C and I'd be happy to look into it (but that will need to wait a few weeks). I'll finish up the current pull request now so that you can go ahead and merge it.

@astrofrog astrofrog force-pushed the faster-histogram-equal-bins branch from 5ab0ad5 to 4325078 Compare July 22, 2015 15:54
@astrofrog
Copy link
Contributor Author

@jaimefrio - I've made the fixes you requested.

I'd be interested in trying to tackle the switching over to C - should I open an issue to keep track of it, or should I just open a pull request once I have something working?

@astrofrog
Copy link
Contributor Author

@jaimefrio - I can also look into applying the same kind of optimization to the histogramdd (and by extension histogram2d).


# We set a block size, as this allows us to iterate over chunks when
# computing histograms, to minimize memory usage.
BLOCK = 65536
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems poorly chosen, this is one L1 cache and numpy likes its copies so you always end up wiping it.
I get a 30% performance increase for N = 20000 when block is 4 times smaller, though the same can be achived by doing the keep computation in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah its not the cache (it is very hard to make use of L1 with numpy unfortunately), its removes 200.000 page faults, 64k seems to trigger some different code path in glibc, not sure why yet, glibc usually only switches allocation mode at much larger values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, I just used the value that was set for the original slow method, but as I indicated in #6100 (comment) for large arrays we'd actually benefit from a slightly larger block size.

Rather than spend too much time trying to find a one-size-fits-all block size, I would suggest leaving the block size for now since this is code that I plan to try and replace with a C iteration (as @jaimefrio and I were discussing).

However, I agree that the keep operation could be done in-place instead. I'll look at this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliantaylor - actually could you clarify what you mean by doing the keep computation inplace? I tried doing:

keep = tmp_a >= mn
np.logical_and(keep,  tmp_a <= mx, out=keep)

but I don't see any improvements. Did you mean something else?

I also just checked with BLOCK=65536/4 for different array sizes from 1 to 1e8 but couldn't see any for which this was faster than the current block size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is more readable:

        keep = (tmp_a >= mn)
        keep &= (tmp_a <= mx)

I don't really understand what is going on, though my simple testcase seems to trigger some bad memory allocation pattern. If I add another histogram call of another size it disappears.
It still might be a good idea as its there is at least one case where its significantly better and just adds one line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done in 46b28f30539bacc19e803f4392fab95fc3c127db!

@jaimefrio
Copy link
Member

6 commits seem a few too many: if you can squash them into a single one, we can put it in.

@astrofrog astrofrog force-pushed the faster-histogram-equal-bins branch from 46b28f3 to 34b582a Compare July 23, 2015 07:17
@astrofrog
Copy link
Contributor Author

@jaimefrio - done!

# is 2x as fast) and it results in a memory footprint 3x lower in the
# limit of large arrays.
for i in arange(0, len(a), BLOCK):
tmp_a = a[i:i+BLOCK]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct iteration limit is something like range(0, ((len(a) - 1) // BLOCK + 1) * BLOCK, BLOCK). With the current limit an array of size a multiple of BLOCK will run an extra iteration on an empty array.

Also, I think using Python's range here is better than NumPy's arange, as it will be an iterator in Py3K, and in Python 2 building a list is typically faster than an array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change this for the 'original' method too? (I copied it from there)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaimefrio - are you sure about the extra iteration? If len(a) is a multiple of BLOCKSIZE, then because range is exclusive at the upper end, I think it does the right thing:

In [3]: np.arange(0, 30, 10)
Out[3]: array([ 0, 10, 20])

(but yes, we could simply change to Python's range)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and a[20:20+10] is the full last block.
edit: maybe the confusing thing is that python ranges do not care about out of bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - so you agree that the only thing that needs changing is arange -> range, but otherwise it's ok? (want to make sure I wait until there is a consensus otherwise if I amend the existing commit and force push it might get rid of this thread)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, would look like it to me, it was a bit of a stupid comment, misread the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaimefrio - do you agree? If so, will make the arange -> range change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's why you don't review code after midnight... You, and whoever wrote the first version, are absolutely right, I was absolutely wrong. I won't make you push yet another change for the range/arange thing.

jaimefrio added a commit that referenced this pull request Jul 23, 2015
ENH: Faster algorithm for computing histograms with equal-size bins
@jaimefrio jaimefrio merged commit eb7104e into numpy:master Jul 23, 2015
@jaimefrio
Copy link
Member

Thanks a lot Thomas.

@astrofrog
Copy link
Contributor Author

@jaimefrio - thanks for all your comments! Once I get a chance (reasonably soon hopefully) I'll investigate some of the further improvements discussed here, including for example switching to C for part of the code.

@@ -189,40 +199,86 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
if mn == mx:
mn -= 0.5
mx += 0.5
# At this point, if the weights are not integer, floating point, or
# complex, we have to use the slow algorithm.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this? What type of weight would not work here?

Copy link
Contributor Author

@astrofrog astrofrog Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember for sure, but I think e.g. object arrays work in the weights array:

In [7]: np.histogram([1,2,3], weights=np.array([Decimal(1), Decimal(2.2), Decimal(4.4)], dtype=np.object))
Out[7]: 
(array([Decimal('1'), Decimal('0'), Decimal('0'), Decimal('0'),
        Decimal('0'), Decimal('2.200000000000000177635683940'),
        Decimal('0E-27'), Decimal('0E-27'), Decimal('0E-27'),
        Decimal('4.400000000000000355271367880')], dtype=object),
 array([ 1. ,  1.2,  1.4,  1.6,  1.8,  2. ,  2.2,  2.4,  2.6,  2.8,  3. ]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's it, thanks - you're guarding bincount, that only works for the types you list there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants