-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Faster algorithm for computing histograms with equal-size bins #6100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -175,54 +175,102 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, | |||
raise AttributeError( | |||
'max must be larger than min in range parameter.') | |||
|
|||
# Histogram is an integer or a float array depending on the weights. | |||
if weights is None: | |||
ntype = int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For indices and counts we always use np.intp
rather than int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok - but did you mean to comment on the line with indices = tmp_a.astype(int)
? (ntype is not used for indices - the code you commented as is as before, I just moved it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see! It then is a bug on the current implementation. An unlikely bug, but the histogram of an array of 2**31
items on a 64 bit system, all of which end up in the same bin, will show a -1
count for that bin, which is not quite what we are after. Since you are at it, you may as well fix it, since it does affect your code: I'm pretty sure bincount
always returns either np.intp
(without weights) or np.double
(with weights).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaimefrio - ah, I see. Actually, testing for this uncovered a bug in bincount:
In [1]: import numpy as np
In [2]: values = np.ones(2**32, dtype=np.intp)
In [3]: np.bincount(values)
/Volumes/Raptor/miniconda3/envs/production/bin/python.app: line 3: 7001 Segmentation fault: 11 /Volumes/Raptor/miniconda3/envs/production/python.app/Contents/MacOS/python "$@"
Do you see the same issue?
One generic comment: by using |
@jaimefrio - thanks for the feedback! I agree about needing to be careful about the weights type. I'm not sure if the code was originally intended to be used with non-float weights since there was the comment |
@jaimefrio - I've now made sure that if the weights are complex, we do the right thing with bincount, and if they are not complex, float, or int, we revert to the original algorithm (and I added a test for all these). Note that there is still an issue with |
1c73e0a
to
6fc0889
Compare
Quick question: In one of my use cases, instead of providing the number of bins, I provide an array of equally spaced bins. The obvious fix is on my end but I believe you are technically speaking not using the optimized code for the case where an array of equally spaced bins is provided. Is that correct? Would there be problems associated doing that check as well? I.e. something along the lines of bin_widths = np.diff(bins)
if not iterable(bins) or np.all(bin_widths == bin_widths[0]) There's some overhead for that check of course and there are probably rounding issues so the simplest way out may just be to use the function as intended. |
@ctk3b, you might want the The issue with checking the bin widths is that it can fail due to rounding error, since a set of bins that are equally spaced in floating point doesn't have to pass that check. e.g.:
edit: doh I see you are aware that rounding issues exist. |
Yeah that's the tricky part I guess. One thing to maybe consider, if the cost of the check is acceptable, is to do a *but they have to be absolutely sure that's what they want |
I personally don't think that we should check for equal bin spacing in the bins array (if it's not an int), but happy to add it if there is a consensus. An alternative is to simply make the docstring clearer and add a note about how to get optimal performance. |
@jaimefrio - apart from that question, I think this is pretty much ready for a final review. |
|
||
# At this point, if the weights are not integer, floating point, or | ||
# complex, we have to use the slow algorithm. | ||
if weights is not None and not weights.dtype.kind in ('i', 'f', 'c'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be better implemented as:
if weights is not None and not (np.can_cast(weights.dtype, np.double) or
np.can_cast(weights.dtype, np.complex)):
It mostly LGTM. Have quite a few questions/comments, but aside from the int-that-should-be-an-intp, there is nothing fundamentally wrong that I see. |
@jaimefrio - I'll reply here about the blocks/chunks to make sure the comment stays here in the long term. I copied the use of the blocks from the older/slower method. It turns out that doing this results in a factor of 2 improvement in performance. Let's take the case of a 10^8 array. The times to do the histogram (with the new method in this PR) are:
So with While slightly inelegant, I do think a 2x improvement in speed and 3x improvement in memory usage is worth it. In fact, I think that having a 3x worsening in memory usage would be considered a performance regression by many, even if the code ran 5x faster. Do you agree that it makes sense to leave it as-is? I can add a comment about why this is done. |
I'll buy that 2x. Thanks for takinig the time to substantiate with measurements all my existential doubts, Correct me if I am wrong, but it seems we both agree that, eventually, the right thing to do would be to rewrite at least this particular path in C, right? If you could fix that int-that-should-be-an-intp, and get rid of some of the extraneous white space, I'll go ahead and merge it. |
@jaimefrio - yes, I think in the long term we could benefit from switching over to C and I'd be happy to look into it (but that will need to wait a few weeks). I'll finish up the current pull request now so that you can go ahead and merge it. |
5ab0ad5
to
4325078
Compare
@jaimefrio - I've made the fixes you requested. I'd be interested in trying to tackle the switching over to C - should I open an issue to keep track of it, or should I just open a pull request once I have something working? |
@jaimefrio - I can also look into applying the same kind of optimization to the |
|
||
# We set a block size, as this allows us to iterate over chunks when | ||
# computing histograms, to minimize memory usage. | ||
BLOCK = 65536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems poorly chosen, this is one L1 cache and numpy likes its copies so you always end up wiping it.
I get a 30% performance increase for N = 20000 when block is 4 times smaller, though the same can be achived by doing the keep computation in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah its not the cache (it is very hard to make use of L1 with numpy unfortunately), its removes 200.000 page faults, 64k seems to trigger some different code path in glibc, not sure why yet, glibc usually only switches allocation mode at much larger values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, I just used the value that was set for the original slow method, but as I indicated in #6100 (comment) for large arrays we'd actually benefit from a slightly larger block size.
Rather than spend too much time trying to find a one-size-fits-all block size, I would suggest leaving the block size for now since this is code that I plan to try and replace with a C iteration (as @jaimefrio and I were discussing).
However, I agree that the keep operation could be done in-place instead. I'll look at this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliantaylor - actually could you clarify what you mean by doing the keep computation inplace? I tried doing:
keep = tmp_a >= mn
np.logical_and(keep, tmp_a <= mx, out=keep)
but I don't see any improvements. Did you mean something else?
I also just checked with BLOCK=65536/4
for different array sizes from 1 to 1e8 but couldn't see any for which this was faster than the current block size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is more readable:
keep = (tmp_a >= mn)
keep &= (tmp_a <= mx)
I don't really understand what is going on, though my simple testcase seems to trigger some bad memory allocation pattern. If I add another histogram call of another size it disappears.
It still might be a good idea as its there is at least one case where its significantly better and just adds one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done in 46b28f30539bacc19e803f4392fab95fc3c127db!
6 commits seem a few too many: if you can squash them into a single one, we can put it in. |
46b28f3
to
34b582a
Compare
@jaimefrio - done! |
# is 2x as fast) and it results in a memory footprint 3x lower in the | ||
# limit of large arrays. | ||
for i in arange(0, len(a), BLOCK): | ||
tmp_a = a[i:i+BLOCK] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct iteration limit is something like range(0, ((len(a) - 1) // BLOCK + 1) * BLOCK, BLOCK)
. With the current limit an array of size a multiple of BLOCK
will run an extra iteration on an empty array.
Also, I think using Python's range
here is better than NumPy's arange
, as it will be an iterator in Py3K, and in Python 2 building a list is typically faster than an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I change this for the 'original' method too? (I copied it from there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaimefrio - are you sure about the extra iteration? If len(a) is a multiple of BLOCKSIZE, then because range is exclusive at the upper end, I think it does the right thing:
In [3]: np.arange(0, 30, 10)
Out[3]: array([ 0, 10, 20])
(but yes, we could simply change to Python's range)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and a[20:20+10]
is the full last block.
edit: maybe the confusing thing is that python ranges do not care about out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg - so you agree that the only thing that needs changing is arange
-> range
, but otherwise it's ok? (want to make sure I wait until there is a consensus otherwise if I amend the existing commit and force push it might get rid of this thread)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, would look like it to me, it was a bit of a stupid comment, misread the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaimefrio - do you agree? If so, will make the arange
-> range
change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that's why you don't review code after midnight... You, and whoever wrote the first version, are absolutely right, I was absolutely wrong. I won't make you push yet another change for the range/arange thing.
ENH: Faster algorithm for computing histograms with equal-size bins
Thanks a lot Thomas. |
@jaimefrio - thanks for all your comments! Once I get a chance (reasonably soon hopefully) I'll investigate some of the further improvements discussed here, including for example switching to C for part of the code. |
@@ -189,40 +199,86 @@ def histogram(a, bins=10, range=None, normed=False, weights=None, | |||
if mn == mx: | |||
mn -= 0.5 | |||
mx += 0.5 | |||
# At this point, if the weights are not integer, floating point, or | |||
# complex, we have to use the slow algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for this? What type of weight would not work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember for sure, but I think e.g. object arrays work in the weights array:
In [7]: np.histogram([1,2,3], weights=np.array([Decimal(1), Decimal(2.2), Decimal(4.4)], dtype=np.object))
Out[7]:
(array([Decimal('1'), Decimal('0'), Decimal('0'), Decimal('0'),
Decimal('0'), Decimal('2.200000000000000177635683940'),
Decimal('0E-27'), Decimal('0E-27'), Decimal('0E-27'),
Decimal('4.400000000000000355271367880')], dtype=object),
array([ 1. , 1.2, 1.4, 1.6, 1.8, 2. , 2.2, 2.4, 2.6, 2.8, 3. ]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's it, thanks - you're guarding bincount
, that only works for the types you list there.
About
When bins are equal-sized, we don't have use
searchsorted
to find which bins values fall in, we can instead just scale and convert the values to the bin indices and usebincount
.Results
I carried out speed tests with arrays going from 1 to 10^8 elements. The values were randomly sampled with:
and the histogram was computed with:
The speed difference is as follows:
The new method is always faster, and for more than 10^5 elements, the speedup is a factor of ~10x.
I then checked that the memory usage is not increased, when binning an array with 10^9 values:
There is no memory penalty!
Note that I originally implemented this without the iteration over blocks, and not only did it use more memory, it was 2x as slow as the version with looping over chunks (to my surprise).
Future work
If this PR is merged, we can also consider switching to using C or Cython for the part where the bin indices are computed. This could provide a further ~3x improvement in performance.
Notes
This partially addresses #6099 (np.histogram could be made even faster by switching to C in places)
A similar speedup could be made for
histogramdd
Of course, given how extensively this function is used, it would be great if people could subject to more rigorous testing (I will also do some more testing to see if I can find cases where there is a performance regression)