ENH: add an axis argument to np.bincount #4330

jaimefrio · 2014-02-19T21:49:43Z

This adds an axis argument to np.bincount, which can now take multidimensional
arrays and do the counting over an arbitrary number of axes. axis defaults
to None, which does the counting over all the axes, i.e. over the flattened
array.

The shape of the output is computed by removing from list (the input array)
all dimensions in axis, and appending a dimension of length
max(max(list) + 1, minlength) to the end. out[..., n] will hold the number
of occurrences of n at the given position over the selected axes.

If a weights argument is provided, its shape is broadcasted with list
before removing the axes. In this case, axis refers to the axes of list
before broadcasting, and out[..., n] will hold the sum of weights over
the selected axes, at all positions where list takes the value n.

The general case is handled with nested iterators, but shortcuts without
having to set up an iterator are provided for 1D cases, with no performance
loss against the previous version.

As a plus, this PR also solves #823, by providing specialized functions for
all integer types to find the max. There are also specialized functions for
all integer types for counting and doing weighted counting.

This adds an axis argument to np.bincount, which can now take multidimensional arrays and do the counting over an arbitrary number of axes. `axis` defaults to `None`, which does the counting over all the axes, i.e. over the flattened array. The shape of the output is computed by removing from `list` (the input array) all dimensions in `axis`, and appending a dimension of length `max(max(list) + 1, minlength)` to the end. `out[..., n]` will hold the number of occurrences of `n` at the given position over the selected axes. If a `weights` argument is provided, its shape is broadcasted with `list` before removing the axes. In this case, `axis` refers to the axes of `list` *before* broadcasting, and `out[..., n]` will hold the sum of `weights` over the selected axes, at all positions where `list` takes the value `n`. The general case is handled with nested iterators, but shortcuts without having to set up an iterator are provided for 1D cases, with no performance loss against the previous version. As a plus, this PR also solves numpy#823, by providing specialized functions for all integer types to find the max. There are also specialized functions for all integer types for counting and doing weighted counting.

charris · 2014-02-22T17:03:17Z

numpy/add_newdocs.py

@@ -4987,48 +4987,58 @@ def luf(lamdaexpr, *args, **kwargs):

 add_newdoc('numpy.lib._compiled_base', 'bincount',
    """
-    bincount(x, weights=None, minlength=None)
+    bincount(list, weights=None, minlength=None, axis=None)


I think list is an unfortunate choice for a variable name, x was fine, but maybe arr or a instead.

I was basically trying to make explicit the following current behavior:

>>> np.bincount(list=[3,4,5]) array([0, 0, 0, 1, 1, 1], dtype=int64)

Internally, the first argument is a kwarg with label list. I agree that it is an unfortunate choice, and will change it both internally and in the docstring to arr. Since it is an undocumented feature, I understand this change can be made with out worries, right?

charris · 2014-02-22T18:18:41Z

I bailed early here ;) looking at the code size here relative to the original function, I prefer the original. The use of type specific code complicates things enormously. A simple, if not so fast, approach, would be to do this in the python call, just calling bincount in a loop and using max to set up the output array shape.

I guess I'm saying this is a premature optimization ;)

This does suggest we could use a standard function dispatcher.

charris · 2014-02-22T18:30:04Z

To avoid arr_bincount doing extra work, we could add a argument to set the length and discard any bins greater than that. Maybe clip, meaning minlength is also a maxlength.

jaimefrio · 2014-02-22T19:03:08Z

Yes, I probably got a little carried away optimizing performance once I got it running. The type specific code, is there to keep the performance for small arrays vs the old function. There was also the thing of handling uint64 inputs, but there have to be other options if that is all you want done. If you think this is bad, you should have seen it with specialized loops for the weights types, to allow complex numbers there also! ;-)

It is very easy to not do any of this and simply have the iterator buffer and cast everything to npy_intp/npy_double and have a single inlined loop. There is a performance hit for small arrays in setting up the iterator, which then run about 2x slower, but making contiguous copies of 1D arrays if they are not behaved/of the right type will probably take care of most of that.

Let me try to rewrite with a leaner approach: I promise you want have to read over so much code!

charris · 2014-02-22T19:38:31Z

I think a clip keyword might be useful in any case.

jaimefrio · 2014-02-28T23:11:05Z

I have just pushed a slimmed down version of ndbincount. As is, it has all the bells and whistles of broadcasting using the axis parameter, but is comparably as fast as the original version for 1D arrays. It could be simplified a little by always using the iterator, but then it would be 2x slower than the original code for small arrays. It's still several hundred more lines of code...

josef-pkt · 2015-01-28T05:27:06Z

Just as cheerleader for this since another issue reminded me of it:

statsmodels would have quite a lot of use for a vectorized version of bincount. I always missed that.
I never hit the "send" button for my reply almost a year ago. The main questions were about broadcasting which I don't know whether they are still relevant.
In any case, I and we can adjust to whatever the pattern numpy has now for partial reduction, and we will use any vectorized version if it's more efficient than the plain loops.

jaimefrio · 2015-03-09T02:27:45Z

I am closing this for now. There are several functions out there that could use an axis argument, and bincount is one of them. But rather than bloating them all with code, the proper approach seems to me to build some intermediate layer over nditer that abstracts the complexity away. It would probably make for a nice GSoC project...

rgommers · 2015-03-09T04:56:24Z

@jaimefrio you could still consider the suggestion of @charris to do this in a for-loop in Python. That would keep the code simple, and non-optimal speed for something that doesn't work at all now is OK:) Reason to do that: the intermediate layer might take a long time to materialize, and this does look useful.....

mhvk · 2016-05-22T19:34:46Z

The mailing list conversation reminded me: the multidimensional weight, single-dimension index case of this was something I was very much looking forward to. I can see why this was closed, but it would still be really great to have...

eric-wieser · 2017-03-25T09:42:10Z

Relevant to #5197

mhvk · 2017-03-25T14:22:02Z

Would still be great to have this -- @eric-wieser - np.bincount is very fast; we'd have to be very careful about performance before thinking of moving this to a gufunc (I analyze GBs of baseband data with it, so I care...).

eric-wieser · 2017-03-25T14:30:08Z

I would hope that the cost of a gufunc would only be a small constant, whereas it sounds like you're using it on large inputs anyway?

mhvk · 2017-03-25T15:54:51Z

True, I just worry about the iterator not being optimized as much, but with a good inner loop I guess it should be fine.

eric-wieser · 2017-03-25T16:03:01Z

I just worry about the iterator not being optimized as much,

As much as what? Obviously iterating over something takes more time than assuming it has length 1...

charris reviewed Feb 22, 2014
View reviewed changes

jaimefrio mentioned this pull request Feb 26, 2014

BUG: Allow any integer type in bincount (fixes #823) #4366

Closed

MANT: Simplify code removing type specific loops

fdc5330

jaimefrio mentioned this pull request Mar 24, 2014

BUG: fix some errors raised when minlength is incorrect in np.bincount #4542

Merged

josef-pkt mentioned this pull request Jan 28, 2015

ENH: Add 'crosstab', for creating contingency tables. #4958

Closed

rgommers added 01 - Enhancement component: numpy.lib labels Mar 8, 2015

jaimefrio closed this Mar 9, 2015

This was referenced Mar 19, 2015

Add examples of more complex NpyIter usage to the documentation. #5695

Open

searchsorted should get a axis kwarg #4224

Open

jaimefrio mentioned this pull request Jan 31, 2016

ENH: Added axis param for np.count_nonzero #7138

Closed

gfyoung mentioned this pull request Feb 3, 2016

ENH: added axis param for np.count_nonzero #7177

Merged

jaimefrio mentioned this pull request Mar 19, 2016

Enhancement: groupby function #7265

Open

mhvk mentioned this pull request Jan 26, 2017

Suggestions on improving numpy.bincount #8495

Closed

roxyboy mentioned this pull request Apr 10, 2019

isotropic_powerspectrum fails for extra dimensions xgcm/xrft#9

Closed

TomNicholas mentioned this pull request Jun 15, 2021

Multidimensional histogram pydata/xarray#5400

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add an axis argument to np.bincount #4330

ENH: add an axis argument to np.bincount #4330

jaimefrio commented Feb 19, 2014

charris Feb 22, 2014

jaimefrio Feb 25, 2014

charris commented Feb 22, 2014

charris commented Feb 22, 2014

jaimefrio commented Feb 22, 2014

charris commented Feb 22, 2014

jaimefrio commented Feb 28, 2014

josef-pkt commented Jan 28, 2015

jaimefrio commented Mar 9, 2015

rgommers commented Mar 9, 2015

mhvk commented May 22, 2016

eric-wieser commented Mar 25, 2017

mhvk commented Mar 25, 2017

eric-wieser commented Mar 25, 2017

mhvk commented Mar 25, 2017

eric-wieser commented Mar 25, 2017

ENH: add an axis argument to np.bincount #4330

ENH: add an axis argument to np.bincount #4330

Conversation

jaimefrio commented Feb 19, 2014

charris Feb 22, 2014

Choose a reason for hiding this comment

jaimefrio Feb 25, 2014

Choose a reason for hiding this comment

charris commented Feb 22, 2014

charris commented Feb 22, 2014

jaimefrio commented Feb 22, 2014

charris commented Feb 22, 2014

jaimefrio commented Feb 28, 2014

josef-pkt commented Jan 28, 2015

jaimefrio commented Mar 9, 2015

rgommers commented Mar 9, 2015

mhvk commented May 22, 2016

eric-wieser commented Mar 25, 2017

mhvk commented Mar 25, 2017

eric-wieser commented Mar 25, 2017

mhvk commented Mar 25, 2017

eric-wieser commented Mar 25, 2017