-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: add an axis argument to np.bincount #4330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds an axis argument to np.bincount, which can now take multidimensional arrays and do the counting over an arbitrary number of axes. `axis` defaults to `None`, which does the counting over all the axes, i.e. over the flattened array. The shape of the output is computed by removing from `list` (the input array) all dimensions in `axis`, and appending a dimension of length `max(max(list) + 1, minlength)` to the end. `out[..., n]` will hold the number of occurrences of `n` at the given position over the selected axes. If a `weights` argument is provided, its shape is broadcasted with `list` before removing the axes. In this case, `axis` refers to the axes of `list` *before* broadcasting, and `out[..., n]` will hold the sum of `weights` over the selected axes, at all positions where `list` takes the value `n`. The general case is handled with nested iterators, but shortcuts without having to set up an iterator are provided for 1D cases, with no performance loss against the previous version. As a plus, this PR also solves numpy#823, by providing specialized functions for all integer types to find the max. There are also specialized functions for all integer types for counting and doing weighted counting.
@@ -4987,48 +4987,58 @@ def luf(lamdaexpr, *args, **kwargs): | |||
|
|||
add_newdoc('numpy.lib._compiled_base', 'bincount', | |||
""" | |||
bincount(x, weights=None, minlength=None) | |||
bincount(list, weights=None, minlength=None, axis=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think list
is an unfortunate choice for a variable name, x
was fine, but maybe arr
or a
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was basically trying to make explicit the following current behavior:
>>> np.bincount(list=[3,4,5])
array([0, 0, 0, 1, 1, 1], dtype=int64)
Internally, the first argument is a kwarg with label list
. I agree that it is an unfortunate choice, and will change it both internally and in the docstring to arr
. Since it is an undocumented feature, I understand this change can be made with out worries, right?
I bailed early here ;) looking at the code size here relative to the original function, I prefer the original. The use of type specific code complicates things enormously. A simple, if not so fast, approach, would be to do this in the python call, just calling bincount in a loop and using max to set up the output array shape. I guess I'm saying this is a premature optimization ;) This does suggest we could use a standard function dispatcher. |
To avoid |
Yes, I probably got a little carried away optimizing performance once I got it running. The type specific code, is there to keep the performance for small arrays vs the old function. There was also the thing of handling It is very easy to not do any of this and simply have the iterator buffer and cast everything to Let me try to rewrite with a leaner approach: I promise you want have to read over so much code! |
I think a clip keyword might be useful in any case. |
I have just pushed a slimmed down version of |
Just as cheerleader for this since another issue reminded me of it: statsmodels would have quite a lot of use for a vectorized version of bincount. I always missed that. |
I am closing this for now. There are several functions out there that could use an axis argument, and |
@jaimefrio you could still consider the suggestion of @charris to do this in a for-loop in Python. That would keep the code simple, and non-optimal speed for something that doesn't work at all now is OK:) Reason to do that: the intermediate layer might take a long time to materialize, and this does look useful..... |
The mailing list conversation reminded me: the multidimensional weight, single-dimension index case of this was something I was very much looking forward to. I can see why this was closed, but it would still be really great to have... |
Relevant to #5197 |
Would still be great to have this -- @eric-wieser - |
I would hope that the cost of a |
True, I just worry about the iterator not being optimized as much, but with a good inner loop I guess it should be fine. |
As much as what? Obviously iterating over something takes more time than assuming it has length 1... |
This adds an
axis
argument tonp.bincount
, which can now take multidimensionalarrays and do the counting over an arbitrary number of axes.
axis
defaultsto
None
, which does the counting over all the axes, i.e. over the flattenedarray.
The shape of the output is computed by removing from
list
(the input array)all dimensions in
axis
, and appending a dimension of lengthmax(max(list) + 1, minlength)
to the end.out[..., n]
will hold the numberof occurrences of
n
at the given position over the selected axes.If a
weights
argument is provided, its shape is broadcasted withlist
before removing the axes. In this case,
axis
refers to the axes oflist
before broadcasting, and
out[..., n]
will hold the sum ofweights
overthe selected axes, at all positions where
list
takes the valuen
.The general case is handled with nested iterators, but shortcuts without
having to set up an iterator are provided for 1D cases, with no performance
loss against the previous version.
As a plus, this PR also solves #823, by providing specialized functions for
all integer types to find the max. There are also specialized functions for
all integer types for counting and doing weighted counting.