Skip to content

ENH: added axis param for np.count_nonzero #7177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 5, 2016
Merged

ENH: added axis param for np.count_nonzero #7177

merged 1 commit into from
Aug 5, 2016

Conversation

gfyoung
Copy link
Contributor

@gfyoung gfyoung commented Feb 3, 2016

Addresses feature request in #391 to add an axis parameter to np.count_nonzero.

Duplicate of #7138 after a forced push on another branch accidentally reset my origin branch for that PR to master, closing it automatically (I had reset hard on the #7138 branch to master so I could do the C refactoring @jaimefrio suggested in the spirit of #4330, whoops).

@homu
Copy link
Contributor

homu commented Feb 14, 2016

☔ The latest upstream changes (presumably #7246) made this pull request unmergeable. Please resolve the merge conflicts.

@charris charris added 01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy._core and removed 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Feb 14, 2016
@gfyoung
Copy link
Contributor Author

gfyoung commented Feb 29, 2016

@jaimefrio : Unfortunately, I have not been able to make much progress on refactoring the code into C. However, I think it should be good to merge as is since it performs relatively well, and the tests pass. If someone else wants to take a look at this, that would be great as well!

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 6, 2016

Can somebody take a look at this? I've been pinging this PR for a couple of weeks now with no response from anyone. The code should be ready to land.

@homu
Copy link
Contributor

homu commented Mar 12, 2016

☔ The latest upstream changes (presumably #7346) made this pull request unmergeable. Please resolve the merge conflicts.

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 12, 2016

Alright, this is the second time I've encountered a merge conflict on this PR. Could someone take a look at this and try to land it?

@homu
Copy link
Contributor

homu commented Mar 13, 2016

☔ The latest upstream changes (presumably #7178) made this pull request unmergeable. Please resolve the merge conflicts.

@shoyer
Copy link
Member

shoyer commented Mar 14, 2016

This code currently has a lot of branches. Given the apparent lack of existing test coverage, I'm not even sure all of them are being exercised. It needs to test every valid dtype, and to test invalid (e.g., axis='foo' or axis=3 for a 2d array), negative and tuple axes (e.g., axis=[0, 1]).

I have also have some broader concerns about count_nonzero:

  1. The behavior for non-numeric dtypes is not documented. What exactly does it mean to be "zero"? Apparently None and '' are all considered zero for object arrays:

    In [13]: x = np.array([None, 0, '', 12])
    
    In [14]: x
    Out[14]: array([None, 0, '', 12], dtype=object)
    
    In [15]: np.count_nonzero(x)
    Out[15]: 1
    

    Is this behavior a bug or a feature? I don't know.

  2. If (x != 0).sum() is faster and more explicit, what's the point of even encouraging the use of this function?

return (multiarray.count_nonzero(a) if axis is None
else (a != '').sum(axis=axis, dtype=intp))

return (multiarray.count_nonzero(a) if axis is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What falls through to this case? Does it really make sense to use a potentially very slow loop with apply_along_axis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I covered most of the common cases above, and when the objects start getting really weird (i.e. dtype=object cases), behaviour is unfortunately quite unpredictable with regards to how it will behave when counting along multiple axes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we had a simple is_nonzero function that would return a boolean array. Then we could simply use sum here instead of apply_along_axis (which feels very hacky).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh, I was going to say that casting-to-bool was is_nonzero, but it's not true -- string->bool type coercion doesn't check truthiness, it tries to parse the strings or something :-/. I guess internally we don't care about truthiness except in the case of if arr: ..., which can be handled by checking shape + count_nonzero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njsmith: Yeah...I hit my head against that option as well. Needless to say, my definition of 'nonzero-ness' has been forever altered...:smile:

@njsmith
Copy link
Member

njsmith commented Mar 14, 2016

What exactly does it mean to be "zero"?

I haven't looked at the rest, but I can explain the name :-). In py2, "nonzero" is the official name for the property that's more commonly known as "truthiness" (see __nonzero__). This is confusing and in py3 they fixed this (__nonzero__ is now __bool__), but this function's name remains as collateral damage.

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 14, 2016

@shoyer: The reason why a != 0 isn't used by itself is because the definition of non-zero, as @njsmith enumerated, is unfortunately not as clear-cut as it seemed to be at the surface. I too had that thought initially, and then numpy tests told me otherwise.

@shoyer
Copy link
Member

shoyer commented Mar 14, 2016

OK, that makes sense for the definition of "nonzero". It would be good to document that :).

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 14, 2016

@njsmith : If we could rename it, what should it be called? That feels like a FutureWarning of some kind (i.e. "WARNING: name change coming soon!"), maybe not for this PR but at least for a follow-up.

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 14, 2016

@shoyer: +1 for the documentation. That has been added along with [ci skip] to this PR until there are other code-related changes to make OR if people deem this merge-able.

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 21, 2016

@shoyer, @njsmith : Is this good to merge? I've made all of the requested changes, and I haven't seen any complaints in the past week or so.

@shoyer
Copy link
Member

shoyer commented Mar 21, 2016

This still needs comprehensive test coverage, as I mentioned in my comment above

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 21, 2016

@shoyer: Ah, good point. Thanks for the reminder!

@gfyoung
Copy link
Contributor Author

gfyoung commented Mar 26, 2016

@shoyer, @njsmith : Added lots of new tests for count_nonzero that I think are fairly comprehensive. Travis and Appveyor give the green light. If there is nothing else, should be ready to merge.

@gfyoung gfyoung closed this Apr 6, 2016
@gfyoung gfyoung reopened this Apr 6, 2016
@gfyoung
Copy link
Contributor Author

gfyoung commented Apr 13, 2016

@shoyer , @njsmith : I haven't seen any complaints / issue from either of you or anyone regarding this PR for almost two weeks. Can this be merged?

@charris charris added this to the 1.12.0 release milestone Apr 16, 2016
@charris
Copy link
Member

charris commented Apr 16, 2016

Could the folks involved here take another look.

size = (5, 5, 5, 5)
msg = "Mismatch for axis: %s"

m = randint(-100, 100, size=size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a random seed to ensure that this test can't fail in a stochastic fashion. I recommend using np.random.RandomState.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung
Copy link
Contributor Author

gfyoung commented May 28, 2016

@rgommers : I can try rewriting the current _validate_axis function and see if my new checks are compatible with the current code. Otherwise, I'll add a new one.

@gfyoung
Copy link
Contributor Author

gfyoung commented May 30, 2016

@rgommers : simplicity seems best - ended up using the current _validate_axis function, as that seems to suffice on second look. Travis and Appveyor are happy with this.

@homu
Copy link
Contributor

homu commented Jun 3, 2016

☔ The latest upstream changes (presumably #7689) made this pull request unmergeable. Please resolve the merge conflicts.

@gfyoung
Copy link
Contributor Author

gfyoung commented Jun 7, 2016

@rgommers : any update on this?

@gfyoung
Copy link
Contributor Author

gfyoung commented Jun 8, 2016

Travis is happy, except for the spurious failure on master from the LRU cache PR. Will undo the [ci skip] in the commit once this error has been fixed. Otherwise, this should be ready to merge if there are no other concerns.

@gfyoung gfyoung closed this Jun 29, 2016
@gfyoung gfyoung reopened this Jun 29, 2016
@gfyoung
Copy link
Contributor Author

gfyoung commented Jul 6, 2016

@rgommers : any update on this?

return a.sum(axis=axis, dtype=np.intp)

if issubdtype(a.dtype, np.number):
return (a != 0).sum(axis=axis, dtype=np.intp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates a new boolean array of the same shape as the original. I thought the whole point was to avoid doing that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madphysicist: When was that the whole point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought wrong apparently. It just seems a bit hacky to do that with a function that is implemented in C exactly to avoid such an operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hacky, a bit, but it does get the job done without too much sadness.

@madphysicist
Copy link
Contributor

+1 for the overall idea. I started something in C along similar lines. My main objection is the creation of boolean temp arrays in corner cases. I do like the idea of adding the results along multiple axes once you have applied it to one.

@gfyoung
Copy link
Contributor Author

gfyoung commented Jul 8, 2016

@madphysicist : thanks for taking a look!

Would be great to get feedback from the maintainers (@rgommers or someone else) too so that this can actually be merged.

@gfyoung
Copy link
Contributor Author

gfyoung commented Jul 13, 2016

Can somebody take a look at this?

@perimosocordiae
Copy link
Contributor

+1 from the peanut gallery.

I think getting the axis kwarg in the official API is the most important consideration, even if the underlying code isn't currently the most efficient possible approach. Further optimization and cleanup can happen later, behind the scenes and without user-visible changes.

@madphysicist
Copy link
Contributor

I second that. I would really like to see this in numpy regardless of whether it could be done a different way.

@gfyoung
Copy link
Contributor Author

gfyoung commented Aug 5, 2016

Can somebody take a look at this?

@shoyer
Copy link
Member

shoyer commented Aug 5, 2016

OK, this looks pretty sane to me at this point, so I will merge after CI tests pass.

@shoyer shoyer merged commit 31a95d9 into numpy:master Aug 5, 2016
@gfyoung gfyoung deleted the count_nonzero_axis branch August 5, 2016 07:47
@rgommers
Copy link
Member

rgommers commented Aug 6, 2016

Great, thanks @gfyoung @shoyer & all.

eric-wieser added a commit to eric-wieser/numpy that referenced this pull request Oct 18, 2017
Fixes numpy#9728

This bug was introduced with the `axis` keyword in numpy#7177, as a misguided optimization.
array([2, 3])

"""
if axis is None or axis == ():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider this a bug, described in #9728.

theodoregoetz pushed a commit to theodoregoetz/numpy that referenced this pull request Oct 23, 2017
Fixes numpy#9728

This bug was introduced with the `axis` keyword in numpy#7177, as a misguided optimization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants