Skip to content

Use count_nonzero instead of sum for boolean arrays #8913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jachymb opened this issue Apr 7, 2017 · 14 comments
Open

Use count_nonzero instead of sum for boolean arrays #8913

jachymb opened this issue Apr 7, 2017 · 14 comments

Comments

@jachymb
Copy link

jachymb commented Apr 7, 2017

Hello,
I was computing some correlations while I noticed that I can easily make an implementation significantly faster than coffcoef by replacing sum() with count_nonzero(). Therefore, I think whenever sum() is called on a boolean array, it should use count_nonzero or such instead of addition.

@njsmith
Copy link
Member

njsmith commented Apr 7, 2017

If so then an even better solution would be to port the count_nonzero optimizations to boolean sum.

@jachymb
Copy link
Author

jachymb commented Apr 8, 2017

That's basically what I thought.

@eric-wieser
Copy link
Member

eric-wieser commented Jun 4, 2019

Note that boolean sum currently involves a type promotion from bool to long of each buffer, which is probably why its slow.

You can time this via:

>>> x = np.arange(1000)

In [38]: %timeit np.count_nonzero(x)
5.97 µs ± 494 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [39]: %timeit np.count_nonzero(x, axis=0)
25.6 µs ± 607 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

since the latter just calls into sum, while the former uses the optimized calll

@mhvk
Copy link
Contributor

mhvk commented Jun 4, 2019

Another way would be to implement @ahaldane's proposed .first and .count reduction methods.

@seberg
Copy link
Member

seberg commented Jun 4, 2019

It might work to create a "lb->l" loop, or similar, and insert it in front of the "ll->l" loop. It will require modifying the strange logic currently used to force the "ll->l" loop...

I am right now looking into changing the type resolution a bit, hoping to make it a bit saner before I attempt replacing it with a completely new version. But... it will mean ABI/API breaking of the type resolution and maybe loop finding slots, so (scipy doesn't use them, numba doesn't use them, and it is pretty crazy to use them in the first place, so hmmm).

@eric-wieser
Copy link
Member

I started trying to create a lb->l loop locally, but the speed up was rather disappointing. Some parts of the patch are salvageable

@seberg
Copy link
Member

seberg commented Jun 4, 2019

@eric-wieser I guess the overhead the issue with that patch, or that it just was not very fast? Might be interesting.

There is some really strange logic in the whole TypeResolution about how scalars and how the specified dtypes are handled. I am trying to pry that out a bit (and delete a bunch of code, because I do not believe it makes a difference).
Just hope that we might get away with simply removing the TypeResolution function slot and if someone uses it, replace it with a "Woah, we did not expect anyone does this, lease contact us" message ;).

@eric-wieser
Copy link
Member

There's a use case we had not considered in our type resolution discussion.

Normally type resolution takes input types and an optional output type, and deduces a full loop - so int,int->* resolves to int,int->int etc

For the case of a reduction, we don't actually know the input type yet, because we want it to match the output type. We're looking for a loop of the form *, in.dtype -> * with the requirement that the two *s match. I don't think this case was considered.

@eric-wieser
Copy link
Member

Pushed the patch to my bool_countnonzero branch at https://github.com/eric-wieser/numpy/tree/bool_countnonzero, in case you want to take it further

@seberg
Copy link
Member

seberg commented Jun 4, 2019

Yeah, the loop lookup will have to know that it is a reduction or accumulation (also because of other things such as np.multiply.reduce actually needing to count the number of items being reduced in principle to decide the final output unit – not the dtype though). Possibly, that would just not implement reduce though for starters.
My guess is that other then confirming that this is the case for the final loop chosen, we can trust users to provide correct loop if they know about it.

We also have some types already specified/fixed, which makes the problem a bit more high dimensional to cche, but that should be OK. About the scalar logic: I would like to move it to the beginning of the ufunc parsing even before starting on actual UfuncImpl stuff. I am slow with the big step, so hope that such things may help prepare it...

@mhvk
Copy link
Contributor

mhvk commented Jun 4, 2019

Don't assume nobody uses their own type resolver... https://github.com/astropy/astropy/blob/master/astropy/_erfa/ufunc.c.templ#L518.

That said, I'd quite happily adapt to anything that is saner. Also, I recall being rather annoyed at when the functions get called - though I don't remind the details off the top of my head. Anyway, would be very interested to look at ideas for how to change things.

@seberg
Copy link
Member

seberg commented Jun 4, 2019

Seriously, ah dang... So maybe call it if it exists and otherwise call a new one... (and put a Deprecation on it). The other problem is, that to get really sane in the long run, we need to put the TypeResolution behind a first dispatching step. But it certainly should be possible to support this indefinitely (for ufuncs you create yourself in any case as at least you do not replace a TypeResolution on a ufunc someone else created).

@mhvk
Copy link
Contributor

mhvk commented Jun 4, 2019

@seberg - in my case, I was definitely trying to circumvent problems with the current type casting, so my override would hopefully not be necessary any more. Please ignore it while you think about how you'd want it to be!!

@seberg
Copy link
Member

seberg commented Jun 4, 2019

@mhvk yes, of course I will not think about it much. But, I thought I could try some cleaning up first.

The type_tup (which you conveniently ignore) should actually be a list of "specified_types" fixed much earlier on, before the dispatching. The other thing is that I was hoping to move the whole value-based scalar logic to the entrance of the ufunc. It seems outrageous that dispatching currently requires access to the array information. This is something that needs to change, and I thought changing it earlier then a "revamp" might make things simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants