-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: add scalar special cases for boolean logical loops #8924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
see gh-8910 for a masked array change that makes use of these loops. |
How do timings compare for cases that do and don't take advantage of this? |
it should easily be a factor 8-16, a SSE unit can perform 16 boolean ops at once. |
I'm more concerned about the impact of the branching for small loop sizes when we aren't in the special case this is optimizing for What determines the number of elements in the inner loop? Does the ufunc always pick a contiguous dimension, or the longest one, or what? |
The ufunc machinery overhead is an order of magnitude higher than the cost of the inner loops. The code inside them only starts mattering when the arrays become larger than a few thousand elements. The inner loop size is the full array if it can be expressed via the strides without buffering. |
This helps, but it's still slower than working with the full arrays (on my MSVC build, anyway). Setup: In [1]: b = np.ma.make_mask_none((100, 100), writeable=False)
In [2]: f = np.ma.make_mask_none((100, 100), writeable=True) Before:
After
|
dd8f47e
to
53f9d22
Compare
I didn't add code for scalar | scalar as that probably doesn't happen in practice, the masked code usually checks for double nomask before doing a masked operation. That it is still slower than full array is actually due to the iterator. In the full array case it uses a fastpath to skip the nditer setup. Apparently this does not trigger when a zero-d array is involved, so it sets up the expensive iterator. |
53f9d22
to
e1b3653
Compare
numpy/core/src/umath/loops.c.src
Outdated
*/ | ||
if (steps[0] == 0) { | ||
if (steps[1] == 0) { | ||
BOOL_SCALAR_OP(1, 1, ip2, (*args[0] @OP@ *args[1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want const bool val = *args[0] @OP@ *args[1]
, so that it doesn't get calculated repeatedly inside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be a relevant code path, but why not.
Scalar logical loops just boil down to memcpy or memset. While not very useful in general, some cases like masked arrays can profit when creating zero stride boolean arrays for readonly views.
e1b3653
to
d041977
Compare
Does the SIMD work supercede this? @Qiyu8 @seiko2plus Thoughts? |
IMO, A |
@juliantaylor @eric-wieser I'm inclined to close this, thoughts? |
I'm going to close this. Thanks Julian. |
Scalar logical loops just boil down to memcpy or memset. While not very
useful in general, some cases like masked arrays can profit when
creating zero stride boolean arrays for readonly views.