-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Added libdivide for floor divide #17727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It seems this PR conflates two optimizations:
I would be interested to see if the second optimization alone is enough to convince the compiler to optimize the division, as was mentioned in the original issue. |
Hmm, that is true, maybe we should start off with some simple optimizations (see if we can remove the As far as I can tell, libdivide also supports AVX2 and AVX512 (if defines are set), which we could use but will require looking into the universal intrinsics probably. |
I used the PEP definition of
Unfortunately, no improvement or not anything visible, attaching my results: Please see the code changes I did in the latest commit, I expected at least a 20-30% improvement, is there an error in my logic? |
According to the license of libdivide, we can use the zlib variant, which allows us to distribute and modify the source. We need to be careful to document any future modifications for Universal SIMD and should give attribution in our LICENSES_bundled.txt |
How can we set special build environment variables for the CI to pickup, modify the description? I want |
Under which build conditions would it be better not to use |
Will do these later this week:
|
If the header and license is not included in the tarball when doing Edit: |
The files are getting included in the tar ball @mattip :
|
Currently, I am trying to break down libdivide header by following: #13516 (comment) I hope this is the right approach to support multi-platform SIMD |
I think it is fine to leave universal simd for another PR. Other reviewers: should this hit the mailing list/get a release note? |
Oh, sure @mattip , it seems like theres more we can do with libdivide itself, different algos we can choose from, etc. Right now it's the most generic one, aimed to work with compatibility. Will raise future PRs with experimentations. |
d3d932e
to
72dcc04
Compare
Since this doesn't change anything user-facing, I don't think we need to hit the mailing list or even add release note. It might be nice to make a release note to summarize some speed improved also due to SIMD (maybe we should add a performance category to the release notes, since performance changes always seem to interest everyone even though they are rarely important from a compatibility perspective). We should double check that zlib is fine for inclusion, so that this has definitely no affect on the NumPy license. |
I want some such section in the release notes and would be grateful is someone else provides it. |
numpy/core/src/umath/loops.c.src
Outdated
npy_set_floatstatus_divbyzero(); | ||
*((@type@ *)op1) = 0; | ||
} | ||
else if (((in1 > 0) != (in2 > 0)) && (in1 % in2 != 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can libdivide
compute in1 % in2
for you? It seems a bit silly to use libdivide only to then perform a remainder calculation without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly think we can avoid this %
and rewrite it as postprocessing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I still don't know how removing the %
gave no performance boost :). The compiler is magically optimizing something. #17727 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. @ganesh-k13 two things: First make sure you are dividing a positive by a negative number (or vice versa), otherwise this is not hit at all. Second, was the timing difference with libdivide? I guess it might be the compiler is smart enough to optimize the modulo away, but I would be surprised if it is smart enough when libdivide is being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They seem to have not done it yet: ridiculousfish/libdivide#9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this changes is subtract one for rounding purproses, Now unless there is some edge case again, I think you can just do without the subtract, and then move the if to later, so that if (res < 0) && (res * in2 != in1) { res -= 1}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right about not hitting the case, @seberg , seems like in the profile script I forgot to invert the signs. Above method seems to work, few edge cases to iron out(like <= 0, etc), will try them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found three edge cases:
- res is
0
, then possible negative divisor/dividend - divisor is 0 or -0, handled by putting inside else
- dividend is 0, handled by the same logic as 1.
Let me know if any more are there.
[EDIT]: Can use the same logic in sliding as well.
a94afb9
to
285d810
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for all the quick followups. I am happy with the current loop macro setups (we can always change them easily anyway).
There is only one big issue remaining from my side, and that is the empty-array case, which we have to guard against in the constant branch. (And it is slightly scary tests did not find it, although it may be that we have tests for it, and we would just have to use valgrind or so to notice the issue reliably).
@@ -249,6 +249,29 @@ def test_division_int(self): | |||
assert_equal(x // 100, [0, 0, 0, 1, -1, -1, -1, -1, -2]) | |||
assert_equal(x % 100, [5, 10, 90, 0, 95, 90, 10, 0, 80]) | |||
|
|||
@pytest.mark.parametrize("input_dtype", | |||
[np.int8, np.int16, np.int32, np.int64]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more than necessary, but that is fine. The different unit cases are not super interesting for the division code, but frankly a bit many tests are fine :).
benchmarks/benchmarks/bench_ufunc.py
Outdated
self.x = np.arange(size) | ||
|
||
def time_floor_divide(self, size): | ||
self.x//8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose dividing by 8 is one of the best case scenarios? A bit curious how things behave for dividing by a less weird number? (but honestly, just curious, I trust that libdivide is worth it).
One thing I am more curious about is how much the speedup is for the small integers, like int8
, etc? I guess there are not many specialized registers for those, so the upcast is probably just as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not happy with the initial tests, so I rewrote them, please let me know
[EDIT] After improved commit a5e1235
before after ratio
[3dca0c71] [8912ffd9]
<master> <enh_14959-libdivide>
- 2.36±0.01μs 2.13±0.04μs 0.90 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 0)
- 2.11±0.01μs 1.84±0.05μs 0.87 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)
- 2.11±0.02μs 1.79±0.01μs 0.85 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)
- 2.19±0.01μs 1.66±0.01μs 0.76 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)
- 2.18±0.03μs 1.65±0.01μs 0.76 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)
- 45.1±0.3ms 25.6±0.05ms 0.57 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 0)
- 285±2μs 137±0.2μs 0.48 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)
- 287±0.5μs 138±1μs 0.48 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
- 121±0.8ms 53.5±0.2ms 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
- 122±0.2ms 53.3±0.3ms 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
- 115±0.08ms 44.8±1ms 0.39 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
- 116±0.6ms 44.4±0.7ms 0.38 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
- 304±0.5μs 107±8μs 0.35 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
- 126±1ms 42.3±0.3ms 0.34 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
- 126±0.1ms 42.4±0.07ms 0.34 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
- 302±1μs 100±2μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
- 42.7±0.1ms 12.9±0.04ms 0.30 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 0)
- 120±0.1ms 35.1±0.09ms 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
- 118±0.2ms 34.6±0.05ms 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
- 98.8±1μs 22.5±0.2μs 0.23 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 0)
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did choose 8 to show nice results :) . But yeah we get some speedup for a number like 43 with int8 itself.
f5a66f3
to
ca4ba20
Compare
Hey @seberg, any other changes needed here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I am happy to put this in, may merge in a bit. But maybe someone else wants to do a quick pass. (marked a few nitpicks, but nothing that matters). There might be some followup with other division loops possible, but not sure.
An exhaustive test for int32 might be neat (not as a test, just to be sure), but it should not be necessary.
Thanks @seberg , I was planning to work on the universal intrinsics. We still have not used the SSE and AVX versions of libdivide. I will make a POC like last time to see the improvements and we can take a call if it's worth the change.
We have covered ints and timedeltas, are there more? Will be happy to port to them as well in a follow-up. |
Hmm, might have thought in the wrong direction about other loops, I guess libdivide probably does not have any unsigned integer versions? Should just double check which |
libdivide does support unsigned 32 and 64 bit ints. I did a quick browse of the code, the other division loops are for float only. Universal intrinsics will be a follow up PR.
Anything in particular I can test here? Any testing strategy? The current added UT tests for all boundry case numbers. |
Thanks @ganesh-k13 |
Further implementation and test improvements can be in follow-on PRs. |
Improvements:
Specs: ryzen 3600(12 cores), 32 GB memory
Changes:
Added a macro check to use libdivide or not:NPY_USE_LIBDIVIDE
%
in floor divide code.*new
Profiler code:
https://gist.github.com/ganesh-k13/6201227c5d3d65902c6eaf71357e72b1
Note: This is just a POC to show some performance improvementsresolves: #14959
cc:
@eric-wieser
@mattip
@seberg