ENH: Added libdivide for floor divide #17727

ganesh-k13 · 2020-11-07T08:09:40Z

Improvements:

Specs: ryzen 3600(12 cores), 32 GB memory

Changes:

~~Added a macro check to use libdivide or not: NPY_USE_LIBDIVIDE~~
Added libdivide to floor divide.
Optimisation to remove % in floor divide code.
*Added libdivide to timedelta.
*Benchmark tests for ints to show improvements.
*UT for ints and timedelta floor divide.

*new

Profiler code:

https://gist.github.com/ganesh-k13/6201227c5d3d65902c6eaf71357e72b1

~~Note: This is just a POC to show some performance improvements~~

resolves: #14959

cc:
@eric-wieser
@mattip
@seberg

numpy/core/src/umath/loops.c.src

mattip · 2020-11-07T19:51:08Z

It seems this PR conflates two optimizations:

using libidivide
optimize the loop for the case where is2 == 0.

I would be interested to see if the second optimization alone is enough to convince the compiler to optimize the division, as was mentioned in the original issue.

seberg · 2020-11-07T21:39:27Z

Hmm, that is true, maybe we should start off with some simple optimizations (see if we can remove the % operation and what happens if optimize (or let the compiler optimize) the is2 == 0 loop.

As far as I can tell, libdivide also supports AVX2 and AVX512 (if defines are set), which we could use but will require looking into the universal intrinsics probably.

ganesh-k13 · 2020-11-08T06:48:04Z

I used the PEP definition of // to write the logic:

Cast to float to get float division value and floor it. Hence, no mod required.
Used the constant loop logic as before for compiler optimization.

Unfortunately, no improvement or not anything visible, attaching my results:

Please see the code changes I did in the latest commit, I expected at least a 20-30% improvement, is there an error in my logic?

numpy/core/src/umath/loops.c.src

mattip · 2020-11-08T07:33:51Z

According to the license of libdivide, we can use the zlib variant, which allows us to distribute and modify the source. We need to be careful to document any future modifications for Universal SIMD and should give attribution in our LICENSES_bundled.txt

ganesh-k13 · 2020-11-08T08:33:05Z

How can we set special build environment variables for the CI to pickup, modify the description? I want NPY_USE_LIBDIVIDE to be set.

mattip · 2020-11-08T09:44:22Z

Under which build conditions would it be better not to use NPY_USE_LIBDIVIDE? Perhaps it would be better to rethink why you might want that in the first place, if at all.

numpy/core/src/umath/loops.c.src

ganesh-k13 · 2020-11-08T18:55:16Z

Will do these later this week:

Add proper license
> document any future modifications for Universal SIMD

mattip · 2020-11-09T13:27:39Z

If the header and license is not included in the tarball when doing python setup.py sdist, it should be added to ./MANIFEST.in. We should set up a CI job to make sure we can build from the sdist tarball.

Edit: s/header/header and license/

ganesh-k13 · 2020-11-10T03:17:35Z

The files are getting included in the tar ball @mattip :

numpy-1.20.0.dev0+d3d932e/numpy/core/include/numpy/libdivide/                                                                                                                   
numpy-1.20.0.dev0+d3d932e/numpy/core/include/numpy/libdivide/LICENSE.txt                
numpy-1.20.0.dev0+d3d932e/numpy/core/include/numpy/libdivide/libdivide.h

ganesh-k13 · 2020-11-10T03:29:56Z

Currently, I am trying to break down libdivide header by following: #13516 (comment)

I hope this is the right approach to support multi-platform SIMD

mattip · 2020-11-10T04:38:35Z

I think it is fine to leave universal simd for another PR. Other reviewers: should this hit the mailing list/get a release note?

ganesh-k13 · 2020-11-10T04:45:34Z

Oh, sure @mattip , it seems like theres more we can do with libdivide itself, different algos we can choose from, etc. Right now it's the most generic one, aimed to work with compatibility. Will raise future PRs with experimentations.

seberg · 2020-11-10T15:38:24Z

Since this doesn't change anything user-facing, I don't think we need to hit the mailing list or even add release note. It might be nice to make a release note to summarize some speed improved also due to SIMD (maybe we should add a performance category to the release notes, since performance changes always seem to interest everyone even though they are rarely important from a compatibility perspective).

We should double check that zlib is fine for inclusion, so that this has definitely no affect on the NumPy license.

charris · 2020-11-10T16:48:19Z

It might be nice to make a release note to summarize some speed improved also due to SIMD

I want some such section in the release notes and would be grateful is someone else provides it.

numpy/core/src/umath/loops.c.src

eric-wieser · 2020-11-10T16:54:00Z

numpy/core/src/umath/loops.c.src

+                npy_set_floatstatus_divbyzero();
+                *((@type@ *)op1) = 0;
+            }
+            else if (((in1 > 0) != (in2 > 0)) && (in1 % in2 != 0)) {


Can libdivide compute in1 % in2 for you? It seems a bit silly to use libdivide only to then perform a remainder calculation without it.

I honestly think we can avoid this % and rewrite it as postprocessing?

So I still don't know how removing the % gave no performance boost :). The compiler is magically optimizing something. #17727 (comment)

Oh, interesting. @ganesh-k13 two things: First make sure you are dividing a positive by a negative number (or vice versa), otherwise this is not hit at all. Second, was the timing difference with libdivide? I guess it might be the compiler is smart enough to optimize the modulo away, but I would be surprised if it is smart enough when libdivide is being used?

They seem to have not done it yet: ridiculousfish/libdivide#9

All this changes is subtract one for rounding purproses, Now unless there is some edge case again, I think you can just do without the subtract, and then move the if to later, so that if (res < 0) && (res * in2 != in1) { res -= 1}?

You were right about not hitting the case, @seberg , seems like in the profile script I forgot to invert the signs. Above method seems to work, few edge cases to iron out(like <= 0, etc), will try them.

I found three edge cases:

res is 0, then possible negative divisor/dividend

divisor is 0 or -0, handled by putting inside else

dividend is 0, handled by the same logic as 1.

Let me know if any more are there.
[EDIT]: Can use the same logic in sliding as well.

numpy/core/src/umath/fast_loop_macros.h

numpy/core/tests/test_umath.py

seberg

OK, thanks for all the quick followups. I am happy with the current loop macro setups (we can always change them easily anyway).

There is only one big issue remaining from my side, and that is the empty-array case, which we have to guard against in the constant branch. (And it is slightly scary tests did not find it, although it may be that we have tests for it, and we would just have to use valgrind or so to notice the issue reliably).

numpy/core/src/umath/loops.c.src

seberg · 2020-11-22T17:14:49Z

numpy/core/tests/test_umath.py

@@ -249,6 +249,29 @@ def test_division_int(self):
        assert_equal(x // 100, [0, 0, 0, 1, -1, -1, -1, -1, -2])
        assert_equal(x % 100, [5, 10, 90, 0, 95, 90, 10, 0, 80])

+    @pytest.mark.parametrize("input_dtype",
+            [np.int8, np.int16, np.int32, np.int64])


Probably more than necessary, but that is fine. The different unit cases are not super interesting for the division code, but frankly a bit many tests are fine :).

numpy/core/tests/test_umath.py

seberg · 2020-11-22T17:21:59Z

benchmarks/benchmarks/bench_ufunc.py

+        self.x = np.arange(size)
+
+    def time_floor_divide(self, size):
+        self.x//8


I suppose dividing by 8 is one of the best case scenarios? A bit curious how things behave for dividing by a less weird number? (but honestly, just curious, I trust that libdivide is worth it).

One thing I am more curious about is how much the speedup is for the small integers, like int8, etc? I guess there are not many specialized registers for those, so the upcast is probably just as well?

I was not happy with the initial tests, so I rewrote them, please let me know
[EDIT] After improved commit a5e1235

before after ratio [3dca0c71] [8912ffd9] <master> <enh_14959-libdivide> - 2.36±0.01μs 2.13±0.04μs 0.90 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 0) - 2.11±0.01μs 1.84±0.05μs 0.87 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43) - 2.11±0.02μs 1.79±0.01μs 0.85 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43) - 2.19±0.01μs 1.66±0.01μs 0.76 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8) - 2.18±0.03μs 1.65±0.01μs 0.76 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8) - 45.1±0.3ms 25.6±0.05ms 0.57 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 0) - 285±2μs 137±0.2μs 0.48 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43) - 287±0.5μs 138±1μs 0.48 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43) - 121±0.8ms 53.5±0.2ms 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43) - 122±0.2ms 53.3±0.3ms 0.44 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43) - 115±0.08ms 44.8±1ms 0.39 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43) - 116±0.6ms 44.4±0.7ms 0.38 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43) - 304±0.5μs 107±8μs 0.35 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8) - 126±1ms 42.3±0.3ms 0.34 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8) - 126±0.1ms 42.4±0.07ms 0.34 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8) - 302±1μs 100±2μs 0.33 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8) - 42.7±0.1ms 12.9±0.04ms 0.30 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 0) - 120±0.1ms 35.1±0.09ms 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8) - 118±0.2ms 34.6±0.05ms 0.29 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8) - 98.8±1μs 22.5±0.2μs 0.23 bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 0) SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE INCREASED.

I did choose 8 to show nice results :) . But yeah we get some speedup for a number like 43 with int8 itself.

doc/release/upcoming_changes/17727.performance.rst

numpy/core/src/umath/loops.c.src

ganesh-k13 · 2020-11-26T14:00:46Z

Hey @seberg, any other changes needed here?

seberg

OK, I am happy to put this in, may merge in a bit. But maybe someone else wants to do a quick pass. (marked a few nitpicks, but nothing that matters). There might be some followup with other division loops possible, but not sure.

An exhaustive test for int32 might be neat (not as a test, just to be sure), but it should not be necessary.

LICENSES_bundled.txt

numpy/core/src/umath/loops.c.src

ganesh-k13 · 2020-12-01T03:28:21Z

Thanks @seberg , I was planning to work on the universal intrinsics. We still have not used the SSE and AVX versions of libdivide. I will make a POC like last time to see the improvements and we can take a call if it's worth the change.

some followup with other division loops possible

We have covered ints and timedeltas, are there more? Will be happy to port to them as well in a follow-up.

seberg · 2020-12-01T03:41:39Z

Hmm, might have thought in the wrong direction about other loops, I guess libdivide probably does not have any unsigned integer versions? Should just double check which divide related loops might be similar...

ganesh-k13 · 2020-12-01T08:24:32Z

libdivide does support unsigned 32 and 64 bit ints. I did a quick browse of the code, the other division loops are for float only. Universal intrinsics will be a follow up PR.
Regarding:

An exhaustive test for int32 might be neat

Anything in particular I can test here? Any testing strategy? The current added UT tests for all boundry case numbers.

mattip · 2020-12-02T21:43:00Z

Thanks @ganesh-k13

mattip · 2020-12-02T21:43:40Z

Further implementation and test improvements can be in follow-on PRs.

ENH: Added libdiv

179038f

github-actions bot added the 01 - Enhancement label Nov 7, 2020

eric-wieser reviewed Nov 7, 2020

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

ENH: Fixed typos in header | use in2 over ip2

e89175b

seberg marked this pull request as draft November 7, 2020 21:36

ENH: Added optimal divisor

565759b

mattip reviewed Nov 8, 2020

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

ENH: Added libdivide header

d0c934c

eric-wieser reviewed Nov 8, 2020

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

ganesh-k13 added 2 commits November 8, 2020 18:08

ENH: Made libdivide default

b02399a

ENH: Handled divide by 0 case

f0ddb7c

charris added the component: numpy._core label Nov 10, 2020

ENH: Added libdivide zlib license

72dcc04

ganesh-k13 force-pushed the enh_14959-libdivide branch from d3d932e to 72dcc04 Compare November 10, 2020 10:11

eric-wieser reviewed Nov 10, 2020

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 10, 2020

View reviewed changes

ganesh-k13 added 2 commits November 21, 2020 19:23

ENH: Optimized 0 divisor cases

0e2116f

TST: Minor changes to floor divide | Added cases for timedelta divide

f93ca93

eric-wieser reviewed Nov 21, 2020

View reviewed changes

numpy/core/src/umath/fast_loop_macros.h Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 21, 2020

View reviewed changes

numpy/core/src/umath/fast_loop_macros.h Outdated Show resolved Hide resolved

eric-wieser reviewed Nov 21, 2020

View reviewed changes

numpy/core/src/umath/fast_loop_macros.h Show resolved Hide resolved

ENH: Remove looping definitions | Renamed fast loop macros

285d810

ganesh-k13 force-pushed the enh_14959-libdivide branch from a94afb9 to 285d810 Compare November 22, 2020 06:04

ENH: Removed unsed macro check

9825795

ganesh-k13 commented Nov 22, 2020

View reviewed changes

numpy/core/tests/test_umath.py Outdated Show resolved Hide resolved

seberg reviewed Nov 22, 2020

View reviewed changes

ganesh-k13 added 4 commits November 23, 2020 12:34

BUG: Added better 0 checks

1f104fd

BENCH: Added floor divide benchmarks (numpy#17727)

2fde590

DOC: Improved floor division (numpy#17727)

8912ffd

BENCH: Improve floor divide benchmarks (numpy#17727)

a5e1235

eric-wieser reviewed Nov 23, 2020

View reviewed changes

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

ganesh-k13 mentioned this pull request Nov 23, 2020

BUG, ENH: Refine Nat checks and division by 0 setting in timedelta division code #17831

Open

BUG,TST: Fixed division by 0 status setting

ca4ba20

ganesh-k13 force-pushed the enh_14959-libdivide branch from f5a66f3 to ca4ba20 Compare November 24, 2020 05:40

seberg self-requested a review November 27, 2020 00:46

seberg approved these changes Nov 30, 2020

View reviewed changes

LICENSES_bundled.txt Outdated Show resolved Hide resolved

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved

MAINT: Linting fixes

28aa883

mattip merged commit 03692a7 into numpy:master Dec 2, 2020

ganesh-k13 mentioned this pull request Dec 22, 2020

ENH: libdivide for unsigned integers #18055

Closed

xumingkuan mentioned this pull request Jun 8, 2021

[Opt] Add a strength reduction pass taichi-dev/taichi#944

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added libdivide for floor divide #17727

ENH: Added libdivide for floor divide #17727

ganesh-k13 commented Nov 7, 2020 •

edited

Loading

mattip commented Nov 7, 2020

seberg commented Nov 7, 2020

ganesh-k13 commented Nov 8, 2020

mattip commented Nov 8, 2020

ganesh-k13 commented Nov 8, 2020

mattip commented Nov 8, 2020

ganesh-k13 commented Nov 8, 2020 •

edited

Loading

mattip commented Nov 9, 2020 •

edited

Loading

ganesh-k13 commented Nov 10, 2020 •

edited

Loading

ganesh-k13 commented Nov 10, 2020 •

edited

Loading

mattip commented Nov 10, 2020

ganesh-k13 commented Nov 10, 2020 •

edited

Loading

seberg commented Nov 10, 2020

charris commented Nov 10, 2020

eric-wieser Nov 10, 2020

seberg Nov 10, 2020

ganesh-k13 Nov 10, 2020 •

edited

Loading

seberg Nov 10, 2020

ganesh-k13 Nov 10, 2020

seberg Nov 10, 2020

ganesh-k13 Nov 11, 2020

ganesh-k13 Nov 11, 2020 •

edited

Loading

seberg left a comment

seberg Nov 22, 2020

seberg Nov 22, 2020

ganesh-k13 Nov 23, 2020 •

edited

Loading

ganesh-k13 Nov 23, 2020

ganesh-k13 commented Nov 26, 2020

seberg left a comment

ganesh-k13 commented Dec 1, 2020

seberg commented Dec 1, 2020

ganesh-k13 commented Dec 1, 2020

mattip commented Dec 2, 2020

mattip commented Dec 2, 2020

ENH: Added libdivide for floor divide #17727

ENH: Added libdivide for floor divide #17727

Conversation

ganesh-k13 commented Nov 7, 2020 • edited Loading

Improvements:

Changes:

Profiler code:

mattip commented Nov 7, 2020

seberg commented Nov 7, 2020

ganesh-k13 commented Nov 8, 2020

mattip commented Nov 8, 2020

ganesh-k13 commented Nov 8, 2020

mattip commented Nov 8, 2020

ganesh-k13 commented Nov 8, 2020 • edited Loading

mattip commented Nov 9, 2020 • edited Loading

ganesh-k13 commented Nov 10, 2020 • edited Loading

ganesh-k13 commented Nov 10, 2020 • edited Loading

mattip commented Nov 10, 2020

ganesh-k13 commented Nov 10, 2020 • edited Loading

seberg commented Nov 10, 2020

charris commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganesh-k13 Nov 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganesh-k13 Nov 11, 2020 • edited Loading

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganesh-k13 Nov 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganesh-k13 commented Nov 26, 2020

seberg left a comment

Choose a reason for hiding this comment

ganesh-k13 commented Dec 1, 2020

seberg commented Dec 1, 2020

ganesh-k13 commented Dec 1, 2020

mattip commented Dec 2, 2020

mattip commented Dec 2, 2020

ganesh-k13 commented Nov 7, 2020 •

edited

Loading

ganesh-k13 commented Nov 8, 2020 •

edited

Loading

mattip commented Nov 9, 2020 •

edited

Loading

ganesh-k13 commented Nov 10, 2020 •

edited

Loading

ganesh-k13 commented Nov 10, 2020 •

edited

Loading

ganesh-k13 commented Nov 10, 2020 •

edited

Loading

ganesh-k13 Nov 10, 2020 •

edited

Loading

ganesh-k13 Nov 11, 2020 •

edited

Loading

ganesh-k13 Nov 23, 2020 •

edited

Loading