ENH: Move dispatch-able umath fast-loops to the new dispatcher #16396

seiko2plus · 2020-05-27T11:22:54Z

This pullrequest changes

Move dispatch-able umath fast-loops to the new dispatcher

Re-implement the dispatch-able umath fast-loops to fit the requirement of the new dispatcher,
also, add new attribute dispatch to umath generator similar to current simd but for the new dispatcher.

mattip

LGTM. I know this is just a reorganization, but it would be prudent to run benchmarks to make sure nothing has changed.

numpy/core/code_generators/generate_umath.py

mattip · 2020-07-10T10:58:10Z

Still waiting for a benchmark on this PR on a machine with AVX512

anirudh2290

I am on a machine with avx512f and avx512cd features. I benchmarked bench_ufunc and bench_avx and performance increased in all cases.

       before           after         ratio
     [58da484a]       [d96263c2]
-      21.6±0.1μs       20.5±0.1μs     0.95  bench_ufunc.UFunc.time_ufunc_types('bitwise_or')
-      13.8±0.3μs       12.8±0.2μs     0.93  bench_ufunc.UFunc.time_ufunc_types('invert')
-      21.6±0.4μs       19.9±0.3μs     0.92  bench_ufunc.UFunc.time_ufunc_types('bitwise_xor')
-        892±10ns         820±30ns     0.92  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True))
-        912±20ns         830±40ns     0.91  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.), subok=True, where=True))
-      13.9±0.1μs       12.7±0.2μs     0.91  bench_ufunc.UFunc.time_ufunc_types('bitwise_not')
-        635±30ns         566±10ns     0.89  bench_ufunc.Scalar.time_add_scalar
-        914±40ns         810±30ns     0.89  bench_ufunc.ArgParsing.time_add_arg_parsing((array(1.), array(2.)))
-      1.60±0.1μs      1.38±0.01μs     0.86  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))
-      1.64±0.1μs      1.40±0.02μs     0.85  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))
-      40.4±0.8μs         34.4±1μs     0.85  bench_ufunc.CustomInplace.time_char_or_temp
-        849±30ns         717±20ns     0.85  bench_ufunc.Scalar.time_add_scalar_conv
-        899±10ns         759±30ns     0.84  bench_ufunc.Scalar.time_add_scalar_conv_complex
-      1.64±0.1μs      1.37±0.01μs     0.84  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))
-      32.8±0.4μs         25.1±1μs     0.76  bench_ufunc.CustomInplace.time_int_or
-      26.3±0.3μs       19.3±0.3μs     0.73  bench_ufunc.CustomInplace.time_char_or
-      47.7±0.4μs       31.8±0.2μs     0.67  bench_ufunc.CustomInplace.time_double_add
-      93.8±0.6μs       25.0±0.3μs     0.27  bench_ufunc.UFunc.time_ufunc_types('right_shift')
-      96.6±0.4μs       22.3±0.3μs     0.23  bench_ufunc.UFunc.time_ufunc_types('left_shift')
-      5.45±0.1μs       5.11±0.1μs     0.94  bench_avx.AVX_cmplx_funcs.time_ufunc('conjugate', 1, 'D')
-     3.13±0.03μs      2.91±0.06μs     0.93  bench_avx.AVX_UFunc.time_ufunc('trunc', 1, 'd')
-     3.17±0.02μs      2.91±0.05μs     0.92  bench_avx.AVX_UFunc.time_ufunc('floor', 1, 'd')
-     6.35±0.06μs       5.12±0.1μs     0.81  bench_avx.AVX_UFunc.time_ufunc('frexp', 4, 'f')

seiko2plus · 2020-07-11T01:06:02Z

Benchmarking results(ufunc inner loops via #15987)

Measuring before and after this patch

The target from the following benchmarks is to detect the auto-vectorization performance changes in the old
and the new dispatcher.

With disabling AVX512_SKX:
- Benchmark AVX2 - stride 1
- Benchmark AVX2 - strides 1 2 - changes only(+-0.05 factor)
With enabling AVX512_SKX(default):
Note: AVX512_SKX is not enabled by the old dispatcher, so what we are doing here is comparing
the SKX(new dispatcher) vs AVX2(old dispatcher)
- Benchmark AVX512_SKX - stride 1
- Benchmark AVX512_SKX - strides 1 2 - changes only(+-0.05 factor)

Measuring the performance gain by enabling SKX in the new dispatcher

in other words, comparing SKX(new dispatcher) vs AVX2(new dispatcher)

Hardware & OS

AWS/EC2 optimized instance (c5d.xlarge), run on Intel SKX CHIP (Skylake servers),

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:            4
CPU MHz:             3402.876
BogoMIPS:            6000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Linux ip-172-31-8-210 5.3.0-1023-aws #25~18.04.1-Ubuntu SMP Fri Jun 5 15:18:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

seiko2plus · 2020-07-11T07:29:54Z

@mattip, I made two mistakes when I reported @rgommers about the binary size during reviewing this patch within #13516:

I only checked the increase of the size for SKX within the domain of the new dispatcher,
but I didn't check the cost of using the new dispatcher for AVX2.
I striped the whole extra ELF symbols, not only the debugging tables.

Now, what is the real cost?

# sizes in kbytes and option '-Wl,--strip-debug' is passed to the linker during the build
# note: the following sizes based on what GCC 7.5 generates, other versions or compilers should act
# differently. 
4864	_multiarray_umath_after.so
1436	_multiarray_umath_after.so.gz
4056	_multiarray_umath_after_noskx.so
1216	_multiarray_umath_after_noskx.so.gz
3548	_multiarray_umath_before.so
1084	_multiarray_umath_before.so.gz

And why the binary size increased even without AVX512_SKX?
Because the new dispatcher is compiling each target in a sperate source with its own flags,
which gives the compiler capability to optimize any related utilities in the scope.
While the current existence dispatching code is based on compiler attributes that only affect on the function scope.
Edit: Link Time Optimization (LTO), may also have a negative impact one the final size, not sure yet.

Any performance gain?
definitely, yes for both AVX2 and AVX512_SKX but there's also slight to massive loss of performance in
few kernels when it comes to non-contiguous memory access.

I think we will need to replace fast loops macros with SIMD code, to improve the performance of non-contiguous operations
and also to decrease the binary size.

add new attribute `dispatch` to umath generator, the same as the current `simd` but for the new dispatcher, requires the name of the dispatch-able source without its extension and signature types. Co-authored-by: Matti Picus <matti.picus@gmail.com>

remove definitions and declarations of dispatche-able umath fast-loops from `loops.c.src` and `loops.h.src`

re-implement the dispatch-able umath fast-loops to fit the requirement of the new dispatcher

re-implement the workaround of disabling optimization for `*_right_shift`.

update umath generator and use attribute `dispatch` instead of `simd`.

mattip · 2020-07-13T12:58:31Z

I think the binary growth is worth the increase. The benchmarks seem to show this is worth the change in most cases. It would be nice if all the benchmarks were equal or faster, I wonder why sometimes AVX is better even for contiguous vectors for smaller data sizes (bytes).

r-devulap · 2020-07-13T17:02:44Z

I am not sure if these are related to the changes in this PR, but I am seeing regression in some of the benchmarks on a SKX:

     before           after         ratio                                     
     [34b748bb]       [ba2239e9]                                                
     <master>         <seiko-move-fast-loops>                                   
     15.4±0.05ms         23.2±6ms     1.51  bench_function_base.Sort.time_argsort('merge', 'float64', ('reversed',))
        728±10ns        971±200ns     1.33  bench_core.Core.time_array_l1          
     3.53±0.05ms       4.46±0.7ms     1.26  bench_scalar.ScalarMath.time_addition_pyint('float32')
     8.11±0.05ms         9.85±2ms     1.22  bench_lib.Nan.time_nancumsum(200, 0)
        686±10ns        829±100ns     1.21  bench_core.Core.time_empty_100         
     7.58±0.02ms         9.15±1ms     1.21  bench_core.Core.time_array_l100        
     7.73±0.02ms      9.29±0.03ms     1.20  bench_function_base.Sort.time_sort('merge', 'int16', ('uniform',))
      12.7±0.1ms         15.2±2ms     1.20  bench_core.Core.time_diag_l100

seiko2plus · 2020-07-14T06:23:00Z

@r-devulap, Which compiler/version you used?

It seems there's no way to achieve that by counting on auto-vectorization, I detected negative impacts with msvc and old versions of GCC, also a weird behavior by gcc-9. not sure yet about Intel Compiler.

It would be nice if all the benchmarks were equal or faster

I'm going to replace fast macros with universal intrinsics, I see it as the only approach to achieve that.

r-devulap · 2020-07-14T15:41:26Z

gcc-9 (Ubuntu 9.2.1-17ubuntu1~18.04.1) 9.2.1 20191102

seiko2plus · 2020-07-17T11:22:01Z

I closed this pr in favor of replacing auto-vectorization with universal intrinsics due to the impossibility to get a stable and equal performance on widely used compilers.

mattip · 2020-07-22T10:27:03Z

Perhaps a more gradual approach of replacing a single class of ufuncs, say logical ops, at a time might be more tractable.

charris added 01 - Enhancement component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels May 28, 2020

seiko2plus force-pushed the move_fast_loops branch 2 times, most recently from 3df132a to 486e6a8 Compare June 18, 2020 07:44

mattip reviewed Jul 6, 2020

View reviewed changes

numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved

anirudh2290 approved these changes Jul 10, 2020

View reviewed changes

seiko2plus and others added 5 commits July 11, 2020 09:34

ENH: [2/5] Move dispatch-able umath fast-loops to the new dispatcher

7a3a670

remove definitions and declarations of dispatche-able umath fast-loops from `loops.c.src` and `loops.h.src`

ENH: [3/5] Move dispatch-able umath fast-loops to the new dispatcher

b7064e9

re-implement the dispatch-able umath fast-loops to fit the requirement of the new dispatcher

ENH: [4/5] move dispatch-able umath fast-loops to the new dispatcher

7c64954

re-implement the workaround of disabling optimization for `*_right_shift`.

ENH: [5/5] Move dispatch-able umath fast-loops to the new dispatcher

72b05c0

update umath generator and use attribute `dispatch` instead of `simd`.

seiko2plus force-pushed the move_fast_loops branch from 486e6a8 to 72b05c0 Compare July 11, 2020 07:35

seiko2plus closed this Jul 17, 2020

seiko2plus mentioned this pull request Jul 17, 2020

ENH: Integrate the new CPU dispatcher with umath generator #16888

Merged

seiko2plus mentioned this pull request Aug 27, 2020

ENH: Improve the performance of einsum by using universal simd #17049

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Move dispatch-able umath fast-loops to the new dispatcher #16396

ENH: Move dispatch-able umath fast-loops to the new dispatcher #16396

seiko2plus commented May 27, 2020 •

edited

Loading

mattip left a comment

mattip commented Jul 10, 2020

anirudh2290 left a comment

seiko2plus commented Jul 11, 2020 •

edited

Loading

seiko2plus commented Jul 11, 2020 •

edited

Loading

mattip commented Jul 13, 2020

r-devulap commented Jul 13, 2020

seiko2plus commented Jul 14, 2020

r-devulap commented Jul 14, 2020

seiko2plus commented Jul 17, 2020 •

edited

Loading

mattip commented Jul 22, 2020

ENH: Move dispatch-able umath fast-loops to the new dispatcher #16396

ENH: Move dispatch-able umath fast-loops to the new dispatcher #16396

Conversation

seiko2plus commented May 27, 2020 • edited Loading

This pullrequest changes

Move dispatch-able umath fast-loops to the new dispatcher

mattip left a comment

Choose a reason for hiding this comment

mattip commented Jul 10, 2020

anirudh2290 left a comment

Choose a reason for hiding this comment

seiko2plus commented Jul 11, 2020 • edited Loading

Benchmarking results(ufunc inner loops via #15987)

Measuring before and after this patch

Measuring the performance gain by enabling SKX in the new dispatcher

Hardware & OS

seiko2plus commented Jul 11, 2020 • edited Loading

mattip commented Jul 13, 2020

r-devulap commented Jul 13, 2020

seiko2plus commented Jul 14, 2020

r-devulap commented Jul 14, 2020

seiko2plus commented Jul 17, 2020 • edited Loading

mattip commented Jul 22, 2020

seiko2plus commented May 27, 2020 •

edited

Loading

seiko2plus commented Jul 11, 2020 •

edited

Loading

seiko2plus commented Jul 11, 2020 •

edited

Loading

seiko2plus commented Jul 17, 2020 •

edited

Loading