SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587

seiko2plus · 2020-10-19T14:37:03Z

SIMD: Replace raw SIMD of sin/cos with NPYV

The new code improves the performance of non-contiguous memory access
for the output array without any reduction in performance.
For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.

TODO:

remove temporary commits
benchmarks
add testing cases for the new intrinsics
rebase after ENH, SIMD: Add new NPYV intrinsics pack(1) #17790, ENH, SIMD: Add new NPYV intrinsics pack(0) #17789
re-run the performance tests and update the current one

Performance tests(ASV)

Args

--bench-compare master bench_ufunc_strides.Unary -- --sort name --cpu-affinity 1,5

X86

I had to count on my local machine because I couldn't able to get stable ratios using aws.
see standalone benchamrk for AVX512F.

CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:            10
CPU MHz:             1800.344
CPU max MHz:         4000.0000
CPU min MHz:         400.0000
BogoMIPS:            3984.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx

OS

Linux ac6279ab1a82 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 x86_64 x86_64 GNU/Linux
gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0

Benchmark

AVX2 & FMA3 - Changed only

       before           after         ratio
     [098a3b41]       [a0322ee9]
     <master>         <to_npyv_sincos_f32>
        259~3us       55.1~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f')
        260~4us       56.2~0.2us     0.22  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f')
      334~0.8us      60.4~0.07us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f')
      335~0.9us       61.5~0.2us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f')
      337~0.4us       62.1~0.2us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f')
        339~2us       61.2~0.6us     0.18  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f')
       266~10us       54.9~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f')
       270~20us       55.6~0.2us     0.21  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f')
        331~3us       60.3~0.1us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f')
        332~2us       61.0~0.3us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f')
        336~1us       61.7~0.3us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f')
      335~0.2us       61.5~0.4us     0.18  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS

Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2(ISA >= 2.07) - Changed only

       before           after         ratio
     [098a3b41]       [a0322ee9]
     <master>         <to_npyv_sincos_f32>
       120±0.2μs      44.7±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 1, 'f')
       121±0.5μs      48.9±0.04μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 2, 'f')
       121±0.3μs      49.1±0.04μs     0.41  bench_ufunc_strides.Unary.time_ufunc('cos', 1, 4, 'f')
       121±0.2μs      48.7±0.02μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 1, 'f')
       121±0.1μs      52.4±0.04μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 2, 'f')
       121±0.1μs      52.5±0.05μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 2, 4, 'f')
       121±0.2μs      48.8±0.06μs     0.40  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 1, 'f')
       122±0.6μs      52.6±0.04μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 2, 'f')
      122±0.09μs      53.0±0.01μs     0.43  bench_ufunc_strides.Unary.time_ufunc('cos', 4, 4, 'f')
       126±0.6μs      44.1±0.01μs     0.35  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 1, 'f')
       131±0.5μs      48.2±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 2, 'f')
       130±0.7μs      48.4±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 1, 4, 'f')
       131±0.6μs      47.9±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 1, 'f')
       131±0.5μs      51.4±0.04μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 2, 'f')
       131±0.6μs      51.6±0.02μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 2, 4, 'f')
       130±0.7μs      48.1±0.02μs     0.37  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 1, 'f')
       131±0.2μs      51.7±0.02μs     0.39  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 2, 'f')
       131±0.4μs      52.0±0.05μs     0.40  bench_ufunc_strides.Unary.time_ufunc('sin', 4, 4, 'f')

Performance tests(standalone #15987)

Args used within #15987

--filter "(sin|cos)::.*[f]" --strides 1 2 3 10 --msleep 1 --iteration 100

Note: --msleep 1 force the running thread to sleep 1 millisecond before collecting each sample
to revert any frequency reduction, since it seems that throttling effect on wall time when AVX512F is enabled.

X86

CPU

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GH
                                 z
Stepping:                        7
CPU MHz:                         3604.410
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no m
                                 icrocode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __us
                                 er pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP dis
                                 abled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep m
                                 trr pge mca cmov pat pse36 clflush mmx fxsr s
                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm co
                                 nstant_tsc rep_good nopl xtopology nonstop_ts
                                 c cpuid aperfmperf tsc_known_freq pni pclmulq
                                 dq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic m
                                 ovbe popcnt tsc_deadline_timer aes xsave avx 
                                 f16c rdrand hypervisor lahf_lm abm 3dnowprefe
                                 tch invpcid_single pti fsgsbase tsc_adjust bm
                                 i1 avx2 smep bmi2 erms invpcid mpx avx512f av
                                 x512dq rdseed adx smap clflushopt clwb avx512
                                 cd avx512bw avx512vl xsaveopt xsavec xgetbv1 
                                 xsaves ida arat pku ospke

OS

Linux ip-172-31-28-146 5.4.0-1025-aws
gcc version 7.5.0 (Ubuntu 7.5.0-6ubuntu2)

Benchmark

AVX512F - Contiguous only

metric: gmean, units: ms

name of test	before_contig_avx512f	after_contig_avx512f	after_contig_avx512f vs before_contig_avx512f
cos::1024 f::1 -> f::1	0.0011	0.0011	1.01
cos::2048 f::1 -> f::1	0.0018	0.0018	1.01
cos::4096 f::1 -> f::1	0.0033	0.0029	`1.13`
sin::1024 f::1 -> f::1	0.0011	0.0011	1.01
sin::2048 f::1 -> f::1	0.0018	0.0018	0.98
sin::4096 f::1 -> f::1	0.0032	0.0030	`1.07`

AVX512F

metric: gmean, units: ms

name of test	before_avx512f	after_avx512f	after_avx512f vs before_avx512f
cos::1024 f::1 -> f::1	0.0011	0.0011	1.01
cos::2048 f::1 -> f::1	0.0018	0.0018	0.99
cos::4096 f::1 -> f::1	0.0034	0.0032	1.05
cos::1024 f::1 -> f::2	0.0139	0.0010	`14.02`
cos::2048 f::1 -> f::2	0.0278	0.0019	`14.76`
cos::4096 f::1 -> f::2	0.0561	0.0046	`12.17`
cos::1024 f::1 -> f::3	0.0140	0.0010	`13.88`
cos::2048 f::1 -> f::3	0.0280	0.0019	`14.76`
cos::4096 f::1 -> f::3	0.0565	0.0045	`12.54`
cos::1024 f::1 -> f::10	0.0140	0.0012	`12.03`
cos::2048 f::1 -> f::10	0.0280	0.0020	`14.18`
cos::4096 f::1 -> f::10	0.0562	0.0046	`12.13`
cos::1024 f::2 -> f::1	0.0010	0.0009	`1.07`
cos::2048 f::2 -> f::1	0.0019	0.0017	`1.08`
cos::4096 f::2 -> f::1	0.0038	0.0035	`1.09`
cos::1024 f::2 -> f::2	0.0200	0.0012	`16.48`
cos::2048 f::2 -> f::2	0.0400	0.0025	`16.22`
cos::4096 f::2 -> f::2	0.0799	0.0048	`16.82`
cos::1024 f::2 -> f::3	0.0200	0.0012	`16.53`
cos::2048 f::2 -> f::3	0.0400	0.0024	`16.88`
cos::4096 f::2 -> f::3	0.0800	0.0047	`17.02`
cos::1024 f::2 -> f::10	0.0200	0.0013	`15.8`
cos::2048 f::2 -> f::10	0.0400	0.0025	`16.08`
cos::4096 f::2 -> f::10	0.0801	0.0050	`16.02`
cos::1024 f::3 -> f::1	0.0010	0.0009	`1.07`
cos::2048 f::3 -> f::1	0.0019	0.0017	`1.09`
cos::4096 f::3 -> f::1	0.0039	0.0035	`1.11`
cos::1024 f::3 -> f::2	0.0200	0.0012	`16.6`
cos::2048 f::3 -> f::2	0.0400	0.0024	`16.65`
cos::4096 f::3 -> f::2	0.0802	0.0048	`16.53`
cos::1024 f::3 -> f::3	0.0200	0.0012	`16.69`
cos::2048 f::3 -> f::3	0.0400	0.0024	`16.92`
cos::4096 f::3 -> f::3	0.0802	0.0047	`17.09`
cos::1024 f::3 -> f::10	0.0200	0.0013	`15.8`
cos::2048 f::3 -> f::10	0.0400	0.0025	`16.01`
cos::4096 f::3 -> f::10	0.0804	0.0050	`16.23`
cos::1024 f::10 -> f::1	0.0011	0.0010	`1.08`
cos::2048 f::10 -> f::1	0.0021	0.0019	`1.11`
cos::4096 f::10 -> f::1	0.0042	0.0038	`1.11`
cos::1024 f::10 -> f::2	0.0200	0.0013	`15.14`
cos::2048 f::10 -> f::2	0.0400	0.0026	`15.54`
cos::4096 f::10 -> f::2	0.0801	0.0052	`15.46`
cos::1024 f::10 -> f::3	0.0200	0.0013	`14.92`
cos::2048 f::10 -> f::3	0.0400	0.0026	`15.54`
cos::4096 f::10 -> f::3	0.0799	0.0051	`15.59`
cos::1024 f::10 -> f::10	0.0200	0.0014	`14.02`
cos::2048 f::10 -> f::10	0.0400	0.0028	`14.4`
cos::4096 f::10 -> f::10	0.0802	0.0055	`14.49`
sin::1024 f::1 -> f::1	0.0011	0.0011	1.01
sin::2048 f::1 -> f::1	0.0017	0.0017	1.03
sin::4096 f::1 -> f::1	0.0033	0.0032	1.02
sin::1024 f::1 -> f::2	0.0132	0.0013	`10.26`
sin::2048 f::1 -> f::2	0.0264	0.0020	`13.36`
sin::4096 f::1 -> f::2	0.0533	0.0046	`11.55`
sin::1024 f::1 -> f::3	0.0132	0.0013	`10.5`
sin::2048 f::1 -> f::3	0.0267	0.0020	`13.49`
sin::4096 f::1 -> f::3	0.0532	0.0046	`11.61`
sin::1024 f::1 -> f::10	0.0132	0.0014	`9.35`
sin::2048 f::1 -> f::10	0.0264	0.0021	`12.63`
sin::4096 f::1 -> f::10	0.0528	0.0047	`11.31`
sin::1024 f::2 -> f::1	0.0012	0.0011	1.04
sin::2048 f::2 -> f::1	0.0020	0.0019	`1.06`
sin::4096 f::2 -> f::1	0.0038	0.0035	`1.07`
sin::1024 f::2 -> f::2	0.0181	0.0015	`12.21`
sin::2048 f::2 -> f::2	0.0361	0.0023	`15.41`
sin::4096 f::2 -> f::2	0.0723	0.0047	`15.53`
sin::1024 f::2 -> f::3	0.0181	0.0014	`12.73`
sin::2048 f::2 -> f::3	0.0364	0.0023	`15.76`
sin::4096 f::2 -> f::3	0.0723	0.0047	`15.26`
sin::1024 f::2 -> f::10	0.0181	0.0015	`12.2`
sin::2048 f::2 -> f::10	0.0362	0.0024	`14.85`
sin::4096 f::2 -> f::10	0.0724	0.0049	`14.82`
sin::1024 f::3 -> f::1	0.0012	0.0011	1.04
sin::2048 f::3 -> f::1	0.0020	0.0019	1.05
sin::4096 f::3 -> f::1	0.0038	0.0036	`1.08`
sin::1024 f::3 -> f::2	0.0181	0.0015	`12.45`
sin::2048 f::3 -> f::2	0.0362	0.0024	`15.39`
sin::4096 f::3 -> f::2	0.0724	0.0047	`15.44`
sin::1024 f::3 -> f::3	0.0181	0.0014	`12.65`
sin::2048 f::3 -> f::3	0.0365	0.0023	`15.71`
sin::4096 f::3 -> f::3	0.0724	0.0047	`15.26`
sin::1024 f::3 -> f::10	0.0181	0.0015	`12.29`
sin::2048 f::3 -> f::10	0.0362	0.0025	`14.77`
sin::4096 f::3 -> f::10	0.0726	0.0049	`14.92`
sin::1024 f::10 -> f::1	0.0013	0.0012	1.04
sin::2048 f::10 -> f::1	0.0022	0.0021	`1.08`
sin::4096 f::10 -> f::1	0.0042	0.0038	`1.09`
sin::1024 f::10 -> f::2	0.0181	0.0015	`11.79`
sin::2048 f::10 -> f::2	0.0361	0.0025	`14.26`
sin::4096 f::10 -> f::2	0.0725	0.0051	`14.28`
sin::1024 f::10 -> f::3	0.0181	0.0015	`11.81`
sin::2048 f::10 -> f::3	0.0364	0.0025	`14.37`
sin::4096 f::10 -> f::3	0.0723	0.0052	`13.91`
sin::1024 f::10 -> f::10	0.0181	0.0016	`11.17`
sin::2048 f::10 -> f::10	0.0362	0.0027	`13.24`
sin::4096 f::10 -> f::10	0.0725	0.0055	`13.3`

AVX2 & FMA3 - Contiguous only

metric: gmean, units: ms

name of test	before_contig_avx2_fma3	after_contig_avx2_fma3	after_contig_avx2_fma3 vs before_contig_avx2_fma3
cos::1024 f::1 -> f::1	0.0015	0.0014	1.05
cos::2048 f::1 -> f::1	0.0027	0.0026	1.05
cos::4096 f::1 -> f::1	0.0053	0.0051	1.04
sin::1024 f::1 -> f::1	0.0014	0.0014	1.02
sin::2048 f::1 -> f::1	0.0026	0.0026	1.03
sin::4096 f::1 -> f::1	0.0051	0.0050	1.03

AVX2 & FMA3

metric: gmean, units: ms

name of test	before_avx2_fma3	after_avx2_fma3	after_avx2_fma3 vs before_avx2_fma3
cos::1024 f::1 -> f::1	0.0015	0.0014	1.05
cos::2048 f::1 -> f::1	0.0027	0.0026	1.05
cos::4096 f::1 -> f::1	0.0052	0.0050	1.05
cos::1024 f::1 -> f::2	0.0139	0.0019	`7.24`
cos::2048 f::1 -> f::2	0.0279	0.0037	`7.46`
cos::4096 f::1 -> f::2	0.0556	0.0073	`7.59`
cos::1024 f::1 -> f::3	0.0139	0.0019	`7.2`
cos::2048 f::1 -> f::3	0.0279	0.0037	`7.51`
cos::4096 f::1 -> f::3	0.0552	0.0073	`7.61`
cos::1024 f::1 -> f::10	0.0139	0.0019	`7.39`
cos::2048 f::1 -> f::10	0.0279	0.0037	`7.57`
cos::4096 f::1 -> f::10	0.0557	0.0072	`7.72`
cos::1024 f::2 -> f::1	0.0018	0.0018	0.99
cos::2048 f::2 -> f::1	0.0035	0.0035	0.99
cos::4096 f::2 -> f::1	0.0066	0.0069	0.96
cos::1024 f::2 -> f::2	0.0188	0.0023	`8.2`
cos::2048 f::2 -> f::2	0.0376	0.0048	`7.83`
cos::4096 f::2 -> f::2	0.0750	0.0088	`8.55`
cos::1024 f::2 -> f::3	0.0188	0.0023	`8.34`
cos::2048 f::2 -> f::3	0.0376	0.0044	`8.52`
cos::4096 f::2 -> f::3	0.0751	0.0088	`8.52`
cos::1024 f::2 -> f::10	0.0188	0.0023	`8.28`
cos::2048 f::2 -> f::10	0.0376	0.0045	`8.33`
cos::4096 f::2 -> f::10	0.0752	0.0090	`8.32`
cos::1024 f::3 -> f::1	0.0018	0.0018	1.0
cos::2048 f::3 -> f::1	0.0035	0.0035	1.0
cos::4096 f::3 -> f::1	0.0067	0.0072	`0.93`
cos::1024 f::3 -> f::2	0.0188	0.0022	`8.43`
cos::2048 f::3 -> f::2	0.0375	0.0044	`8.51`
cos::4096 f::3 -> f::2	0.0752	0.0092	`8.15`
cos::1024 f::3 -> f::3	0.0188	0.0023	`8.31`
cos::2048 f::3 -> f::3	0.0376	0.0044	`8.54`
cos::4096 f::3 -> f::3	0.0750	0.0093	`8.1`
cos::1024 f::3 -> f::10	0.0188	0.0024	`7.93`
cos::2048 f::3 -> f::10	0.0375	0.0045	`8.36`
cos::4096 f::3 -> f::10	0.0753	0.0094	`8.04`
cos::1024 f::10 -> f::1	0.0019	0.0020	0.96
cos::2048 f::10 -> f::1	0.0036	0.0037	0.96
cos::4096 f::10 -> f::1	0.0072	0.0073	0.98
cos::1024 f::10 -> f::2	0.0188	0.0025	`7.66`
cos::2048 f::10 -> f::2	0.0375	0.0048	`7.79`
cos::4096 f::10 -> f::2	0.0748	0.0096	`7.8`
cos::1024 f::10 -> f::3	0.0188	0.0025	`7.56`
cos::2048 f::10 -> f::3	0.0376	0.0048	`7.78`
cos::4096 f::10 -> f::3	0.0750	0.0097	`7.74`
cos::1024 f::10 -> f::10	0.0188	0.0025	`7.52`
cos::2048 f::10 -> f::10	0.0375	0.0049	`7.65`
cos::4096 f::10 -> f::10	0.0753	0.0098	`7.69`
sin::1024 f::1 -> f::1	0.0015	0.0014	1.05
sin::2048 f::1 -> f::1	0.0027	0.0025	1.05
sin::4096 f::1 -> f::1	0.0051	0.0048	`1.07`
sin::1024 f::1 -> f::2	0.0139	0.0018	`7.5`
sin::2048 f::1 -> f::2	0.0277	0.0037	`7.51`
sin::4096 f::1 -> f::2	0.0555	0.0071	`7.8`
sin::1024 f::1 -> f::3	0.0138	0.0018	`7.5`
sin::2048 f::1 -> f::3	0.0278	0.0037	`7.6`
sin::4096 f::1 -> f::3	0.0556	0.0072	`7.75`
sin::1024 f::1 -> f::10	0.0139	0.0018	`7.56`
sin::2048 f::1 -> f::10	0.0277	0.0036	`7.67`
sin::4096 f::1 -> f::10	0.0556	0.0071	`7.88`
sin::1024 f::2 -> f::1	0.0018	0.0018	1.02
sin::2048 f::2 -> f::1	0.0034	0.0034	0.99
sin::4096 f::2 -> f::1	0.0065	0.0067	0.97
sin::1024 f::2 -> f::2	0.0190	0.0022	`8.48`
sin::2048 f::2 -> f::2	0.0382	0.0047	`8.1`
sin::4096 f::2 -> f::2	0.0766	0.0086	`8.95`
sin::1024 f::2 -> f::3	0.0190	0.0022	`8.77`
sin::2048 f::2 -> f::3	0.0383	0.0043	`8.84`
sin::4096 f::2 -> f::3	0.0762	0.0087	`8.77`
sin::1024 f::2 -> f::10	0.0191	0.0022	`8.68`
sin::2048 f::2 -> f::10	0.0382	0.0044	`8.6`
sin::4096 f::2 -> f::10	0.0762	0.0087	`8.72`
sin::1024 f::3 -> f::1	0.0018	0.0018	1.02
sin::2048 f::3 -> f::1	0.0034	0.0034	0.99
sin::4096 f::3 -> f::1	0.0066	0.0067	0.98
sin::1024 f::3 -> f::2	0.0191	0.0022	`8.77`
sin::2048 f::3 -> f::2	0.0382	0.0044	`8.77`
sin::4096 f::3 -> f::2	0.0761	0.0086	`8.87`
sin::1024 f::3 -> f::3	0.0191	0.0022	`8.87`
sin::2048 f::3 -> f::3	0.0383	0.0043	`8.87`
sin::4096 f::3 -> f::3	0.0760	0.0087	`8.78`
sin::1024 f::3 -> f::10	0.0191	0.0022	`8.76`
sin::2048 f::3 -> f::10	0.0383	0.0044	`8.75`
sin::4096 f::3 -> f::10	0.0761	0.0088	`8.66`
sin::1024 f::10 -> f::1	0.0019	0.0018	1.02
sin::2048 f::10 -> f::1	0.0035	0.0035	1.0
sin::4096 f::10 -> f::1	0.0068	0.0069	0.99
sin::1024 f::10 -> f::2	0.0191	0.0022	`8.63`
sin::2048 f::10 -> f::2	0.0381	0.0045	`8.56`
sin::4096 f::10 -> f::2	0.0765	0.0088	`8.74`
sin::1024 f::10 -> f::3	0.0192	0.0022	`8.69`
sin::2048 f::10 -> f::3	0.0382	0.0044	`8.6`
sin::4096 f::10 -> f::3	0.0765	0.0089	`8.64`
sin::1024 f::10 -> f::10	0.0191	0.0022	`8.52`
sin::2048 f::10 -> f::10	0.0382	0.0045	`8.46`
sin::4096 f::10 -> f::10	0.0766	0.0090	`8.55`

ARM8 64-bit

CPU

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

OS

Linux ip-172-31-6-63 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:17:48 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc-7 (Ubuntu/Linaro 7.5.0-6ubuntu2) 7.5.0

Benchmark

ASIMD - Contiguous only

metric: gmean, units: ms

name of test	before_contig	after_contig	after_contig vs before_contig
cos::1024 f::1 -> f::1	0.0072	0.0037	`1.93`
cos::2048 f::1 -> f::1	0.0149	0.0074	`2.0`
cos::4096 f::1 -> f::1	0.0313	0.0149	`2.11`
sin::1024 f::1 -> f::1	0.0072	0.0037	`1.97`
sin::2048 f::1 -> f::1	0.0148	0.0073	`2.03`
sin::4096 f::1 -> f::1	0.0305	0.0146	`2.09`

ASIMD

metric: gmean, units: ms

name of test	before	after	after vs before
cos::1024 f::1 -> f::1	0.0057	0.0037	`1.53`
cos::2048 f::1 -> f::1	0.0125	0.0074	`1.68`
cos::4096 f::1 -> f::1	0.0260	0.0149	`1.75`
cos::1024 f::1 -> f::2	0.0057	0.0042	`1.37`
cos::2048 f::1 -> f::2	0.0124	0.0083	`1.49`
cos::4096 f::1 -> f::2	0.0260	0.0166	`1.56`
cos::1024 f::1 -> f::3	0.0057	0.0042	`1.37`
cos::2048 f::1 -> f::3	0.0124	0.0083	`1.49`
cos::4096 f::1 -> f::3	0.0260	0.0167	`1.56`
cos::1024 f::1 -> f::10	0.0057	0.0042	`1.37`
cos::2048 f::1 -> f::10	0.0127	0.0086	`1.48`
cos::4096 f::1 -> f::10	0.0262	0.0167	`1.57`
cos::1024 f::2 -> f::1	0.0060	0.0040	`1.5`
cos::2048 f::2 -> f::1	0.0125	0.0080	`1.56`
cos::4096 f::2 -> f::1	0.0261	0.0160	`1.63`
cos::1024 f::2 -> f::2	0.0060	0.0044	`1.36`
cos::2048 f::2 -> f::2	0.0125	0.0088	`1.42`
cos::4096 f::2 -> f::2	0.0261	0.0177	`1.47`
cos::1024 f::2 -> f::3	0.0061	0.0044	`1.37`
cos::2048 f::2 -> f::3	0.0125	0.0088	`1.41`
cos::4096 f::2 -> f::3	0.0262	0.0177	`1.48`
cos::1024 f::2 -> f::10	0.0060	0.0044	`1.36`
cos::2048 f::2 -> f::10	0.0126	0.0089	`1.42`
cos::4096 f::2 -> f::10	0.0264	0.0177	`1.49`
cos::1024 f::3 -> f::1	0.0057	0.0042	`1.35`
cos::2048 f::3 -> f::1	0.0126	0.0084	`1.51`
cos::4096 f::3 -> f::1	0.0265	0.0168	`1.57`
cos::1024 f::3 -> f::2	0.0057	0.0047	`1.22`
cos::2048 f::3 -> f::2	0.0126	0.0093	`1.36`
cos::4096 f::3 -> f::2	0.0265	0.0187	`1.42`
cos::1024 f::3 -> f::3	0.0057	0.0047	`1.22`
cos::2048 f::3 -> f::3	0.0127	0.0093	`1.36`
cos::4096 f::3 -> f::3	0.0266	0.0187	`1.42`
cos::1024 f::3 -> f::10	0.0057	0.0047	`1.22`
cos::2048 f::3 -> f::10	0.0128	0.0094	`1.37`
cos::4096 f::3 -> f::10	0.0266	0.0187	`1.43`
cos::1024 f::10 -> f::1	0.0060	0.0048	`1.26`
cos::2048 f::10 -> f::1	0.0125	0.0095	`1.31`
cos::4096 f::10 -> f::1	0.0263	0.0190	`1.38`
cos::1024 f::10 -> f::2	0.0061	0.0051	`1.2`
cos::2048 f::10 -> f::2	0.0125	0.0102	`1.23`
cos::4096 f::10 -> f::2	0.0263	0.0204	`1.29`
cos::1024 f::10 -> f::3	0.0061	0.0051	`1.18`
cos::2048 f::10 -> f::3	0.0125	0.0102	`1.22`
cos::4096 f::10 -> f::3	0.0263	0.0204	`1.29`
cos::1024 f::10 -> f::10	0.0061	0.0052	`1.16`
cos::2048 f::10 -> f::10	0.0126	0.0102	`1.23`
cos::4096 f::10 -> f::10	0.0264	0.0206	`1.28`
sin::1024 f::1 -> f::1	0.0073	0.0037	`2.0`
sin::2048 f::1 -> f::1	0.0147	0.0073	`2.01`
sin::4096 f::1 -> f::1	0.0300	0.0146	`2.06`
sin::1024 f::1 -> f::2	0.0074	0.0041	`1.79`
sin::2048 f::1 -> f::2	0.0146	0.0082	`1.78`
sin::4096 f::1 -> f::2	0.0300	0.0164	`1.83`
sin::1024 f::1 -> f::3	0.0073	0.0041	`1.79`
sin::2048 f::1 -> f::3	0.0146	0.0082	`1.78`
sin::4096 f::1 -> f::3	0.0300	0.0164	`1.83`
sin::1024 f::1 -> f::10	0.0073	0.0041	`1.78`
sin::2048 f::1 -> f::10	0.0147	0.0085	`1.74`
sin::4096 f::1 -> f::10	0.0301	0.0164	`1.83`
sin::1024 f::2 -> f::1	0.0072	0.0039	`1.85`
sin::2048 f::2 -> f::1	0.0147	0.0078	`1.89`
sin::4096 f::2 -> f::1	0.0301	0.0156	`1.93`
sin::1024 f::2 -> f::2	0.0072	0.0044	`1.65`
sin::2048 f::2 -> f::2	0.0147	0.0088	`1.68`
sin::4096 f::2 -> f::2	0.0301	0.0176	`1.71`
sin::1024 f::2 -> f::3	0.0073	0.0044	`1.66`
sin::2048 f::2 -> f::3	0.0147	0.0088	`1.68`
sin::4096 f::2 -> f::3	0.0302	0.0176	`1.72`
sin::1024 f::2 -> f::10	0.0073	0.0044	`1.66`
sin::2048 f::2 -> f::10	0.0148	0.0088	`1.68`
sin::4096 f::2 -> f::10	0.0302	0.0175	`1.72`
sin::1024 f::3 -> f::1	0.0073	0.0042	`1.75`
sin::2048 f::3 -> f::1	0.0146	0.0083	`1.76`
sin::4096 f::3 -> f::1	0.0299	0.0167	`1.79`
sin::1024 f::3 -> f::2	0.0073	0.0046	`1.59`
sin::2048 f::3 -> f::2	0.0146	0.0091	`1.6`
sin::4096 f::3 -> f::2	0.0299	0.0183	`1.63`
sin::1024 f::3 -> f::3	0.0073	0.0046	`1.57`
sin::2048 f::3 -> f::3	0.0147	0.0091	`1.6`
sin::4096 f::3 -> f::3	0.0299	0.0183	`1.64`
sin::1024 f::3 -> f::10	0.0073	0.0046	`1.59`
sin::2048 f::3 -> f::10	0.0147	0.0092	`1.61`
sin::4096 f::3 -> f::10	0.0300	0.0183	`1.64`
sin::1024 f::10 -> f::1	0.0073	0.0047	`1.57`
sin::2048 f::10 -> f::1	0.0147	0.0094	`1.57`
sin::4096 f::10 -> f::1	0.0301	0.0187	`1.61`
sin::1024 f::10 -> f::2	0.0073	0.0050	`1.45`
sin::2048 f::10 -> f::2	0.0146	0.0101	`1.45`
sin::4096 f::10 -> f::2	0.0301	0.0201	`1.5`
sin::1024 f::10 -> f::3	0.0073	0.0050	`1.46`
sin::2048 f::10 -> f::3	0.0146	0.0101	`1.45`
sin::4096 f::10 -> f::3	0.0301	0.0201	`1.5`
sin::1024 f::10 -> f::10	0.0074	0.0051	`1.45`
sin::2048 f::10 -> f::10	0.0147	0.0101	`1.45`
sin::4096 f::10 -> f::10	0.0302	0.0201	`1.5`

Power little-endian

CPU

Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor   : 7
cpu     : POWER9 (architected), altivec supported
clock       : 2200.000000MHz
revision    : 2.2 (pvr 004e 1202)

timebase    : 512000000
platform    : pSeries
model       : IBM pSeries (emulated by qemu)
machine     : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

OS

Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)

Benchmark

VSX2(ISA >= 2.07) - Contiguous only

metric: gmean, units: ms

name of test	before_contig	after_contig	after_contig vs before_contig
cos::1024 f::1 -> f::1	0.0131	0.0044	`2.94`
cos::2048 f::1 -> f::1	0.0265	0.0089	`2.99`
cos::4096 f::1 -> f::1	0.0535	0.0176	`3.03`
sin::1024 f::1 -> f::1	0.0134	0.0042	`3.16`
sin::2048 f::1 -> f::1	0.0265	0.0085	`3.13`
sin::4096 f::1 -> f::1	0.0542	0.0169	`3.2`

VSX2(ISA >= 2.07)

metric: gmean, units: ms

name of test	before	after	after vs before
cos::1024 f::1 -> f::1	0.0127	0.0044	`2.86`
cos::2048 f::1 -> f::1	0.0264	0.0088	`2.99`
cos::4096 f::1 -> f::1	0.0538	0.0184	`2.92`
cos::1024 f::1 -> f::2	0.0129	0.0048	`2.7`
cos::2048 f::1 -> f::2	0.0268	0.0095	`2.83`
cos::4096 f::1 -> f::2	0.0546	0.0189	`2.88`
cos::1024 f::1 -> f::3	0.0129	0.0047	`2.72`
cos::2048 f::1 -> f::3	0.0268	0.0095	`2.82`
cos::4096 f::1 -> f::3	0.0547	0.0189	`2.89`
cos::1024 f::1 -> f::10	0.0129	0.0047	`2.72`
cos::2048 f::1 -> f::10	0.0269	0.0095	`2.84`
cos::4096 f::1 -> f::10	0.0546	0.0190	`2.87`
cos::1024 f::2 -> f::1	0.0131	0.0048	`2.73`
cos::2048 f::2 -> f::1	0.0266	0.0095	`2.79`
cos::4096 f::2 -> f::1	0.0547	0.0191	`2.87`
cos::1024 f::2 -> f::2	0.0130	0.0051	`2.55`
cos::2048 f::2 -> f::2	0.0266	0.0102	`2.61`
cos::4096 f::2 -> f::2	0.0544	0.0210	`2.58`
cos::1024 f::2 -> f::3	0.0131	0.0051	`2.56`
cos::2048 f::2 -> f::3	0.0281	0.0103	`2.74`
cos::4096 f::2 -> f::3	0.0544	0.0208	`2.62`
cos::1024 f::2 -> f::10	0.0131	0.0051	`2.55`
cos::2048 f::2 -> f::10	0.0266	0.0102	`2.6`
cos::4096 f::2 -> f::10	0.0544	0.0205	`2.65`
cos::1024 f::3 -> f::1	0.0130	0.0048	`2.7`
cos::2048 f::3 -> f::1	0.0271	0.0096	`2.84`
cos::4096 f::3 -> f::1	0.0543	0.0191	`2.83`
cos::1024 f::3 -> f::2	0.0129	0.0051	`2.53`
cos::2048 f::3 -> f::2	0.0271	0.0102	`2.65`
cos::4096 f::3 -> f::2	0.0543	0.0205	`2.65`
cos::1024 f::3 -> f::3	0.0130	0.0051	`2.53`
cos::2048 f::3 -> f::3	0.0272	0.0102	`2.66`
cos::4096 f::3 -> f::3	0.0543	0.0205	`2.65`
cos::1024 f::3 -> f::10	0.0130	0.0053	`2.46`
cos::2048 f::3 -> f::10	0.0280	0.0102	`2.73`
cos::4096 f::3 -> f::10	0.0563	0.0204	`2.75`
cos::1024 f::10 -> f::1	0.0133	0.0048	`2.76`
cos::2048 f::10 -> f::1	0.0265	0.0096	`2.77`
cos::4096 f::10 -> f::1	0.0551	0.0191	`2.89`
cos::1024 f::10 -> f::2	0.0133	0.0051	`2.6`
cos::2048 f::10 -> f::2	0.0266	0.0102	`2.59`
cos::4096 f::10 -> f::2	0.0552	0.0205	`2.7`
cos::1024 f::10 -> f::3	0.0133	0.0051	`2.59`
cos::2048 f::10 -> f::3	0.0266	0.0102	`2.59`
cos::4096 f::10 -> f::3	0.0552	0.0205	`2.7`
cos::1024 f::10 -> f::10	0.0133	0.0051	`2.58`
cos::2048 f::10 -> f::10	0.0265	0.0102	`2.59`
cos::4096 f::10 -> f::10	0.0552	0.0205	`2.7`
sin::1024 f::1 -> f::1	0.0134	0.0042	`3.16`
sin::2048 f::1 -> f::1	0.0271	0.0085	`3.2`
sin::4096 f::1 -> f::1	0.0535	0.0169	`3.17`
sin::1024 f::1 -> f::2	0.0133	0.0046	`2.9`
sin::2048 f::1 -> f::2	0.0268	0.0091	`2.93`
sin::4096 f::1 -> f::2	0.0530	0.0183	`2.9`
sin::1024 f::1 -> f::3	0.0133	0.0046	`2.9`
sin::2048 f::1 -> f::3	0.0268	0.0091	`2.94`
sin::4096 f::1 -> f::3	0.0530	0.0188	`2.82`
sin::1024 f::1 -> f::10	0.0133	0.0047	`2.83`
sin::2048 f::1 -> f::10	0.0268	0.0094	`2.87`
sin::4096 f::1 -> f::10	0.0530	0.0183	`2.9`
sin::1024 f::2 -> f::1	0.0131	0.0046	`2.87`
sin::2048 f::2 -> f::1	0.0264	0.0092	`2.89`
sin::4096 f::2 -> f::1	0.0530	0.0183	`2.9`
sin::1024 f::2 -> f::2	0.0131	0.0050	`2.65`
sin::2048 f::2 -> f::2	0.0264	0.0099	`2.68`
sin::4096 f::2 -> f::2	0.0531	0.0198	`2.68`
sin::1024 f::2 -> f::3	0.0131	0.0049	`2.66`
sin::2048 f::2 -> f::3	0.0264	0.0099	`2.68`
sin::4096 f::2 -> f::3	0.0530	0.0198	`2.68`
sin::1024 f::2 -> f::10	0.0131	0.0050	`2.65`
sin::2048 f::2 -> f::10	0.0265	0.0099	`2.68`
sin::4096 f::2 -> f::10	0.0531	0.0197	`2.69`
sin::1024 f::3 -> f::1	0.0130	0.0046	`2.82`
sin::2048 f::3 -> f::1	0.0263	0.0092	`2.86`
sin::4096 f::3 -> f::1	0.0532	0.0183	`2.9`
sin::1024 f::3 -> f::2	0.0130	0.0050	`2.61`
sin::2048 f::3 -> f::2	0.0263	0.0099	`2.65`
sin::4096 f::3 -> f::2	0.0533	0.0198	`2.69`
sin::1024 f::3 -> f::3	0.0136	0.0050	`2.74`
sin::2048 f::3 -> f::3	0.0263	0.0099	`2.65`
sin::4096 f::3 -> f::3	0.0533	0.0198	`2.69`
sin::1024 f::3 -> f::10	0.0130	0.0050	`2.61`
sin::2048 f::3 -> f::10	0.0263	0.0099	`2.66`
sin::4096 f::3 -> f::10	0.0532	0.0198	`2.69`
sin::1024 f::10 -> f::1	0.0128	0.0046	`2.78`
sin::2048 f::10 -> f::1	0.0264	0.0092	`2.88`
sin::4096 f::10 -> f::1	0.0537	0.0184	`2.91`
sin::1024 f::10 -> f::2	0.0133	0.0050	`2.67`
sin::2048 f::10 -> f::2	0.0264	0.0099	`2.66`
sin::4096 f::10 -> f::2	0.0537	0.0198	`2.72`
sin::1024 f::10 -> f::3	0.0128	0.0050	`2.58`
sin::2048 f::10 -> f::3	0.0264	0.0099	`2.67`
sin::4096 f::10 -> f::3	0.0537	0.0198	`2.71`
sin::1024 f::10 -> f::10	0.0128	0.0050	`2.58`
sin::2048 f::10 -> f::10	0.0264	0.0099	`2.67`
sin::4096 f::10 -> f::10	0.0537	0.0198	`2.71`

numpy/core/src/umath/loops_trig.dispatch.c.src

numpy/core/src/common/simd/neon/math.h

numpy/core/src/umath/loops_trig.dispatch.c.src

seiko2plus · 2020-11-17T05:50:25Z

the new NPYV intrinsics have moved to separate pull-requests #17790, #17789

numpy/core/src/umath/loops_trig.dispatch.c.src

seiko2plus · 2020-12-26T10:50:35Z

ping @mattip

seiko2plus · 2020-12-26T11:05:23Z

numpy/core/src/umath/loops_trigonometric.dispatch.c.src

@@ -0,0 +1,230 @@
+/*@targets
+ ** $maxopt $werror baseline


Suggested change

** $maxopt $werror baseline

** $maxopt baseline

remove treating warnings as errors after the CI pass the tests

CI is passing

Done, I temporarily use this policy during the development to detect any warnings.

numpy/core/code_generators/generate_umath.py

mattip · 2020-12-26T16:18:20Z

Nice speedups. Is this for 32-bit float only or also for 64-bit?

Edit: 32 bit only.

The new code improves the performance of non-contiguous memory access for the output array without any reduction in performance. For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.

This test should not be exclusive to AVX. this patch also extends unary test to cover different sets of output strides.

seiko2plus · 2020-12-26T16:38:04Z

@mattip, just replaced the raw SIMD code of f32 with NPYV.

mattip · 2020-12-26T18:00:17Z

Thanks @seiko2plus

seiko2plus force-pushed the to_npyv_sincos_f32 branch 12 times, most recently from 7161e30 to e01dc6e Compare October 21, 2020 10:25

seiko2plus marked this pull request as ready for review October 21, 2020 10:25

seiko2plus commented Oct 21, 2020

View reviewed changes

numpy/core/src/umath/loops_trig.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus commented Oct 21, 2020

View reviewed changes

numpy/core/src/umath/loops_trig.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus commented Oct 21, 2020

View reviewed changes

numpy/core/src/umath/loops_trig.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus mentioned this pull request Oct 21, 2020

SIMD: Add partial/non-contig load and store intrinsics for 32/64-bit #17340

Merged

seiko2plus force-pushed the to_npyv_sincos_f32 branch 2 times, most recently from 1163a6d to 84c4c2d Compare October 25, 2020 14:04

Qiyu8 reviewed Oct 26, 2020

View reviewed changes

numpy/core/src/common/simd/neon/math.h Outdated Show resolved Hide resolved

seiko2plus force-pushed the to_npyv_sincos_f32 branch from 84c4c2d to 8900a72 Compare October 26, 2020 17:20

seiko2plus commented Oct 27, 2020

View reviewed changes

numpy/core/src/umath/loops_trig.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus force-pushed the to_npyv_sincos_f32 branch 2 times, most recently from 518fd92 to 2a01e5f Compare November 1, 2020 16:39

charris added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Nov 2, 2020

seiko2plus force-pushed the to_npyv_sincos_f32 branch 2 times, most recently from 360472c to bb08eb2 Compare November 11, 2020 03:40

This was referenced Nov 14, 2020

SIMD, BUG: fix reuses the previous values during the fallback to libc #17763

Merged

ENH, SIMD: Add new NPYV intrinsics pack(0) #17789

Merged

seiko2plus mentioned this pull request Nov 17, 2020

ENH, SIMD: Add new NPYV intrinsics pack(1) #17790

Merged

2 tasks

seiko2plus force-pushed the to_npyv_sincos_f32 branch from bb08eb2 to 8f829c9 Compare November 17, 2020 05:45

seiko2plus commented Nov 17, 2020

View reviewed changes

numpy/core/src/umath/loops_trig.dispatch.c.src Outdated Show resolved Hide resolved

seiko2plus mentioned this pull request Nov 17, 2020

BLD: Enable Werror=undef in travis #17791

Merged

seiko2plus force-pushed the to_npyv_sincos_f32 branch 7 times, most recently from b958d43 to a0322ee Compare December 26, 2020 10:48

seiko2plus commented Dec 26, 2020

View reviewed changes

seiko2plus mentioned this pull request Dec 26, 2020

ENH: libdivide for unsigned integers #18055

Closed

mattip reviewed Dec 26, 2020

View reviewed changes

numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved

seiko2plus added 3 commits December 26, 2020 16:32

SIMD: Replace raw SIMD of sin/cos with NPYV

b162886

The new code improves the performance of non-contiguous memory access for the output array without any reduction in performance. For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.

MAINT: Suppress maybe-uninitialized warning in gcc on VSX

968288a

BENCH: Rename bench_avx.py to bench_ufunc_strides.py

1470654

This test should not be exclusive to AVX. this patch also extends unary test to cover different sets of output strides.

seiko2plus force-pushed the to_npyv_sincos_f32 branch from a0322ee to 1470654 Compare December 26, 2020 16:32

mattip merged commit ce82028 into numpy:master Dec 26, 2020

seiko2plus deleted the to_npyv_sincos_f32 branch January 9, 2021 16:52

mattip mentioned this pull request Jan 24, 2023

ENH: Add support SLEEF for transcendental functions #23068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587

seiko2plus commented Oct 19, 2020 •

edited

Loading

seiko2plus commented Nov 17, 2020

seiko2plus commented Dec 26, 2020

seiko2plus Dec 26, 2020

mattip Dec 26, 2020

seiko2plus Dec 26, 2020

mattip commented Dec 26, 2020 •

edited

Loading

seiko2plus commented Dec 26, 2020

mattip commented Dec 26, 2020

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587

Conversation

seiko2plus commented Oct 19, 2020 • edited Loading

SIMD: Replace raw SIMD of sin/cos with NPYV

Performance tests(ASV)

Args

X86

Benchmark

Power little-endian

Benchmark

Performance tests(standalone #15987)

Args used within #15987

X86

Benchmark

ARM8 64-bit

Benchmark

Power little-endian

Benchmark

seiko2plus commented Nov 17, 2020

seiko2plus commented Dec 26, 2020

seiko2plus Dec 26, 2020

Choose a reason for hiding this comment

mattip Dec 26, 2020

Choose a reason for hiding this comment

seiko2plus Dec 26, 2020

Choose a reason for hiding this comment

mattip commented Dec 26, 2020 • edited Loading

seiko2plus commented Dec 26, 2020

mattip commented Dec 26, 2020

seiko2plus commented Oct 19, 2020 •

edited

Loading

mattip commented Dec 26, 2020 •

edited

Loading