-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics) #17587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7161e30
to
e01dc6e
Compare
1163a6d
to
84c4c2d
Compare
84c4c2d
to
8900a72
Compare
518fd92
to
2a01e5f
Compare
360472c
to
bb08eb2
Compare
bb08eb2
to
8f829c9
Compare
b958d43
to
a0322ee
Compare
ping @mattip |
@@ -0,0 +1,230 @@ | |||
/*@targets | |||
** $maxopt $werror baseline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** $maxopt $werror baseline | |
** $maxopt baseline |
remove treating warnings as errors after the CI pass the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I temporarily use this policy during the development to detect any warnings.
Nice speedups. Is this for 32-bit float only or also for 64-bit? Edit: 32 bit only. |
The new code improves the performance of non-contiguous memory access for the output array without any reduction in performance. For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
This test should not be exclusive to AVX. this patch also extends unary test to cover different sets of output strides.
a0322ee
to
1470654
Compare
@mattip, just replaced the raw SIMD code of f32 with NPYV. |
Thanks @seiko2plus |
Merge after #17790, #17789
SIMD: Replace raw SIMD of sin/cos with NPYV
The new code improves the performance of non-contiguous memory access
for the output array without any reduction in performance.
For PPC64LE the performance increased by 2-3.0, and 1.5-2.0 on aarch64.
TODO:
Performance tests(ASV)
Args
X86
I had to count on my local machine because I couldn't able to get stable ratios using aws.
see standalone benchamrk for AVX512F.
CPU
OS
Benchmark
AVX2 & FMA3 - Changed only
Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Changed only
Performance tests(standalone #15987)
Args used within #15987
Note:
--msleep 1
force the running thread to sleep 1 millisecond before collecting each sampleto revert any frequency reduction, since it seems that throttling effect on wall time when
AVX512F
is enabled.X86
CPU
OS
Benchmark
AVX512F - Contiguous only
metric: gmean, units: ms
1.13
1.07
AVX512F
metric: gmean, units: ms
14.02
14.76
12.17
13.88
14.76
12.54
12.03
14.18
12.13
1.07
1.08
1.09
16.48
16.22
16.82
16.53
16.88
17.02
15.8
16.08
16.02
1.07
1.09
1.11
16.6
16.65
16.53
16.69
16.92
17.09
15.8
16.01
16.23
1.08
1.11
1.11
15.14
15.54
15.46
14.92
15.54
15.59
14.02
14.4
14.49
10.26
13.36
11.55
10.5
13.49
11.61
9.35
12.63
11.31
1.06
1.07
12.21
15.41
15.53
12.73
15.76
15.26
12.2
14.85
14.82
1.08
12.45
15.39
15.44
12.65
15.71
15.26
12.29
14.77
14.92
1.08
1.09
11.79
14.26
14.28
11.81
14.37
13.91
11.17
13.24
13.3
AVX2 & FMA3 - Contiguous only
metric: gmean, units: ms
AVX2 & FMA3
metric: gmean, units: ms
7.24
7.46
7.59
7.2
7.51
7.61
7.39
7.57
7.72
8.2
7.83
8.55
8.34
8.52
8.52
8.28
8.33
8.32
0.93
8.43
8.51
8.15
8.31
8.54
8.1
7.93
8.36
8.04
7.66
7.79
7.8
7.56
7.78
7.74
7.52
7.65
7.69
1.07
7.5
7.51
7.8
7.5
7.6
7.75
7.56
7.67
7.88
8.48
8.1
8.95
8.77
8.84
8.77
8.68
8.6
8.72
8.77
8.77
8.87
8.87
8.87
8.78
8.76
8.75
8.66
8.63
8.56
8.74
8.69
8.6
8.64
8.52
8.46
8.55
ARM8 64-bit
CPU
OS
Benchmark
ASIMD - Contiguous only
metric: gmean, units: ms
1.93
2.0
2.11
1.97
2.03
2.09
ASIMD
metric: gmean, units: ms
1.53
1.68
1.75
1.37
1.49
1.56
1.37
1.49
1.56
1.37
1.48
1.57
1.5
1.56
1.63
1.36
1.42
1.47
1.37
1.41
1.48
1.36
1.42
1.49
1.35
1.51
1.57
1.22
1.36
1.42
1.22
1.36
1.42
1.22
1.37
1.43
1.26
1.31
1.38
1.2
1.23
1.29
1.18
1.22
1.29
1.16
1.23
1.28
2.0
2.01
2.06
1.79
1.78
1.83
1.79
1.78
1.83
1.78
1.74
1.83
1.85
1.89
1.93
1.65
1.68
1.71
1.66
1.68
1.72
1.66
1.68
1.72
1.75
1.76
1.79
1.59
1.6
1.63
1.57
1.6
1.64
1.59
1.61
1.64
1.57
1.57
1.61
1.45
1.45
1.5
1.46
1.45
1.5
1.45
1.45
1.5
Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Contiguous only
metric: gmean, units: ms
2.94
2.99
3.03
3.16
3.13
3.2
VSX2(ISA >= 2.07)
metric: gmean, units: ms
2.86
2.99
2.92
2.7
2.83
2.88
2.72
2.82
2.89
2.72
2.84
2.87
2.73
2.79
2.87
2.55
2.61
2.58
2.56
2.74
2.62
2.55
2.6
2.65
2.7
2.84
2.83
2.53
2.65
2.65
2.53
2.66
2.65
2.46
2.73
2.75
2.76
2.77
2.89
2.6
2.59
2.7
2.59
2.59
2.7
2.58
2.59
2.7
3.16
3.2
3.17
2.9
2.93
2.9
2.9
2.94
2.82
2.83
2.87
2.9
2.87
2.89
2.9
2.65
2.68
2.68
2.66
2.68
2.68
2.65
2.68
2.69
2.82
2.86
2.9
2.61
2.65
2.69
2.74
2.65
2.69
2.61
2.66
2.69
2.78
2.88
2.91
2.67
2.66
2.72
2.58
2.67
2.71
2.58
2.67
2.71