-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH:Umath Replace raw SIMD of unary float point(32-64) with NPYV - g0 #16247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
225ef80
to
f1759d6
Compare
f1759d6
to
043c07d
Compare
@seiko2plus this needs a redo now that the infrastructure is in place. |
@seiko2plus Ping. Looks like the recent header change merges reguires a rebase. Any possibility of breaking out the fixes required for the 32 bit wheel builds? |
@charris, I'm going to rebase it and push the local changes just after getting done from #16782, |
@charris, Any possibility of breaking out the fixes required for the 32 bit wheel builds? no, I have to implement a SIMD kernel that handles scalars as well as vectors, just give me a week. |
f128eca
to
2688270
Compare
9387b54
to
a75caf1
Compare
bc8f69e
to
5ca9489
Compare
877a34e
to
761cac3
Compare
// a * (1/√a) | ||
npyv_f32 sqrt = vmulq_f32(a, rsqrte); | ||
// return zero if the a is zero | ||
npyv_u32 bits = vbicq_u32(vreinterpretq_u32_f32(sqrt), vceqq_f32(a, zero)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenCV simd sqrt is here. but you are right, if we didn't check zero, the result will become a nan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, to #16782, I catched two issues here positive infinity(same as zero) and precision.
unfortunately, I had to add a third Newton-Raphson iteration to provides acceptable precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for that failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a check to all sqrt special cases, see:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/tests/test_simd.py#L184-L188
I made another fix to npyv_sqrt_f32()
within the last push, which fixes floating-point division-by-zero error that
raised by vrsqrteq_f32(x)
when x is zero.
Now all the tests passed on armhf.
761cac3
to
24d957e
Compare
TD(O, f='Py_square'), | ||
), | ||
'reciprocal': | ||
Ufunc(1, 1, None, | ||
docstrings.get('numpy.core.umath.reciprocal'), | ||
None, | ||
TD(ints+inexact, simd=[('avx2', ints), ('fma', 'fd'), ('avx512f','fd')]), | ||
TD(ints+inexact, simd=[('avx2', ints)], dispatch=[('loops_unary_fp', 'fd')]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fix the failing 32-bit wheel build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change activates the new dispatcher.
32-bit wheel build fails due to aggressive optimization gcc made that doesn't respect zero division,
this issue was exist also on 64-bit when AVX2 and AVX512F aren't enabled.
the fix mainly here:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/umath/loops_unary_fp.dispatch.c.src#L129-L133
where partial load intrinsic npyv_load_till_*
and npyv_loadn_till_*
guarantee adding "one" to the tail of the vector.
I also made a slight change in the last push to generate using NPYV version for overlapped arrays, also to guarantee the same precsion on armhf.
ac9540f
to
1f87265
Compare
/*@targets | ||
** $maxopt baseline | ||
** sse2 vsx2 neon | ||
**/ | ||
/** | ||
* Force use SSE only on x86, even if AVX2 or AVX512F are enabled | ||
* through the baseline, since scatter(AVX512F) and gather very costly | ||
* to handle non-contiguous memory access comparing with SSE for | ||
* such small operations that this file covers. | ||
*/ | ||
#define NPY_SIMD_FORCE_128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-devulap, I only enabled SSE
and dropped AVX2
and AVX512F
since there's no performance gain for contiguous arrays,
also, the emulated version of partial and non-contiguous memory load/store intrinsics show better performance
comparing with the gather/scatter(AVX512F) intrinsics, especially when I unroll by x2/x4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should run benchmarks to compare this to the current code. I would expect x86_64 to be no slower, and arm64 to be faster. @Qiyu8 any thoughts?
#include <Python.h> // for PyObject | ||
#include "numpy/numpyconfig.h" // for NPY_VISIBILITY_HIDDEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious why the shift in order is needed, usually Python.h
comes first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to suppress warning 'declaration of 'struct timespec*', this compiler warnning raised when math.h
get included before
Python.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some more info about this you can link to?
NPY_FINLINE npyv_f64 npyv_square_f64(npyv_f64 a) | ||
{ return _mm256_mul_pd(a, a); } | ||
|
||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the new intrinsics need to be added to the _simd
module explicitly or is it automatic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return !(nomemoverlap((char*)src, src_step*len, (char*)dst, dst_step*len)); | ||
} | ||
|
||
#endif // _NPY_UMATH_LOOPS_UTILS_H_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have solve_may_share_memory
, in common/mem_overlap.c
, can we use this opportunity to refactor the code to use it? If not, then this function should be moved to that file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently solve may share_memory()
is used to resolve arrays overlapping from Python level,
while what we do here is to avoid perform SIMD vector operations on overlapped arrays,
since the user expected scalar by scalar overlap which acts differently than unrolling,
also, we use sccatter/gather operations which may lead to undefined behaviour.
If not, then this function should be moved to that file.
I don't think this function is related to the content of common/mem_overlap.c
.
data_recip = self.load([1/x for x in data]) # load to truncate precision | ||
recip = self.recip(vdata) | ||
assert recip == data_recip | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, the power of _simd
to show the intrinsics are correct is compelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be very helpful for the upcoming architectures.
I'm already provided good benchmarks in the description of this pull-request based on #15987, the last changes I made |
IMO,A standalone benchmark script for universal intrinsics is the performance assurance in technical perspective, but what @mattip cared about is the final performance that numpy user can awared, which should be measured by |
@seiko2plus this now has a merge conflict
Yes, this is what we now need to verify these changes have not impacted x86_64 in a bad way. I would do it, but I don't seem to be able to stabilize my machine enough to get consistent results. |
this patch also improves division precision for NEON/A32
- only covers sqrt, absolute, square and reciprocal - fix SIMD memory overlap check for aliasing(same ptr & stride) - unify fp/domain errors for both scalars and vectors
1f87265
to
c811166
Compare
That actually the exact reason behind #15987, it only compares the inner loops of ufunc which reduce the number of outliers
try to stabilize your system via sudo python -m pyperf system tune You will need to compare against multiple dispatched targets, you gonna have to use the environment variable for example: # disable AVX512F to benchmarking AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F"
# run asv or #15987
# disable AVX512F and AVX2 to benchmarking SSE
export NPY_DISABLE_CPU_FEATURES="AVX2 AVX512F"
# run asv or #15987 |
Doesn't do much on an AMD machine |
@hameerabbasi could you benchmark this? |
I originally posted this on #15987 by mistake: I ran this PR on a live environment without a desktop (Ubuntu Server), using the method in the PR description. The noise was around 3% and this PR had a performance impact of ±5%, so not too much of a difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hameerabbasi. So it seems this is good to be merged: performance on x86_64 is unchanged, and this unlocks universal intrinsics for the other architectures (and should solve that pesky test failure on old gcc on 32-bit linux).
Thanks @seiko2plus |
This pull-request:
replaces the raw SIMD of sqrt, absolute, square, reciprocal
with NumPy C SIMD vectorization interface(NPYV).
fix SIMD memory overlap check for aliasing(same ptr & stride)
unify fp/domain errors for both scalars and vectors,
which lead to fix AVX test failures for 32 bit manylinux1 wheels #17174.
improves float32 division precision on NEON/A32
add new NPYV intrinsics sqrt, abs, recip and square
reorder
Python.h
to suppress warning 'declaration of 'struct timespec*'merge after #17340
closes #17174
TODO:
Performance tests
Args used within #15987
Note:
--msleep 1
force the running thread to sleep 1 millisecond before collecting each sampleto revert any frequency reduction, since it seems that throttling effect on wall time when
AVX512F
is enabled.X86
CPU
OS
Benchmark
AVX512F - Contiguous only
metric: gmean, units: ms
2.37
1.82
1.44
2.6
2.46
1.5
1.33
1.24
1.2
1.5
1.61
1.28
1.26
1.21
1.19
1.5
1.62
1.4
2.25
1.97
1.51
1.96
1.84
1.58
AVX512F
metric: gmean, units: ms
2.36
1.81
1.44
2.23
1.53
1.52
1.87
1.48
1.37
1.7
1.76
1.55
1.46
1.26
1.22
1.13
1.11
1.12
1.07
2.42
2.42
1.6
2.23
2.26
1.91
1.1
1.14
1.13
1.3
1.33
1.22
1.7
1.68
1.52
1.08
1.1
1.12
1.45
1.25
1.18
1.41
1.43
1.44
1.06
1.32
1.24
1.2
2.0
1.99
2.0
1.13
1.13
1.12
1.33
1.24
1.2
1.97
1.98
1.99
1.06
1.06
1.36
1.21
1.63
1.64
1.63
1.5
1.61
1.27
2.78
2.89
2.45
1.3
1.29
1.33
1.35
1.43
1.25
1.74
1.7
1.7
1.18
1.2
1.2
1.24
1.27
1.12
1.5
1.51
1.53
1.11
1.08
1.08
1.27
1.21
1.19
6.33
6.32
6.35
5.15
5.3
5.33
1.26
1.21
1.19
6.33
6.34
6.35
4.92
4.97
4.94
1.26
1.21
1.19
6.32
6.34
6.33
3.86
3.87
3.86
1.63
1.61
1.4
12.14
12.51
10.57
5.51
5.62
5.66
1.5
1.61
1.24
7.36
7.32
7.33
5.12
5.21
5.35
1.49
1.31
1.2
6.42
6.56
6.47
4.64
4.66
4.63
2.17
1.92
1.34
1.91
1.28
1.29
1.61
1.52
1.35
1.65
1.31
1.29
1.43
1.28
1.18
1.07
0.93
1.92
1.82
1.63
1.7
1.71
1.42
1.32
1.4
1.2
1.2
1.19
1.11
1.43
1.25
1.19
AVX2 - Contiguous only
metric: gmean, units: ms
2.27
1.6
1.19
3.32
2.37
1.48
1.14
1.06
1.55
1.17
1.56
1.24
1.1
2.1
1.67
1.28
3.93
2.04
1.88
AVX2
metric: gmean, units: ms
2.22
1.58
1.25
3.78
2.6
2.59
1.11
1.12
1.13
2.68
1.64
1.6
2.47
2.44
2.48
1.07
1.06
1.09
1.26
1.13
1.08
1.65
1.65
1.64
1.07
3.51
2.36
1.48
2.86
2.89
2.46
1.26
1.31
1.3
1.5
1.25
1.12
1.93
1.9
1.95
1.2
1.22
1.24
1.09
1.61
1.63
1.65
1.1
1.1
1.1
1.14
2.0
1.99
2.0
1.13
1.13
1.12
1.15
1.06
1.97
1.98
1.99
1.06
1.17
0.82
1.63
1.64
1.63
1.51
1.19
1.48
2.78
2.89
2.45
1.3
1.29
1.33
1.44
1.2
1.12
1.74
1.74
1.7
1.18
1.2
1.2
1.07
0.91
1.5
1.51
1.53
1.09
1.08
1.08
6.33
6.32
6.35
5.15
5.3
5.33
6.33
6.34
6.35
4.92
4.97
4.94
6.32
6.34
6.33
3.86
3.87
3.87
1.52
1.26
1.1
12.14
12.54
10.56
5.51
5.62
5.66
1.5
1.27
1.12
7.36
7.32
7.33
5.12
5.21
5.35
1.07
0.92
6.42
6.56
6.47
4.64
4.66
4.63
2.09
1.66
1.27
2.86
1.91
1.93
2.28
1.64
1.56
2.38
1.93
1.93
1.16
1.1
1.25
1.23
1.22
0.94
2.72
2.05
1.91
2.88
2.91
2.42
1.31
1.28
1.32
1.48
1.25
1.09
1.86
1.84
1.89
1.17
1.19
1.23
0.93
1.58
1.6
1.64
1.08
1.06
1.06
SSE3 - Contiguous only
metric: gmean, units: ms
2.28
1.59
1.19
3.33
2.35
1.48
1.14
1.06
1.55
1.16
1.16
1.55
1.26
1.1
2.11
1.65
1.28
2.76
2.04
1.87
SSE3
metric: gmean, units: ms
2.23
1.58
1.24
3.78
2.6
2.59
1.11
1.12
1.13
2.63
1.64
1.6
2.47
2.44
2.48
1.07
1.06
1.09
1.2
1.07
1.65
1.65
1.64
1.07
3.83
2.36
1.48
2.86
2.89
2.46
1.26
1.31
1.3
1.5
1.24
1.11
1.94
1.89
1.95
1.2
1.22
1.24
1.08
0.94
1.61
1.63
1.65
1.1
1.1
1.1
1.14
2.0
1.99
2.0
1.13
1.13
1.12
1.15
1.06
1.97
1.98
1.99
1.06
1.16
0.82
1.63
1.64
1.63
1.53
1.17
1.18
2.78
2.89
2.45
1.3
1.29
1.33
1.44
1.2
1.11
1.74
1.7
1.7
1.18
1.2
1.2
1.06
0.89
1.5
1.51
1.53
1.09
1.08
1.07
1.53
6.33
6.32
6.35
5.15
5.3
5.33
6.33
7.67
6.34
4.92
4.97
4.94
6.32
6.34
6.33
3.86
3.87
3.87
1.54
1.25
1.1
12.14
12.51
10.57
5.51
5.62
5.66
1.5
1.25
1.12
7.36
7.32
7.33
5.12
5.21
5.35
1.08
0.93
6.42
6.56
6.47
4.64
4.66
4.63
2.12
1.62
1.27
2.86
1.91
1.93
2.31
1.61
1.59
2.38
1.93
1.93
1.18
1.25
1.23
1.22
0.93
2.68
2.04
1.92
2.88
2.91
2.42
1.31
1.28
1.32
1.49
1.25
1.09
1.86
1.84
1.89
1.17
1.19
1.23
0.94
0.94
1.58
1.6
1.64
1.08
1.06
1.06
ARM8 64-bit
CPU
OS
Benchmark
ASIMD - Contiguous only
metric: gmean, units: ms
4.93
4.77
5.0
8.9
9.44
9.62
1.44
1.44
1.43
1.58
1.63
1.61
1.77
1.74
1.73
ASIMD
metric: gmean, units: ms
4.93
5.01
4.71
2.7
2.67
2.26
1.42
0.92
2.17
2.18
2.16
1.76
1.77
1.57
1.08
1.07
1.09
2.23
1.45
1.56
1.21
1.19
1.19
1.14
1.13
1.2
8.89
9.47
9.63
2.85
2.88
2.88
2.52
1.39
1.28
3.74
3.86
3.93
1.96
1.98
1.98
1.95
1.32
1.34
1.7
1.55
1.57
1.27
1.2
1.22
1.13
1.11
1.35
1.34
1.33
1.35
0.94
1.35
1.34
1.33
1.27
1.32
1.32
1.3
1.11
0.86
0.85
0.85
0.92
1.45
1.44
1.43
1.44
1.44
1.43
1.45
1.25
1.12
1.45
1.44
1.43
1.33
1.33
1.34
1.34
1.13
1.13
1.27
1.22
1.22
1.58
1.57
1.61
1.99
1.96
1.53
1.07
0.93
1.62
1.63
1.58
1.28
1.25
1.1
1.96
1.15
1.15
1.11
1.09
1.12
1.06
1.77
1.74
1.72
2.07
2.08
2.07
1.89
0.94
2.79
2.85
2.88
1.43
1.45
1.43
1.42
1.22
1.08
1.07
0.93
0.87
0.87
0.93
Power little-endian
CPU
OS
Benchmark
VSX2(ISA >= 2.07) - Contiguous only
metric: gmean, units: ms
3.11
2.88
2.98
6.5
6.96
6.92
1.08
1.13
1.26
1.21
1.22
2.55
2.55
2.54
1.89
1.83
1.87
1.76
1.92
1.84
VSX2(ISA >= 2.07)
metric: gmean, units: ms
2.9
2.81
2.79
1.87
1.88
1.82
1.86
1.88
1.82
2.06
1.97
2.04
1.24
1.22
1.26
1.28
1.23
1.29
1.53
1.54
1.53
1.16
1.1
1.1
1.11
1.1
6.01
6.4
6.37
2.62
2.69
2.85
2.66
2.77
2.82
2.7
2.81
2.83
1.56
1.63
1.62
1.61
1.63
1.62
2.04
2.16
2.2
1.54
1.58
1.57
1.54
1.56
1.57
1.26
1.27
1.27
1.26
1.27
1.27
1.28
1.28
1.23
1.31
1.2
1.2
1.24
1.25
1.25
1.28
1.32
1.29
1.24
1.24
1.25
1.2
1.19
1.15
2.06
2.07
2.08
2.06
1.91
2.05
2.38
2.38
2.38
1.87
1.88
1.88
1.87
1.9
1.88
2.37
2.38
2.38
1.83
1.84
1.86
1.83
1.85
1.86
1.14
1.14
1.13
1.11
1.11
1.11
1.11
1.1
1.1
1.13
1.14
1.13
1.13
1.13
1.12
1.13
1.12
1.12
1.15
1.14
1.13
1.13
1.12
1.12
1.13
1.07
1.12
2.53
2.52
2.51
2.26
2.27
2.27
2.28
2.27
2.39
2.52
2.51
2.51
2.06
2.07
2.0
2.03
2.07
2.06
2.52
2.44
2.48
2.03
2.04
2.05
1.94
1.93
2.03
1.8
1.75
1.76
1.43
1.44
1.44
1.43
1.44
1.4
1.73
1.84
1.89
1.09
1.06
1.1
1.09
1.09
1.47
1.48
1.45
1.74
1.77
1.82
2.14
2.18
2.2
2.15
2.19
2.21
2.32
2.39
2.41
1.27
1.28
1.29
1.27
1.28
1.34
1.93
1.94
1.95
1.23
1.25
1.25
1.23
1.26
1.25
Binary size of
_multiarray_umath.cpython-ver-arch-linux-gnu.so
in kbytesNote: Debugging symbols are striped
EDIT: left some notes