-
-
Notifications
You must be signed in to change notification settings - Fork 698
SIMD optimizations with Highway #3618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In preparation for highway. Also, don't use liborc for `vips_abs()`, as that didn't yield any usable speedup.
In addition to disabling SIMD completely using `--vips-novector` or `VIPS_NOVECTOR`, one has the option to selectively override specific SIMD targets using: - the `VIPS_VECTOR` environment variable; - the `vips_vector_disable_targets()` function. Handy for testing and benchmarking purposes.
This partially reverts commit dfdf899.
In favor of `InterleaveLower` / `InterleaveUpper`.
Just fallback to the C paths if SIMD is not supported.
For images with 3 or 4 bands.
By casting back to the unpremultiplied format immediately after `vips_premultiply()`.
The fixed-point coefficients are 16-bit.
This is fantastic Kleis, what a huge project, and congratulations on getting it over the line. I'll run some tests here. |
I tried a few things:
This is limited by jpg encode and decode, but you can see a nice improvement in CPU time. If you make it more CPU limited, the speedup is more obvious:
Haha 7x faster in real time because sigma 100 will make master fall off the orc path.
I like the new highway infrastructure. It should make it relatively simple to add more highway paths, for example to I've not noticed any bad results. |
Wow, this is great, thank you Kleis! I'll go away and do some testing, but please don't let that stop you merging. We don't currently include vector paths via oss-fuzz but perhaps we might want to consider doing so? |
In preparation for libvips/libvips#3618.
Indeed, somehow
I just opened PR google/oss-fuzz#10868 for this. I'll update |
I think this would be a bad idea for large shrinks -- if you are shrinking by x100, for example, reducev would need to read the input image in huge chunks. shrinkv has the nice property of never fetching too many input scanlines in one go. |
Ah, you're right. I tested this with: DetailsBenchmark script: https://gist.github.com/kleisauke/ea7f7e12ae043aa1151dbc09987600a7 $ curl -LO https://github.com/kleisauke/vips-microbench/raw/master/images/x.jpg
$ python3 gap-bench.py --gap=2.0 -o gap-2.0.json
$ python3 gap-bench.py --gap=0.0 -o gap-0.0.json
$ python3 -m pyperf compare_to gap-2.0.json gap-0.0.json --table
+----------------+---------+----------------------+
| Benchmark | gap-2.0 | gap-0.0 |
+================+=========+======================+
| 4x | 567 ms | 305 ms: 1.86x faster |
+----------------+---------+----------------------+
| 8x | 424 ms | 306 ms: 1.39x faster |
+----------------+---------+----------------------+
| 9.4x | 391 ms | 303 ms: 1.29x faster |
+----------------+---------+----------------------+
| 16x | 355 ms | 315 ms: 1.12x faster |
+----------------+---------+----------------------+
| 64x | 338 ms | 415 ms: 1.23x slower |
+----------------+---------+----------------------+
| Geometric mean | (ref) | 1.17x faster |
+----------------+---------+----------------------+
Benchmark hidden because not significant (2): 2x, 32x |
In preparation for libvips/libvips#3618.
A big difference in memory use too:
|
My initial testing on an Intel i7-1255U (2 physical cores with hyperthreading, 8 physical without, "12 cores" in total) laptop with AVX2 suggests there is a noticeable variance in multi-threaded resize performance when compared with liborc, with a seemingly-random range of +15% at best to -5% at worst. I've yet to dig into the details but it could be a clock speed reduction of the non-hyperthreading cores when hot, or AVX2 "heavy" operations causing slowdown due to throttling / lane widening, or maybe some operations are now too fast and now there are more cache evictions. |
Thanks for testing @lovell! The 5% slowdown sounds like a CPU clock throttling issue, does this also occur with It could also be due to the over-computation issue mentioned in #2757, which could be circumvented by forcing random access (I'm not sure if this can be done in CLI). I'll have a look to see if I can reproduce this on my old AVX2 laptop. |
I could not reproduce this on my old AVX2 laptop. Tested with: DetailsBenchmark script Test environment
Images
Results $ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/2569067123_aca715a2ee_o.jpg
$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/alpha-premultiply-2048x1536-paper.png
$ curl -LO https://github.com/lovell/sharp/raw/main/test/fixtures/4.webp
$ python3 thumbnail-bench.py 2569067123_aca715a2ee_o.jpg -o jpeg-highway.json
.....................
720x: Mean +- std dev: 49.2 ms +- 0.8 ms
$ python3 thumbnail-bench.py alpha-premultiply-2048x1536-paper.png -o png-highway.json
.....................
720x: Mean +- std dev: 83.8 ms +- 1.6 ms
$ python3 thumbnail-bench.py 4.webp -o webp-highway.json
.....................
720x: Mean +- std dev: 30.3 ms +- 0.4 ms
$ python3 thumbnail-bench.py 2569067123_aca715a2ee_o.jpg -o jpeg-orc.json
.....................
720x: Mean +- std dev: 60.0 ms +- 1.9 ms
$ python3 thumbnail-bench.py alpha-premultiply-2048x1536-paper.png -o png-orc.json
.....................
720x: Mean +- std dev: 100 ms +- 2 ms
$ python3 thumbnail-bench.py 4.webp -o webp-orc.json
.....................
720x: Mean +- std dev: 30.3 ms +- 0.4 ms
$ python3 -m pyperf compare_to jpeg-orc.json jpeg-highway.json --table
+-----------+----------+-----------------------+
| Benchmark | jpeg-orc | jpeg-highway |
+===========+==========+=======================+
| 720x | 60.0 ms | 49.2 ms: 1.22x faster |
+-----------+----------+-----------------------+
$ python3 -m pyperf compare_to png-orc.json png-highway.json --table
+-----------+---------+-----------------------+
| Benchmark | png-orc | png-highway |
+===========+=========+=======================+
| 720x | 100 ms | 83.8 ms: 1.20x faster |
+-----------+---------+-----------------------+
$ python3 -m pyperf compare_to webp-orc.json webp-highway.json --table
Benchmark hidden because not significant (1): 720x Notes
So on this benchmark, Highway is ~16% to ~22% faster when compared with liborc. |
I've done more testing and can confirm |
I think Intel brand this as Turbo Boost. If the CPU is mostly just using one core, that single core gets about a 20% or 30% clock bump above the standard rated frequency. Once you start to hot a couple of cores, it'll clock down to normal speeds. Maybe disable turbo boost and try benchmarking again? I always forget how to do this, but SO suggests: https://askubuntu.com/a/620114 The other factor might be the cache. Your cores will share L2/L3, so single core performance will in effect get a cache boost. |
\o/, this will be in libvips 8.15. |
Highway main author here. Great to see this, congrats @kleisauke on the great results and thanks for letting us know :)
Agreed. Hyperthreads are sharing the vector units of a core; on Intel there are essentially 2 arithmetic ports plus one for shuffles, and it is pretty easy to keep them busy with a single thread. Scheduling threads onto the same core will not help and may actually hurt. When benchmarking, I usually use taskset or numactl to pin threads to the first HT in a core. In addition to that, faster vectorization means we are closer to being memory-bound, and in particular influenced by background activity that happens to use more of the bandwidth during the test. |
Congratulations on landing this huge thing Kleis! |
just catching up with this - its "music to my ears" as you can expect ;-) well done Kleis! |
Previously, `seq->coff` was used both for storing offsets to clear values (zero values in masks) and as an array for non-128 mask coefficients. However, in commit 40e2884 (PR libvips#3618), `seq->coff` was restricted to `guint8` values, making it incompatible for storing offsets. Fix this by syncing the C-paths with the Highway implementation.
Previously, `seq->coff` was used both for storing offsets to clear values (zero values in masks) and as an array for non-128 mask coefficients. However, in commit 40e2884 (PR libvips#3618), `seq->coff` was restricted to `guint8` values, making it incompatible for storing offsets. Fix this by syncing the C-paths with the Highway implementation.
* morph: fix erode Highway path * morph: sync C-paths with the Highway implementation Previously, `seq->coff` was used both for storing offsets to clear values (zero values in masks) and as an array for non-128 mask coefficients. However, in commit 40e2884 (PR #3618), `seq->coff` was restricted to `guint8` values, making it incompatible for storing offsets. Fix this by syncing the C-paths with the Highway implementation. * morph: prefer bitwise NOT over bitwise XOR `~p` and `p ^ 255` produce the same result on uchar images, as XOR affects only the lowest 8 bits.
This PR optimizes the
reduce{h,v}
,convi
, andmorph
operations using portable SIMD/vector instructions through Highway. The liborc paths serve as fallbacks whenever Highway >= v1.0.5 is unavailable[1].Motivation
Traditionally, libvips depends on liborc's runtime compiler to dynamically generate optimized SIMD/vector code specifically for the target architecture. However, maintaining this code proved challenging and it didn't generalize to other architectures (such as WebAssembly). Additionally, it lacked support for newer instruction sets (like AVX2 and AVX-512), and the vector paths of liborc didn't match the precision of the C paths (as noted here).
Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. Because Highway is a library (rather than a code generator or compiler) it facilitates straightforward development, debugging, and maintenance of the code. Highway supports five architectures[2]; the same application code can target various instruction sets, including those with 'scalable' vectors (size unknown at compile time).
Usage
Users can view available targets for their platform using the
--targets
flag:Additionally, users can specify which available targets to use at runtime via the
VIPS_VECTOR
environment variable, particularly useful for testing and benchmarking:As always, you have the option to disable vector paths with the
--vips-novector
flag or theVIPS_NOVECTOR
environment variable.Accuracy and performance
This PR underwent accuracy and speed testing on the following targets:
https://gist.github.com/kleisauke/1f28a9fc156c753bcb1239b6fc1a2e62
It produces identical output to the C paths on these targets, addressing issue #2047.
On my AMD Ryzen 9 7900 workstation, this implementation shows a noticeable speed improvement, ranging from ~15% to even ~2.5 times faster, depending on the number of worker threads used. See the benchmark results at:
https://github.com/kleisauke/vips-microbench/blob/master/results/simd-highway.md
Feel free to benchmark this across additional architectures!
Backward compatibility
Several liborc-specific functions are now deprecated; see API changes below for details[3].
This PR should not affect backward compatibility. The
abi-compliance-checker
result is available at:https://kleisauke.nl/compat_reports/vips/master_to_simd-highway/compat_report.html
References
[1]: Highway packaging status
[2]: Highway targets
Highway currently targets the following 'clusters' of features:
SSE2
(any x64)SSSE3
(~Intel Core)SSE4
(~Nehalem)AVX2
(~Haswell)AVX3
(~Skylake)AVX3_DL
(~Icelake)AVX3_ZEN4
(Zen4).AVX3_SPR
(~Sapphire Rapids)NEON
(Armv7+)SVE
(plus its specialization for 256-bit vectorsSVE_256
)SVE2
(plus its specialization for 128-bit vectorsSVE2_128
)PPC8
(v2.07)PPC9
(v3.0)PPC10
(v3.1B)RVV
(1.0)WASM
WASM_EMU256
(a 2x unrolled version of wasm128)[3]: API Changes
memory.h
:A new function to allocate memory aligned on a specific boundary, along with a function for releasing that memory.
vector.h
:New functions to obtain or disable specific targets; the previous
VipsVector
/VipsExecutor
APIs are deprecated.