-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: Enable 16-bit VQSort routines on AArch64 #25346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Following the enablement of VQSort, this enables the 16-bit routines, which improves the performance quite a bit: ``` | Change | Before [895c752] <highway-vqsort-16~1> | After [5abc0cac] <highway-vqsort-16> | Ratio | Benchmark (Parameter) | |----------|-------------------------------------------|----------------------------------------|---------|--------------------------------------------------------------------------------| | + | 63.1±0.1μs | 79.7±4μs | 1.26 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 100)) | | + | 35.8±0.01μs | 41.5±6μs | 1.16 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000)) | | + | 652±1μs | 695±5μs | 1.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000)) | | + | 728±0.5μs | 767±2μs | 1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',)) | | + | 681±1μs | 716±4μs | 1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100)) | | + | 49.8±0.1μs | 52.3±0.05μs | 1.05 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',)) | | - | 58.9±0.05μs | 55.9±0.2μs | 0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',)) | | - | 59.0±0.03μs | 55.7±0.02μs | 0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',)) | | - | 81.6±0.2μs | 52.6±0.03μs | 0.64 | bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',)) | | - | 135±0.05μs | 59.7±0.1μs | 0.44 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',)) | | - | 251±0.9μs | 55.2±0.03μs | 0.22 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000)) | | - | 340±1μs | 59.2±0.05μs | 0.17 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) | | - | 312±0.6μs | 52.8±0.04μs | 0.17 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10)) | | - | 329±0.7μs | 52.7±0.06μs | 0.16 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100)) | | - | 413±0.8μs | 57.6±0.02μs | 0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10)) | | - | 407±0.6μs | 56.8±0.05μs | 0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100)) | | - | 376±0.6μs | 52.8±0.05μs | 0.14 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',)) | | - | 471±1μs | 56.9±0.06μs | 0.12 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',)) | | - | 49.6±0.01μs | 2.15±0.02μs | 0.04 | bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',)) | | - | 153±0.08μs | 4.28±0.02μs | 0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',)) | | - | 143±1μs | 4.28±0.02μs | 0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',)) | ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
#include "highway_qsort_16bit.dispatch.h" | ||
#endif | ||
NPY_CPU_DISPATCH_DECLARE(template <typename T> void QSort, (T *arr, npy_intp size)) | ||
NPY_CPU_DISPATCH_DECLARE(template <typename T> void QSelect, (T* arr, npy_intp num, npy_intp kth)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QSelect
isn't implemented yet, right? I suppose it can come after google/highway#1710.
{ | ||
hwy::HWY_NAMESPACE::VQSortStatic(reinterpret_cast<hwy::float16_t*>(arr), size, hwy::SortAscending()); | ||
} | ||
template<> void NPY_CPU_DISPATCH_CURFX(QSort)(uint16_t *arr, intptr_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, do the uint16_t and int16_t versions also need NPY_CPU_FEATURE_ASIMDHP
? I thought that was only required for vector float16 operations.
Found these numbers interesting: 35x speed for sorting float16 reversed array but only 1.5x for int16 ❗
|
Thank you @Mousius |
This looks to have broken linux aarch64 builds: https://cirrus-ci.com/task/6261480706801664. |
No, this pr #25247 broke the build according to the Highway build error |
I will try to update Highway within #25397 to see if this is going to fix the build error, otherwise, disable SVE's quicksort and report upstream. |
Following the enablement of VQSort, this enables the 16-bit routines, which improves the performance quite a bit: