ENH: Enable 16-bit VQSort routines on AArch64 #25346

Mousius · 2023-12-08T14:39:31Z

Following the enablement of VQSort, this enables the 16-bit routines, which improves the performance quite a bit:

| Change   | Before [895c7520] <highway-vqsort-16~1>   | After [5abc0cac] <highway-vqsort-16>   |   Ratio | Benchmark (Parameter)                                                          |
|----------|-------------------------------------------|----------------------------------------|---------|--------------------------------------------------------------------------------|
| +        | 63.1±0.1μs                                | 79.7±4μs                               |    1.26 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 100))   |
| +        | 35.8±0.01μs                               | 41.5±6μs                               |    1.16 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000))  |
| +        | 652±1μs                                   | 695±5μs                                |    1.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))  |
| +        | 728±0.5μs                                 | 767±2μs                                |    1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))             |
| +        | 681±1μs                                   | 716±4μs                                |    1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))   |
| +        | 49.8±0.1μs                                | 52.3±0.05μs                            |    1.05 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))             |
| -        | 58.9±0.05μs                               | 55.9±0.2μs                             |    0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',))           |
| -        | 59.0±0.03μs                               | 55.7±0.02μs                            |    0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))            |
| -        | 81.6±0.2μs                                | 52.6±0.03μs                            |    0.64 | bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))            |
| -        | 135±0.05μs                                | 59.7±0.1μs                             |    0.44 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))           |
| -        | 251±0.9μs                                 | 55.2±0.03μs                            |    0.22 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))   |
| -        | 340±1μs                                   | 59.2±0.05μs                            |    0.17 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) |
| -        | 312±0.6μs                                 | 52.8±0.04μs                            |    0.17 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))     |
| -        | 329±0.7μs                                 | 52.7±0.06μs                            |    0.16 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))    |
| -        | 413±0.8μs                                 | 57.6±0.02μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))   |
| -        | 407±0.6μs                                 | 56.8±0.05μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))  |
| -        | 376±0.6μs                                 | 52.8±0.05μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))              |
| -        | 471±1μs                                   | 56.9±0.06μs                            |    0.12 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',))            |
| -        | 49.6±0.01μs                               | 2.15±0.02μs                            |    0.04 | bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))             |
| -        | 153±0.08μs                                | 4.28±0.02μs                            |    0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))          |
| -        | 143±1μs                                   | 4.28±0.02μs                            |    0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))           |

Following the enablement of VQSort, this enables the 16-bit routines, which improves the performance quite a bit: ``` | Change | Before [895c752] <highway-vqsort-16~1> | After [5abc0cac] <highway-vqsort-16> | Ratio | Benchmark (Parameter) | |----------|-------------------------------------------|----------------------------------------|---------|--------------------------------------------------------------------------------| | + | 63.1±0.1μs | 79.7±4μs | 1.26 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 100)) | | + | 35.8±0.01μs | 41.5±6μs | 1.16 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000)) | | + | 652±1μs | 695±5μs | 1.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000)) | | + | 728±0.5μs | 767±2μs | 1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',)) | | + | 681±1μs | 716±4μs | 1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100)) | | + | 49.8±0.1μs | 52.3±0.05μs | 1.05 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',)) | | - | 58.9±0.05μs | 55.9±0.2μs | 0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',)) | | - | 59.0±0.03μs | 55.7±0.02μs | 0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',)) | | - | 81.6±0.2μs | 52.6±0.03μs | 0.64 | bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',)) | | - | 135±0.05μs | 59.7±0.1μs | 0.44 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',)) | | - | 251±0.9μs | 55.2±0.03μs | 0.22 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000)) | | - | 340±1μs | 59.2±0.05μs | 0.17 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) | | - | 312±0.6μs | 52.8±0.04μs | 0.17 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10)) | | - | 329±0.7μs | 52.7±0.06μs | 0.16 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100)) | | - | 413±0.8μs | 57.6±0.02μs | 0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10)) | | - | 407±0.6μs | 56.8±0.05μs | 0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100)) | | - | 376±0.6μs | 52.8±0.05μs | 0.14 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',)) | | - | 471±1μs | 56.9±0.06μs | 0.12 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',)) | | - | 49.6±0.01μs | 2.15±0.02μs | 0.04 | bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',)) | | - | 153±0.08μs | 4.28±0.02μs | 0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',)) | | - | 143±1μs | 4.28±0.02μs | 0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',)) | ```

r-devulap

LGTM.

r-devulap · 2023-12-11T21:37:49Z

numpy/_core/src/npysort/highway_qsort.hpp

+    #include "highway_qsort_16bit.dispatch.h"
+#endif
+NPY_CPU_DISPATCH_DECLARE(template <typename T> void QSort, (T *arr, npy_intp size))
+NPY_CPU_DISPATCH_DECLARE(template <typename T> void QSelect, (T* arr, npy_intp num, npy_intp kth))


QSelect isn't implemented yet, right? I suppose it can come after google/highway#1710.

r-devulap · 2023-12-11T21:39:18Z

numpy/_core/src/npysort/highway_qsort_16bit.dispatch.cpp

+{
+    hwy::HWY_NAMESPACE::VQSortStatic(reinterpret_cast<hwy::float16_t*>(arr), size, hwy::SortAscending());
+}
+template<> void NPY_CPU_DISPATCH_CURFX(QSort)(uint16_t *arr, intptr_t size)


Just out of curiosity, do the uint16_t and int16_t versions also need NPY_CPU_FEATURE_ASIMDHP? I thought that was only required for vector float16 operations.

r-devulap · 2023-12-11T21:50:45Z

Found these numbers interesting: 35x speed for sorting float16 reversed array but only 1.5x for int16 ❗

---	---	---	---	---
-	153±0.08μs	4.28±0.02μs	0.03	bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))
-	81.6±0.2μs	52.6±0.03μs	0.64	bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))

seiko2plus · 2023-12-14T16:23:08Z

Thank you @Mousius

charris · 2023-12-14T19:00:14Z

This looks to have broken linux aarch64 builds: https://cirrus-ci.com/task/6261480706801664.

seiko2plus · 2023-12-14T19:07:46Z

This looks to have broken linux aarch64 builds: https://cirrus-ci.com/task/6261480706801664.

No, this pr #25247 broke the build according to the Highway build error

seiko2plus · 2023-12-14T19:37:02Z

I will try to update Highway within #25397 to see if this is going to fix the build error, otherwise, disable SVE's quicksort and report upstream.

github-actions bot added the 01 - Enhancement label Dec 8, 2023

r-devulap approved these changes Dec 11, 2023

View reviewed changes

seiko2plus merged commit da8afcb into numpy:main Dec 14, 2023

seiko2plus mentioned this pull request Dec 14, 2023

BUG, SIMD: Fix quicksort build error when Highway/SVE is enabled #25397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Enable 16-bit VQSort routines on AArch64 #25346

ENH: Enable 16-bit VQSort routines on AArch64 #25346

Uh oh!

Mousius commented Dec 8, 2023

Uh oh!

r-devulap left a comment

Uh oh!

r-devulap Dec 11, 2023

Uh oh!

r-devulap Dec 11, 2023

Uh oh!

r-devulap commented Dec 11, 2023

Uh oh!

seiko2plus commented Dec 14, 2023

Uh oh!

charris commented Dec 14, 2023

Uh oh!

seiko2plus commented Dec 14, 2023

Uh oh!

seiko2plus commented Dec 14, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ENH: Enable 16-bit VQSort routines on AArch64 #25346

ENH: Enable 16-bit VQSort routines on AArch64 #25346

Uh oh!

Conversation

Mousius commented Dec 8, 2023

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

r-devulap Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

r-devulap Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

r-devulap commented Dec 11, 2023

Uh oh!

seiko2plus commented Dec 14, 2023

Uh oh!

charris commented Dec 14, 2023

Uh oh!

seiko2plus commented Dec 14, 2023

Uh oh!

seiko2plus commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Dec 14, 2023 •

edited

Loading