Skip to content

ENH: Enable 16-bit VQSort routines on AArch64 #25346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented Dec 8, 2023

Following the enablement of VQSort, this enables the 16-bit routines, which improves the performance quite a bit:

| Change   | Before [895c7520] <highway-vqsort-16~1>   | After [5abc0cac] <highway-vqsort-16>   |   Ratio | Benchmark (Parameter)                                                          |
|----------|-------------------------------------------|----------------------------------------|---------|--------------------------------------------------------------------------------|
| +        | 63.1±0.1μs                                | 79.7±4μs                               |    1.26 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 100))   |
| +        | 35.8±0.01μs                               | 41.5±6μs                               |    1.16 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000))  |
| +        | 652±1μs                                   | 695±5μs                                |    1.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))  |
| +        | 728±0.5μs                                 | 767±2μs                                |    1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))             |
| +        | 681±1μs                                   | 716±4μs                                |    1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))   |
| +        | 49.8±0.1μs                                | 52.3±0.05μs                            |    1.05 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))             |
| -        | 58.9±0.05μs                               | 55.9±0.2μs                             |    0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',))           |
| -        | 59.0±0.03μs                               | 55.7±0.02μs                            |    0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))            |
| -        | 81.6±0.2μs                                | 52.6±0.03μs                            |    0.64 | bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))            |
| -        | 135±0.05μs                                | 59.7±0.1μs                             |    0.44 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))           |
| -        | 251±0.9μs                                 | 55.2±0.03μs                            |    0.22 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))   |
| -        | 340±1μs                                   | 59.2±0.05μs                            |    0.17 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) |
| -        | 312±0.6μs                                 | 52.8±0.04μs                            |    0.17 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))     |
| -        | 329±0.7μs                                 | 52.7±0.06μs                            |    0.16 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))    |
| -        | 413±0.8μs                                 | 57.6±0.02μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))   |
| -        | 407±0.6μs                                 | 56.8±0.05μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))  |
| -        | 376±0.6μs                                 | 52.8±0.05μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))              |
| -        | 471±1μs                                   | 56.9±0.06μs                            |    0.12 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',))            |
| -        | 49.6±0.01μs                               | 2.15±0.02μs                            |    0.04 | bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))             |
| -        | 153±0.08μs                                | 4.28±0.02μs                            |    0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))          |
| -        | 143±1μs                                   | 4.28±0.02μs                            |    0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))           |

Following the enablement of VQSort, this enables the 16-bit routines, which improves the performance quite a bit:

```
| Change   | Before [895c752] <highway-vqsort-16~1>   | After [5abc0cac] <highway-vqsort-16>   |   Ratio | Benchmark (Parameter)                                                          |
|----------|-------------------------------------------|----------------------------------------|---------|--------------------------------------------------------------------------------|
| +        | 63.1±0.1μs                                | 79.7±4μs                               |    1.26 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 100))   |
| +        | 35.8±0.01μs                               | 41.5±6μs                               |    1.16 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000))  |
| +        | 652±1μs                                   | 695±5μs                                |    1.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))  |
| +        | 728±0.5μs                                 | 767±2μs                                |    1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))             |
| +        | 681±1μs                                   | 716±4μs                                |    1.05 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))   |
| +        | 49.8±0.1μs                                | 52.3±0.05μs                            |    1.05 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))             |
| -        | 58.9±0.05μs                               | 55.9±0.2μs                             |    0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',))           |
| -        | 59.0±0.03μs                               | 55.7±0.02μs                            |    0.95 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))            |
| -        | 81.6±0.2μs                                | 52.6±0.03μs                            |    0.64 | bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))            |
| -        | 135±0.05μs                                | 59.7±0.1μs                             |    0.44 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))           |
| -        | 251±0.9μs                                 | 55.2±0.03μs                            |    0.22 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))   |
| -        | 340±1μs                                   | 59.2±0.05μs                            |    0.17 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000)) |
| -        | 312±0.6μs                                 | 52.8±0.04μs                            |    0.17 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))     |
| -        | 329±0.7μs                                 | 52.7±0.06μs                            |    0.16 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))    |
| -        | 413±0.8μs                                 | 57.6±0.02μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))   |
| -        | 407±0.6μs                                 | 56.8±0.05μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))  |
| -        | 376±0.6μs                                 | 52.8±0.05μs                            |    0.14 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))              |
| -        | 471±1μs                                   | 56.9±0.06μs                            |    0.12 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',))            |
| -        | 49.6±0.01μs                               | 2.15±0.02μs                            |    0.04 | bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))             |
| -        | 153±0.08μs                                | 4.28±0.02μs                            |    0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))          |
| -        | 143±1μs                                   | 4.28±0.02μs                            |    0.03 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))           |
```
Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

#include "highway_qsort_16bit.dispatch.h"
#endif
NPY_CPU_DISPATCH_DECLARE(template <typename T> void QSort, (T *arr, npy_intp size))
NPY_CPU_DISPATCH_DECLARE(template <typename T> void QSelect, (T* arr, npy_intp num, npy_intp kth))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QSelect isn't implemented yet, right? I suppose it can come after google/highway#1710.

{
hwy::HWY_NAMESPACE::VQSortStatic(reinterpret_cast<hwy::float16_t*>(arr), size, hwy::SortAscending());
}
template<> void NPY_CPU_DISPATCH_CURFX(QSort)(uint16_t *arr, intptr_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, do the uint16_t and int16_t versions also need NPY_CPU_FEATURE_ASIMDHP? I thought that was only required for vector float16 operations.

@r-devulap
Copy link
Member

Found these numbers interesting: 35x speed for sorting float16 reversed array but only 1.5x for int16 ❗

--- --- --- --- ---
- 153±0.08μs 4.28±0.02μs 0.03 bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))
- 81.6±0.2μs 52.6±0.03μs 0.64 bench_function_base.Sort.time_sort('quick', 'int16', ('reversed',))

@seiko2plus seiko2plus merged commit da8afcb into numpy:main Dec 14, 2023
@seiko2plus
Copy link
Member

Thank you @Mousius

@charris
Copy link
Member

charris commented Dec 14, 2023

This looks to have broken linux aarch64 builds: https://cirrus-ci.com/task/6261480706801664.

@seiko2plus
Copy link
Member

This looks to have broken linux aarch64 builds: https://cirrus-ci.com/task/6261480706801664.

No, this pr #25247 broke the build according to the Highway build error

@seiko2plus
Copy link
Member

seiko2plus commented Dec 14, 2023

I will try to update Highway within #25397 to see if this is going to fix the build error, otherwise, disable SVE's quicksort and report upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants