[libc] Use UMAXV.4S to reduce bcmp result. #99260

pcc · 2024-07-17T00:51:19Z

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Created using spr 1.3.6-beta.1

llvmbot · 2024-07-17T00:51:49Z

@llvm/pr-subscribers-libc

Author: None (pcc)

Changes

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Full diff: https://github.com/llvm/llvm-project/pull/99260.diff

1 Files Affected:

(modified) libc/src/string/memory_utils/op_aarch64.h (+6-12)

diff --git a/libc/src/string/memory_utils/op_aarch64.h b/libc/src/string/memory_utils/op_aarch64.h
index 1090ea2617f09..5c08a6ae48b04 100644
--- a/libc/src/string/memory_utils/op_aarch64.h
+++ b/libc/src/string/memory_utils/op_aarch64.h
@@ -84,8 +84,7 @@ template <size_t Size> struct Bcmp {
       uint8x16_t a = vld1q_u8(_p1);
       uint8x16_t n = vld1q_u8(_p2);
       uint8x16_t an = veorq_u8(a, n);
-      uint32x2_t an_reduced = vqmovn_u64(vreinterpretq_u64_u8(an));
-      return vmaxv_u32(an_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(an));
     } else if constexpr (Size == 32) {
       auto _p1 = as_u8(p1);
       auto _p2 = as_u8(p2);
@@ -97,12 +96,9 @@ template <size_t Size> struct Bcmp {
       uint8x16_t bo = veorq_u8(b, o);
       // anbo = (a ^ n) | (b ^ o).  At least one byte is nonzero if there is
       // a difference between the two buffers.  We reduce this value down to 4
-      // bytes in two steps. First, calculate the saturated move value when
-      // going from 2x64b to 2x32b. Second, compute the max of the 2x32b to get
-      // a single 32 bit nonzero value if a mismatch occurred.
+      // bytes using the UMAXV instruction to compute the max across the vector.
       uint8x16_t anbo = vorrq_u8(an, bo);
-      uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo));
-      return vmaxv_u32(anbo_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(anbo));
     } else if constexpr ((Size % BlockSize) == 0) {
       for (size_t offset = 0; offset < Size; offset += BlockSize)
         if (auto value = Bcmp<BlockSize>::block(p1 + offset, p2 + offset))
@@ -129,8 +125,7 @@ template <size_t Size> struct Bcmp {
       uint8x16_t bo = veorq_u8(b, o);
       // anbo = (a ^ n) | (b ^ o)
       uint8x16_t anbo = vorrq_u8(an, bo);
-      uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo));
-      return vmaxv_u32(anbo_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(anbo));
     } else if constexpr (Size == 32) {
       auto _p1 = as_u8(p1);
       auto _p2 = as_u8(p2);
@@ -150,9 +145,8 @@ template <size_t Size> struct Bcmp {
       uint8x16_t cpdq = vorrq_u8(cp, dq);
       // abnocpdq = ((a ^ n) | (b ^ o)) | ((c ^ p) | (d ^ q)).  Reduce this to
       // a nonzero 32 bit value if a mismatch occurred.
-      uint64x2_t abnocpdq = vreinterpretq_u64_u8(anbo | cpdq);
-      uint32x2_t abnocpdq_reduced = vqmovn_u64(abnocpdq);
-      return vmaxv_u32(abnocpdq_reduced);
+      uint8x16_t abnocpdq = anbo | cpdq;
+      return vmaxvq_u32(vreinterpretq_u32_u8(abnocpdq));
     } else {
       static_assert(cpp::always_false<decltype(Size)>, "SIZE not implemented");
     }

SchrodingerZhu · 2024-08-03T05:40:37Z

Hi,

Thank you for the patch. Unfortunately, I think the proposed change is causing failures in tests:

Ran 5 tests.  PASS: 5  FAIL: 0
[4171/5229] Running unit test libc.test.src.stdio.snprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.snprintf_test.__unit__.__build__
[==========] Running 2 tests from 1 test suite.
[ RUN      ] LlvmLibcSNPrintfTest.CutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:23: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string"
      Which is: A simple string
[  FAILED  ] LlvmLibcSNPrintfTest.CutOff
[ RUN      ] LlvmLibcSNPrintfTest.NoCutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:53: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string with no conversions."
      Which is: A simple string with no conversions.
[  FAILED  ] LlvmLibcSNPrintfTest.NoCutOff
Ran 2 tests.  PASS: 0  FAIL: 2
[4172/5229] Running unit test libc.test.src.stdio.vsnprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.vsnprintf_test.__unit__.__build__
[==========] Running 2 tests from 1 test suite.
[ RUN      ] LlvmLibcVSNPrintfTest.CutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:35: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string"
      Which is: A simple string
[  FAILED  ] LlvmLibcVSNPrintfTest.CutOff
[ RUN      ] LlvmLibcVSNPrintfTest.NoCutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:64: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string with no conversions."
      Which is: A simple string with no conversions.
[  FAILED  ] LlvmLibcVSNPrintfTest.NoCutOff
Ran 2 tests.  PASS: 0  FAIL: 2
[4173/5229] Running unit test libc.test.src.stdio.fprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.fprintf_test.__unit__.__build__
[==========] Running 1 test from 1 test suite.
[ RUN      ] LlvmLibcFPrintfTest.WriteToFile
/home/schrodinger/development/llvm-project/libc/test/src/stdio/fprintf_test.cpp:65: FAILURE
      Expected: printf_test::fread(data, 1, sizeof(simple) - 1, file)
      Which is: 0
To be equal to: sizeof(simple) - 1
      Which is: 37
[  FAILED  ] LlvmLibcFPrintfTest.WriteToFile
Ran 1 tests.  PASS: 0  FAIL: 1
[4184/5229] Linking CXX executable libc/test/src/stdio/libc.test.src.stdio.remove_test.__hermetic__.__build__
ninja: build stopped: subcommand failed.

SchrodingerZhu

See previous comment

pcc · 2025-08-15T03:14:07Z

can't reproduce this test failure. I did:

cmake -G Ninja -DLLVM_ENABLE_RUNTIMES=libc -DLLVM_ENABLE_PROJECTS=clang\;lld -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD=AArch64 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ../llvm
ninja libc
ninja -C ./runtimes/runtimes-bins check-libc

all tests passed.

SchrodingerZhu · 2025-08-15T04:08:54Z

I will test it again. Thanks for letting me know.On Thu, Aug 14, 2025 at 23:14, Peter Collingbourne ***@***.***> wrote: Reopened #99260. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>

[𝘀𝗽𝗿] initial version

841c2e8

Created using spr 1.3.6-beta.1

llvmbot added the libc label Jul 17, 2024

pcc requested a review from gchatelet July 17, 2024 00:51

lntue changed the title ~~libc: Use UMAXV.4S to reduce bcmp result.~~ [libc] Use UMAXV.4S to reduce bcmp result. Jul 17, 2024

SchrodingerZhu requested changes Aug 3, 2024

View reviewed changes

pcc closed this Aug 5, 2024

pcc reopened this Aug 15, 2025

lntue requested a review from overmighty August 26, 2025 01:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc] Use UMAXV.4S to reduce bcmp result. #99260

[libc] Use UMAXV.4S to reduce bcmp result. #99260

Uh oh!

pcc commented Jul 17, 2024

Uh oh!

llvmbot commented Jul 17, 2024

Uh oh!

SchrodingerZhu commented Aug 3, 2024

Uh oh!

SchrodingerZhu left a comment

Uh oh!

pcc commented Aug 15, 2025

Uh oh!

SchrodingerZhu commented Aug 15, 2025 via email

Uh oh!

Uh oh!

[libc] Use UMAXV.4S to reduce bcmp result. #99260

Are you sure you want to change the base?

[libc] Use UMAXV.4S to reduce bcmp result. #99260

Uh oh!

Conversation

pcc commented Jul 17, 2024

Uh oh!

llvmbot commented Jul 17, 2024

Uh oh!

SchrodingerZhu commented Aug 3, 2024

Uh oh!

SchrodingerZhu left a comment

Choose a reason for hiding this comment

Uh oh!

pcc commented Aug 15, 2025

Uh oh!

SchrodingerZhu commented Aug 15, 2025 via email

Uh oh!

Uh oh!