Skip to content

Conversation

pcc
Copy link
Contributor

@pcc pcc commented Jul 17, 2024

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)

Created using spr 1.3.6-beta.1
@llvmbot llvmbot added the libc label Jul 17, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 17, 2024

@llvm/pr-subscribers-libc

Author: None (pcc)

Changes

We can use UMAXV.4S to reduce the comparison result in a single
instruction. This improves performance by roughly 4% on Apple M1:

Summary
bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10 ran
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark3 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.01 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.02 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.03 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark2 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.03 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10
1.05 ± 0.02 times faster than bin/libc.src.string.bcmp_benchmark1 --study-name="new bcmp" --sweep-mode --sweep-max-size=128 --output=/dev/null --num-trials=10

(1 = original, 2 = a variant of this patch that uses UMAXV.16B, 3 = this patch)


Full diff: https://github.com/llvm/llvm-project/pull/99260.diff

1 Files Affected:

  • (modified) libc/src/string/memory_utils/op_aarch64.h (+6-12)
diff --git a/libc/src/string/memory_utils/op_aarch64.h b/libc/src/string/memory_utils/op_aarch64.h
index 1090ea2617f09..5c08a6ae48b04 100644
--- a/libc/src/string/memory_utils/op_aarch64.h
+++ b/libc/src/string/memory_utils/op_aarch64.h
@@ -84,8 +84,7 @@ template <size_t Size> struct Bcmp {
       uint8x16_t a = vld1q_u8(_p1);
       uint8x16_t n = vld1q_u8(_p2);
       uint8x16_t an = veorq_u8(a, n);
-      uint32x2_t an_reduced = vqmovn_u64(vreinterpretq_u64_u8(an));
-      return vmaxv_u32(an_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(an));
     } else if constexpr (Size == 32) {
       auto _p1 = as_u8(p1);
       auto _p2 = as_u8(p2);
@@ -97,12 +96,9 @@ template <size_t Size> struct Bcmp {
       uint8x16_t bo = veorq_u8(b, o);
       // anbo = (a ^ n) | (b ^ o).  At least one byte is nonzero if there is
       // a difference between the two buffers.  We reduce this value down to 4
-      // bytes in two steps. First, calculate the saturated move value when
-      // going from 2x64b to 2x32b. Second, compute the max of the 2x32b to get
-      // a single 32 bit nonzero value if a mismatch occurred.
+      // bytes using the UMAXV instruction to compute the max across the vector.
       uint8x16_t anbo = vorrq_u8(an, bo);
-      uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo));
-      return vmaxv_u32(anbo_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(anbo));
     } else if constexpr ((Size % BlockSize) == 0) {
       for (size_t offset = 0; offset < Size; offset += BlockSize)
         if (auto value = Bcmp<BlockSize>::block(p1 + offset, p2 + offset))
@@ -129,8 +125,7 @@ template <size_t Size> struct Bcmp {
       uint8x16_t bo = veorq_u8(b, o);
       // anbo = (a ^ n) | (b ^ o)
       uint8x16_t anbo = vorrq_u8(an, bo);
-      uint32x2_t anbo_reduced = vqmovn_u64(vreinterpretq_u64_u8(anbo));
-      return vmaxv_u32(anbo_reduced);
+      return vmaxvq_u32(vreinterpretq_u32_u8(anbo));
     } else if constexpr (Size == 32) {
       auto _p1 = as_u8(p1);
       auto _p2 = as_u8(p2);
@@ -150,9 +145,8 @@ template <size_t Size> struct Bcmp {
       uint8x16_t cpdq = vorrq_u8(cp, dq);
       // abnocpdq = ((a ^ n) | (b ^ o)) | ((c ^ p) | (d ^ q)).  Reduce this to
       // a nonzero 32 bit value if a mismatch occurred.
-      uint64x2_t abnocpdq = vreinterpretq_u64_u8(anbo | cpdq);
-      uint32x2_t abnocpdq_reduced = vqmovn_u64(abnocpdq);
-      return vmaxv_u32(abnocpdq_reduced);
+      uint8x16_t abnocpdq = anbo | cpdq;
+      return vmaxvq_u32(vreinterpretq_u32_u8(abnocpdq));
     } else {
       static_assert(cpp::always_false<decltype(Size)>, "SIZE not implemented");
     }

@pcc pcc requested a review from gchatelet July 17, 2024 00:51
@lntue lntue changed the title libc: Use UMAXV.4S to reduce bcmp result. [libc] Use UMAXV.4S to reduce bcmp result. Jul 17, 2024
@SchrodingerZhu
Copy link
Contributor

Hi,

Thank you for the patch. Unfortunately, I think the proposed change is causing failures in tests:

Ran 5 tests.  PASS: 5  FAIL: 0
[4171/5229] Running unit test libc.test.src.stdio.snprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.snprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.snprintf_test.__unit__.__build__
[==========] Running 2 tests from 1 test suite.
[ RUN      ] LlvmLibcSNPrintfTest.CutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:23: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string"
      Which is: A simple string
[  FAILED  ] LlvmLibcSNPrintfTest.CutOff
[ RUN      ] LlvmLibcSNPrintfTest.NoCutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/snprintf_test.cpp:53: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string with no conversions."
      Which is: A simple string with no conversions.
[  FAILED  ] LlvmLibcSNPrintfTest.NoCutOff
Ran 2 tests.  PASS: 0  FAIL: 2
[4172/5229] Running unit test libc.test.src.stdio.vsnprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.vsnprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.vsnprintf_test.__unit__.__build__
[==========] Running 2 tests from 1 test suite.
[ RUN      ] LlvmLibcVSNPrintfTest.CutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:35: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string"
      Which is: A simple string
[  FAILED  ] LlvmLibcVSNPrintfTest.CutOff
[ RUN      ] LlvmLibcVSNPrintfTest.NoCutOff
/home/schrodinger/development/llvm-project/libc/test/src/stdio/vsnprintf_test.cpp:64: FAILURE
      Expected: buff
      Which is: 
To be equal to: "A simple string with no conversions."
      Which is: A simple string with no conversions.
[  FAILED  ] LlvmLibcVSNPrintfTest.NoCutOff
Ran 2 tests.  PASS: 0  FAIL: 2
[4173/5229] Running unit test libc.test.src.stdio.fprintf_test.__unit__
FAILED: libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/CMakeFiles/libc.test.src.stdio.fprintf_test.__unit__ 
cd /home/schrodinger/development/llvm-project/build/libc/test/src/stdio && /home/schrodinger/development/llvm-project/build/libc/test/src/stdio/libc.test.src.stdio.fprintf_test.__unit__.__build__
[==========] Running 1 test from 1 test suite.
[ RUN      ] LlvmLibcFPrintfTest.WriteToFile
/home/schrodinger/development/llvm-project/libc/test/src/stdio/fprintf_test.cpp:65: FAILURE
      Expected: printf_test::fread(data, 1, sizeof(simple) - 1, file)
      Which is: 0
To be equal to: sizeof(simple) - 1
      Which is: 37
[  FAILED  ] LlvmLibcFPrintfTest.WriteToFile
Ran 1 tests.  PASS: 0  FAIL: 1
[4184/5229] Linking CXX executable libc/test/src/stdio/libc.test.src.stdio.remove_test.__hermetic__.__build__
ninja: build stopped: subcommand failed.

Copy link
Contributor

@SchrodingerZhu SchrodingerZhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@pcc pcc closed this Aug 5, 2024
@pcc
Copy link
Contributor Author

pcc commented Aug 15, 2025

can't reproduce this test failure. I did:

cmake -G Ninja -DLLVM_ENABLE_RUNTIMES=libc -DLLVM_ENABLE_PROJECTS=clang\;lld -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_ASSERTIONS=1 -DLLVM_TARGETS_TO_BUILD=AArch64 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ../llvm
ninja libc
ninja -C ./runtimes/runtimes-bins check-libc

all tests passed.

@pcc pcc reopened this Aug 15, 2025
@SchrodingerZhu
Copy link
Contributor

SchrodingerZhu commented Aug 15, 2025 via email

@lntue lntue requested a review from overmighty August 26, 2025 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants