wc: Speed optimization #7934

drinkcat · 2025-05-15T03:09:49Z

wc: Align buffer to 32-byte boundary

bytecount uses vector operations to speed up line counting.
At least on x86 with AVX2 support, the vectors are 256-byte wide,
and operations are much faster if the data is aligned.

Saves about 4% of total performance, matching wc's performance.

wc: Increase buffer size to 256kb

Improves performance by about 4% on large files.

This gets us close or better than GNU's version:

seq 10000000000000 1 inf | head -n 100000000 > seq100M
cargo build -r -p uu_wc && taskset -c 0 hyperfine --warmup 10 -L wc ./wc.main,target/release/wc,wc "{wc} -l tmp/seq100M"
   Compiling uu_wc v0.0.30 (/home/drinkcat/dev/coreutils/coreutils/src/uu/wc)
    Finished `release` profile [optimized] target(s) in 11.51s
Benchmark 1: ./wc.main -l tmp/seq100M
  Time (mean ± σ):     265.1 ms ±   5.4 ms    [User: 49.7 ms, System: 212.3 ms]
  Range (min … max):   259.8 ms … 279.8 ms    11 runs
 
Benchmark 2: target/release/wc -l tmp/seq100M
  Time (mean ± σ):     239.0 ms ±   4.1 ms    [User: 38.2 ms, System: 198.1 ms]
  Range (min … max):   233.2 ms … 249.2 ms    12 runs
 
Benchmark 3: wc -l tmp/seq100M
  Time (mean ± σ):     243.4 ms ±   6.2 ms    [User: 43.9 ms, System: 196.5 ms]
  Range (min … max):   237.5 ms … 258.1 ms    11 runs
 
Summary
  target/release/wc -l tmp/seq100M ran
    1.02 ± 0.03 times faster than wc -l tmp/seq100M
    1.11 ± 0.03 times faster than ./wc.main -l tmp/seq100M

And on 1brc dataset from original report:

cargo build -r -p uu_wc && taskset -c 0 hyperfine --warmup 3 -L wc ./wc.main,target/release/wc,wc "{wc} -l ../../1brc/data/measurements.txt"
    Finished `release` profile [optimized] target(s) in 0.10s
Benchmark 1: ./wc.main -l ../../1brc/data/measurements.txt
  Time (mean ± σ):      2.633 s ±  0.016 s    [User: 0.533 s, System: 2.075 s]
  Range (min … max):    2.615 s …  2.668 s    10 runs
 
Benchmark 2: target/release/wc -l ../../1brc/data/measurements.txt
  Time (mean ± σ):      2.375 s ±  0.017 s    [User: 0.404 s, System: 1.948 s]
  Range (min … max):    2.355 s …  2.406 s    10 runs
 
Benchmark 3: wc -l ../../1brc/data/measurements.txt
  Time (mean ± σ):      2.408 s ±  0.017 s    [User: 0.457 s, System: 1.928 s]
  Range (min … max):    2.383 s …  2.440 s    10 runs
 
Summary
  target/release/wc -l ../../1brc/data/measurements.txt ran
    1.01 ± 0.01 times faster than wc -l ../../1brc/data/measurements.txt
    1.11 ± 0.01 times faster than ./wc.main -l ../../1brc/data/measurements.txt

Improves performance by about 4% on large files.

bytecount uses vector operations to speed up line counting. At least on x86 with AVX2 support, the vectors are 256-byte wide, and operations are much faster if the data is aligned. Saves about 4% of total performance, matching wc's performance.

github-actions · 2025-05-15T03:54:43Z

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

jnsgruk · 2025-05-15T05:35:44Z

Nice, same result on my machine, which is a 7040 series Ryzen laptop chip.

❯ hyperfine --warmup 3 "wc -l ~/temp/measurements.txt" "target/release/wc -l ~/temp/measurements.txt"
Benchmark 1: wc -l ~/omgrust/measurements.txt
  Time (mean ± σ):      1.528 s ±  0.068 s    [User: 0.166 s, System: 1.362 s]
  Range (min … max):    1.435 s …  1.626 s    10 runs

Benchmark 2: target/release/wc -l ~/omgrust/measurements.txt
  Time (mean ± σ):      1.378 s ±  0.074 s    [User: 0.182 s, System: 1.195 s]
  Range (min … max):    1.270 s …  1.525 s    10 runs

Summary
  target/release/wc -l ~/temp/measurements.txt ran
    1.11 ± 0.08 times faster than wc -l ~/temp/measurements.txt

sylvestre · 2025-05-15T06:44:56Z

well done!

drinkcat · 2025-05-15T10:55:57Z

Nice, same result on my machine, which is a 7040 series Ryzen laptop chip.

It's actually a lot better than what I see. Interesting! Thanks for testing.

willshuttleworth · 2025-05-15T13:22:42Z

previously the coreutils implementation was about 10x faster on my m1 mac, now its about 13x faster!

hyperfine -w 3 'wc -l /tmp/seq100M' './target/release/wc -l /tmp/seq100M'
Benchmark 1: wc -l /tmp/seq100M
  Time (mean ± σ):     900.0 ms ±   0.3 ms    [User: 844.2 ms, System: 55.4 ms]
  Range (min … max):   899.6 ms … 900.4 ms    10 runs

Benchmark 2: ./target/release/wc -l /tmp/seq100M
  Time (mean ± σ):      66.8 ms ±   0.8 ms    [User: 12.6 ms, System: 53.8 ms]
  Range (min … max):    65.8 ms …  70.6 ms    43 runs

Summary
  ./target/release/wc -l /tmp/seq100M ran
   13.47 ± 0.17 times faster than wc -l /tmp/seq100M

sylvestre · 2025-05-15T13:35:57Z

Summary
./target/release/wc -l /tmp/seq100M ran
13.47 ± 0.17 times faster than wc -l /tmp/seq100M

I wonder how it is possible :)

wc is the BSD apple implementation ?

willshuttleworth · 2025-05-15T13:40:26Z

yes, i am using the default apple implementation

drinkcat · 2025-05-15T17:39:41Z

@willshuttleworth oh nice, thanks!

I'm not completely sure about the aarch64 code, but I see some loading operations on 8x16x4, which sound like 512-bit/64-bytes (the underlying registers are 128-bit wide though).

Mind trying to see if 64-byte alignment improves performance?

diff --git a/src/uu/wc/src/count_fast.rs b/src/uu/wc/src/count_fast.rs
index 479183263b21..0dd7c3882a31 100644
--- a/src/uu/wc/src/count_fast.rs
+++ b/src/uu/wc/src/count_fast.rs
@@ -201,7 +201,7 @@ pub(crate) fn count_bytes_fast<T: WordCountable>(handle: &mut T) -> (usize, Opti
 ///
 /// This is useful as bytecount uses 256-bit wide vector operations that run much
 /// faster on aligned data (at least on x86 with AVX2 support).
-#[repr(align(32))]
+#[repr(align(64))]
 struct AlignedBuffer {
     data: [u8; BUF_SIZE],
 }

(you can mv target/release/wc wc.main before the change to easily compare)

Thanks!

willshuttleworth · 2025-05-15T18:16:39Z

@drinkcat i'm not seeing a difference with 64 byte alignment:

hyperfine -w 3 './wc-main -l /tmp/seq100M' './target/release/wc -l /tmp/seq100M'
Benchmark 1: ./wc-main -l /tmp/seq100M
  Time (mean ± σ):      66.0 ms ±   0.3 ms    [User: 12.6 ms, System: 53.0 ms]
  Range (min … max):    65.5 ms …  66.6 ms    43 runs

Benchmark 2: ./target/release/wc -l /tmp/seq100M
  Time (mean ± σ):      66.0 ms ±   0.3 ms    [User: 12.6 ms, System: 53.0 ms]
  Range (min … max):    65.4 ms …  66.8 ms    43 runs

Summary
  ./target/release/wc -l /tmp/seq100M ran
    1.00 ± 0.01 times faster than ./wc-main -l /tmp/seq100M
    ```

drinkcat · 2025-05-16T06:03:51Z

Neat, thanks for trying!

drinkcat added 2 commits May 15, 2025 05:02

wc: Increase buffer size to 256kb

1fe539f

Improves performance by about 4% on large files.

wc: Align buffer to 32-byte boundary

1fc14d8

bytecount uses vector operations to speed up line counting. At least on x86 with AVX2 support, the vectors are 256-byte wide, and operations are much faster if the data is aligned. Saves about 4% of total performance, matching wc's performance.

drinkcat changed the title ~~Wc faster~~ wc: Speed optimization May 15, 2025

drinkcat mentioned this pull request May 15, 2025

wc -l appears to be consistently slower than gnu implementation #7929

Closed

jnsgruk approved these changes May 15, 2025

View reviewed changes

sylvestre merged commit c76c0ad into uutils:main May 15, 2025
70 checks passed

drinkcat deleted the wc-faster branch May 16, 2025 06:03

BrewTestBot mentioned this pull request May 24, 2025

uutils-coreutils 0.1.0 Homebrew/homebrew-core#224645

Merged

moonfruit mentioned this pull request May 26, 2025

uutils-selected 0.1.0 moonfruit/homebrew-tap#243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

wc: Speed optimization #7934

wc: Speed optimization #7934

Uh oh!

drinkcat commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

jnsgruk commented May 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

sylvestre commented May 15, 2025

Uh oh!

drinkcat commented May 15, 2025

Uh oh!

willshuttleworth commented May 15, 2025

Uh oh!

sylvestre commented May 15, 2025

Uh oh!

willshuttleworth commented May 15, 2025

Uh oh!

drinkcat commented May 15, 2025

Uh oh!

willshuttleworth commented May 15, 2025

Uh oh!

drinkcat commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

wc: Speed optimization #7934

wc: Speed optimization #7934

Uh oh!

Conversation

drinkcat commented May 15, 2025

wc: Align buffer to 32-byte boundary

wc: Increase buffer size to 256kb

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

jnsgruk commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sylvestre commented May 15, 2025

Uh oh!

drinkcat commented May 15, 2025

Uh oh!

willshuttleworth commented May 15, 2025

Uh oh!

sylvestre commented May 15, 2025

Uh oh!

willshuttleworth commented May 15, 2025

Uh oh!

drinkcat commented May 15, 2025

Uh oh!

willshuttleworth commented May 15, 2025

Uh oh!

drinkcat commented May 16, 2025

Uh oh!

Uh oh!

jnsgruk commented May 15, 2025 •

edited

Loading