Skip to content

wc: Speed optimization #7934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 15, 2025
Merged

wc: Speed optimization #7934

merged 2 commits into from
May 15, 2025

Conversation

drinkcat
Copy link
Contributor

Fixes #7929.

wc: Align buffer to 32-byte boundary

bytecount uses vector operations to speed up line counting.
At least on x86 with AVX2 support, the vectors are 256-byte wide,
and operations are much faster if the data is aligned.

Saves about 4% of total performance, matching wc's performance.

wc: Increase buffer size to 256kb

Improves performance by about 4% on large files.


This gets us close or better than GNU's version:

seq 10000000000000 1 inf | head -n 100000000 > seq100M
cargo build -r -p uu_wc && taskset -c 0 hyperfine --warmup 10 -L wc ./wc.main,target/release/wc,wc "{wc} -l tmp/seq100M"
   Compiling uu_wc v0.0.30 (/home/drinkcat/dev/coreutils/coreutils/src/uu/wc)
    Finished `release` profile [optimized] target(s) in 11.51s
Benchmark 1: ./wc.main -l tmp/seq100M
  Time (mean ± σ):     265.1 ms ±   5.4 ms    [User: 49.7 ms, System: 212.3 ms]
  Range (min … max):   259.8 ms … 279.8 ms    11 runs
 
Benchmark 2: target/release/wc -l tmp/seq100M
  Time (mean ± σ):     239.0 ms ±   4.1 ms    [User: 38.2 ms, System: 198.1 ms]
  Range (min … max):   233.2 ms … 249.2 ms    12 runs
 
Benchmark 3: wc -l tmp/seq100M
  Time (mean ± σ):     243.4 ms ±   6.2 ms    [User: 43.9 ms, System: 196.5 ms]
  Range (min … max):   237.5 ms … 258.1 ms    11 runs
 
Summary
  target/release/wc -l tmp/seq100M ran
    1.02 ± 0.03 times faster than wc -l tmp/seq100M
    1.11 ± 0.03 times faster than ./wc.main -l tmp/seq100M

And on 1brc dataset from original report:

cargo build -r -p uu_wc && taskset -c 0 hyperfine --warmup 3 -L wc ./wc.main,target/release/wc,wc "{wc} -l ../../1brc/data/measurements.txt"
    Finished `release` profile [optimized] target(s) in 0.10s
Benchmark 1: ./wc.main -l ../../1brc/data/measurements.txt
  Time (mean ± σ):      2.633 s ±  0.016 s    [User: 0.533 s, System: 2.075 s]
  Range (min … max):    2.615 s …  2.668 s    10 runs
 
Benchmark 2: target/release/wc -l ../../1brc/data/measurements.txt
  Time (mean ± σ):      2.375 s ±  0.017 s    [User: 0.404 s, System: 1.948 s]
  Range (min … max):    2.355 s …  2.406 s    10 runs
 
Benchmark 3: wc -l ../../1brc/data/measurements.txt
  Time (mean ± σ):      2.408 s ±  0.017 s    [User: 0.457 s, System: 1.928 s]
  Range (min … max):    2.383 s …  2.440 s    10 runs
 
Summary
  target/release/wc -l ../../1brc/data/measurements.txt ran
    1.01 ± 0.01 times faster than wc -l ../../1brc/data/measurements.txt
    1.11 ± 0.01 times faster than ./wc.main -l ../../1brc/data/measurements.txt

drinkcat added 2 commits May 15, 2025 05:02
Improves performance by about 4% on large files.
bytecount uses vector operations to speed up line counting.
At least on x86 with AVX2 support, the vectors are 256-byte wide,
and operations are much faster if the data is aligned.

Saves about 4% of total performance, matching wc's performance.
@drinkcat drinkcat changed the title Wc faster wc: Speed optimization May 15, 2025
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@jnsgruk
Copy link

jnsgruk commented May 15, 2025

Nice, same result on my machine, which is a 7040 series Ryzen laptop chip.

❯ hyperfine --warmup 3 "wc -l ~/temp/measurements.txt" "target/release/wc -l ~/temp/measurements.txt"
Benchmark 1: wc -l ~/omgrust/measurements.txt
  Time (mean ± σ):      1.528 s ±  0.068 s    [User: 0.166 s, System: 1.362 s]
  Range (min … max):    1.435 s …  1.626 s    10 runs

Benchmark 2: target/release/wc -l ~/omgrust/measurements.txt
  Time (mean ± σ):      1.378 s ±  0.074 s    [User: 0.182 s, System: 1.195 s]
  Range (min … max):    1.270 s …  1.525 s    10 runs

Summary
  target/release/wc -l ~/temp/measurements.txt ran
    1.11 ± 0.08 times faster than wc -l ~/temp/measurements.txt

@sylvestre sylvestre merged commit c76c0ad into uutils:main May 15, 2025
70 checks passed
@sylvestre
Copy link
Contributor

well done!

@drinkcat
Copy link
Contributor Author

Nice, same result on my machine, which is a 7040 series Ryzen laptop chip.

It's actually a lot better than what I see. Interesting! Thanks for testing.

@willshuttleworth
Copy link
Contributor

previously the coreutils implementation was about 10x faster on my m1 mac, now its about 13x faster!

hyperfine -w 3 'wc -l /tmp/seq100M' './target/release/wc -l /tmp/seq100M'
Benchmark 1: wc -l /tmp/seq100M
  Time (mean ± σ):     900.0 ms ±   0.3 ms    [User: 844.2 ms, System: 55.4 ms]
  Range (min … max):   899.6 ms … 900.4 ms    10 runs

Benchmark 2: ./target/release/wc -l /tmp/seq100M
  Time (mean ± σ):      66.8 ms ±   0.8 ms    [User: 12.6 ms, System: 53.8 ms]
  Range (min … max):    65.8 ms …  70.6 ms    43 runs

Summary
  ./target/release/wc -l /tmp/seq100M ran
   13.47 ± 0.17 times faster than wc -l /tmp/seq100M

@sylvestre
Copy link
Contributor

Summary
./target/release/wc -l /tmp/seq100M ran
13.47 ± 0.17 times faster than wc -l /tmp/seq100M

I wonder how it is possible :)

wc is the BSD apple implementation ?

@willshuttleworth
Copy link
Contributor

yes, i am using the default apple implementation

@drinkcat
Copy link
Contributor Author

@willshuttleworth oh nice, thanks!

I'm not completely sure about the aarch64 code, but I see some loading operations on 8x16x4, which sound like 512-bit/64-bytes (the underlying registers are 128-bit wide though).

Mind trying to see if 64-byte alignment improves performance?

diff --git a/src/uu/wc/src/count_fast.rs b/src/uu/wc/src/count_fast.rs
index 479183263b21..0dd7c3882a31 100644
--- a/src/uu/wc/src/count_fast.rs
+++ b/src/uu/wc/src/count_fast.rs
@@ -201,7 +201,7 @@ pub(crate) fn count_bytes_fast<T: WordCountable>(handle: &mut T) -> (usize, Opti
 ///
 /// This is useful as bytecount uses 256-bit wide vector operations that run much
 /// faster on aligned data (at least on x86 with AVX2 support).
-#[repr(align(32))]
+#[repr(align(64))]
 struct AlignedBuffer {
     data: [u8; BUF_SIZE],
 }

(you can mv target/release/wc wc.main before the change to easily compare)

Thanks!

@willshuttleworth
Copy link
Contributor

@drinkcat i'm not seeing a difference with 64 byte alignment:

hyperfine -w 3 './wc-main -l /tmp/seq100M' './target/release/wc -l /tmp/seq100M'
Benchmark 1: ./wc-main -l /tmp/seq100M
  Time (mean ± σ):      66.0 ms ±   0.3 ms    [User: 12.6 ms, System: 53.0 ms]
  Range (min … max):    65.5 ms …  66.6 ms    43 runs

Benchmark 2: ./target/release/wc -l /tmp/seq100M
  Time (mean ± σ):      66.0 ms ±   0.3 ms    [User: 12.6 ms, System: 53.0 ms]
  Range (min … max):    65.4 ms …  66.8 ms    43 runs

Summary
  ./target/release/wc -l /tmp/seq100M ran
    1.00 ± 0.01 times faster than ./wc-main -l /tmp/seq100M
    ```

@drinkcat
Copy link
Contributor Author

Neat, thanks for trying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wc -l appears to be consistently slower than gnu implementation
4 participants