cksum: take out regex pattern matching #7580

GTimothy · 2025-03-25T14:27:52Z

After reading #7504, i looked in https://github.com/uutils/coreutils-tracking/blob/main/individual-size-results/ and saw that cksum had a big jump in size as well. I identified the culprit: bringing in the regex crate.

After looking a bit into regex for bytes, I tried removing the regex calls altogether and i came up with this branch.

The binary size is much smaller: 1.4MB vs 3.1MB .
The binary is much faster (?!) tested on a 162500 lines long file of b2sum 50% tagged and 50%untagged checksum :

taskset -c 0 hyperfine --warmup 10 -L cksum ../cksum.latest,../target/release/cksum "{cksum} -c checkbig.b2sum" --export-csv output.csv
Benchmark 1: ../cksum.latest -c checkbig.b2sum
  Time (mean ± σ):     914.7 ms ±   2.8 ms    [User: 451.6 ms, System: 447.8 ms]
  Range (min … max):   910.0 ms … 920.1 ms    10 runs

Benchmark 2: ../target/release/cksum -c checkbig.b2sum
  Time (mean ± σ):     658.5 ms ±   1.4 ms    [User: 196.4 ms, System: 446.7 ms]
  Range (min … max):   656.4 ms … 660.2 ms    10 runs

Summary
  '../target/release/cksum -c checkbig.b2sum' ran
    1.39 ± 0.01 times faster than '../cksum.latest -c checkbig.b2sum'

And passes the gnu tests:

bash util/run-gnu-test.sh ../gnu/tests/cksum/*.pl ../gnu/tests/cksum/*.sh

PASS: ../gnu/tests/cksum/md5sum-newline.pl
PASS: ../gnu/tests/cksum/sha224sum.pl
PASS: ../gnu/tests/cksum/sha256sum.pl
PASS: ../gnu/tests/cksum/sum.pl
PASS: ../gnu/tests/cksum/sm3sum.pl
PASS: ../gnu/tests/cksum/sha1sum.pl
PASS: ../gnu/tests/cksum/sha384sum.pl
PASS: ../gnu/tests/cksum/sha512sum.pl
PASS: ../gnu/tests/cksum/md5sum.pl
PASS: ../gnu/tests/cksum/cksum-base64.pl
PASS: ../gnu/tests/cksum/cksum-raw.sh
PASS: ../gnu/tests/cksum/cksum-c.sh
PASS: ../gnu/tests/cksum/md5sum-bsd.sh
PASS: ../gnu/tests/cksum/md5sum-parallel.sh
PASS: ../gnu/tests/cksum/b2sum.sh
PASS: ../gnu/tests/cksum/cksum-a.sh
PASS: ../gnu/tests/cksum/cksum.sh
PASS: ../gnu/tests/cksum/sum-sysv.sh
PASS: ../gnu/tests/cksum/sha1sum-vec.pl
============================================================================
Testsuite summary for GNU coreutils 9.6-dirty
============================================================================
# TOTAL: 19
# PASS:  19
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

as well as:

make UTILS=cksum test

test result: ok. 135 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out; finished in 0.54s

If this is of value, please let me know. Then I would still have some cleanup to do.

Any remarks/thoughts are very welcome.

github-actions · 2025-03-25T15:14:17Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

sylvestre · 2025-03-25T15:28:55Z

of course, perf and file size are key!

github-actions · 2025-03-25T19:06:48Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

GTimothy · 2025-03-25T22:45:41Z

I think this is ready for review.
As of now,
-the binary size is much smaller: 1.4MB vs 3.1MB .
-perf seems much closer to that of my distro's cksum

taskset -c 0 hyperfine --warmup 10 -L cksum ../cksum.latest,../target/release/cksum,cksum "{cksum} -c checkbig.b2sum"
Benchmark 1: ../cksum.latest -c checkbig.b2sum
  Time (mean ± σ):     932.7 ms ±   2.9 ms    [User: 463.1 ms, System: 450.7 ms]
  Range (min … max):   929.1 ms … 939.7 ms    10 runs

Benchmark 2: ../target/release/cksum -c checkbig.b2sum
  Time (mean ± σ):     667.3 ms ±   1.5 ms    [User: 208.3 ms, System: 441.9 ms]
  Range (min … max):   665.2 ms … 669.9 ms    10 runs

Benchmark 3: cksum -c checkbig.b2sum
  Time (mean ± σ):     629.7 ms ±   0.8 ms    [User: 162.2 ms, System: 451.2 ms]
  Range (min … max):   628.1 ms … 631.0 ms    10 runs

Summary
  'cksum -c checkbig.b2sum' ran
    1.06 ± 0.00 times faster than '../target/release/cksum -c checkbig.b2sum'
    1.48 ± 0.00 times faster than '../cksum.latest -c checkbig.b2sum'

github-actions · 2025-03-26T06:39:57Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

sylvestre · 2025-03-26T07:41:19Z

looks great! bravo

Dunno if it is a regression from this change but it seems that the fuzzer isn't happy:

=== TEST cksum ["cksum", "--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
File content (/tmp/checksum_file)
rD91pj59+zO22nIRPx0e/HkQnLQK3XUHQYb0rA==  /tmp/yhlguutsnv

Running test ["cksum", "--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
=== Compare result for: cksum ["--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
Rust stdout:
SHA224 (/tmp/onbcuwafsc) = e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84

GNU stdout:
e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84  /tmp/onbcuwafsc

--- START diff
-SHA224 (/tmp/onbcuwafsc) = e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84
+e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84  /tmp/onbcuwafsc
--- END diff

Discrepancies detected: stdout differs
===  KO  Test failed and will panic for: cksum ["--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]

thread '<unnamed>' panicked at fuzz_targets/fuzz_common/mod.rs:375:13:
Test failed for: cksum ["--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
==3121== ERROR: libFuzzer: deadly signal

GTimothy · 2025-03-26T09:11:33Z

I cannot find your fuzzer error in the latest test, was it in a previous test ?

The CICD / MinRustV issue was my fault, I accidentally swapped variables filename and checksum when updating tests. The f5a569e commit should fix this.

sylvestre · 2025-03-26T15:55:17Z

https://github.com/uutils/coreutils/actions/runs/14071244361/job/39420840651

RenjiSann · 2025-03-26T17:51:51Z

looks great! bravo

Dunno if it is a regression from this change but it seems that the fuzzer isn't happy:

The fuzzer discrepancy is not caused by this change. This PR seems only affect cksum --check, so I think it is not the reason for this issue. Good catch though

RenjiSann · 2025-03-26T17:55:47Z

Very good changes overall 👍
It's cool to see such an improvement in performance and binary size.

~~One small comment, I would keep the regexes as comments in the different LineFormat::parse_*, to show what the code is actually parsing.~~ Meh, that's nitpicking, I'm good with the templates you already included.

RenjiSann · 2025-03-26T19:38:31Z

src/uucore/src/lib/features/checksum.rs

+        let (filename, checksum) = ByteSliceExt::split_once(after_paren, b") = ")
+            .or_else(|| ByteSliceExt::split_once(after_paren, b")= "))?;


This won't work for files called like 'foo) = ', because the .position() used in ByteSliceExt is short-circuiting, when we'd like it to be greedy.

I see, I did not think of this case. If I implement rsplit_once that iterates from right to left, that would solve this, no?
I'll add a test case.

Also, I think there is a second issue here.

MD5(filename)= fds65dsf46as5df4d6f54asd is valid openssl syntax.
MD5 (filename) = fds65dsf46as5df4d6f54asd is valid posix syntax.
MD5(filename) = fds65dsf46as5df4d6f54asd is invalid syntax I think but would match here. I'll add a test case.

I guess that'll work. I'd add a test where there is a ' ) = ' in the middle of the digest as well to see if the error message is the one expected.

GTimothy · 2025-03-27T09:45:35Z

looks great! bravo
Dunno if it is a regression from this change but it seems that the fuzzer isn't happy:

The fuzzer discrepancy is not caused by this change. This PR seems only affect cksum --check, so I think it is not the reason for this issue. Good catch though

I am looking into it none the less, but independently of this PR.

src/uucore/src/lib/features/checksum.rs

github-actions · 2025-03-27T12:10:31Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

github-actions · 2025-03-27T13:25:07Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

GTimothy · 2025-03-28T07:29:35Z

failures:

---- test_who::test_boot stdout ----
run: who --version
uutils-tests-info: version for the reference coreutil 'who' is higher than expected; expected: 8.3, found: 9.4
run: who -b
run: /home/runner/work/coreutils/coreutils/target/x86_64-unknown-linux-gnu/debug/coreutils who -b

thread 'test_who::test_boot' panicked at tests/by-util/test_who.rs:37:39:
assertion failed: `(left == right)`

Diff < left / right > :
>         system boot  Mar 27 13:00
>


stack backtrace:
   0: rust_begin_unwind
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:75:14
   2: tests::common::util::CmdResult::stdout_is
             at ./tests/common/util.rs:481:9
   3: tests::test_who::test_boot
             at ./tests/by-util/test_who.rs:37:9
   4: tests::test_who::test_boot::{{closure}}
             at ./tests/by-util/test_who.rs:33:15
   5: core::ops::function::FnOnce::call_once
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    test_who::test_boot

test result: FAILED. 3836 passed; 1 failed; 54 ignored; 0 measured; 0 filtered out; finished in 92.27s

I am pretty confident I did not cause this CICD/build error.

sylvestre · 2025-03-28T10:56:46Z

Could you please improve the stack ?
Some changes should be merged into one (spelling for example) but please keep separate (at least) utf-8 and regexp removal

GTimothy · 2025-03-29T11:57:08Z

The utf-8 was a regression on my part, the original regex handled filenames as bytes, my first rewrite did not.
I was thinking of merging the commits into three:

rewrite lineformat parsing without regex (merging utf8 and regexp)
update tests to test new parsers not regex
fix : filename that include separator should parse + add tests

maybe a fourth:

fix: hybrid posix/openssl parsing should fail + add tests

What do you think ?

RenjiSann · 2025-03-29T12:11:46Z

The utf-8 was a regression on my part, the original regex handled filenames as bytes, my first rewrite did not. I was thinking of merging the commits into three:
* rewrite lineformat parsing without regex (merging utf8 and regexp)

* update tests to test new parsers not regex

* fix : filename that include separator should parse + add tests
maybe a fourth:
* fix: hybrid posix/openssl parsing should fail  + add tests
What do you think ?

The fourth one is not needed IMO, the 3 others are good 👍

GTimothy · 2025-03-29T13:03:21Z

Is this ok?

github-actions · 2025-03-29T13:37:36Z

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

removes dependency on the regex crate for LineFormat detection and parsing, resulting in a faster and lighter cksum binary.

…dd tests fixes this non-regex implementation's flaw with file_names containing the separator's pattern: - replaces left-to-right greedy separator match with right-to-left one. - added bugfix tests fixes secondary bug: positive match on hybrid posix-openssl format adds secondary bugfix tests Co-authored-by: Dorian Péron <72708393+RenjiSann@users.noreply.github.com>

RenjiSann · 2025-03-31T15:18:38Z

I'm wondering what's wrong with the SELinux test

github-actions · 2025-03-31T16:21:23Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

GTimothy force-pushed the cksum_no_regex branch 2 times, most recently from bbbbd00 to b1a19f7 Compare March 25, 2025 18:46

GTimothy force-pushed the cksum_no_regex branch from b1a19f7 to 8f41040 Compare March 25, 2025 22:23

GTimothy marked this pull request as ready for review March 25, 2025 22:45

RenjiSann reviewed Mar 26, 2025

View reviewed changes

RenjiSann reviewed Mar 27, 2025

View reviewed changes

src/uucore/src/lib/features/checksum.rs Show resolved Hide resolved

GTimothy force-pushed the cksum_no_regex branch from 5178fc0 to 80e5b2f Compare March 27, 2025 12:43

GTimothy force-pushed the cksum_no_regex branch from 647adc2 to fca3971 Compare March 29, 2025 13:01

GTimothy added 3 commits March 31, 2025 17:18

checksum/cksum: rewrite lineformat parsing without regex

621f2b5

removes dependency on the regex crate for LineFormat detection and parsing, resulting in a faster and lighter cksum binary.

checksum/cksum: update tests to test new parsers not regex

04ad555

RenjiSann force-pushed the cksum_no_regex branch from fca3971 to 09a9dc7 Compare March 31, 2025 15:18

RenjiSann merged commit 6675460 into uutils:main Apr 1, 2025
67 of 68 checks passed

BrewTestBot mentioned this pull request May 24, 2025

uutils-coreutils 0.1.0 Homebrew/homebrew-core#224645

Merged

moonfruit mentioned this pull request May 26, 2025

uutils-selected 0.1.0 moonfruit/homebrew-tap#243

Closed

		let (filename, checksum) = ByteSliceExt::split_once(after_paren, b") = ")
		.or_else(\|\| ByteSliceExt::split_once(after_paren, b")= "))?;

Uh oh!

cksum: take out regex pattern matching #7580

cksum: take out regex pattern matching #7580

Uh oh!

Conversation

GTimothy commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

sylvestre commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

GTimothy commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

sylvestre commented Mar 26, 2025

Uh oh!

GTimothy commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sylvestre commented Mar 26, 2025

Uh oh!

RenjiSann commented Mar 26, 2025

Uh oh!

RenjiSann commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RenjiSann Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

GTimothy Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

RenjiSann Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

GTimothy commented Mar 27, 2025

Uh oh!

Uh oh!

github-actions bot commented Mar 27, 2025

Uh oh!

github-actions bot commented Mar 27, 2025

Uh oh!

GTimothy commented Mar 28, 2025

Uh oh!

sylvestre commented Mar 28, 2025

Uh oh!

GTimothy commented Mar 29, 2025

Uh oh!

RenjiSann commented Mar 29, 2025

Uh oh!

GTimothy commented Mar 29, 2025

Uh oh!

github-actions bot commented Mar 29, 2025

Uh oh!

RenjiSann commented Mar 31, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!

GTimothy commented Mar 26, 2025 •

edited

Loading

RenjiSann commented Mar 26, 2025 •

edited

Loading