Skip to content

Conversation

GTimothy
Copy link
Contributor

After reading #7504, i looked in https://github.com/uutils/coreutils-tracking/blob/main/individual-size-results/ and saw that cksum had a big jump in size as well. I identified the culprit: bringing in the regex crate.

After looking a bit into regex for bytes, I tried removing the regex calls altogether and i came up with this branch.

The binary size is much smaller: 1.4MB vs 3.1MB .
The binary is much faster (?!) tested on a 162500 lines long file of b2sum 50% tagged and 50%untagged checksum :

taskset -c 0 hyperfine --warmup 10 -L cksum ../cksum.latest,../target/release/cksum "{cksum} -c checkbig.b2sum" --export-csv output.csv
Benchmark 1: ../cksum.latest -c checkbig.b2sum
  Time (mean ± σ):     914.7 ms ±   2.8 ms    [User: 451.6 ms, System: 447.8 ms]
  Range (min … max):   910.0 ms … 920.1 ms    10 runs

Benchmark 2: ../target/release/cksum -c checkbig.b2sum
  Time (mean ± σ):     658.5 ms ±   1.4 ms    [User: 196.4 ms, System: 446.7 ms]
  Range (min … max):   656.4 ms … 660.2 ms    10 runs

Summary
  '../target/release/cksum -c checkbig.b2sum' ran
    1.39 ± 0.01 times faster than '../cksum.latest -c checkbig.b2sum'

And passes the gnu tests:

bash util/run-gnu-test.sh ../gnu/tests/cksum/*.pl ../gnu/tests/cksum/*.sh
PASS: ../gnu/tests/cksum/md5sum-newline.pl
PASS: ../gnu/tests/cksum/sha224sum.pl
PASS: ../gnu/tests/cksum/sha256sum.pl
PASS: ../gnu/tests/cksum/sum.pl
PASS: ../gnu/tests/cksum/sm3sum.pl
PASS: ../gnu/tests/cksum/sha1sum.pl
PASS: ../gnu/tests/cksum/sha384sum.pl
PASS: ../gnu/tests/cksum/sha512sum.pl
PASS: ../gnu/tests/cksum/md5sum.pl
PASS: ../gnu/tests/cksum/cksum-base64.pl
PASS: ../gnu/tests/cksum/cksum-raw.sh
PASS: ../gnu/tests/cksum/cksum-c.sh
PASS: ../gnu/tests/cksum/md5sum-bsd.sh
PASS: ../gnu/tests/cksum/md5sum-parallel.sh
PASS: ../gnu/tests/cksum/b2sum.sh
PASS: ../gnu/tests/cksum/cksum-a.sh
PASS: ../gnu/tests/cksum/cksum.sh
PASS: ../gnu/tests/cksum/sum-sysv.sh
PASS: ../gnu/tests/cksum/sha1sum-vec.pl
============================================================================
Testsuite summary for GNU coreutils 9.6-dirty
============================================================================
# TOTAL: 19
# PASS:  19
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

as well as:

make UTILS=cksum test
test result: ok. 135 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out; finished in 0.54s

If this is of value, please let me know. Then I would still have some cleanup to do.

Any remarks/thoughts are very welcome.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@sylvestre
Copy link
Contributor

of course, perf and file size are key!

@GTimothy GTimothy force-pushed the cksum_no_regex branch 2 times, most recently from bbbbd00 to b1a19f7 Compare March 25, 2025 18:46
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@GTimothy
Copy link
Contributor Author

I think this is ready for review.
As of now,
-the binary size is much smaller: 1.4MB vs 3.1MB .
-perf seems much closer to that of my distro's cksum

taskset -c 0 hyperfine --warmup 10 -L cksum ../cksum.latest,../target/release/cksum,cksum "{cksum} -c checkbig.b2sum"
Benchmark 1: ../cksum.latest -c checkbig.b2sum
  Time (mean ± σ):     932.7 ms ±   2.9 ms    [User: 463.1 ms, System: 450.7 ms]
  Range (min … max):   929.1 ms … 939.7 ms    10 runs

Benchmark 2: ../target/release/cksum -c checkbig.b2sum
  Time (mean ± σ):     667.3 ms ±   1.5 ms    [User: 208.3 ms, System: 441.9 ms]
  Range (min … max):   665.2 ms … 669.9 ms    10 runs

Benchmark 3: cksum -c checkbig.b2sum
  Time (mean ± σ):     629.7 ms ±   0.8 ms    [User: 162.2 ms, System: 451.2 ms]
  Range (min … max):   628.1 ms … 631.0 ms    10 runs

Summary
  'cksum -c checkbig.b2sum' ran
    1.06 ± 0.00 times faster than '../target/release/cksum -c checkbig.b2sum'
    1.48 ± 0.00 times faster than '../cksum.latest -c checkbig.b2sum'

@GTimothy GTimothy marked this pull request as ready for review March 25, 2025 22:45
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@sylvestre
Copy link
Contributor

looks great! bravo

Dunno if it is a regression from this change but it seems that the fuzzer isn't happy:

=== TEST cksum ["cksum", "--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
File content (/tmp/checksum_file)
rD91pj59+zO22nIRPx0e/HkQnLQK3XUHQYb0rA==  /tmp/yhlguutsnv

Running test ["cksum", "--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
=== Compare result for: cksum ["--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
Rust stdout:
SHA224 (/tmp/onbcuwafsc) = e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84

GNU stdout:
e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84  /tmp/onbcuwafsc

--- START diff
-SHA224 (/tmp/onbcuwafsc) = e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84
+e05d21dc0cc78150b068bf91688bb497257336cdf5b345ca5314da84  /tmp/onbcuwafsc
--- END diff

Discrepancies detected: stdout differs
===  KO  Test failed and will panic for: cksum ["--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]

thread '<unnamed>' panicked at fuzz_targets/fuzz_common/mod.rs:375:13:
Test failed for: cksum ["--untagged", "-a", "sha224", "/tmp/onbcuwafsc"]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
==3121== ERROR: libFuzzer: deadly signal

@GTimothy
Copy link
Contributor Author

GTimothy commented Mar 26, 2025

I cannot find your fuzzer error in the latest test, was it in a previous test ?

The CICD / MinRustV issue was my fault, I accidentally swapped variables filename and checksum when updating tests. The f5a569e commit should fix this.

@sylvestre
Copy link
Contributor

@RenjiSann
Copy link
Collaborator

looks great! bravo

Dunno if it is a regression from this change but it seems that the fuzzer isn't happy:

The fuzzer discrepancy is not caused by this change. This PR seems only affect cksum --check, so I think it is not the reason for this issue. Good catch though

@RenjiSann
Copy link
Collaborator

RenjiSann commented Mar 26, 2025

Very good changes overall 👍
It's cool to see such an improvement in performance and binary size.

One small comment, I would keep the regexes as comments in the different LineFormat::parse_*, to show what the code is actually parsing. Meh, that's nitpicking, I'm good with the templates you already included.

Comment on lines 512 to 513
let (filename, checksum) = ByteSliceExt::split_once(after_paren, b") = ")
.or_else(|| ByteSliceExt::split_once(after_paren, b")= "))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for files called like 'foo) = ', because the .position() used in ByteSliceExt is short-circuiting, when we'd like it to be greedy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I did not think of this case. If I implement rsplit_once that iterates from right to left, that would solve this, no?
I'll add a test case.

Also, I think there is a second issue here.

MD5(filename)= fds65dsf46as5df4d6f54asd is valid openssl syntax.
MD5 (filename) = fds65dsf46as5df4d6f54asd is valid posix syntax.
MD5(filename) = fds65dsf46as5df4d6f54asd is invalid syntax I think but would match here. I'll add a test case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that'll work. I'd add a test where there is a ' ) = ' in the middle of the digest as well to see if the error message is the one expected.

@GTimothy
Copy link
Contributor Author

looks great! bravo
Dunno if it is a regression from this change but it seems that the fuzzer isn't happy:

The fuzzer discrepancy is not caused by this change. This PR seems only affect cksum --check, so I think it is not the reason for this issue. Good catch though

I am looking into it none the less, but independently of this PR.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@GTimothy
Copy link
Contributor Author

failures:

---- test_who::test_boot stdout ----
run: who --version
uutils-tests-info: version for the reference coreutil 'who' is higher than expected; expected: 8.3, found: 9.4
run: who -b
run: /home/runner/work/coreutils/coreutils/target/x86_64-unknown-linux-gnu/debug/coreutils who -b

thread 'test_who::test_boot' panicked at tests/by-util/test_who.rs:37:39:
assertion failed: `(left == right)`

Diff < left / right > :
>         system boot  Mar 27 13:00
>


stack backtrace:
   0: rust_begin_unwind
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:75:14
   2: tests::common::util::CmdResult::stdout_is
             at ./tests/common/util.rs:481:9
   3: tests::test_who::test_boot
             at ./tests/by-util/test_who.rs:37:9
   4: tests::test_who::test_boot::{{closure}}
             at ./tests/by-util/test_who.rs:33:15
   5: core::ops::function::FnOnce::call_once
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    test_who::test_boot

test result: FAILED. 3836 passed; 1 failed; 54 ignored; 0 measured; 0 filtered out; finished in 92.27s

I am pretty confident I did not cause this CICD/build error.

@sylvestre
Copy link
Contributor

Could you please improve the stack ?
Some changes should be merged into one (spelling for example) but please keep separate (at least) utf-8 and regexp removal

@GTimothy
Copy link
Contributor Author

The utf-8 was a regression on my part, the original regex handled filenames as bytes, my first rewrite did not.
I was thinking of merging the commits into three:

  • rewrite lineformat parsing without regex (merging utf8 and regexp)
  • update tests to test new parsers not regex
  • fix : filename that include separator should parse + add tests

maybe a fourth:

  • fix: hybrid posix/openssl parsing should fail + add tests

What do you think ?

@RenjiSann
Copy link
Collaborator

The utf-8 was a regression on my part, the original regex handled filenames as bytes, my first rewrite did not. I was thinking of merging the commits into three:

* rewrite lineformat parsing without regex (merging utf8 and regexp)

* update tests to test new parsers not regex

* fix : filename that include separator should parse + add tests

maybe a fourth:

* fix: hybrid posix/openssl parsing should fail  + add tests

What do you think ?

The fourth one is not needed IMO, the 3 others are good 👍

@GTimothy
Copy link
Contributor Author

Is this ok?

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

removes dependency on the regex crate for LineFormat detection and
parsing, resulting in a faster and lighter cksum binary.
…dd tests

fixes this non-regex implementation's flaw with file_names containing
the separator's pattern:
 - replaces left-to-right greedy separator match with right-to-left one.
 - added bugfix tests
fixes secondary bug: positive match on hybrid posix-openssl format
adds secondary bugfix tests

Co-authored-by: Dorian Péron
<72708393+RenjiSann@users.noreply.github.com>
@RenjiSann
Copy link
Collaborator

I'm wondering what's wrong with the SELinux test

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@RenjiSann RenjiSann merged commit 6675460 into uutils:main Apr 1, 2025
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants