cksum: Accept non UTF-8 inputs #6603

RenjiSann · 2024-08-01T22:24:17Z

github-actions · 2024-08-02T09:25:42Z

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

RenjiSann · 2024-08-17T09:34:47Z

@BenWiederhake Could you take a look ? :)

sylvestre · 2024-08-17T10:16:23Z

src/uucore/src/lib/features/checksum.rs

+            if !last.is_ascii_whitespace() {
+                break;
+            }
+            line_trim = &line_trim[..line_trim.len() - 1];


could you please document what this line is doing, it isn't obvious

It is a simple implementation of the trim_ascii() function which is not available for MSRV 1.70.

The behavior is extracted in its own documented function in #6654

src/uucore/src/lib/features/checksum.rs

sylvestre · 2024-08-17T10:19:13Z

tests/by-util/test_cksum.rs

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
        .stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
        .no_stderr();
 }
+
+#[cfg(target_os = "linux")]


why linux only ?

could you please to add a test to inject invalid utf8 ?

why linux only ?

This is somewhat related to this.

TLDR: With the current MSRV being 1.70, on windows. there is no way to translate between OsStr and &[u8] without going through str (this holds for owned types as well).

could you please to add a test to inject invalid utf8 ?

This is what the test should be doing. Using non UTF-8 characters that would make the old implementation fail because it would try to create Strings from these characters.

there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

I'd like to be sure I get it correctly. In the added test, I should check that the hashes variable get written in a file with a non-UTF-8 name, is this correct ?

If yes, I might need to refactor slightly the test framework in order to be able to write to files with non-UTF-8 names.

RenjiSann · 2024-08-26T13:21:11Z

@sylvestre all your comments are addressed in #6654. If that's still an issue, please tell me, I will refactor the code accordingly.

BenWiederhake

I'm confused.

This PR has older code, doesn't address the points sylvestre already mentioned, and you also say that #6654 should be used instead.
The other PR (#6654) doesn't address sylvestre's points either (e.g. the linux-only issue), and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review, and it's also in draft mode (which indicates to me that it's more of a "sneak peek", and not ready).

So which PR would you like us to review?

BenWiederhake · 2024-09-07T21:26:14Z

tests/by-util/test_cksum.rs

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
        .stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
        .no_stderr();
 }
+
+#[cfg(target_os = "linux")]


there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

RenjiSann · 2024-09-17T12:58:13Z

The other PR doesn't address sylvestre's points either (e.g. the linux-only issue)

The test is linux-only because os_str_as_bytes will automatically fail if given non-UTF-8 characters on windows.
This is due to the fact that there is no safe provided way to convert a OsStr to &[u8] on windows, and we are required to go through an intermediary &str conversion.

Improving the handling of windows filesystem exceptions this might be a further enhancement, but I don't think this should be the main point of this PR.

and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review,

True, #6654 is a really big refactor, which solves many problems at once. I think about re-organizing all the changes in smaller PRs, while trying to keep the reasoning clear.

RenjiSann · 2024-09-17T13:35:36Z

So which PR would you like us to review?

Let's review and merge this one first, I eventually added the changes requested which I originally put in #6654.
Next, I will work on splitting #6654 into reasonably-sized changes, and try to document things correctly

RenjiSann · 2024-09-17T16:42:49Z

As requested, I changed the code to handle non-UTF8 filenames, and I added a test for it.

Previously, I used String::from_utf8_lossy to display the filename on the terminal when needed. I realized this was wrong, because it actually inserts a REPLACEMENT CHARACTER U+FFFD for any non-UTF-8 sequence, when we actually want to omit them completely.
At first, I tried to copy-paste the implementation of String::from_utf8_lossy and adapt it, but I figured out, because of unstabilized internals, that this could not be done without copy-pasting even more code from the stdlib.

The compromise I went with is to just remove the U+FFFD chars from the output of String::from_utf8_lossy. It might produce unwanted behavior in case the character appears in the original filename, but I consider this to be unlikely enough to wait to fix it when the involved String's API is stable (MSRV 1.79 iirc).

github-actions · 2024-09-17T16:58:20Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

github-actions · 2024-09-18T13:45:17Z

GNU testsuite comparison:

Skipping an intermittent issue tests/rm/rm1 (passes in this run but fails in the 'main' branch)

RenjiSann · 2024-10-11T12:21:08Z

@BenWiederhake If you want to take a look, this should finally be mergeable

github-actions · 2024-10-11T12:46:20Z

GNU testsuite comparison:

GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?

github-actions · 2024-10-15T09:39:45Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?

BenWiederhake

Please add a platform-independent test that checks the comment-feature
Please decide on how to handle the "incorrect" output. It's fine if you don't want to deal with it and just want to get this merged, but please at least leave behind a FIXME to mark the places where the code is still technically incorrect.

BenWiederhake · 2024-10-16T00:20:09Z

tests/by-util/test_cksum.rs

+        .arg("--check")
+        .arg(at.subdir.join("check"))
+        .succeeds()
+        .stdout_is("funkyname: OK\n")


This is the wrong output. If the filename contains \xff, then the output should contain something in its place.

If you decide it's in-scope for this PR, then please change the code to emit something like that.

If you decide it's out-of-scope for this PR, then please at least add a FIXME here plus another FIXME in the code where the filename is emitted, to indicate that the code is technically incorrect.

You are right, I confused the outputs when doing my tests

src/uucore/src/lib/features/checksum.rs

- add a test for non UTF-8 chars in comments - add test for non-UTF-8 chars in filenames

RenjiSann · 2024-10-18T23:36:55Z

I closing this PR in favor of these 2 new ones :

comments handling (cksum/hashsum: Support for commented lines in checksum files #6782)
non-UTF-8 handling (cksum/hashsum: Support for non-UTF-8 input in checksum files #6793)

RenjiSann force-pushed the renji/utf8-cksum-comment branch 6 times, most recently from 8111dc4 to 9eecf25 Compare August 2, 2024 08:58

RenjiSann force-pushed the renji/utf8-cksum-comment branch 2 times, most recently from b4d46d2 to f2abbae Compare August 3, 2024 10:02

sylvestre reviewed Aug 17, 2024

View reviewed changes

src/uucore/src/lib/features/checksum.rs Outdated Show resolved Hide resolved

sylvestre reviewed Aug 17, 2024

View reviewed changes

src/uucore/src/lib/features/checksum.rs Outdated Show resolved Hide resolved

sylvestre reviewed Aug 17, 2024

View reviewed changes

RenjiSann mentioned this pull request Aug 17, 2024

checksum: rework for improving checkum checking GNU behavior match #6654

Closed

BenWiederhake requested changes Sep 7, 2024

View reviewed changes

sylvestre force-pushed the renji/utf8-cksum-comment branch from f2abbae to c2f43fa Compare September 16, 2024 07:25

RenjiSann force-pushed the renji/utf8-cksum-comment branch from c2f43fa to 5d68b3b Compare September 17, 2024 13:32

RenjiSann force-pushed the renji/utf8-cksum-comment branch 4 times, most recently from 3533b6e to 0415d72 Compare September 17, 2024 16:32

RenjiSann force-pushed the renji/utf8-cksum-comment branch from 0415d72 to 07da5d9 Compare September 18, 2024 13:17

RenjiSann force-pushed the renji/utf8-cksum-comment branch from 07da5d9 to 326b5d7 Compare October 11, 2024 08:48

RenjiSann force-pushed the renji/utf8-cksum-comment branch 2 times, most recently from 2848d7f to df019d1 Compare October 11, 2024 12:20

RenjiSann force-pushed the renji/utf8-cksum-comment branch from df019d1 to 33853fe Compare October 15, 2024 09:12

BenWiederhake requested changes Oct 16, 2024

View reviewed changes

RenjiSann added 2 commits October 16, 2024 11:52

checksum: Allow for non UTF-8 content in input file

63633aa

test(cksum): add non-UTF-8 handling tests

7c0ff21

- add a test for non UTF-8 chars in comments - add test for non-UTF-8 chars in filenames

RenjiSann force-pushed the renji/utf8-cksum-comment branch from 33853fe to 7c0ff21 Compare October 16, 2024 09:52

This was referenced Oct 16, 2024

cksum/hashsum: Support for commented lines in checksum files #6782

Merged

cksum/hashsum: Support for non-UTF-8 input in checksum files #6793

Merged

RenjiSann closed this Oct 18, 2024

RenjiSann deleted the renji/utf8-cksum-comment branch January 29, 2025 15:21

Uh oh!

cksum: Accept non UTF-8 inputs #6603

cksum: Accept non UTF-8 inputs #6603

Uh oh!

Conversation

RenjiSann commented Aug 1, 2024

Uh oh!

github-actions bot commented Aug 2, 2024

Uh oh!

RenjiSann commented Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RenjiSann Aug 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RenjiSann commented Aug 26, 2024

Uh oh!

BenWiederhake left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RenjiSann commented Sep 17, 2024

Uh oh!

RenjiSann commented Sep 17, 2024

Uh oh!

RenjiSann commented Sep 17, 2024

Uh oh!

github-actions bot commented Sep 17, 2024

Uh oh!

github-actions bot commented Sep 18, 2024

Uh oh!

RenjiSann commented Oct 11, 2024

Uh oh!

github-actions bot commented Oct 11, 2024

Uh oh!

github-actions bot commented Oct 15, 2024

Uh oh!

BenWiederhake left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RenjiSann commented Oct 18, 2024

Uh oh!

Uh oh!

RenjiSann commented Aug 17, 2024 •

edited

Loading

RenjiSann Aug 17, 2024 •

edited

Loading