Skip to content

Conversation

RenjiSann
Copy link
Collaborator

Fixes #6573

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 6 times, most recently from 8111dc4 to 9eecf25 Compare August 2, 2024 08:58
Copy link

github-actions bot commented Aug 2, 2024

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 2 times, most recently from b4d46d2 to f2abbae Compare August 3, 2024 10:02
@RenjiSann
Copy link
Collaborator Author

RenjiSann commented Aug 17, 2024

@BenWiederhake Could you take a look ? :)

if !last.is_ascii_whitespace() {
break;
}
line_trim = &line_trim[..line_trim.len() - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please document what this line is doing, it isn't obvious

Copy link
Collaborator Author

@RenjiSann RenjiSann Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a simple implementation of the trim_ascii() function which is not available for MSRV 1.70.

The behavior is extracted in its own documented function in #6654

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
.stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
.no_stderr();
}

#[cfg(target_os = "linux")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why linux only ?

could you please to add a test to inject invalid utf8 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why linux only ?

This is somewhat related to this.

TLDR: With the current MSRV being 1.70, on windows. there is no way to translate between OsStr and &[u8] without going through str (this holds for owned types as well).

could you please to add a test to inject invalid utf8 ?

This is what the test should be doing. Using non UTF-8 characters that would make the old implementation fail because it would try to create Strings from these characters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to be sure I get it correctly. In the added test, I should check that the hashes variable get written in a file with a non-UTF-8 name, is this correct ?

If yes, I might need to refactor slightly the test framework in order to be able to write to files with non-UTF-8 names.

@RenjiSann
Copy link
Collaborator Author

@sylvestre all your comments are addressed in #6654. If that's still an issue, please tell me, I will refactor the code accordingly.

Copy link
Collaborator

@BenWiederhake BenWiederhake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused.

  • This PR has older code, doesn't address the points sylvestre already mentioned, and you also say that #6654 should be used instead.
  • The other PR (#6654) doesn't address sylvestre's points either (e.g. the linux-only issue), and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review, and it's also in draft mode (which indicates to me that it's more of a "sneak peek", and not ready).

So which PR would you like us to review?

@@ -1277,3 +1277,27 @@ fn test_non_utf8_filename() {
.stdout_is_bytes(b"SHA256 (funky\xffname) = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\n")
.no_stderr();
}

#[cfg(target_os = "linux")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way to translate between OsStr and &[u8]

Yes there is: os_str_as_bytes, just like you also use in the crate itself.

This is what the test should be doing.

No, the test does not check for correct handling of files with non-UTF-8 names.

@sylvestre sylvestre force-pushed the renji/utf8-cksum-comment branch from f2abbae to c2f43fa Compare September 16, 2024 07:25
@RenjiSann
Copy link
Collaborator Author

The other PR doesn't address sylvestre's points either (e.g. the linux-only issue)

The test is linux-only because os_str_as_bytes will automatically fail if given non-UTF-8 characters on windows.
This is due to the fact that there is no safe provided way to convert a OsStr to &[u8] on windows, and we are required to go through an intermediary &str conversion.

Improving the handling of windows filesystem exceptions this might be a further enhancement, but I don't think this should be the main point of this PR.

and seems to be a collection of many different PRs, which makes it unnecessarily difficult to review,

True, #6654 is a really big refactor, which solves many problems at once. I think about re-organizing all the changes in smaller PRs, while trying to keep the reasoning clear.

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch from c2f43fa to 5d68b3b Compare September 17, 2024 13:32
@RenjiSann
Copy link
Collaborator Author

So which PR would you like us to review?

Let's review and merge this one first, I eventually added the changes requested which I originally put in #6654.
Next, I will work on splitting #6654 into reasonably-sized changes, and try to document things correctly

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 4 times, most recently from 3533b6e to 0415d72 Compare September 17, 2024 16:32
@RenjiSann
Copy link
Collaborator Author

As requested, I changed the code to handle non-UTF8 filenames, and I added a test for it.

Previously, I used String::from_utf8_lossy to display the filename on the terminal when needed. I realized this was wrong, because it actually inserts a REPLACEMENT CHARACTER U+FFFD for any non-UTF-8 sequence, when we actually want to omit them completely.
At first, I tried to copy-paste the implementation of String::from_utf8_lossy and adapt it, but I figured out, because of unstabilized internals, that this could not be done without copy-pasting even more code from the stdlib.

The compromise I went with is to just remove the U+FFFD chars from the output of String::from_utf8_lossy. It might produce unwanted behavior in case the character appears in the original filename, but I consider this to be unlikely enough to wait to fix it when the involved String's API is stable (MSRV 1.79 iirc).

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch from 0415d72 to 07da5d9 Compare September 18, 2024 13:17
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/rm/rm1 (passes in this run but fails in the 'main' branch)

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch from 07da5d9 to 326b5d7 Compare October 11, 2024 08:48
@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch 2 times, most recently from 2848d7f to df019d1 Compare October 11, 2024 12:20
@RenjiSann
Copy link
Collaborator Author

@BenWiederhake If you want to take a look, this should finally be mergeable

Copy link

GNU testsuite comparison:

GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?

@RenjiSann RenjiSann force-pushed the renji/utf8-cksum-comment branch from df019d1 to 33853fe Compare October 15, 2024 09:12
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
GNU test failed: tests/timeout/timeout. tests/timeout/timeout is passing on 'main'. Maybe you have to rebase?

Copy link
Collaborator

@BenWiederhake BenWiederhake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please add a platform-independent test that checks the comment-feature
  • Please decide on how to handle the "incorrect" output. It's fine if you don't want to deal with it and just want to get this merged, but please at least leave behind a FIXME to mark the places where the code is still technically incorrect.

.arg("--check")
.arg(at.subdir.join("check"))
.succeeds()
.stdout_is("funkyname: OK\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong output. If the filename contains \xff, then the output should contain something in its place.

  • If you decide it's in-scope for this PR, then please change the code to emit something like that.
  • If you decide it's out-of-scope for this PR, then please at least add a FIXME here plus another FIXME in the code where the filename is emitted, to indicate that the code is technically incorrect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I confused the outputs when doing my tests

- add a test for non UTF-8 chars in comments
- add test for non-UTF-8 chars in filenames
@RenjiSann
Copy link
Collaborator Author

I closing this PR in favor of these 2 new ones :

@RenjiSann RenjiSann closed this Oct 18, 2024
@RenjiSann RenjiSann deleted the renji/utf8-cksum-comment branch January 29, 2025 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cksum: --check can't handle non-UTF-8 in comments
3 participants