Skip to content

printf: accept non-UTF-8 input in FORMAT and ARGUMENT arguments #7209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jtracey
Copy link
Contributor

@jtracey jtracey commented Jan 25, 2025

Rebases #6812 on #7208.

EDIT: Now also includes a commit from me to pass the printf-mb GNU test, as well as a few other odds and ends. Fixes #6804.

@jtracey
Copy link
Contributor Author

jtracey commented Jan 25, 2025

This is just a rebase, it compiles and passes our tests, but there are still some kinks to work out to get it to work as expected/pass GNU tests.

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/usage_vs_getopt (passes in this run but fails in the 'main' branch)

@sylvestre
Copy link
Contributor

needs to be rebased again :/ Sorry

@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch 3 times, most recently from ca73a2d to 51649ec Compare February 22, 2025 01:48
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@jtracey
Copy link
Contributor Author

jtracey commented Feb 22, 2025

Rebased on #7208 again. This still needs some cleaning up IMO, but I'm going to hold off until #7208 gets merged.

@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch from 51649ec to 6d9ab8f Compare April 30, 2025 19:14
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch from 6d9ab8f to 2ec2433 Compare April 30, 2025 20:21
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch from 2ec2433 to 016df8a Compare May 3, 2025 01:05
@jtracey
Copy link
Contributor Author

jtracey commented May 3, 2025

Force push before the most recent is the last with the individual commits from #6812, most recent force push squashes them into one and adds my fixes in another commit. Final rebased #6812 commits before squashing are: 9eddbca a9f53a6 bc7516a 2ec2433

Copy link

github-actions bot commented May 3, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch from 016df8a to 0760508 Compare May 3, 2025 02:10
@jtracey jtracey marked this pull request as ready for review May 3, 2025 02:11
@jtracey
Copy link
Contributor Author

jtracey commented May 3, 2025

To spell out a bit what my commit does: there was some redundant handling of various pieces of parsing format arguments, each with their own bugs. I minimized and simplified that, and got most things to only being implemented once in more proper locations, so that, e.g., behavior no longer differs between %i and %d format strings (aside from the type), misc. utils like seq no longer accept things like 'a as numbers, etc.

Copy link

github-actions bot commented May 3, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Copy link

github-actions bot commented May 5, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

@jtracey
Copy link
Contributor Author

jtracey commented May 10, 2025

@sylvestre: This is ready for review btw (no rush, just pinging in case the status change didn't).

Other implementations of `printf` permit arbitrary data to be passed
to `printf`. The only restriction is that a null byte terminates
FORMAT and ARGUMENT argument strings (since they are C strings).

The current implementation only accepts FORMAT and ARGUMENT
arguments that are valid UTF-8 (this is being enforced by clap).

This commit removes the UTF-8 validation by switching to OsStr
and OsString.

This allows users to use `printf` to transmit or reformat null-safe
but not UTF-8-safe data, such as text encoded in an 8-bit text
encoding. See the `non_utf_8_input` test for an example (ISO-8859-1
text).
@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch from c45824b to cae5441 Compare May 27, 2025 01:43
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

Copy link
Contributor

@drinkcat drinkcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at GNU test printf-mb failure and realized you did the work already, and I'm somewhat familiar with that part of the codebase ,-)

Nothing too serious, just wondering about non-Unix OSStr parsing...

where
T: ExtendedParser + std::convert::From<u8> + std::convert::From<u32> + Default,
{
let s = os_str_as_bytes(os)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always succeeds on unix platforms, and fails on other platforms if the string can't be coerced to UTF-8.

Is it really ok to fail here?

  • If the string doesn't start with a '/", we probably still want to move forward to extended_parse?
  • Is there something we can do to still try to figure out if the first char is a \'? (maybe we should use os_str_as_bytes_lossy?!)

(I was looking at this wondering if we really need to result a Result<T, ...> instead of just T, as that'd avoid a lot of the other changes... But I guess we'd still need to fail if the first character is ' and the second is U+FFFD REPLACEMENT CHARACTER.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If the string doesn't start with a '/", we probably still want to move forward to extended_parse?

Well, there are no numbers that contain a unicode replacement character, so it would fail anyway. 😛 I think it's fine to fail early here, the string being invalid unicode is likely to be about as useful error context as the string not being a valid number.

  • Is there something we can do to still try to figure out if the first char is a \'? (maybe we should use os_str_as_bytes_lossy?!)

The only case we would get such a failure is on Windows with a string that is not valid unicode. Unlike unix platforms, this almost never happens, since all Windows strings are UTF-16(ish) -- namely, Windows doesn't allow arbitrary byte sequences (or byte pairs) as input, it has to look like valid UTF-16(ish). The exception (the "ish") is that Windows does allow invalid surrogate pairs, which can't be converted to bytes in any meaningful way, which is why conversion to bytes has to be fallible. This means that any argument here containing an invalid codepoint also necessarily contains data that doesn't represent any particular byte, which, aside from being difficult to do in practice, I can't think of any reason you would want that data to then be parsed as a literal -- it seems much more likely that something has gone horribly wrong.

But I guess we'd still need to fail if the first character is ' and the second is U+FFFD REPLACEMENT CHARACTER.

Problem with that strategy is we'd need some way to distinguish the replacement character from lossy conversion and the replacement character as input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If the string doesn't start with a '/", we probably still want to move forward to extended_parse?

Well, there are no numbers that contain a unicode replacement character, so it would fail anyway. 😛 I think it's fine to fail early here, the string being invalid unicode is likely to be about as useful error context as the string not being a valid number.

I should have made my thought clearer... In general, partial matches still return whatever can be parsed (with an error). And you do the same here with ' prefixes (see my new comment below though, you do change the behaviour a bit by manually printing the error instead of returning a PartialMatch)

So I think something like 123abc[badunicode] will return PartialMatch(123, "abc[U+FFFD]") with your change, I think I'm ok with that? But you do reject \'abc[badunicode], wouldn't be better if we could still return the ASCII value of a here? (ideally a PartialMatch(97, "abc[U+FFFD]").

  • Is there something we can do to still try to figure out if the first char is a \'? (maybe we should use os_str_as_bytes_lossy?!)

The only case we would get such a failure is on Windows with a string that is not valid unicode. Unlike unix platforms, this almost never happens, since all Windows strings are UTF-16(ish) -- namely, Windows doesn't allow arbitrary byte sequences (or byte pairs) as input, it has to look like valid UTF-16(ish). The exception (the "ish") is that Windows does allow invalid surrogate pairs, which can't be converted to bytes in any meaningful way, which is why conversion to bytes has to be fallible. This means that any argument here containing an invalid codepoint also necessarily contains data that doesn't represent any particular byte, which, aside from being difficult to do in practice, I can't think of any reason you would want that data to then be parsed as a literal -- it seems much more likely that something has gone horribly wrong.

Goodness, thanks for the explanation...

If this is exceedingly rare, I do wonder if it's not easier to just print a warning, and return a best guess value (even if that ends up being U+FFFD due to the lossy replacement...).

Copy link
Contributor

@drinkcat drinkcat May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay... back from running after my train. I think my thoughts are getting a bit clearer. I'll continue my point from below here to keep things in one place.

I'm not terribly happy that get_num returns a Result, it's really weird that it propagates the errors from literal strings \' (in a very uncommon non-Unix corner case BTW), but processes the errors from extended_parse to the best possible number.

I'm also not too happy to move code out of extended_parse (I've been spending a lot of time trying to move all the parsing code in one place...). But I also understand we may not want to pass OsStr to extended_parse... So maybe it's ok to do this processing here (I'd just add a comment in extended_parse), since argument.rs is the only user (AFAICT).

So I think I'd do something like this:

  • fn get_num<T>(os: &OsStr) -> T
  • let s = os.to_string_lossy();
  • If s doesn't start with \', call extended_parse. If not, call another function parse_literal(?) with the original OsStr that also returns a Result<T, ExtendedParserError<'_, T>>
    • In parse_literal, you can call os_str_as_bytes and just return a NotNumeric if that fails (or feel free to create a new error) -- or maybe you can just use to_string_lossy and accept that we'd sometimes return 0xfffd, all of these options sound ok to me.
    • In parse_literal, you can return a PartialMatch instead of copy-pasting the warning.
  • Then you can call extract_value on the return value from above (either extended_parse or the parse_literal)

That simplifies the code a bit, avoids a bunch of unwrap, removes the warning message duplication.

WDYT? I obviously didn't try, so maybe something doesn't work, or things can be done in yet another different way ,-)

jtracey added 2 commits May 29, 2025 22:41
This fixes handling of format arguments, in part by eliminating duplicate
implementations. Utilities with format arguments other than printf will no
longer accept things like "'a" as numbers, etc.
@jtracey jtracey force-pushed the printf-allow-non-utf-8 branch from fe3770c to f9ca1e3 Compare May 30, 2025 02:42
Copy link
Contributor Author

@jtracey jtracey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

@jtracey
Copy link
Contributor Author

jtracey commented May 30, 2025

Sorry for any duplicate notifications, GitHub did something weird because I grouped my responses in a review.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

};
// Emit a warning if there are additional characters
if bytes.len() > len {
show_warning!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized you change the behaviour here, this used to return a PartialMatch. I'm not 100% sure if this is critical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this makes me wonder if it's a good idea to move this code out of num_parser.rs:parse)

[-- sorry I need to run, I'll try to think more about this later]

@RenjiSann
Copy link
Collaborator

@jtracey ping ? ^^

@jtracey
Copy link
Contributor Author

jtracey commented Jul 7, 2025

Sorry, in the middle of a move + new job. I'll try to get to this in the next week. I wouldn't be offended if someone else wanted to take it over, though that would start to be a real pile of authors, so if this is pressing it might be simpler to merge as-is, then take a look at @drinkcat's suggestions (haven't had time to look in depth, but at a glance they don't seem unreasonable).

@drinkcat
Copy link
Contributor

Goodness, even a rebase was challenging, ended up squashing 2 commits together to make the rebase slightly easier, preserved author info in commit message. #8329. I'll try to apply the changes I suggested, next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

echo, printf: UTF-8 sensitivity in arguments
5 participants