printf: accept non-UTF-8 input in FORMAT and ARGUMENT arguments #7209

jtracey · 2025-01-25T06:32:22Z

EDIT: Now also includes a commit from me to pass the printf-mb GNU test, as well as a few other odds and ends. Fixes #6804.

jtracey · 2025-01-25T06:41:18Z

This is just a rebase, it compiles and passes our tests, but there are still some kinks to work out to get it to work as expected/pass GNU tests.

github-actions · 2025-01-25T06:58:49Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/usage_vs_getopt (passes in this run but fails in the 'main' branch)

sylvestre · 2025-02-16T23:19:31Z

needs to be rebased again :/ Sorry

github-actions · 2025-02-22T02:12:58Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

jtracey · 2025-02-22T02:15:46Z

Rebased on #7208 again. This still needs some cleaning up IMO, but I'm going to hold off until #7208 gets merged.

github-actions · 2025-04-30T19:48:33Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

github-actions · 2025-04-30T20:56:38Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

jtracey · 2025-05-03T01:11:50Z

Force push before the most recent is the last with the individual commits from #6812, most recent force push squashes them into one and adds my fixes in another commit. Final rebased #6812 commits before squashing are: 9eddbca a9f53a6 bc7516a 2ec2433

src/uucore/src/lib/features/format/argument.rs

github-actions · 2025-05-03T01:40:33Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

jtracey · 2025-05-03T02:24:14Z

To spell out a bit what my commit does: there was some redundant handling of various pieces of parsing format arguments, each with their own bugs. I minimized and simplified that, and got most things to only being implemented once in more proper locations, so that, e.g., behavior no longer differs between %i and %d format strings (aside from the type), misc. utils like seq no longer accept things like 'a as numbers, etc.

github-actions · 2025-05-03T02:44:23Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

github-actions · 2025-05-05T17:28:08Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

jtracey · 2025-05-10T01:34:46Z

@sylvestre: This is ready for review btw (no rush, just pinging in case the status change didn't).

Other implementations of `printf` permit arbitrary data to be passed to `printf`. The only restriction is that a null byte terminates FORMAT and ARGUMENT argument strings (since they are C strings). The current implementation only accepts FORMAT and ARGUMENT arguments that are valid UTF-8 (this is being enforced by clap). This commit removes the UTF-8 validation by switching to OsStr and OsString. This allows users to use `printf` to transmit or reformat null-safe but not UTF-8-safe data, such as text encoded in an 8-bit text encoding. See the `non_utf_8_input` test for an example (ISO-8859-1 text).

github-actions · 2025-05-27T02:26:49Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

github-actions · 2025-05-29T00:26:00Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

drinkcat

I was looking at GNU test printf-mb failure and realized you did the work already, and I'm somewhat familiar with that part of the codebase ,-)

Nothing too serious, just wondering about non-Unix OSStr parsing...

src/uucore/src/lib/features/format/argument.rs

drinkcat · 2025-05-29T15:03:46Z

src/uucore/src/lib/features/format/argument.rs

+    where
+        T: ExtendedParser + std::convert::From<u8> + std::convert::From<u32> + Default,
+    {
+        let s = os_str_as_bytes(os)?;


This always succeeds on unix platforms, and fails on other platforms if the string can't be coerced to UTF-8.

Is it really ok to fail here?

If the string doesn't start with a '/", we probably still want to move forward to extended_parse?

Is there something we can do to still try to figure out if the first char is a \'? (maybe we should use os_str_as_bytes_lossy?!)

(I was looking at this wondering if we really need to result a Result<T, ...> instead of just T, as that'd avoid a lot of the other changes... But I guess we'd still need to fail if the first character is ' and the second is U+FFFD REPLACEMENT CHARACTER.)

Maybe this could help but it's not stable... https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.slice_encoded_bytes

If the string doesn't start with a '/", we probably still want to move forward to extended_parse?

Well, there are no numbers that contain a unicode replacement character, so it would fail anyway. 😛 I think it's fine to fail early here, the string being invalid unicode is likely to be about as useful error context as the string not being a valid number.

Is there something we can do to still try to figure out if the first char is a \'? (maybe we should use os_str_as_bytes_lossy?!)

The only case we would get such a failure is on Windows with a string that is not valid unicode. Unlike unix platforms, this almost never happens, since all Windows strings are UTF-16(ish) -- namely, Windows doesn't allow arbitrary byte sequences (or byte pairs) as input, it has to look like valid UTF-16(ish). The exception (the "ish") is that Windows does allow invalid surrogate pairs, which can't be converted to bytes in any meaningful way, which is why conversion to bytes has to be fallible. This means that any argument here containing an invalid codepoint also necessarily contains data that doesn't represent any particular byte, which, aside from being difficult to do in practice, I can't think of any reason you would want that data to then be parsed as a literal -- it seems much more likely that something has gone horribly wrong.

But I guess we'd still need to fail if the first character is ' and the second is U+FFFD REPLACEMENT CHARACTER.

Problem with that strategy is we'd need some way to distinguish the replacement character from lossy conversion and the replacement character as input.

If the string doesn't start with a '/", we probably still want to move forward to extended_parse?

Well, there are no numbers that contain a unicode replacement character, so it would fail anyway. 😛 I think it's fine to fail early here, the string being invalid unicode is likely to be about as useful error context as the string not being a valid number.

I should have made my thought clearer... In general, partial matches still return whatever can be parsed (with an error). And you do the same here with ' prefixes (see my new comment below though, you do change the behaviour a bit by manually printing the error instead of returning a PartialMatch)

So I think something like 123abc[badunicode] will return PartialMatch(123, "abc[U+FFFD]") with your change, I think I'm ok with that? But you do reject \'abc[badunicode], wouldn't be better if we could still return the ASCII value of a here? (ideally a PartialMatch(97, "abc[U+FFFD]").

Is there something we can do to still try to figure out if the first char is a \'? (maybe we should use os_str_as_bytes_lossy?!)

The only case we would get such a failure is on Windows with a string that is not valid unicode. Unlike unix platforms, this almost never happens, since all Windows strings are UTF-16(ish) -- namely, Windows doesn't allow arbitrary byte sequences (or byte pairs) as input, it has to look like valid UTF-16(ish). The exception (the "ish") is that Windows does allow invalid surrogate pairs, which can't be converted to bytes in any meaningful way, which is why conversion to bytes has to be fallible. This means that any argument here containing an invalid codepoint also necessarily contains data that doesn't represent any particular byte, which, aside from being difficult to do in practice, I can't think of any reason you would want that data to then be parsed as a literal -- it seems much more likely that something has gone horribly wrong.

Goodness, thanks for the explanation...

If this is exceedingly rare, I do wonder if it's not easier to just print a warning, and return a best guess value (even if that ends up being U+FFFD due to the lossy replacement...).

Okay... back from running after my train. I think my thoughts are getting a bit clearer. I'll continue my point from below here to keep things in one place.

I'm not terribly happy that get_num returns a Result, it's really weird that it propagates the errors from literal strings \' (in a very uncommon non-Unix corner case BTW), but processes the errors from extended_parse to the best possible number.

I'm also not too happy to move code out of extended_parse (I've been spending a lot of time trying to move all the parsing code in one place...). But I also understand we may not want to pass OsStr to extended_parse... So maybe it's ok to do this processing here (I'd just add a comment in extended_parse), since argument.rs is the only user (AFAICT).

So I think I'd do something like this:

fn get_num<T>(os: &OsStr) -> T

let s = os.to_string_lossy();

If s doesn't start with \', call extended_parse. If not, call another function parse_literal(?) with the original OsStr that also returns a Result<T, ExtendedParserError<'_, T>>

In parse_literal, you can call os_str_as_bytes and just return a NotNumeric if that fails (or feel free to create a new error) -- or maybe you can just use to_string_lossy and accept that we'd sometimes return 0xfffd, all of these options sound ok to me.

In parse_literal, you can return a PartialMatch instead of copy-pasting the warning.

Then you can call extract_value on the return value from above (either extended_parse or the parse_literal)

That simplifies the code a bit, avoids a bunch of unwrap, removes the warning message duplication.

WDYT? I obviously didn't try, so maybe something doesn't work, or things can be done in yet another different way ,-)

src/uucore/src/lib/lib.rs

This fixes handling of format arguments, in part by eliminating duplicate implementations. Utilities with format arguments other than printf will no longer accept things like "'a" as numbers, etc.

jtracey

Thanks for the review!

jtracey · 2025-05-30T03:16:41Z

Sorry for any duplicate notifications, GitHub did something weird because I grouped my responses in a review.

github-actions · 2025-05-30T03:27:09Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

drinkcat · 2025-05-30T06:10:37Z

src/uucore/src/lib/features/format/argument.rs

+                };
+                // Emit a warning if there are additional characters
+                if bytes.len() > len {
+                    show_warning!(


I just realized you change the behaviour here, this used to return a PartialMatch. I'm not 100% sure if this is critical.

(this makes me wonder if it's a good idea to move this code out of num_parser.rs:parse)

[-- sorry I need to run, I'll try to think more about this later]

RenjiSann · 2025-06-30T16:45:14Z

@jtracey ping ? ^^

jtracey · 2025-07-07T12:51:10Z

Sorry, in the middle of a move + new job. I'll try to get to this in the next week. I wouldn't be offended if someone else wanted to take it over, though that would start to be a real pile of authors, so if this is pressing it might be simpler to merge as-is, then take a look at @drinkcat's suggestions (haven't had time to look in depth, but at a glance they don't seem unreasonable).

drinkcat · 2025-07-11T02:54:20Z

Goodness, even a rebase was challenging, ended up squashing 2 commits together to make the rebase slightly easier, preserved author info in commit message. #8329. I'll try to apply the changes I suggested, next.

jtracey mentioned this pull request Jan 25, 2025

printf: accept non-UTF-8 input in FORMAT and #6812

Closed

jtracey force-pushed the printf-allow-non-utf-8 branch 3 times, most recently from ca73a2d to 51649ec Compare February 22, 2025 01:48

jtracey force-pushed the printf-allow-non-utf-8 branch from 51649ec to 6d9ab8f Compare April 30, 2025 19:14

jtracey force-pushed the printf-allow-non-utf-8 branch from 6d9ab8f to 2ec2433 Compare April 30, 2025 20:21

jtracey force-pushed the printf-allow-non-utf-8 branch from 2ec2433 to 016df8a Compare May 3, 2025 01:05

jtracey commented May 3, 2025

View reviewed changes

src/uucore/src/lib/features/format/argument.rs Outdated Show resolved Hide resolved

jtracey force-pushed the printf-allow-non-utf-8 branch from 016df8a to 0760508 Compare May 3, 2025 02:10

jtracey marked this pull request as ready for review May 3, 2025 02:11

jtracey mentioned this pull request May 3, 2025

timeout: Should not accept "character" input (e.g. '0) #7678

Open

jtracey force-pushed the printf-allow-non-utf-8 branch from 0760508 to c45824b Compare May 5, 2025 16:55

jtracey force-pushed the printf-allow-non-utf-8 branch from c45824b to cae5441 Compare May 27, 2025 01:43

drinkcat reviewed May 29, 2025

View reviewed changes

jtracey added 2 commits May 29, 2025 22:41

uucore, printf: improve non-UTF-8 format arguments

ee3ac83

This fixes handling of format arguments, in part by eliminating duplicate implementations. Utilities with format arguments other than printf will no longer accept things like "'a" as numbers, etc.

printf: remove passing tests from why-error.md

f9ca1e3

jtracey force-pushed the printf-allow-non-utf-8 branch from fe3770c to f9ca1e3 Compare May 30, 2025 02:42

jtracey commented May 30, 2025

View reviewed changes

drinkcat reviewed May 30, 2025

View reviewed changes

drinkcat mentioned this pull request Jul 11, 2025

printf: accept non-UTF-8 input in FORMAT and ARGUMENT arguments #8329

Open

Uh oh!

printf: accept non-UTF-8 input in FORMAT and ARGUMENT arguments #7209

Are you sure you want to change the base?

printf: accept non-UTF-8 input in FORMAT and ARGUMENT arguments #7209

Uh oh!

Conversation

jtracey commented Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtracey commented Jan 25, 2025

Uh oh!

github-actions bot commented Jan 25, 2025

Uh oh!

sylvestre commented Feb 16, 2025

Uh oh!

github-actions bot commented Feb 22, 2025

Uh oh!

jtracey commented Feb 22, 2025

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

jtracey commented May 3, 2025

Uh oh!

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

jtracey commented May 3, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

github-actions bot commented May 5, 2025

Uh oh!

jtracey commented May 10, 2025

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

github-actions bot commented May 29, 2025

Uh oh!

drinkcat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drinkcat May 29, 2025

Choose a reason for hiding this comment

Uh oh!

drinkcat May 29, 2025

Choose a reason for hiding this comment

Uh oh!

jtracey May 30, 2025

Choose a reason for hiding this comment

Uh oh!

drinkcat May 30, 2025

Choose a reason for hiding this comment

Uh oh!

drinkcat May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jtracey left a comment

Choose a reason for hiding this comment

Uh oh!

jtracey commented May 30, 2025

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

drinkcat May 30, 2025

Choose a reason for hiding this comment

Uh oh!

drinkcat May 30, 2025

Choose a reason for hiding this comment

Uh oh!

RenjiSann commented Jun 30, 2025

Uh oh!

jtracey commented Jul 7, 2025

jtracey commented Jan 25, 2025 •

edited

Loading

drinkcat May 30, 2025 •

edited

Loading