-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
printf: accept non-UTF-8 input in FORMAT and ARGUMENT arguments #7209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is just a rebase, it compiles and passes our tests, but there are still some kinks to work out to get it to work as expected/pass GNU tests. |
GNU testsuite comparison:
|
needs to be rebased again :/ Sorry |
ca73a2d
to
51649ec
Compare
GNU testsuite comparison:
|
51649ec
to
6d9ab8f
Compare
GNU testsuite comparison:
|
6d9ab8f
to
2ec2433
Compare
GNU testsuite comparison:
|
2ec2433
to
016df8a
Compare
GNU testsuite comparison:
|
016df8a
to
0760508
Compare
To spell out a bit what my commit does: there was some redundant handling of various pieces of parsing format arguments, each with their own bugs. I minimized and simplified that, and got most things to only being implemented once in more proper locations, so that, e.g., behavior no longer differs between |
GNU testsuite comparison:
|
0760508
to
c45824b
Compare
GNU testsuite comparison:
|
@sylvestre: This is ready for review btw (no rush, just pinging in case the status change didn't). |
Other implementations of `printf` permit arbitrary data to be passed to `printf`. The only restriction is that a null byte terminates FORMAT and ARGUMENT argument strings (since they are C strings). The current implementation only accepts FORMAT and ARGUMENT arguments that are valid UTF-8 (this is being enforced by clap). This commit removes the UTF-8 validation by switching to OsStr and OsString. This allows users to use `printf` to transmit or reformat null-safe but not UTF-8-safe data, such as text encoded in an 8-bit text encoding. See the `non_utf_8_input` test for an example (ISO-8859-1 text).
c45824b
to
cae5441
Compare
GNU testsuite comparison:
|
GNU testsuite comparison:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at GNU test printf-mb
failure and realized you did the work already, and I'm somewhat familiar with that part of the codebase ,-)
Nothing too serious, just wondering about non-Unix OSStr parsing...
where | ||
T: ExtendedParser + std::convert::From<u8> + std::convert::From<u32> + Default, | ||
{ | ||
let s = os_str_as_bytes(os)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always succeeds on unix platforms, and fails on other platforms if the string can't be coerced to UTF-8.
Is it really ok to fail here?
- If the string doesn't start with a
'
/"
, we probably still want to move forward toextended_parse
? - Is there something we can do to still try to figure out if the first char is a
\'
? (maybe we should useos_str_as_bytes_lossy
?!)
(I was looking at this wondering if we really need to result a Result<T, ...>
instead of just T
, as that'd avoid a lot of the other changes... But I guess we'd still need to fail if the first character is '
and the second is U+FFFD REPLACEMENT CHARACTER.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could help but it's not stable... https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.slice_encoded_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If the string doesn't start with a
'
/"
, we probably still want to move forward toextended_parse
?
Well, there are no numbers that contain a unicode replacement character, so it would fail anyway. 😛 I think it's fine to fail early here, the string being invalid unicode is likely to be about as useful error context as the string not being a valid number.
- Is there something we can do to still try to figure out if the first char is a
\'
? (maybe we should useos_str_as_bytes_lossy
?!)
The only case we would get such a failure is on Windows with a string that is not valid unicode. Unlike unix platforms, this almost never happens, since all Windows strings are UTF-16(ish) -- namely, Windows doesn't allow arbitrary byte sequences (or byte pairs) as input, it has to look like valid UTF-16(ish). The exception (the "ish") is that Windows does allow invalid surrogate pairs, which can't be converted to bytes in any meaningful way, which is why conversion to bytes has to be fallible. This means that any argument here containing an invalid codepoint also necessarily contains data that doesn't represent any particular byte, which, aside from being difficult to do in practice, I can't think of any reason you would want that data to then be parsed as a literal -- it seems much more likely that something has gone horribly wrong.
But I guess we'd still need to fail if the first character is ' and the second is U+FFFD REPLACEMENT CHARACTER.
Problem with that strategy is we'd need some way to distinguish the replacement character from lossy conversion and the replacement character as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If the string doesn't start with a
'
/"
, we probably still want to move forward toextended_parse
?Well, there are no numbers that contain a unicode replacement character, so it would fail anyway. 😛 I think it's fine to fail early here, the string being invalid unicode is likely to be about as useful error context as the string not being a valid number.
I should have made my thought clearer... In general, partial matches still return whatever can be parsed (with an error). And you do the same here with '
prefixes (see my new comment below though, you do change the behaviour a bit by manually printing the error instead of returning a PartialMatch)
So I think something like 123abc[badunicode]
will return PartialMatch(123, "abc[U+FFFD]")
with your change, I think I'm ok with that? But you do reject \'abc[badunicode]
, wouldn't be better if we could still return the ASCII value of a
here? (ideally a PartialMatch(97, "abc[U+FFFD]")
.
- Is there something we can do to still try to figure out if the first char is a
\'
? (maybe we should useos_str_as_bytes_lossy
?!)The only case we would get such a failure is on Windows with a string that is not valid unicode. Unlike unix platforms, this almost never happens, since all Windows strings are UTF-16(ish) -- namely, Windows doesn't allow arbitrary byte sequences (or byte pairs) as input, it has to look like valid UTF-16(ish). The exception (the "ish") is that Windows does allow invalid surrogate pairs, which can't be converted to bytes in any meaningful way, which is why conversion to bytes has to be fallible. This means that any argument here containing an invalid codepoint also necessarily contains data that doesn't represent any particular byte, which, aside from being difficult to do in practice, I can't think of any reason you would want that data to then be parsed as a literal -- it seems much more likely that something has gone horribly wrong.
Goodness, thanks for the explanation...
If this is exceedingly rare, I do wonder if it's not easier to just print a warning, and return a best guess value (even if that ends up being U+FFFD
due to the lossy replacement...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay... back from running after my train. I think my thoughts are getting a bit clearer. I'll continue my point from below here to keep things in one place.
I'm not terribly happy that get_num
returns a Result, it's really weird that it propagates the errors from literal strings \'
(in a very uncommon non-Unix corner case BTW), but processes the errors from extended_parse
to the best possible number.
I'm also not too happy to move code out of extended_parse
(I've been spending a lot of time trying to move all the parsing code in one place...). But I also understand we may not want to pass OsStr
to extended_parse
... So maybe it's ok to do this processing here (I'd just add a comment in extended_parse
), since argument.rs
is the only user (AFAICT).
So I think I'd do something like this:
fn get_num<T>(os: &OsStr) -> T
let s = os.to_string_lossy();
- If
s
doesn't start with\'
, callextended_parse
. If not, call another functionparse_literal
(?) with the originalOsStr
that also returns aResult<T, ExtendedParserError<'_, T>>
- In
parse_literal
, you can callos_str_as_bytes
and just return aNotNumeric
if that fails (or feel free to create a new error) -- or maybe you can just useto_string_lossy
and accept that we'd sometimes return0xfffd
, all of these options sound ok to me. - In
parse_literal
, you can return aPartialMatch
instead of copy-pasting the warning.
- In
- Then you can call
extract_value
on the return value from above (eitherextended_parse
or theparse_literal
)
That simplifies the code a bit, avoids a bunch of unwrap
, removes the warning message duplication.
WDYT? I obviously didn't try, so maybe something doesn't work, or things can be done in yet another different way ,-)
This fixes handling of format arguments, in part by eliminating duplicate implementations. Utilities with format arguments other than printf will no longer accept things like "'a" as numbers, etc.
fe3770c
to
f9ca1e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
Sorry for any duplicate notifications, GitHub did something weird because I grouped my responses in a review. |
GNU testsuite comparison:
|
}; | ||
// Emit a warning if there are additional characters | ||
if bytes.len() > len { | ||
show_warning!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized you change the behaviour here, this used to return a PartialMatch
. I'm not 100% sure if this is critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this makes me wonder if it's a good idea to move this code out of num_parser.rs:parse)
[-- sorry I need to run, I'll try to think more about this later]
@jtracey ping ? ^^ |
Sorry, in the middle of a move + new job. I'll try to get to this in the next week. I wouldn't be offended if someone else wanted to take it over, though that would start to be a real pile of authors, so if this is pressing it might be simpler to merge as-is, then take a look at @drinkcat's suggestions (haven't had time to look in depth, but at a glance they don't seem unreasonable). |
Goodness, even a rebase was challenging, ended up squashing 2 commits together to make the rebase slightly easier, preserved author info in commit message. #8329. I'll try to apply the changes I suggested, next. |
Rebases #6812 on #7208.
EDIT: Now also includes a commit from me to pass the printf-mb GNU test, as well as a few other odds and ends. Fixes #6804.