decide if and how to support non-UTF-8, single byte field separators/delimiters on non-Unix

Related: #554

As part of getting GNU's join tests to pass (#2634), we implemented support for non-Unicode field separators on unix-like platforms (#2902). The reason for not supporting other platforms is that only `std::os::unix::ffi::OsStrExt` provides the `as_bytes()` method for `OsStr`s (see the [std::ffi docs on conversions](https://doc.rust-lang.org/1.58.1/std/ffi/index.html#conversions)). Clap can only provide arguments in one of two forms: as Rust `String`s, or as `OsString`s. The former can only represent valid Unicode data, and the latter has a platform-dependent representation, which is by default opaque to the consumer. Because of this, most OSs cannot directly represent arbitrary single bytes in arguments.

Using a non-ASCII byte as a field separator is somewhat rare, but far from unheard of. It should be decided if this usage is common enough to warrant supporting it on other, non-unix platforms, and if so, how that support should be implemented. The options I see are:
 - Extend the `\0` syntax implemented in #2881. While GNU doesn't support it, we could fairly easily extend this to parsing everything following the `\` as a u8 (presumably following [prtinf's syntax](https://github.com/uutils/coreutils/blob/4f9ba87c52d27f639f19f091ca475975915b89cb/src/uu/printf/src/printf.rs#L61)). This would have the advantage of also making it easier to use non-Unicode values generally, even on Unix platforms, where they currently have to be created using workarounds like `$(printf '\247')`. The disadvantage is that it contradicts GNU's behavior in a literal sense, though not in any way that is currently tested.
 - Add a new option. Similar to the above, but rather than extending an existing option, it would avoid directly contradicting GNU's behavior by making a new `--separator-value` option or something. Presumably it shouldn't be hard to pick a name that has a very low probability of ever colliding with a future GNU option. The disadvantage of this is it adds a redundant option to something already exposed, in a non-standard way.
 - OS-specific hacks. This would only be available on Windows, since it's the only other OS that exposes its internal `OsString` representation in any way, but we could hack around the UTF-16 values to represent any single byte value we want. E.g., my understanding (someone with a Windows dev environment can confirm) is that there are ways to pass invalid UTF-16 arguments from the command line, so we could choose to interpret values from `0xD800` to `0xD8FF` as the bytes from `0x00` to `0xFF`, respectively. Currently, these values (in isolation) will cause an error, as they can't be turned into UTF-8, and have no obvious alternative meaning, so they are in some sense "safe" to overload (i.e., GNU can't even represent these values, let alone have intended behavior for them). The disadvantages are that this could only be exposed on Windows, and is a pretty unintuitive hack that would need some explaining.

The options basically run from "most elegant but least safe" to "safest but least elegant", where "safe" here is "unlikely to conflict with GNU join behavior" (e.g., if someone were to write a fuzzer to compare behavior, how careful would they have to be). My personal preference is for the first option, but I also don't use non-unix platforms, so don't need much of a say.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

decide if and how to support non-UTF-8, single byte field separators/delimiters on non-Unix #3075

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

decide if and how to support non-UTF-8, single byte field separators/delimiters on non-Unix #3075

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions