Skip to content

decide if and how to support non-UTF-8, single byte field separators/delimiters on non-Unix #3075

@jtracey

Description

@jtracey

Related: #554

As part of getting GNU's join tests to pass (#2634), we implemented support for non-Unicode field separators on unix-like platforms (#2902). The reason for not supporting other platforms is that only std::os::unix::ffi::OsStrExt provides the as_bytes() method for OsStrs (see the std::ffi docs on conversions). Clap can only provide arguments in one of two forms: as Rust Strings, or as OsStrings. The former can only represent valid Unicode data, and the latter has a platform-dependent representation, which is by default opaque to the consumer. Because of this, most OSs cannot directly represent arbitrary single bytes in arguments.

Using a non-ASCII byte as a field separator is somewhat rare, but far from unheard of. It should be decided if this usage is common enough to warrant supporting it on other, non-unix platforms, and if so, how that support should be implemented. The options I see are:

  • Extend the \0 syntax implemented in join: add support for -t '\0' #2881. While GNU doesn't support it, we could fairly easily extend this to parsing everything following the \ as a u8 (presumably following prtinf's syntax). This would have the advantage of also making it easier to use non-Unicode values generally, even on Unix platforms, where they currently have to be created using workarounds like $(printf '\247'). The disadvantage is that it contradicts GNU's behavior in a literal sense, though not in any way that is currently tested.
  • Add a new option. Similar to the above, but rather than extending an existing option, it would avoid directly contradicting GNU's behavior by making a new --separator-value option or something. Presumably it shouldn't be hard to pick a name that has a very low probability of ever colliding with a future GNU option. The disadvantage of this is it adds a redundant option to something already exposed, in a non-standard way.
  • OS-specific hacks. This would only be available on Windows, since it's the only other OS that exposes its internal OsString representation in any way, but we could hack around the UTF-16 values to represent any single byte value we want. E.g., my understanding (someone with a Windows dev environment can confirm) is that there are ways to pass invalid UTF-16 arguments from the command line, so we could choose to interpret values from 0xD800 to 0xD8FF as the bytes from 0x00 to 0xFF, respectively. Currently, these values (in isolation) will cause an error, as they can't be turned into UTF-8, and have no obvious alternative meaning, so they are in some sense "safe" to overload (i.e., GNU can't even represent these values, let alone have intended behavior for them). The disadvantages are that this could only be exposed on Windows, and is a pretty unintuitive hack that would need some explaining.

The options basically run from "most elegant but least safe" to "safest but least elegant", where "safe" here is "unlikely to conflict with GNU join behavior" (e.g., if someone were to write a fuzzer to compare behavior, how careful would they have to be). My personal preference is for the first option, but I also don't use non-unix platforms, so don't need much of a say.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions