-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Related: #554
As part of getting GNU's join tests to pass (#2634), we implemented support for non-Unicode field separators on unix-like platforms (#2902). The reason for not supporting other platforms is that only std::os::unix::ffi::OsStrExt
provides the as_bytes()
method for OsStr
s (see the std::ffi docs on conversions). Clap can only provide arguments in one of two forms: as Rust String
s, or as OsString
s. The former can only represent valid Unicode data, and the latter has a platform-dependent representation, which is by default opaque to the consumer. Because of this, most OSs cannot directly represent arbitrary single bytes in arguments.
Using a non-ASCII byte as a field separator is somewhat rare, but far from unheard of. It should be decided if this usage is common enough to warrant supporting it on other, non-unix platforms, and if so, how that support should be implemented. The options I see are:
- Extend the
\0
syntax implemented in join: add support for-t '\0'
#2881. While GNU doesn't support it, we could fairly easily extend this to parsing everything following the\
as a u8 (presumably following prtinf's syntax). This would have the advantage of also making it easier to use non-Unicode values generally, even on Unix platforms, where they currently have to be created using workarounds like$(printf '\247')
. The disadvantage is that it contradicts GNU's behavior in a literal sense, though not in any way that is currently tested. - Add a new option. Similar to the above, but rather than extending an existing option, it would avoid directly contradicting GNU's behavior by making a new
--separator-value
option or something. Presumably it shouldn't be hard to pick a name that has a very low probability of ever colliding with a future GNU option. The disadvantage of this is it adds a redundant option to something already exposed, in a non-standard way. - OS-specific hacks. This would only be available on Windows, since it's the only other OS that exposes its internal
OsString
representation in any way, but we could hack around the UTF-16 values to represent any single byte value we want. E.g., my understanding (someone with a Windows dev environment can confirm) is that there are ways to pass invalid UTF-16 arguments from the command line, so we could choose to interpret values from0xD800
to0xD8FF
as the bytes from0x00
to0xFF
, respectively. Currently, these values (in isolation) will cause an error, as they can't be turned into UTF-8, and have no obvious alternative meaning, so they are in some sense "safe" to overload (i.e., GNU can't even represent these values, let alone have intended behavior for them). The disadvantages are that this could only be exposed on Windows, and is a pretty unintuitive hack that would need some explaining.
The options basically run from "most elegant but least safe" to "safest but least elegant", where "safe" here is "unlikely to conflict with GNU join behavior" (e.g., if someone were to write a fuzzer to compare behavior, how careful would they have to be). My personal preference is for the first option, but I also don't use non-unix platforms, so don't need much of a say.