Skip to content

Starting with localization #3997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tertsdiepraam opened this issue Oct 3, 2022 · 34 comments
Open

Starting with localization #3997

tertsdiepraam opened this issue Oct 3, 2022 · 34 comments

Comments

@tertsdiepraam
Copy link
Member

TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in uucore


uutils is currently following the C locale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:

We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of icu4x. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..

However, it requires data to operate on, which is different from the usual data generated by locale-gen and friends (if I understand correctly). There are essentially 2 viable ways to include data with icu4x1:

  1. Store a blob on the filesystem to read at runtime (BlobDataProvider).
  2. Encode the data as Rust code included in the binary (BakedDataProvider).

Since we don't know up front what locales we might need, I think we need to use the BlobDataProvider and allow the user to generate their own locale data on command. So, I propose we do the following:

  1. Add a new util, called locale-gen or something similar
    • This util downloads and stores the locale data in a global directory (I'm not sure where, could also be controlled by an environment variable).
    • This util would be a wrapper around the icu_datagen crate2.
    • It could also read from system config files and install any necessary locales based on the system config automatically.
    • Since this util needs access to the internet, we will run into similar issues like we did with uudoc back when it automatically downloaded examples, so it needs to be optional.3
  2. Create locale-aware functionality in uucore as much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..
    • For example, to check the collation locale, the LC_COLLATE, LC_ALL and LANG env vars need to be checked.
    • For the utils, we then just expose a sort/collate function that checks (and caches) the locale and performs the correct collation.
  3. Change the utils to use the locale-aware functions provided by uucore.

Do you see any problems with this approach? Are there alternatives we should explore first?

Footnotes

  1. They also have FsDataProvider which is meant for development only.

  2. This crate also has a CLI, but we need to tailor it for use with coreutils, by setting nicer defaults for our purpose.

  3. icu_datagen uses reqwest, which will lead to similar problems as in https://github.com/uutils/coreutils/pull/3184

@tertsdiepraam
Copy link
Member Author

There is also rust_icu, which is a wrapper around ICU4C, which works without additional datagen, but it's a big C dependency. So I guess we have to choose between C code or custom datagen?

@tertsdiepraam
Copy link
Member Author

I'm no longer sure rust_icu works without datagen. icu4c also has a different data format from POSIX. I think this only future-proof way forward is to embrace icu4x's data format. I wonder if the Unicode folks are willing to spec out some standard location for this data and provide some tools for managing it. It'd be nice if all applications build using icu4x that want to store the data in the filesystem could share their data.

@VorpalBlade
Copy link
Contributor

VorpalBlade commented Feb 4, 2024

I was running into essentially the same problem for my own command line tools.

  • Did you figure out a standard location to store the data?
  • What about translations, icu4x seems to handle everything except for LC_MESSAGES? Or am I missing something?
  • Could you consider putting the logic for locale env parsing, etc in a separate crate rather than uucore, so other projects outside of uutils can reuse it (without copy pasting code)? It would be good to be able to solve this for all sorts of POSIX command line tools rather than reinvent the wheel every time. Especially with proper support for mixed locales (as you are considering it seems, and I use, but few others care about it).

@tertsdiepraam
Copy link
Member Author

Did you figure out a standard location to store the data

Not yet. We should start talking to some people about that :)

What about translations, icu4x seems to handle everything except for LC_MESSAGES? Or am I missing something?

Translations are out of scope for a while for us I think, but if you want it, I think Project Fluent is the gold standard there.

Could you consider putting the logic for locale env parsing, etc in a separate crate rather than uucore, so other projects outside of uutils can reuse it (without copy pasting code)?

If there is a significant amount of code, it should definitely go in a separate crate.

Especially with proper support for mixed locales (as you are considering it seems, and I use, but few others care about it).

Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. icu4x can do all of that I believe.

@VorpalBlade
Copy link
Contributor

Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. icu4x can do all of that I believe.

Exactly. I use LC_MESSAGES in English (for searchability and because translations tend to be poor), but I use sv_SE.UTF-8 for everything else, except for collate where I prefer C.UTF-8 for case sensitive sort.

@jtracey
Copy link
Contributor

jtracey commented Mar 16, 2025

To get this moving faster, I'd like to propose another interim solution: implementing safe wrappers around the relevant libc functions that implement locale functionality (strcoll, wcstombs, etc.) . While a pure-Rust solution would be nice in the long run, we already link against libc on every supported platform, and by using its functionality, we can use the locale configurations already provided by the operating system instead of figuring out a way to distribute our own. Since the GNU coreutils are presumably relying on glibc to handle this, it would also mean a very high likelihood of matching GNU behavior on glibc platforms.

We wouldn't want to adversely affect our existing performance wins, so this should probably be implemented as fallback functionality when a relevant LC_* environment variable for this util is set. I've gone through all the GNU info and man pages to see where this would work, and it seems doable to me. If there's no objection, I'll file some tracking issues.

With translations, we could easily use gettext-rs, which would come with the major advantage of allowing us to piggyback on any GNU coreutils strings that are already translated on the system (we would just have to "lie" about which domain we are, though "coreutils" isn't wrong per se). The problem with that is I suspect the copyright on those strings is owned by the GNU project, and therefore subject to the GPL. One could argue using them is covered by the compatibility exemptions to copyright, but I'm not a lawyer and have zero comfort making that call.

@sylvestre
Copy link
Contributor

Oh no
Please don't use the libc functions for this. They are pretty bad and painful. I prefer not to have tanslation than a painful system.

And we can't use the translation files from upstream as they are under gpl.

In general, please don't start this work before I validated the technical solution :)

@jtracey
Copy link
Contributor

jtracey commented Mar 16, 2025

To be clear, I'm suggesting wrapping libc functions for locales, not translations. E.g., implementing a Rust wrapper function that provides a cleaner interface to do to a collated comparison by wrapping the libc functionality.

@sylvestre
Copy link
Contributor

Yeah, don't worry, i got it initially :)

@tertsdiepraam
Copy link
Member Author

Since this is gaining some traction again, I'd like to mention that I've changed my mind. I now think that icu4x would be the wrong choice here for compatibility (even though it is a great project). Wrapping or replacing libc's functionality is a better option, I think.

@sylvestre
Copy link
Contributor

@tertsdiepraam why not icu?

@sylvestre
Copy link
Contributor

sylvestre commented Mar 16, 2025

I started to experiment with fluent

@tertsdiepraam
Copy link
Member Author

I think icu4x will make things difficult because it requires that external data, which makes it harder for distros to package uutils. It might also be incompatible with the posix functionality.

@jtracey
Copy link
Contributor

jtracey commented Mar 16, 2025

Right, to spell it out a bit more: locale files are distributed in a compiled format that is tightly coupled with glibc. Our only options are to either find our own way to build out the infrastructure required to create and distribute our own locale files (without just giving everyone every possible locale), or rely on the ones that come with the OS/libc. My first thought was "write a parser for the libc files in Rust", and while that's still possible in theory, the compiled format isn't documented, and "tightly coupled with glibc" in the sense that it can completely change between release versions (I suspect they're just byte stores of internal structs, but I haven't looked at the code). Any Rust code that tries to operate with the system locales would therefore not only need to parse these files, but use a different parser for every version of libc (and figure out which to use), or just wrap the libc functions to let libc figure it out.

For longer term solutions, we could maybe negotiate with OS vendors (especially GNU) to standardize on a documented compiled locale format, but that would obviously take some long term discussion.

@VorpalBlade
Copy link
Contributor

VorpalBlade commented Mar 16, 2025

@jtracey wrote:

Right, to spell it out a bit more: locale files are distributed in a compiled format that is tightly coupled with glibc.
[...]
My first thought was "write a parser for the libc files in Rust", and while that's still possible in theory, the compiled format isn't documented, and "tightly coupled with glibc" in the sense that it can completely change between release versions

Also: What about musl (used on e.g. Alpine Linux)? Or *BSD? Or even Windows or MacOS?

Lets look at the various categories of locales by environment variables:

LC_COLLATE

This is how to order strings (relevant for sorting for example). This defines if you have a sort order of ABCabc (as in C locale) or AaBbCc (as in most other western locales). It is also what would define that in the Swedish alphabet åäö comes after xyz, etc.

Implementation wise this is annoying, it looks like we can't easily extract the meaning of for example LC_COLLATE=sv_SE.UTF-8. The best we can do is a sort that calls strcoll(const char*, const char*) from rust for every comparison (or figure out how to use strxfrm, which seems about as bad). Having such a FFI call for every comparison is going to be quite bad for performance. It will also not handle embedded 0 bytes on lines (since it uses C strings).

LC_NUMERIC

This describes how to format numbers. Do you use write one and a half as 1.5 (English) or 1,5 (several European languages, such as Swedish). What about thousand separator? Or do you do hundreds? Or Tens of thousands?

It looks like string comparisons is the worst issue though, as for numbers localeconv returns a struct lconv, the data of which we can presumably make use of in a Rust library to determine decimal separator, and how to group thousands etc.

LC_MESSAGES

Translated messages, what people usually think about. But also the most boring and arguably easiest part to handle. There are already some libraries in Rust for this, such as fluent.

LC_CTYPE

This covers which letters are considered upper or lower case, as well as how to convert between them. It looks like this just affects the behaviour of other functions, there is no "dump the data for the current locale" function for this. So you would again have to use libc functions (isupper, toupper, ...).

EDIT: This also handles Unicode vs multibyte encodings etc. I suspect anything except UTF-8 is legacy, do we care about that?

LC_MONETARY

I doubt this is needed for coreutils, but localeconv covers this case too.

LC_TIME

Some of the info seems to be available from nl_langinfo, but we then need to parse a strftime format string unless we want to use strftime from libc. Also the specific strings for "january", "monday" etc are available from nl_langinfo.

One thing I can't figure out is where you get the first day of a week though (if it is Monday or some other unusual day, like Sunday in US).

Others (GNU extensions)

This includes LC_ADDRESS, LC_TELEPHONE, LC_PAPER etc. Doubtful these are needed for coreutils?

Rust libraries

I think we (the Rust community) probably need to come up with a standardised library with stable locale files that can be reused by many such programs. It seems to me that icu4x would be that, but you noted that "It might also be incompatible with the posix functionality.", why would that be?

Relevant links

@tertsdiepraam
Copy link
Member Author

tertsdiepraam commented Mar 16, 2025

Very nice summary!

you noted that "It might also be incompatible with the posix functionality.", why would that be?

It's mostly my fear of it being incompatible in some cases. Like if there are small collation differences for some reason. I don't know whether that's the case, but it's likely that some differences will be found.

@jtracey
Copy link
Contributor

jtracey commented Mar 17, 2025

Having such a FFI call for every comparison is going to be quite bad for performance.

This is why I wanted such wrappers to only intelligently fall back when a relevant variable is set, but it is at least worth mentioning that all I/O (and syscalls generally) are already going through libc ffi, and Windows already needs to do pretty intensive string manipulation to constantly convert between UTF-8 and UTF-16 on the fly. The extra work we'd need to do is pretty small compared to that, so I suspect most users wouldn't notice the performance cost, even on the slow path.

It will also not handle embedded 0 bytes on lines (since it uses C strings).

That's not a big deal, since that's already true for the existing utils we're aiming for compatibility with, right? If we want to do something smarter, we can just split on null bytes (no additional copies necessary since each slice would already be a valid C string, other than the last one).

EDIT: This also handles Unicode vs multibyte encodings etc. I suspect anything except UTF-8 is legacy, do we care about that?

Windows is almost always UTF-16 (transparently handled with Rust's I/O for the most part, but not if you're interoperating between it and another OS), and Linux systems still support a ton of alternatives (locale -m | wc -l gives 236 on my system). IIUC, non-unicode CJK encodings are still see some significant use. Safely converting between encodings is one of the main documented features of printf (~20% of the info page), so I don't think it makes sense to just call it deprecated. Coercing Windows to use something other than UTF-16 sounds like a nightmare we shouldn't touch for now, but on all the other platforms I think it's worth doing.

@zbraniecki
Copy link

Hi all, just found this thread through the Ubuntu announcement. I'm one of the authors of ICU4X and Fluent. Happy to support you in the evaluation of the technologies for internationalization and localization.

I started to experiment with fluent

Fluent is a localization system, not i18n system. It uses ICU4X for i18n.

I think icu4x will make things difficult because it requires that external data, which makes it harder for distros to package uutils.

that's not accurate. ICU4X allows you to "bake" data into the library, or provide external data stored in data files. Your choice.

Based on what I read here, my initial suggestion is that:

  • For i18n you should look to ICU4X. It provides you with modern, Rust-native API surface, with ability to slice the code and data to minimize the binary size impact compared to ICU4C.
  • For localization you're in a bit tougher spot. Fluent is a robust localization system by Mozilla, used in Firefox, and has a good Rust library, so you can use it as-is. But in the recent years Mozilla has been collaborating with Unicode on design of MessageFormat 2.0 which is now nearing stabilization. MF2.0 is an evolutionary step on top of Fluent and ICU4X is working on the MF2.0 integration directly, but it will take some time (EOY?) before we will have MF2.0 in ICU4X. You could start with Fluent and then migrate to MF2.0 as the Fluent project owner (eemeli) who is also a co-author of MF2.0 is working on tooling to automate such migrations.

@sylvestre
Copy link
Contributor

Thanks @zbraniecki and nice to see you here! :)

@tertsdiepraam
Copy link
Member Author

Hi! Thanks for your input!

that's not accurate. ICU4X allows you to "bake" data into the library, or provide external data stored in data files. Your choice.

Right, I was sloppy in my wording there. I discarded the baked data option, because recompiling for every locale seemed silly and including all locales probably isn't viable either. Having the external data is a plus, but users (and/or distros) would need a way to install locales into a standardized folder that we can access. So basically locale-gen for ICU4X data. If you know something like that, let us know!

@zbraniecki
Copy link

because recompiling for every locale seemed silly and including all locales probably isn't viable either.

I don't think including all locales is irrational as you make it sound. The real challenge is that you get yourself in the business of deciding on the tradeoff between disk payload and locale coverage. My argument is that you are already in that business if you want to create a multi-locale application.
What you can do, is kick the can down the road - make it the distro's question. And it seems that that's what you suggest doing.

So basically locale-gen for ICU4X data. If you know something like that, let us know!

Correct, if you want the customer of your application suite to package locales you need to make it a public API for the consumer to place locale data in a way that your application can discover and use it.
ICU4X provides a robust tooling for data generation - you can use it to generate data from existing CLDR files, or fetch it from the network, generate a single data package, or package-per-locale, and then use a DataProvider to access this data and feed it to ICU4X runtime.

I recommend as a tracer bullet to supply date +%c with ICU4X induced data. It will give you a good overview of options to package data-per-locale, and once you have an architecture that scales for you, you can scale the support.

@Manishearth
Copy link

and including all locales probably isn't viable either

I'll note: by default, ICU4X baked data ships with most locales and it's not actually that much data. You don't need to do any datagen for this.

I would recommend experimenting with ICU4X default baked data at first, seeing what the impact is, and then tweaking with datagen if you really need to make it loadable. ICU4X is very much designed to give full data flexibility, but you should try and see what works before paying that cost.

(At the very least you should just use baked data for "singleton" crates, like icu_properties, icu_normalizer, and icu_casemap, where the data is not locale dependent)

@tertsdiepraam
Copy link
Member Author

Alright, thank you both!

@VorpalBlade
Copy link
Contributor

@Manishearth Looking towards a future with many core command lines tools written in Rust, it would be ideal if the locale data could be shared between them. Is the data format of icu4x stable across versions (or at least backward/forward compatible so that older versions ignore unknown fields from newer data)?

If not what would it take to get to that point, and is it something you are planning to do?

@Manishearth
Copy link

It's .... complicated. The answer rounds up to "yes, for the cases you care about", but it's worth going in to more detail. As I said, data flexibility is a core goal, so this is something we've thought about extensively.

Data markers

For a given "data marker" (also called a "data key"), the serialized data format is stable. For example, the formatting data for decimal formatting, DecimalSymbolsV1 is a stable format (the underlying type is DecimalSymbols): it may evolve in ways that add more variants/fields but only if old data can be successfully deserialized by new code. We try to also maintain forward compatibility.

The baked data format is not: icu_decimal v2.0 MUST be used with either icu_decimal_data v2.0, or baked data generated by icu4x-datagen 2.0. The baked data format constructs these types directly, and the actual Rust types are internally unstable, so mixing versions is bad.

Baked data is largely irrelevant for your question: it is designed for when you want to bake data in to your binary, which is the opposite of what you want if you're trying to "share" data across CLI tools.

API-level data requirements

Okay, so that was about "data marker"s. How does that relate to things you actually use?

Well, individual APIs pull in a (usually small) list of data markers. You can see which ones in the _unstable constructors of individual types, for example DecimalFormatter pulls in DecimalSymbolsV1 and DecimalDigitsV1. ICU4X datagen has a keyextract option that lets it figure out which data markers you need by analyzing a built binary, so you don't usually have to manually collect these, but it's possible.

Now, the fact that DecimalFormatter uses DecimalSymbolsV1 and DecimalDigitsV1 is itself not stable. We might change it to also pull in DecimalSomethingElseV1. Or we might change the symbols format and introduce DecimalDigitsV2. Usually when doing so we do not expect to remove the ability to generate old data for at least a couple releases, so datagen will be perfectly capable of generating DecimalDigitsV1 and DecimalDigitsV2 data at the same time.

All this means is that new code may not successfully work with old data, if it ends up needing extra data. However new and old code can still share the data, because the underlying data is stable.

This type of change does not happen often, FWIW. We also try hard to make new code continue to work with old data when making such changes, especially if our users engage with us on such needs (if we know which APIs you're using, we can take extra effort to make these migrations painless). For example, if we change a data format, we may still introduce code such that when loading from blob data, if DecimalDigitsV2 is missing we still fall back to DecimalDigitsV1 and do a somewhat suboptimal conversion (or something else). If we add a new data marker, we may make it so that the absence of that data does not cause an error, and instead hits fallback behavior.

A minor caveat for segmentation: Segmentation data is tied to unicode version, as is segmentation code. While the format is backwards compatible, if you are using ICU4X for segmentation, we do not recommend mixing data and code from different versions.

Summary

Putting all of this together:

  • CLI tools will be able to share data even on disparate ICU4X versions
  • Data is organized under data "marker"s. In the filesystem provider these markers end up as individual folders1.
  • Data under the same "marker" is backwards compatible, and usually forwards compatible
  • You will have to do some work to know which data marker each CLI tool needs (easy), and perform some union operation, or slice things in your packaging system. This is all possible.
  • You may have to version your data such that a tool using newer ICU4X requires newer data.
  • When ICU4X changes marker requirements of an API, you may need your data bundle to carry around old and new data. This is rare, and we can manage how it happens to better fit your needs if we know what APIs you are using

Final note: ICU4X 2.0 should be released in the next month or so. That does include major changes in APIs and data, and very little is backwards compatible. We do not expect to be making major changes often (scale of years, and we don't have major plans for 3.0 any time soon)

Footnotes

  1. Your use case may actually want a different kind of provider where each data marker corresponds to a single postcard-formatted file. We don't have this, this is easy to build in third-party code but is also something we could provide in first-party code if you think it would be nice.

@hsivonen
Copy link

you noted that "It might also be incompatible with the posix functionality.", why would that be?

It's mostly my fear of it being incompatible in some cases. Like if there are small collation differences for some reason. I don't know whether that's the case, but it's likely that some differences will be found.

(I wrote most of the ICU4X collator.)

It looks a lot like glibc tries to use the ISO counterpart of DUCET at the root collation while applying CLDR tailorings, but I didn’t check whether glibc really uses the LDML collation root instead of the DUCET root, since it would make sense to take the root and the tailorings from the same upstream. See https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation for the current differences between DUCET and the CLDR/LDML root collation.

ICU4X uses LDML/CLDR Collation root and does not provide raw DUCET. The collation rules can change between Unicode/CLDR versions, so it’s a bad idea to rely on the absence of “small collation differences” even between versions of the same collation library. Apart from bugs, one likely source of differences between different libraries is the libraries being on different Unicode/CLDR versions.

Furthermore, ICU4C and ICU4X provide two root collation alternatives that differ in data size and in how Han characters are sorted. The difference shows up if the decisive comparison is between Han characters from different Unicode blocks: The smaller data orders such characters by Unicode block and the larger data orders them by radical–stroke. If both Han characters are from the same Unicode block, they are order by radical–stroke either way. Moreover, all CJK locales tailor the collation order of common (for the locale) Han characters anyway, so the difference isn’t that relevant in practice. (View source on this demo for example characters to test with.)

If you wish to evaluate the performance of the ICU4X collator, I suggest testing with the PR implementing an identical-prefix optimization applied. It improves performance significantly for scenarios like sorting a bunch of file names in a directory when the multiple file names have a common prefix.

Windows is almost always UTF-16 (transparently handled with Rust's I/O for the most part, but not if you're interoperating between it and another OS), and Linux systems still support a ton of alternatives (locale -m | wc -l gives 236 on my system). IIUC, non-unicode CJK encodings are still see some significant use.

My recollection off the top of my head is that Red Hat changed its defaults to UTF-8 in 2002 and Debian in 2007 with Ubuntu defaulting to UTF-8 from the start (i.e. before upstream Debian). Firefox hasn’t supported non-UTF-8 file paths on Linux for years, and judging from the absence of bug reports it’s not a problem. (I’m the owner of character encodings in Firefox, but I’m commenting on uutils in personal hobbyist capacity.) Chances are that it’s entirely impractical to try to run a Linux system with a non-UTF-8 locale setting these days even if glibc still comes with non-UTF-8 locale definitions.

Windows still has non-UTF-8 defaults for interpreting stdout for terminal display beyond just CJK, but the exe metadata can tell Windows to use UTF-8. I think Rust programs do that by default, but I actually examined the metadata of Rust binaries. Not my call, of course, but I think supporting old Windows versions that don’t support the metadata that allows an exe to opt into UTF-8 interpretation of stdout in terminal is a very bad use of effort.

It should be fine to be UTF-8-only.

@sylvestre
Copy link
Contributor

Nice to see that many Mozillians here!
How would you start implementing this in the Rust coreutils ?

@VorpalBlade
Copy link
Contributor

Chances are that it’s entirely impractical to try to run a Linux system with a non-UTF-8 locale setting these days even if glibc still comes with non-UTF-8 locale definitions.

There might be a use case for being able to load old text files (but this is really the job of iconv in the command line world). And you might want to use the plain C locale to make tools just treat everything as raw bytes. What is important is to not panic on non-UTF8 data, you need to gracefully do something sensible for many low level tools.

@hsivonen
Copy link

How would you start implementing this in the Rust coreutils ?

I'd use ICU4X with baked data, since, unlike glibc, ICU4X is cross-platform and, unlike ICU4C, ICU4X is Rust-native and doesn't involve dealing with non-Rust dependencies. Baked date is a fine starting point for experimentation (and likely also for shipping esspecially when all the utilities are compiled into one executable anyway).

I’ll focus on collation and encodings, since I know those areas the best. (I think it makes sense to use Fluent for UI text and ICU4X for other stuff like segmentation, case conversion, number formatting, and datetime formatting.)

I’d introduce a PosixLocale enum with variants C and Bcp47(icu_locale_core::Locale) (Bcp47 to be bikeshed, or just Option<icu_locale_core::Locale> where None means C to avoid the bikeshed).

I’d parse LC_COLLATE and LC_ALL (by taking the part before the first . or @, replacing _ with -, and the mapping C and POSIX to to PosixLocale::C and parsing other values into icu_locale_core::Locale) giving LC_COLLATE and LC_ALL the right precedence. On Mac and Windows, I’d assume them to be used just for overriding the collation to C and otherwise the preference would need to be looked up from a non-environment-variable API. (On my Mac, neither LC_COLLATE and LC_ALL is set in the environment.) It would be good to share this code with ICU4X.

I expect the answer for what to do in case of failure (value exists but fails to parse or environment variable doesn’t exist on a platform that doesn’t have non-environment-variable API) to depend on the category of preference (collation, number format, etc.). For collation, there are 3 plausible fallbacks: C, the CLDR root collation, and en-US-posix. I think the CLDR root collation is the right fallback, but I could be persuaded of the merits of en-US-posix. (en-US-posix takes the root collation and tailors the ASCII range to be C/POSIX locale-like.)

Then I’d start with something simpler than the sort utility, and add ICU4X collator sorting to e.g. ls such that PosixLocale::C runs the pre-existing code and PosixLocale::Bcp47 instantiates icu_collator::CollatorBorrowed with baked data (the default baked data) for the locale and options that are the closest match to the behavior of GNU ls (I haven’t researched what options those are).

Where uutils ls currently sorts file names as a.display_name.cmp(&b.display_name), I’d make the collator-aware branch sort by collator.compare_utf8(a.display_name.as_encoded_bytes(), b.display_name.as_encoded_bytes()).

(Caveat: The POSIX spec talks about collations having to provide a total order of characters, but, per LDML collation, ICU4X will report inputs whose NFD form matches as equal even if the bytes differ. I don’t know if POSIX theoretically requires an even higher collation strength than what Strength::Identical provides, but chances are that that for practical purposes even Strength::Identical is excessive.)

(Sorry about not submitting this as a patch right now.)

There might be a use case for being able to load old text files (but this is really the job of iconv in the command line world).

Yeah, telling users to convert legacy file contents to UTF-8 first is a better way than adding legacy encoding locale support by parsing the part after the dot in LC_* environment variables and adding conversion capabilities all over the place in all the utilities.

Do I understand correctly that uutils doesn’t have an iconv replacement, yet? https://crates.io/crates/recode_rs goes a long way on the side of converting from legacy encodings to UTF-8, but on the reverse side it exhibits enough Web-specific quirks that folks might not accept it as an iconv drop-in even for the smaller set of supported encodings.

And you might want to use the plain C locale to make tools just treat everything as raw bytes.

Yes, I think it makes sense to retain the current behaviors for C in a way that bypasses ICU4X. Especially for sorting, the performance penalty from real collation in significant, and some users will want to be able to get the C behavior in a way that bypasses real collation.

What is important is to not panic on non-UTF8 data, you need to gracefully do something sensible for many low level tools.

icu_collator::CollatorBorrowed::compare requires &str, but icu_collator::CollatorBorrowed::compare_utf8 takes &[u8] and treats UTF-8 errors as U+FFFD according to the Encoding Standard. It’s actually useful when the input attempts to be UTF-8 but fails a bit, and it at least does something non-panicy (albeit not particularly useful) with input that doesn’t even attempt to be UTF-8.

@hsivonen
Copy link

It looks like I had misunderstood en-US-posix, and the purpose of en-US-posix is to represent the C locale in the BCP47 value space.

So instead of introducing a PosixLocale enum, the way to go is likely this:

  1. Use icu_locale_core::Locale to represent the locale, even including C.
  2. Represent the C locale as locale!("en-US-posix").
  3. When about to sort, if locale equals locale!("en-US-posix") do the byte-wise lexical sort, else use the ICU4X collator.

That is, never actually use the ICU4X collator with the en-US-posix CLDR tailoring.

@jtracey
Copy link
Contributor

jtracey commented Apr 30, 2025

Chances are that it’s entirely impractical to try to run a Linux system with a non-UTF-8 locale setting these days even if glibc still comes with non-UTF-8 locale definitions.

It's been a few years, but I've handled an issue in another project from at least one user who had problems using non-UTF-8 locale (we just told them to run under C, but that feels less acceptable for a project as fundamental as the coretuils). I'm fine with a decision of "it's not worth it", but again, GNU's printf spends a considerable amount of its info page describing how to use it to safely handle multiple encodings, so it is definitely a compatibility-breaking decision to choose to never support it.

Windows still has non-UTF-8 defaults for interpreting stdout for terminal display beyond just CJK, but the exe metadata can tell Windows to use UTF-8. I think Rust programs do that by default, but I actually examined the metadata of Rust binaries. Not my call, of course, but I think supporting old Windows versions that don’t support the metadata that allows an exe to opt into UTF-8 interpretation of stdout in terminal is a very bad use of effort.

Rust does some dark magic to let it operate compatibly between UTF-8 and Windows. We can't rely on normal string handling because the coreutils regularly interface with non-UTF-8 data. Obviously that happens all the time in Unix (e.g., sort or join on byte separators), but in particular, Windows OsStrings can contain unpaired surrogate code points, which is why Rust's Windows ffi extensions expose a UTF-16(ish) interface. We can't completely ignore this, since there are real filesystems that have these awful file names; plus, we want to be able to do proper globbing, which, even on Windows filesystems with only valid unicode, requires handling this (see #7161 and the issues it links). Remember, coreutils aren't just used to display text to the user, their inputs and outputs are frequently piped and interface with the rest of the OS. I agree that it doesn't make sense to prioritize Windows encoding issues over basic functionality, and I think the "ICU4X with baked data" route could be viable (assuming other concerns, e.g. size, turn out okay), but I wanted to point this out to make sure we avoid dropping the (partial) support we do have, or building a bunch of tech debt for future support.

@VorpalBlade
Copy link
Contributor

We can't completely ignore this, since there are real filesystems that have these awful file names;

This is not just a Windows concern: on Linux (and *nix in general), valid file names can contain any byte except nul and /. Yes, you can have file names containing new lines or binary data. Usually you don't, for obvious reasons. But to be correct you need to handle that. (And when messing up logic in shell scripts I have had ro remove file names with new lines, or leading dashes etc more than once. Coreutils cannot assume it operates on sane data.)

@jtracey
Copy link
Contributor

jtracey commented Apr 30, 2025

Yep, Unix OsStrings are basically just byte sequences until you try to semantically interpret them. The distinction is that Windows can contain semantically valid strings (for some definition of "semantically valid") that do not map onto individual byte sequences (nor UTF-8), so you can't just parse them in the C locale for things like sorting.

@Manishearth
Copy link

Note that the ICU4X collator can be expanded to support more encodings as needed. Currently it supports UTF8 and UTF16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants