Description
TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in uucore
uutils is currently following the C
locale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:
- ls: " is not taken in consideration when building table #3584
- sort -k doesn't handle locale #3123
- "ls" returns different sort order, incorrect results for subdirectory listings. #2149
expr
is failing with multibyte chars #3132- ls: Compatibility Tracking Issue #1872 (much locale-related missing functionality)
We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of icu4x
. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..
However, it requires data to operate on, which is different from the usual data generated by locale-gen
and friends (if I understand correctly). There are essentially 2 viable ways to include data with icu4x
1:
- Store a blob on the filesystem to read at runtime (
BlobDataProvider
). - Encode the data as Rust code included in the binary (
BakedDataProvider
).
Since we don't know up front what locales we might need, I think we need to use the BlobDataProvider
and allow the user to generate their own locale data on command. So, I propose we do the following:
- Add a new util, called
locale-gen
or something similar- This util downloads and stores the locale data in a global directory (I'm not sure where, could also be controlled by an environment variable).
- This util would be a wrapper around the
icu_datagen
crate2. - It could also read from system config files and install any necessary locales based on the system config automatically.
- Since this util needs access to the internet, we will run into similar issues like we did with
uudoc
back when it automatically downloaded examples, so it needs to be optional.3
- Create locale-aware functionality in
uucore
as much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..- For example, to check the collation locale, the
LC_COLLATE
,LC_ALL
andLANG
env vars need to be checked. - For the utils, we then just expose a
sort/collate
function that checks (and caches) the locale and performs the correct collation.
- For example, to check the collation locale, the
- Change the utils to use the locale-aware functions provided by
uucore
.
Do you see any problems with this approach? Are there alternatives we should explore first?
Footnotes
-
They also have
FsDataProvider
which is meant for development only. ↩ -
This crate also has a CLI, but we need to tailor it for use with coreutils, by setting nicer defaults for our purpose. ↩
-
icu_datagen
usesreqwest
, which will lead to similar problems as in https://github.com/uutils/coreutils/pull/3184 ↩