-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Starting with localization #3997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is also |
I'm no longer sure |
I was running into essentially the same problem for my own command line tools.
|
Not yet. We should start talking to some people about that :)
Translations are out of scope for a while for us I think, but if you want it, I think Project Fluent is the gold standard there.
If there is a significant amount of code, it should definitely go in a separate crate.
Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. |
Exactly. I use LC_MESSAGES in English (for searchability and because translations tend to be poor), but I use sv_SE.UTF-8 for everything else, except for collate where I prefer C.UTF-8 for case sensitive sort. |
To get this moving faster, I'd like to propose another interim solution: implementing safe wrappers around the relevant libc functions that implement locale functionality ( We wouldn't want to adversely affect our existing performance wins, so this should probably be implemented as fallback functionality when a relevant LC_* environment variable for this util is set. I've gone through all the GNU info and man pages to see where this would work, and it seems doable to me. If there's no objection, I'll file some tracking issues. With translations, we could easily use gettext-rs, which would come with the major advantage of allowing us to piggyback on any GNU coreutils strings that are already translated on the system (we would just have to "lie" about which domain we are, though "coreutils" isn't wrong per se). The problem with that is I suspect the copyright on those strings is owned by the GNU project, and therefore subject to the GPL. One could argue using them is covered by the compatibility exemptions to copyright, but I'm not a lawyer and have zero comfort making that call. |
Oh no And we can't use the translation files from upstream as they are under gpl. In general, please don't start this work before I validated the technical solution :) |
To be clear, I'm suggesting wrapping libc functions for locales, not translations. E.g., implementing a Rust wrapper function that provides a cleaner interface to do to a collated comparison by wrapping the libc functionality. |
Yeah, don't worry, i got it initially :) |
Since this is gaining some traction again, I'd like to mention that I've changed my mind. I now think that icu4x would be the wrong choice here for compatibility (even though it is a great project). Wrapping or replacing libc's functionality is a better option, I think. |
@tertsdiepraam why not icu? |
I started to experiment with fluent |
I think icu4x will make things difficult because it requires that external data, which makes it harder for distros to package uutils. It might also be incompatible with the posix functionality. |
Right, to spell it out a bit more: locale files are distributed in a compiled format that is tightly coupled with glibc. Our only options are to either find our own way to build out the infrastructure required to create and distribute our own locale files (without just giving everyone every possible locale), or rely on the ones that come with the OS/libc. My first thought was "write a parser for the libc files in Rust", and while that's still possible in theory, the compiled format isn't documented, and "tightly coupled with glibc" in the sense that it can completely change between release versions (I suspect they're just byte stores of internal structs, but I haven't looked at the code). Any Rust code that tries to operate with the system locales would therefore not only need to parse these files, but use a different parser for every version of libc (and figure out which to use), or just wrap the libc functions to let libc figure it out. For longer term solutions, we could maybe negotiate with OS vendors (especially GNU) to standardize on a documented compiled locale format, but that would obviously take some long term discussion. |
@jtracey wrote:
Also: What about musl (used on e.g. Alpine Linux)? Or *BSD? Or even Windows or MacOS? Lets look at the various categories of locales by environment variables: LC_COLLATEThis is how to order strings (relevant for sorting for example). This defines if you have a sort order of ABCabc (as in C locale) or AaBbCc (as in most other western locales). It is also what would define that in the Swedish alphabet åäö comes after xyz, etc. Implementation wise this is annoying, it looks like we can't easily extract the meaning of for example LC_NUMERICThis describes how to format numbers. Do you use write one and a half as 1.5 (English) or 1,5 (several European languages, such as Swedish). What about thousand separator? Or do you do hundreds? Or Tens of thousands? It looks like string comparisons is the worst issue though, as for numbers LC_MESSAGESTranslated messages, what people usually think about. But also the most boring and arguably easiest part to handle. There are already some libraries in Rust for this, such as fluent. LC_CTYPEThis covers which letters are considered upper or lower case, as well as how to convert between them. It looks like this just affects the behaviour of other functions, there is no "dump the data for the current locale" function for this. So you would again have to use libc functions ( EDIT: This also handles Unicode vs multibyte encodings etc. I suspect anything except UTF-8 is legacy, do we care about that? LC_MONETARYI doubt this is needed for coreutils, but LC_TIMESome of the info seems to be available from One thing I can't figure out is where you get the first day of a week though (if it is Monday or some other unusual day, like Sunday in US). Others (GNU extensions)This includes Rust librariesI think we (the Rust community) probably need to come up with a standardised library with stable locale files that can be reused by many such programs. It seems to me that icu4x would be that, but you noted that "It might also be incompatible with the posix functionality.", why would that be? Relevant links
|
Very nice summary!
It's mostly my fear of it being incompatible in some cases. Like if there are small collation differences for some reason. I don't know whether that's the case, but it's likely that some differences will be found. |
This is why I wanted such wrappers to only intelligently fall back when a relevant variable is set, but it is at least worth mentioning that all I/O (and syscalls generally) are already going through libc ffi, and Windows already needs to do pretty intensive string manipulation to constantly convert between UTF-8 and UTF-16 on the fly. The extra work we'd need to do is pretty small compared to that, so I suspect most users wouldn't notice the performance cost, even on the slow path.
That's not a big deal, since that's already true for the existing utils we're aiming for compatibility with, right? If we want to do something smarter, we can just split on null bytes (no additional copies necessary since each slice would already be a valid C string, other than the last one).
Windows is almost always UTF-16 (transparently handled with Rust's I/O for the most part, but not if you're interoperating between it and another OS), and Linux systems still support a ton of alternatives ( |
Hi all, just found this thread through the Ubuntu announcement. I'm one of the authors of ICU4X and Fluent. Happy to support you in the evaluation of the technologies for internationalization and localization.
Fluent is a localization system, not i18n system. It uses ICU4X for i18n.
that's not accurate. ICU4X allows you to "bake" data into the library, or provide external data stored in data files. Your choice. Based on what I read here, my initial suggestion is that:
|
Thanks @zbraniecki and nice to see you here! :) |
Hi! Thanks for your input!
Right, I was sloppy in my wording there. I discarded the baked data option, because recompiling for every locale seemed silly and including all locales probably isn't viable either. Having the external data is a plus, but users (and/or distros) would need a way to install locales into a standardized folder that we can access. So basically |
I don't think including all locales is irrational as you make it sound. The real challenge is that you get yourself in the business of deciding on the tradeoff between disk payload and locale coverage. My argument is that you are already in that business if you want to create a multi-locale application.
Correct, if you want the customer of your application suite to package locales you need to make it a public API for the consumer to place locale data in a way that your application can discover and use it. I recommend as a tracer bullet to supply |
I'll note: by default, ICU4X baked data ships with most locales and it's not actually that much data. You don't need to do any datagen for this. I would recommend experimenting with ICU4X default baked data at first, seeing what the impact is, and then tweaking with datagen if you really need to make it loadable. ICU4X is very much designed to give full data flexibility, but you should try and see what works before paying that cost. (At the very least you should just use baked data for "singleton" crates, like |
Alright, thank you both! |
@Manishearth Looking towards a future with many core command lines tools written in Rust, it would be ideal if the locale data could be shared between them. Is the data format of icu4x stable across versions (or at least backward/forward compatible so that older versions ignore unknown fields from newer data)? If not what would it take to get to that point, and is it something you are planning to do? |
It's .... complicated. The answer rounds up to "yes, for the cases you care about", but it's worth going in to more detail. As I said, data flexibility is a core goal, so this is something we've thought about extensively. Data markersFor a given "data marker" (also called a "data key"), the serialized data format is stable. For example, the formatting data for decimal formatting, The baked data format is not: Baked data is largely irrelevant for your question: it is designed for when you want to bake data in to your binary, which is the opposite of what you want if you're trying to "share" data across CLI tools. API-level data requirementsOkay, so that was about "data marker"s. How does that relate to things you actually use? Well, individual APIs pull in a (usually small) list of data markers. You can see which ones in the Now, the fact that DecimalFormatter uses All this means is that new code may not successfully work with old data, if it ends up needing extra data. However new and old code can still share the data, because the underlying data is stable. This type of change does not happen often, FWIW. We also try hard to make new code continue to work with old data when making such changes, especially if our users engage with us on such needs (if we know which APIs you're using, we can take extra effort to make these migrations painless). For example, if we change a data format, we may still introduce code such that when loading from blob data, if A minor caveat for segmentation: Segmentation data is tied to unicode version, as is segmentation code. While the format is backwards compatible, if you are using ICU4X for segmentation, we do not recommend mixing data and code from different versions. SummaryPutting all of this together:
Final note: ICU4X 2.0 should be released in the next month or so. That does include major changes in APIs and data, and very little is backwards compatible. We do not expect to be making major changes often (scale of years, and we don't have major plans for 3.0 any time soon) Footnotes
|
(I wrote most of the ICU4X collator.) It looks a lot like glibc tries to use the ISO counterpart of DUCET at the root collation while applying CLDR tailorings, but I didn’t check whether glibc really uses the LDML collation root instead of the DUCET root, since it would make sense to take the root and the tailorings from the same upstream. See https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation for the current differences between DUCET and the CLDR/LDML root collation. ICU4X uses LDML/CLDR Collation root and does not provide raw DUCET. The collation rules can change between Unicode/CLDR versions, so it’s a bad idea to rely on the absence of “small collation differences” even between versions of the same collation library. Apart from bugs, one likely source of differences between different libraries is the libraries being on different Unicode/CLDR versions. Furthermore, ICU4C and ICU4X provide two root collation alternatives that differ in data size and in how Han characters are sorted. The difference shows up if the decisive comparison is between Han characters from different Unicode blocks: The smaller data orders such characters by Unicode block and the larger data orders them by radical–stroke. If both Han characters are from the same Unicode block, they are order by radical–stroke either way. Moreover, all CJK locales tailor the collation order of common (for the locale) Han characters anyway, so the difference isn’t that relevant in practice. (View source on this demo for example characters to test with.) If you wish to evaluate the performance of the ICU4X collator, I suggest testing with the PR implementing an identical-prefix optimization applied. It improves performance significantly for scenarios like sorting a bunch of file names in a directory when the multiple file names have a common prefix.
My recollection off the top of my head is that Red Hat changed its defaults to UTF-8 in 2002 and Debian in 2007 with Ubuntu defaulting to UTF-8 from the start (i.e. before upstream Debian). Firefox hasn’t supported non-UTF-8 file paths on Linux for years, and judging from the absence of bug reports it’s not a problem. (I’m the owner of character encodings in Firefox, but I’m commenting on uutils in personal hobbyist capacity.) Chances are that it’s entirely impractical to try to run a Linux system with a non-UTF-8 locale setting these days even if glibc still comes with non-UTF-8 locale definitions. Windows still has non-UTF-8 defaults for interpreting stdout for terminal display beyond just CJK, but the exe metadata can tell Windows to use UTF-8. I think Rust programs do that by default, but I actually examined the metadata of Rust binaries. Not my call, of course, but I think supporting old Windows versions that don’t support the metadata that allows an exe to opt into UTF-8 interpretation of stdout in terminal is a very bad use of effort. It should be fine to be UTF-8-only. |
Nice to see that many Mozillians here! |
There might be a use case for being able to load old text files (but this is really the job of iconv in the command line world). And you might want to use the plain C locale to make tools just treat everything as raw bytes. What is important is to not panic on non-UTF8 data, you need to gracefully do something sensible for many low level tools. |
I'd use ICU4X with baked data, since, unlike glibc, ICU4X is cross-platform and, unlike ICU4C, ICU4X is Rust-native and doesn't involve dealing with non-Rust dependencies. Baked date is a fine starting point for experimentation (and likely also for shipping esspecially when all the utilities are compiled into one executable anyway). I’ll focus on collation and encodings, since I know those areas the best. (I think it makes sense to use Fluent for UI text and ICU4X for other stuff like segmentation, case conversion, number formatting, and datetime formatting.) I’d introduce a I’d parse I expect the answer for what to do in case of failure (value exists but fails to parse or environment variable doesn’t exist on a platform that doesn’t have non-environment-variable API) to depend on the category of preference (collation, number format, etc.). For collation, there are 3 plausible fallbacks: Then I’d start with something simpler than the Where (Caveat: The POSIX spec talks about collations having to provide a total order of characters, but, per LDML collation, ICU4X will report inputs whose NFD form matches as equal even if the bytes differ. I don’t know if POSIX theoretically requires an even higher collation strength than what (Sorry about not submitting this as a patch right now.)
Yeah, telling users to convert legacy file contents to UTF-8 first is a better way than adding legacy encoding locale support by parsing the part after the dot in Do I understand correctly that
Yes, I think it makes sense to retain the current behaviors for
|
It looks like I had misunderstood So instead of introducing a
That is, never actually use the ICU4X collator with the |
It's been a few years, but I've handled an issue in another project from at least one user who had problems using non-UTF-8 locale (we just told them to run under C, but that feels less acceptable for a project as fundamental as the coretuils). I'm fine with a decision of "it's not worth it", but again, GNU's printf spends a considerable amount of its
Rust does some dark magic to let it operate compatibly between UTF-8 and Windows. We can't rely on normal string handling because the coreutils regularly interface with non-UTF-8 data. Obviously that happens all the time in Unix (e.g., |
This is not just a Windows concern: on Linux (and *nix in general), valid file names can contain any byte except nul and /. Yes, you can have file names containing new lines or binary data. Usually you don't, for obvious reasons. But to be correct you need to handle that. (And when messing up logic in shell scripts I have had ro remove file names with new lines, or leading dashes etc more than once. Coreutils cannot assume it operates on sane data.) |
Yep, Unix |
Note that the ICU4X collator can be expanded to support more encodings as needed. Currently it supports UTF8 and UTF16. |
TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in
uucore
uutils is currently following the
C
locale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:expr
is failing with multibyte chars #3132We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of
icu4x
. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..However, it requires data to operate on, which is different from the usual data generated by
locale-gen
and friends (if I understand correctly). There are essentially 2 viable ways to include data withicu4x
1:BlobDataProvider
).BakedDataProvider
).Since we don't know up front what locales we might need, I think we need to use the
BlobDataProvider
and allow the user to generate their own locale data on command. So, I propose we do the following:locale-gen
or something similaricu_datagen
crate2.uudoc
back when it automatically downloaded examples, so it needs to be optional.3uucore
as much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..LC_COLLATE
,LC_ALL
andLANG
env vars need to be checked.sort/collate
function that checks (and caches) the locale and performs the correct collation.uucore
.Do you see any problems with this approach? Are there alternatives we should explore first?
Footnotes
They also have
FsDataProvider
which is meant for development only. ↩This crate also has a CLI, but we need to tailor it for use with coreutils, by setting nicer defaults for our purpose. ↩
icu_datagen
usesreqwest
, which will lead to similar problems as in https://github.com/uutils/coreutils/pull/3184 ↩The text was updated successfully, but these errors were encountered: