Skip to content

Why MessageFormat needs a successor #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 20, 2020
Merged

Why MessageFormat needs a successor #84

merged 5 commits into from
Jul 20, 2020

Conversation

mihnita
Copy link
Collaborator

@mihnita mihnita commented May 14, 2020

Issue #49

In preparation for the group meeting.

@mihnita mihnita requested a review from zbraniecki May 15, 2020 00:01
@mihnita
Copy link
Collaborator Author

mihnita commented May 15, 2020

@zibi: thank you for offering to help.
I had you in mind since I started writing this :-)


If you look at the history of the file, I started with a list of bullets that were the same as the list of issues we already collected in GitHub.
But "filtered" through my opinionated view of things.
And that would not be very useful.


**Mandatory xkcd:**

[<img src='https://imgs.xkcd.com/comics/standards.png'>](https://xkcd.com/927/)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂😂😂😂😂😂😂😂😂😂😂

@zbraniecki
Copy link
Member

zbraniecki commented May 18, 2020

Going high-level first.

I like the "5 points" - it simplifies the logic and makes it easy to read "what they're addressing" and "what should I expect out of the outcome".

I don't like personal sentences in WG documents. Sentences like "I've started with..." read more like a personal take than a formal position of the WG.

Here's my summary of your outline:

  • Intro
    • MessageFormat is the Unicode API for software localization
    • It's 20y old, well designed, proven solution
    • Its design is optimized for the software development model of 20y ago and its shortcomings result in mixed reception and adoption by the industry.
    • The current wave of software development coming from dynamic languages, modern UI frameworks and new forms of user interactions (voice, VR etc.) combined with the lessons learned from MessageFormat, we aim to design the next iteration of MessageFormat suitable for current generation of software and adoption by Web Standards.
  • Core problems
    • Non-extensible syntax
    • Every "feature" encoded in the data model
    • Limited specification and tooling around semantic features
    • Designed for imperative APIs only

I removed pt (2) from your listing because it doesn't feel like a problem that requires fixing via new iteration - we could fix it just by adding test suite. Let me know if you agree.

On the other hand, I see a number of MF shortcomings that are not captured here:

  • MF syntax makes it hard to recover from resolution errors, which makes the system less resilient and limits fallbacking.
  • MF doesn't provide any meta-data (semantic comments, localizer comments, string versioning etc.)
  • Limited multi-variant messages (I'm not sure how to phrase it, but it seems that we aim to provide more generic ability to provide variants of the same message that depend on some information, while MF is designed to allow for variants that are tied to basically PLURAL since custom formatters are deprecated and never got any adoption)
  • Multiline strings are not well supported (not sure if the handling of whitespace in multilines is actually specified well, for example for HTML fragments)
  • Modern formatters enable much richer/better integration that MF wasn't designed for, and with mixed formatting control

I trust you on selection of those, so feel free to incorporate or reject any and all of the above.

Finally, I'm wondering if this is the right place to extract the conclusion from the shortcoming listed by you about "hard to add/remove features" into some form of "L10n systems are hard to change, they stay for decades as software changes. Encoding every feature deep into the data model and syntax does not stand the test of time. In conclusion we believe that designing a more generic data model on top of which we can develop features that can be extended and even deprecated as the time goes without affecting low-level foundational tooling is critical for the longevity of the system".
Do you think it's the consensus or am I projecting my personal bias here? :)

@mihnita
Copy link
Collaborator Author

mihnita commented May 20, 2020

I don't like personal sentences in WG documents. Sentences like "I've started with..." read more like a personal take than a formal position of the WG.

Thanks, I'll do that
(for a long time I've been told that I am "too forceful" and learned to "sprinkle" a lots of "I think", "in my opinion" and so on :-)

I'll go through your list and refactor, but can't take you full suggestion "as is"

There are things that are not addressed because they are not MessageFormat issues, or are already included, or can be fixed in the existing form, no need for a new version:


  • MF syntax makes it hard to recover from resolution errors, which makes the system less resilient and limits fallbacking.

There is no fallback or resolution, because "Designed to be API only", so it is already captured.

The loading of the string (including resolution / fallback) is the responsibility of some other component. You load the string (using the Android ResourceManager, Java Resources, Windows / MacOS APIs, C/C++ gettext), then pass it to the MessageFormat class.

This allows one to use ICU without being force to migrate the whole resource resolution out of the OS (making the resolution results inconsistent in the process).

Add this to the "detailed notes" of the bullet?


  • MF doesn't provide any meta-data (semantic comments, localizer comments, string versioning etc.)
  • Multiline strings are not well supported (not sure if the handling of whitespace in multilines is actually specified well, for example for HTML fragments)

Same for both items, addressed by "Designed to be API only", not a complete solution.

You can use whatever metadata the native string storage mechanism you use.
MessageFormat is agnostic in that respect, treats this as "out of scope"
If I have a way to say add metadata to non-ICU messages (comments, examples, length limits) in "strings storage", then I am free to use them for ICU messages too.
Same for multiline.
One can say that "Java properties are not multiline friendly"

Whitespaces and multiline can't be specified well in HTML fragments to begin with, MessageFormat or not better or worse in this respect.
If my fragment is used inside a <pre> or a div with white-space set to nowrap / pre / pre-line / pre-wrap then I get different behavior. If my fragment is used in 2 different contexts, it might end up rendered differently.

So MessageFormat is again agnostic here. Solving multiline spaces are the problem of the string loading mechanism, or of the rendering part.

And all of the 3 issues above can be solved without a new version of MessageFormat, we "only need to add a "MessageFormat store", with resolution + fallback.

But the price is loss of flexibility: now as an iOS / Android / Qt developer I need to either move ALL my strings to a "MF store", with its own resolution / fallback / metadata, or keep non MF strings in a "native store" and MF strings in a "MF store" (with the huge risk that the resolution is different)


  • Limited multi-variant messages (I'm not sure how to phrase it, but it seems that we aim to provide more generic ability to provide variants of the same message that depend on some information, while MF is designed to allow for variants that are tied to basically PLURAL since custom formatters are deprecated and never got any adoption)
  • Modern formatters enable much richer/better integration that MF wasn't designed for, and with mixed formatting control

Agree with both.
But I've considered that to be captured in "Does not have any “extension points”"
We can split it in a separate point, or add this clarification in the explanation of the bullet?


Finally, I'm wondering if this is the right place to extract the conclusion from the shortcoming listed by you about "hard to add/remove features" into some form of "L10n systems are hard to change, they stay for decades as software changes. Encoding every feature deep into the data model and syntax does not stand the test of time. In conclusion we believe that designing a more generic data model on top of which we can develop features that can be extended and even deprecated as the time goes without affecting low-level foundational tooling is critical for the longevity of the system".

I agree with this, and I'll see how I take this and use it.
Most likely not a bullet, but probably in the explanation section of "Hard to map to the existing localization core structures"


Do you think it's the consensus or am I projecting my personal bias here? :)

As you might have noticed in my answers above, there is some bias (probably yours and mine :-)
Let me "sleep on it" a bit and I'll see what we get.

@stasm
Copy link
Collaborator

stasm commented May 20, 2020

Would it be a good idea to enumarate with more detail the design mistakes which you briefly touch on in point 3? Things like positional and named parameters, escaping, etc. I feel like each would benefit from its own bullet point with an example and the rationale about why we think they didn't work out.

@mihnita
Copy link
Collaborator Author

mihnita commented Jun 11, 2020

Would it be a good idea to enumarate with more detail the design mistakes which you briefly touch on in point 3?

I'm afraid that would make this document too big.

There is an older Google Doc (public, shared) where I listed what I like / dislike about MessageFormat. I can link to it. Or I can convert it to a markdown file and post it somewhere (where?)
That document is "kind of opinionated", it is not necessarily the position of the working group.

But if others agree to include all that here I am also OK to include some of that doc here.

@mihnita
Copy link
Collaborator Author

mihnita commented Jun 11, 2020

Changed things quite a bit. I've tried to capture / address all of your points, even if that didn't change the bullets too much.

“de facto reference-implementations”, and the ports to other languages
(JavaScript, Go, Dart, etc.) are at risk for being “slightly incompatible”

### 3. Can't remove anything, even if we know know better
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is related to ICU itself or more with its process? But it's a good reminder that we should avoid falling back in the same situation (e.g. support versioning)


### 4. Hard to map to the existing localization core structures

The format is not supported by any major localization system that I know of. \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean TMS here? and can you elaborate a bit more around what you would expect in terms of support?

Copy link
Collaborator Author

@mihnita mihnita Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term was "intentionally vague", since I didn't want to get into debates about TMS vs GMS vs CAT vs whateve-else-abbreviations-people-use.

Support means most of the stuff people do with other formats: extract (with unescape), segment, leverage, translate protecting codes, validate, merge (with proper escape).

I don't count "take the file as is, with curly brackets and all, and let translators edit it raw" as "support".

Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I agree with all points and I think it captures my mental model really well. Thank you!

are doing it in ICU itself.
It also means most tools used to process these messages are built rigidly,
and are unprepared to handle changes
(think localization tools, liners, friendly UIs, etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe add that, from the experience with Fluent, the fact that almost every feature in MF is a syntax feature, means that on the very lowest level - AST, parser - every extensions basically breaks the existing functionality because the file cannot be parsed and its AST operated on.

Fluent's AST is more generic which means that more of the functionality falls on higher levels of abstraction and in fact, over the last 3 years, all accrued "wants" can be implemented either on the higher level (Semantic Comments, Dynamic References etc.) or require additive syntax relaxations (flatten selectors, rich overlays).

@@ -0,0 +1,146 @@
# Why `MessageFormat` needs a successor ([issue #49](https://github.com/unicode-org/message-format-wg/issues/49))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the issue number should be placed along the article in a "more info" , "links" , "discussions" section or in the intro while giving context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants