diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index d1399137fb..9255007fb5 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -2,7 +2,7 @@ name: Feature request about: Suggest an idea or feature for Message Format title: '' -labels: '' +labels: Preview-Feedback assignees: '' --- diff --git a/.github/ISSUE_TEMPLATE/feedback.md b/.github/ISSUE_TEMPLATE/feedback.md new file mode 100644 index 0000000000..3d807e4082 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feedback.md @@ -0,0 +1,10 @@ +--- +name: Feedback +about: Use this template to enter feedback on the MessageFormat part of LDML +title: "[FEEDBACK] " +labels: Feedback +assignees: '' + +--- + +The Working Group is looking for implementation reports, success stories, problems encountered, suggestions for improvements, and errata. diff --git a/.github/ISSUE_TEMPLATE/tech-preview-feedback.md b/.github/ISSUE_TEMPLATE/tech-preview-feedback.md deleted file mode 100644 index c762047891..0000000000 --- a/.github/ISSUE_TEMPLATE/tech-preview-feedback.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -name: Tech Preview Feedback -about: Use this template to enter feedback on the Tech Preview (LDML45) release of - MF2 -title: "[FEEDBACK] " -labels: Preview-Feedback -assignees: '' - ---- - - diff --git a/.github/workflows/validate_tests.yml b/.github/workflows/validate_tests.yml new file mode 100644 index 0000000000..beb4ee2948 --- /dev/null +++ b/.github/workflows/validate_tests.yml @@ -0,0 +1,27 @@ +name: Validate test data + +on: + push: + branches: + - main + paths: + - test/** + pull_request: + paths: + - test/** + +jobs: + run_all: + name: Validate tests using schema + runs-on: ubuntu-latest + steps: + - name: Checkout repo + uses: actions/checkout@v4 + - name: Install CLI tool for JSON Schema validation + run: npm install --global ajv-cli + - name: Validate tests using the latest schema version + run: > + ajv validate --spec=draft2020 --allow-union-types + -s $(ls -1v schemas/*/*schema.json | tail -1) + -d 'tests/**/*.json' + working-directory: ./test diff --git a/.gitignore b/.gitignore index e617da4486..e053c35522 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,3 @@ -.DS_Store +.* node_modules/ package-lock.json diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d28236c057..1b2bb58bf5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,13 +1,6 @@ # Contributing to this project -## Joining the Working Group - -We are looking for participation from software developers, localization engineers and others with experience -in Internationalization (I18N) and Localization (L10N). If you wish to contribute to this work, please review -the information on the Contributor License Agreement below. In addition, you should: - -1. Apply to join our [mailing list](https://groups.google.com/a/chromium.org/forum/#!forum/message-format-wg) -2. Watch this repository (use the "Watch" button in the upper right corner) +To join this Working Group, please read the information in the [README.md](./README.md) as well as the Contributor License Agreement information just below: diff --git a/README.md b/README.md index 102a51f868..fc9c099ea4 100644 --- a/README.md +++ b/README.md @@ -4,135 +4,74 @@ Welcome to the home page for the MessageFormat Working Group, a subgroup of the ## Charter -The Message Format Working Group (MFWG) is tasked with developing an industry standard for the representation of localizable message strings to be a successor to [ICU MessageFormat](https://unicode-org.github.io/icu/userguide/format_parse/messages/). MFWG will recommend how to remove redundancies, make the syntax more usable, and support more complex features, such as gender, inflections, and speech. MFWG will also consider the integration of the new standard with programming environments, including, but not limited to, ICU, DOM, and ECMAScript, and with localization platform interchange. The output of MFWG will be a specification for the new syntax. - -- [Why ICU MessageFormat Needs a Successor](docs/why_mf_next.md) -- [Goals and Non-Goals](docs/goals.md) - -## MessageFormat 2 Technical Preview - -The MessageFormat 2 specification is a new part of -the [LDML](https://www.unicode.org/reports/tr35/) specification. -This specification is initially released as a "Tech Preview", -which means that the stability policy is not in effect and feedback from -users and implementers might result in changes to the syntax, data model, -functions, or other normative aspects of MessageFormat 2. -Such changes are expected to be minor and, to the extent possible, -to be compatible with what is defined in the Tech Preview. - -The MFWG welcomes any and all feedback, including bugs reports, implementation -reports, success stories, feature requests, requests for clarification, -or anything that would be helpful in stabilizing the specification and +The MessageFormat Working Group (MFWG) is tasked with developing and supporting an industry standard +for the representation of localizable message strings. +MessageFormat is designed to support software developers, translators, and end users with fluent messages +and locally-adapted presentation for data values +while providing a framework for increasingly complex features, such as gender, inflections, and speech. +Our goal is to provide an interoperable syntax, message data model, and associated processing that is +capable of being adopted by any presentation framework or programming environement. + +## The Unicode MessageFormat Standard + +The [Unicode MessageFormat Standard](./spec/) is a stable part of CLDR. +It was approved by the CLDR Technical Committee +and is recommended for implementation and adoption. +The normative version of the specification is published as a part of [TR35](https://www.unicode.org/reports/tr35/). +This repository contains the editor's copy. + +**Unicode MessageFormat** is sometimes referred to as _MessageFormat 2.0_, +since it replaces earlier message formatting capabilities built into ICU. + +Some _default functions_ and items in the `u:` namespace are still in Draft status. +Feedback from users and implementers might result in changes to these capabilities. + +The MessageFormat Working Group and CLDR Technical Committee welcome any and all feedback, +including bugs reports, +implementation reports, +success stories, +feature requests, +requests for clarification, +or anything that would be helpful in supporting or enhancing the specification and promoting widespread adoption. -The MFWG specifically requests feedback on the following issues: -- How best to define value resolution [#678](https://github.com/unicode-org/message-format-wg/issues/678) -- How to perform non-integer exact number selection [#675](https://github.com/unicode-org/message-format-wg/issues/675) -- Whether `markup` should support additional spaces [#650](https://github.com/unicode-org/message-format-wg/issues/650) -- Whether "attribute-like" behavior is needed and what form it should take [#642](https://github.com/unicode-org/message-format-wg/issues/642) -- Whether to relax constraints on complex message start [#610](https://github.com/unicode-org/message-format-wg/issues/610) -- Whether omitting the `*` variant key should be permitted [#603](https://github.com/unicode-org/message-format-wg/issues/603) - -## What is MessageFormat 2? - -Software needs to construct messages that incorporate various pieces of information. -The complexities of the world's languages make this challenging. -MessageFormat 2 defines the data model, syntax, processing, and conformance requirements -for the next generation of dynamic messages. -It is intended for adoption by programming languages, software libraries, and software localization tooling. -It enables the integration of internationalization APIs (such as date or number formats), -and grammatical matching (such as plurals or genders). -It is extensible, allowing software developers to create formatting -or message selection logic that add on to the core capabilities. -Its data model provides a means of representing existing syntaxes, -thus enabling gradual adoption by users of older formatting systems. - -The goal is to allow developers and translators to create natural-sounding, grammatically-correct, -user interfaces that can appear in any language and support the needs of diverse cultures. - -## MessageFormat 2 Specification and Syntax - -The current specification starts [here](spec/README.md) and may have changed since the publication -of the Tech Preview version. -The Tech Preview specification is [here](tr35-messageformat.md) (link to follow). - -The current draft syntax for defining messages can be found in [spec/syntax.md](./spec/syntax.md). -The syntax is formally described in [ABNF](spec/message.abnf). - -Messages can be simple strings: - - Hello, world! - -Messages can interpolate arguments: - - Hello {$user}! - -Messages can transform those arguments using _formatting functions_. -Functions can optionally take _options_: - - Today is {$date :datetime} - Today is {$date :datetime weekday=long}. - -Messages can use a _selector_ to choose between different _variants_, -which correspond to the grammatical (or other) requirements of the language: - - .match {$count :integer} - 0 {{You have no notifications.}} - one {{You have {$count} notification.}} - * {{You have {$count} notifications.}} - -Messages can annotate arguments with formatting instructions -or assign local values for use in the formatted message: - - .input {$date :datetime weekday=long month=medium day=short} - .local $numPigs = {$pigs :integer} - {{On {$date} you had this many pigs: {$numPigs}}} - -The message syntax supports using multiple _selectors_ and other features -to build complex messages. -It is designed so that implementations can extend the set of functions or their options -using the same syntax. -Implementations may even support users creating their own functions. - -See more examples and the formal definition of the grammar in [spec/syntax.md](./spec/syntax.md). - -## Normative Changes during Tech Preview - -The Working Group continues to address feedback -and develop portions of the specification not completed for the LDML45 Tech Preview release. -The `main` branch of this repository contains changes implemented since the technical preview. - -Implementers should be aware of the following normative changes during the tech preview period: -- _(list to be updated during tech preview)_ - -## Implementations - -(The working group expects that ICU75 will include both Java and C/C++ implementations of the tech preview specification) - -- Java: [`com.ibm.icu.message2`](https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/index.html?com/ibm/icu/message2/package-summary.html), part of ICU 72 released in October 2022, is a _tech preview_ implementation of the MessageFormat 2 syntax, together with a formatting API. See the [ICU User Guide](https://unicode-org.github.io/icu/userguide/format_parse/messages/mf2.html) for examples and a quickstart guide. -- JavaScript: [`messageformat`](https://github.com/messageformat/messageformat/tree/master/packages/mf2-messageformat) 4.0 implements the MessageFormat 2 syntax, together with a polyfill of the runtime API proposed for ECMA-402. - ## Sharing Feedback -Technical Preview Feedback: [file an issue here](https://github.com/unicode-org/message-format-wg/issues/new?labels=Preview-Feedback&projects=&template=tech-preview-feedback.md&title=%5BFEEDBACK%5D+) +Do you have feedback on the specification or any of its elements? [file an issue here](https://github.com/unicode-org/message-format-wg/issues/new?labels=Preview-Feedback&projects=&template=tech-preview-feedback.md&title=%5BFEEDBACK%5D+) -We invite feedback about the current syntax draft, as well as the real-life use-cases, requirements, tooling, runtime APIs, localization workflows, and other topics. +We invite feedback about implementation difficulties, +proposed functions or options +real-life use-cases, +requirements for future work, +tooling, +runtime APIs, +localization workflows, +and other topics. - General questions and thoughts → [post a discussion thread](https://github.com/unicode-org/message-format-wg/discussions). - Actionable feedback (bugs, feature requests) → [file a new issue](https://github.com/unicode-org/message-format-wg/issues). -## Participation +## Participation / Joining the Working Group -To join in: +We are looking for participation from software developers, localization engineers and others with experience +in Internationalization (I18N) and Localization (L10N). +If you wish to contribute to this work, please review the information about the Contributor License Agreement below. -1. Review [CONTRIBUTING.md](./CONTRIBUTING.md) -2. Apply to join our [mailing list](https://groups.google.com/a/chromium.org/forum/#!forum/message-format-wg) -3. Watch this repository (use the "Watch" button in the upper right corner) +To follow this work: +1. Apply to join our [mailing list](https://groups.google.com/a/chromium.org/forum/#!forum/message-format-wg) +2. Watch this repository (use the "Watch" button in the upper right corner) -### Copyright & Licenses +To contribute to this work, in addition to the above: +1. Each individual MUST have a copy of the CLA on file. See below. +2. Individuals who are employees of Unicode Member organizations SHOULD contact their member representative. + Individuals who are not employees of Unicode Member organizations MUST contact the chair to request Invited Expert status. + Employees of Unicode Member organizations MAY also apply for Invited Expert status, + subject to approval from their member representative. -Copyright © 2019-2024 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. +### Copyright & Licenses -The project is released under [LICENSE](./LICENSE). +Copyright © 2019-2025 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries. A CLA is required to contribute to this project - please refer to the [CONTRIBUTING.md](./CONTRIBUTING.md) file (or start a Pull Request) for more information. + +The contents of this repository are governed by the Unicode [Terms of Use](https://www.unicode.org/copyright.html) and are released under [LICENSE](./LICENSE). diff --git a/delegates.md b/delegates.md index 2452e4fc42..e0254e08ef 100644 --- a/delegates.md +++ b/delegates.md @@ -19,6 +19,10 @@ Please include your primary affiliation (e.g., the company you represent or wher ## Delegate List +This is list not in any way official. +Please sign the CLA to participate. +Contributions to specification language are acknowledged in the specification. + ### Current - Addison Phillips - Unicode (APP) @@ -29,6 +33,7 @@ Please include your primary affiliation (e.g., the company you represent or wher - Jan Mühlemann - Locize (JMU) - Janne Tynkkynen - PayPal (JMT) - Jeff Genovy - Microsoft (JMG) +- Harmit Goswamy - Mozilla (HGO) - John Watson - Facebook (JRW) - Long Ho - Dropbox (LHO) - Mick Monaghan - Guidewire - (MMN) diff --git a/docs/checklist-for-pourover-creation.md b/docs/checklist-for-pourover-creation.md new file mode 100644 index 0000000000..7bf3adafd9 --- /dev/null +++ b/docs/checklist-for-pourover-creation.md @@ -0,0 +1,76 @@ +# Notes on How to Create a Pour Over + +Being a compendium of tasks needed to get a clean pour over of the spec. + +> [!IMPORTANT] +> This is a work in progress. Do not believe anything you read in this page. +> If you are reading this, you're probably in the wrong place. + +- get a JIRA ticket for the pour over +- update forkmessage-format-wg +- cldr works on branches of main, not in forks, so pull cldr and checkout a new branch to work in + for the pour over (e.g. CLDR-18323-message-format-v47-pour) +- check that the header (see below) is in place +- insert the spec parts over the contents of part 9 (tr35-messageFormat.md under docs/ldml) + - remove subsidiary TOCs (from the README.md files in subdirectory parts and in intro.md) + - as you go, change all cross-document links to local + - some links are to spec/ and some are to .md; + generally you can replace `filename.md` with `#filename`, although the README ones are tricksy + - change the many links to message.abnf to a section link + - change the link to message.json to a section link + - make a ## section for the message.abnf and insert with abnf backticks + - make a ### section of message.json and insert with json backticks + - altogether remove the why_mf_next link +- check the toc. @srl295 made a change so that `message.json` and `message.abnf` linkify automagically (as `#messagejson` and `#messageabnf`) + so there should be no need to touch the autogenerated stuff. + If you need to generate a TOC by hand (unlikely) try https://bitdowntoc.derlin.ch/ + but the tr-archive tool generates a TOC under dist, so use that preferably + +- use `base make-tr-archive.sh` to generate + +- use the tools/scripts/tr-archive tools to generate the HTML + instructions in that location in the CLDR repo + +- use `npm run serve` to view the HTML output locally + +- git add/git commit/git push + +> [!IMPORTANT] +> Be sure to make all commits in the CLDR style: +> `CLDR-jiranum description` + +- Create a release in the message-format-wg repo + + +--- + +> [!NOTE] +> Below is the markdown for the header + +``` +--- +linkify: true +--- +## Unicode Technical Standard \#35 + +# Unicode Locale Data Markup Language (LDML)
Part 9: MessageFormat + +|Version|47 (draft) | +|-------|------------------------| +|Editors|Addison Phillips and [other CLDR committee members](tr35.md#Acknowledgments)| + +For the full header, summary, and status, see [Part 1: Core](tr35.md). + +### _Summary_ +``` + +--- + +> [!NOTE] +> Below is the markdown for the footer + +``` +* * * +© 2001–2025 Unicode, Inc. +This publication is protected by copyright, and permission must be obtained from Unicode, Inc. +``` diff --git a/docs/goals.md b/docs/goals.md index aa954e30bd..14caeed234 100644 --- a/docs/goals.md +++ b/docs/goals.md @@ -39,7 +39,7 @@ The design goals are listed below. escape sequences, whitespace, markup, as well as parsing errors. 3. A specification for a one-to-one mapping between the data model and XLIFF. - _Note: not part of the LDML45 release._ + _Note: This deliverable is not included in the LDML46.1 Final Candidate release._ 4. A specification for resolving messages at runtime, including runtime errors. diff --git a/docs/tech-preview-blog-post.md b/docs/tech-preview-blog-post.md new file mode 100644 index 0000000000..1f5ea8424c --- /dev/null +++ b/docs/tech-preview-blog-post.md @@ -0,0 +1,128 @@ +# Blog Post for Technical Preview + +Today, Unicode announced the Technical Preview of MessageFormat 2, +a new standard for creating and managing user interface strings. +These messages can dynamically include data values formatted +(using information in the Common Locale Data Repository [CLDR]) +according to the needs of the language and culture of the end user. +Such messages can be adjusted to meet the linguistic needs of each +language and are designed to be translated easily and efficiently. + +Previously, software developers had to choose between many different +APIs and templating languages to build user interface strings. +These solutions did not always provide for the features of different +human languages. Support was limited to specific platforms +and these formats were not widely supported by translation tools, +making translation and adaptation to specific cultures costly +and time consuming. +Most significantly, message formatting was limited to a small +number of built-in formats. + +One of the challenges in adapting software to work for +users with different languages and cultures is the need for **_dynamic messages_**. +Whenever a user interface needs to present data as part of a larger message, +that data needs to be formatted. +In many languages, including English, the message itself needs to be altered +to make it grammatically correct. + +For example, if a message in English might read: + +> Your item had **1,023** views on **April 8, 2024**. + +The equivalent message in French might read: + +> Votre article a eu **1 023** vues le **8 avril 2024**. + +Or Japanese: + +> あなたのアイテムは **2024 年 4 月 8 日**に **1,023** 回閲覧されました。 + +But even in English, there are grammatical variations required: + +> Your item had _no views_... +> +> Your item had 1 _view_... +> +> Your item had 1,043 _views_... + +Once messages have been created, they need to be translated into the various +languages and adapted for the various cultures around the world. +Previously, there was no widely adopted standard, +and existing formats provided only rudimentary support for managing +the variations needed by other languages. +Thus, it could be difficult for translators to do their work effectively. + +For example, the same message shown above needs a different set of variations +in order to support Polish: + +> Twój przedmiot nie _ma_ żadnych _wyświetleń_. +> +> Twój przedmiot _miał_ 1 _wyświetlenie_. +> +> Twój przedmiot _miał_ 2 _wyświetlenia_. +> +> Twój przedmiot _ma_ 5 _wyświetleń_. + + +MessageFormat 2 makes it easy to write messages like this +without developers needing to know about such language variation. +In fact, developers don't need to learn about any of the language +and formatting variations needed by languages other than their own +nor write code that manipulates formatting. + +MessageFormat 2 messages can be simple strings: +``` + Hello, world! +``` + +A message can also include _placeholders_ that are replaced by user-provided values: +``` + Hello {$user}! +``` + +The user-provided values can be transformed or formatted using functions: +``` + Today is {$date :date} + Today is {$date :datetime weekday=long}. +``` + +Messages can use a function (called a _selector_) to choose between +different versions of a message. +These allow messages to be tailored to the grammatical (or other) requirements of +a given language: +``` + .match {$count :integer} + 0 {{You have no views.}} + one {{You have {$count} view.}} + * {{You have {$count} views.}} +``` + +Unlike the previous version of MessageFormat, MessageFormat 2 is designed for +extension by implementers and even end users. +This means that new functionality can be added to messages without modifying +either existing messages or, in some cases, even the core library containing the +MessageFormat 2 code. + +MessageFormat 2 provides a rich and extensible set of functionality +to permit the creation of natural-sounding, grammatically-correct, +messages, while enabling rapid, accurate translation +and extension using new and improved internationalization functionality +in any computing system. + +The Technical Preview is available for comment. +The stable version of this specification is expected to be part of the +Fall 2024 release of CLDR (v46). +Implementations are available in ICU4J (Java) and ICU4C (C/C++) +as well as JavaScript. +Feedback about implementation experience, +syntax, +functionality, +or other parts of the specification is welcome! +See the end of this article for details on participation and how to comment on this work. + +MessageFormat 2 consists of multiple parts: +a syntax, including a formal grammar, for writing messages; +a data model for representing messages (including those ported from other APIs); +a registry of required functions; +a function description mechanism for use by implementations and tools; +and a test suite. diff --git a/docs/tools/linkify.js b/docs/tools/linkify.js new file mode 100644 index 0000000000..770134fbfe --- /dev/null +++ b/docs/tools/linkify.js @@ -0,0 +1,50 @@ +// Work in progress: tooling to linkify the HTML produced from +// the MessageFormat 2 markdown. +// this has been tested on the tr35-messageformat.html file +// but not implemented in LDML45 +function linkify() { + const terms = findTerms(); + const missing = new Set(); + const links = document.querySelectorAll("em"); + links.forEach((item) => { + const target = generateId(item.textContent); + if (terms.has(target)) { + const el = item.lastElementChild ?? item; + el.innerHTML = `${item.textContent}`; + } else { + missing.add(target); + } + }); + // report missing terms + // (leave out sort if you want it in file order) + Array.from(missing).sort().forEach((item)=> { + console.log(item); + }); +} + +function findTerms() { + const terms = new Set(); + document.querySelectorAll("dfn").forEach((item) => { + // console.log(index + ": " + item.textContent); + const term = generateId(item.textContent); + // guard against duplicates + if (terms.has(term)) { + console.log("Duplicate term: " + term); + } + terms.add(term); + item.setAttribute("id", term); + }); + return terms; +} + +function generateId(term) { + const id = term.toLowerCase().replaceAll(" ", "-"); + if (id.endsWith("rategies")) { + // found in the bidi isolation strategies + return id.slice(0, -3) + "y"; + } else if (id.endsWith("s") && id !== "status") { + // regular English plurals + return id.slice(0, -1); + } + return id; +} diff --git a/docs/why_mf_next.md b/docs/why_mf_next.md index d7a7f26c7b..b03152a1f0 100644 --- a/docs/why_mf_next.md +++ b/docs/why_mf_next.md @@ -1,24 +1,21 @@ # Why `MessageFormat` needs a successor ([issue #49](https://github.com/unicode-org/message-format-wg/issues/49)) -Check out the [YouTube video](https://www.youtube.com/watch?v=-DlS6KNopoU) -of the Unicode Technical Workshop (UTW) +Check out the [YouTube video](https://www.youtube.com/watch?v=4jucYXE42_s) +of the Unicode Technical Workshop 2024 (UTW) presentation about MessageFormat 2.0 which includes a discussion of why MessageFormat is important and why MessageFormat 2.0 is needed. ## Intro -The `MessageFormat` API and syntax have been around for a long time. - -Intro +The `MessageFormat` API and syntax have been around for a long time: - `MessageFormat` is the Unicode API for software localization -- It is 20 years old, well designed, proven solution - Its design was optimized for the software development model - of twenty years ago. - Implementers, developers, and translators struggle with its shortcomings. +- It is 20 years old and is a well-designed, proven solution -The current wave of software development uses dynamic languages, modern UI -frameworks and new forms of user interactions (voice, VR etc.). +However, its design was optimized for the software development model of twenty +years ago. Implementers, developers, and translators struggle with its +shortcomings. The current wave of software development uses dynamic languages, +modern UI frameworks, and new forms of user interactions (voice, VR etc.). Considering these new challenges, combined with the lessons learned from using `MessageFormat`, we aim to design the next iteration of `MessageFormat` diff --git a/exploration/bidi-usability.md b/exploration/bidi-usability.md new file mode 100644 index 0000000000..49bfcc1aac --- /dev/null +++ b/exploration/bidi-usability.md @@ -0,0 +1,646 @@ +# Bidi Usability + +Status: **Proposed** + +
+ Metadata +
+
Contributors
+
@aphillips
+
@eemeli
+
First proposed
+
2024-03-27
+
Pull Requests
+
#754
+
#781
+
+
+ +## Objective + +_What is this proposal trying to achieve?_ + +The MessageFormat 2 syntax uses whitespace as a required delimiter +as well as permitting the use of whitespace to make _messages_ easier to read. +In addition, a _message_ can include bidirectional text in identifiers and literals. + +MessageFormat's syntax also uses a variety of "sigils" and markers to form the structure of a _message_. +These sigils are ASCII punctuation characters that have neutral directionality. +This means that the inclusion of right-to-left ("RTL") identifiers or literals in a _message_ +can result in the syntax looking "scrambled" or, in extreme cases, appearing to have a different meaning +due to [spillover](https://www.w3.org/TR/i18n-glossary/#dfn-spillover-effects). + +To prevent spillover effects and to allow users (particularly RTL language users) +to author _messages_ in a straightforward way, we want to allow the syntax to include appropriate +bidirectional support and to recommend to tool and translation technology implementers +mechanisms to make _messages_ that include RTL characters easy to work with +without introducing spoofing or "Trojan Source" attack vectors. + +## Background + +_What context is helpful to understand this proposal?_ + +If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction +[here](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics). + +MessageFormat _message_ strings are created and edited primarily by humans. +The original _message_ is often written by a software developer or user experience designer. +Translators need to work with the target-language versions of each _message_. +Like many templating or domain-specific languages, MF2 uses neutrally-directional symbols +to form portions of the syntax. +When the _message_ contains right-to-left (RTL) characters in translations or +in portions of the syntax, +the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX#9) +can interact in ways that make the _message_ unintelligible or difficult to parse visually. + +Machines do not have a problem parsing _messages_ that contain RTL characters, +but users need to be able to discern what a _message_ does. +For example, users need to be able to match _keys_ in a _variant_ to _selectors_ +in a `.match` statement. +Or they want to know how a _pattern_ will be evaluated, +such as understanding the _options_ and _values_ in a _placeholder_. + +In addition, it is possible to construct messages that use bidi characters to spoof +users into believing that a _message_ does something different than what it actually does. + +The current syntax does not permit bidi controls in _name_ tokens, +_unquoted literals_, +or in the non-pattern whitespace portions of a _message_. + +Permitting the Unicode bidi **isolate** characters and the standalone strongly-directional markers +would enable tools, including translation tools, and users who are writing in RTL languages +to format a _message_ so that its plain-text representation and its function +are unambiguous. + +The isolates are paired invisible characters inserted around a portion of a string. +The start of an isolated sequence is one of: +- U+2066 LEFT-TO-RIGHT ISOLATE (LRI) +- U+2067 RIGHT-TO-LEFT ISOLATE (RLI) +- U+2068 FIRST-STRONG ISOLATE (FSI) + +The end of an isolated sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI). + +The characters inside an isolated sequence have the initial string direction +corresponding to the starting character ( +left-to-right for `LRI`, +right-to-left for `RLI`, +or auto for `FSI`). +They are called "isolates" because the enclosed text is **isolated** from surrounding text +while being processed using the Unicode Bidirectional Algorithm (UBA). +The surrounding text treats the sequence as-if it were a single neutral character, +while the interior sequence is processed using the base direction specified by the isolate +starting character. + +> [!NOTE] +> One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_ +> and _patterns_ is that these paired enclosing punctuations provide a measure of +> isolation in UBA. +> This is an additional reason not to change over to quote marks (which are not enclosing) +> around patterns. + +This design also allows for the use of strongly directional marker characters. +These include: +- U+200E LEFT-TO-RIGHT MARK (LRM) +- U+200F RIGHT-TO-LEFT MARK (RLM) +- U+061C ARABIC LETTER MARK (ALM) + +These characters are invisible strongly-directional characters. +They are used in bidirectional +text to coerce certain directional behavior (usually to mark the end of +a sequence of characters that would otherwise be ambiguous or interact with +neutrals or opposite direction runs in an unhelpful way). + +### Strictness and Abuse + +We want the syntax to be somewhat permissive, particularly when it comes to paired isolates. +The isolates and strongly-directional marks are invisible except in certain specialized editing environments. +While users and tools should be strict about using well-formed isolate sequences, +we don't want to have invisible characters or whitespace generate additional syntax errors except where necessary. +Therefore, it should not be a syntax error if a user, editor, or tool fails to match opening/closing isolates. + +It is possible to generate a "strict" version of the ABNF that is more restrictive about isolate pairing. +Such an ABNF might be used by message serializers to ensure high-quality message generation. + +Unfortunately, permitting a "relaxed" handling of isolates/marks, when mixed with whitespace, +could produce the various Trojan Source effects described in [[UTS55]](https://www.unicode.org/reports/tr55/#Usability-bidi)) + +## Use-Cases + +_What use-cases do we see? Ideally, quote concrete examples._ + +1. Presentation of _keys_ can change if the text of the _key's_ _literal_ is not isolated: +``` +.match {$م2صر :string}{$num :integer} +م2صر 0 {{The {$م2صر} is actually the first key}} +م2صر * {{This one appears okay}} +``` + +> [!NOTE] +> +> The first _variant_ in the use case above is actually: +>``` +> \u06452\u0635\u0631 0 {{The {$\u06452\u0635\u0631} is actually the first key}} +>``` + + +2. Presentation in an expression can change if portions of the expression + are not isolated or do not restore LTR order: +> In the following example, we use the same string with a number inserted into the middle of +> the string to make the bidi effects visible. +> The numbers correspond to: +> 1. operand +> 2. function +> 3. option name +> 4. option value + +``` +You have {$م1صر :م2صر م3صر=م4صر} <- no controls +You have {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} <- LRM after each RTL token +``` + +3. As a developer or translator, I want to make unquoted RTL literals or names appear correctly + in my plain-text editing environment. + I don't want to have to manage a lot of paired controls, when I can get the right effect using + strongly directional mark characters (LRM, RLM, ALM) + +4. As a translation tool or MF2 implementation, I want to automatically generate + _messages_ which display correctly when they contain RTL text or substring with minimal user intervention. + +## Requirements + +_What properties does the solution have to manifest to enable the use-cases above?_ + +To prevent RTL _literals_ from having spillover effects with surrounding syntax, +it should be possible to bidi isolate a _quoted_ or _unquoted_ _literal_. + +>``` +> .local $title = {|البحرين مصر الكويت!|} +> .local $egypt = {مصر :string} +>``` + +To prevent _patterns_ from having spillover effects with other parts of a _message_, +particularly with _keys_ in a _variant_, +it should be possible to bidi-isolate a _quoted-pattern_. + +>``` +> .match {$foo :string} +> isolate {{البحرين مصر الكويت!}} +>``` + +To prevent _markup_, _placeholders_, or _expressions_ from having spillover effects +with other parts of a _message_ +it should be possible to bidi isolate the contents of a _markup_ or an _expression_. + +>``` +> You can find it in {$مصر}. +>``` + +To prevent RTL identifiers from having spillover effects with other parts of an _expression_, +it should be possible to include "local effect" bidi controls following an _identifier_, +_name_, +_option value_, +or _literal_. +These controls must not be included into the _identifier_, _name_, _option value_, or _literal_, +that is, it must be possible to distinguish these characters from the identifier, +name, option value, or literal in question. + +>``` +> You can use {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} +>``` + +To prevent RTL _namespace_ names from having spillover effects with _function_ names, +it should be possible to include "local effect" strongly directional marks in an _identifier_: +> In this example, the _namespace_ is `:م2` and the _name_ is `:ن⁩3`, but the sequence is displayed +> with a spillover effect. +> (Note that the number in each name _trails_ the Arabic letter: it appears to the left because the +> string is RTL!). +>``` +> {$a1 :b2:c3} +> {$م1 :م2:ن3} spillover effects +> {⁦$م1‎ :م2‎:ن3‎⁩} with isolates and LRMs +>``` + +Newlines inside of messages should not harm later syntax. + +``` +* * {{\u0645
\u0646}} 123 456 {{ No LRM==bad }} +* * {{م +ن}} 123 456 {{ No LRM==bad }} + +* * {{\u0645
\u0646}}\u200e 123 456 {{ LRM }} +* * {{م +ن}}‎ 123 456 {{ LRM }} +``` + + +Naive text editors, when operating in a right-to-left context, +might display a _message_ with an RTL base direction. +While the display of the _message_ might be somewhat damaged by this, +it should still produce results that are as reasonable as possible. + +## Constraints + +_What prior decisions and existing conditions limit the possible design?_ + +Users cannot be expected to create or manage bidirectional controls or +marks in _messages_, since the characters are invisible and can be difficult +to manage. +Tools (such as resource editors or translation editors) +and other implementations of MessageFormat 2 serialization are strongly +encouraged to provide paired isolates around any right-to-left +syntax as described in this design so that _messages_ display appropriately as plain text. + +Ideally we do not want RLM/LRM/ALM to be part of the parsed +`name`, `variable`, `reserved-keyword`, `unquoted`, or any other term +defined in terms of `name`. +This is complicated to do in ABNF because each of these tokens is followed either by +whitespace or by some closing marker such as `}`. +The workaround in #763 was to permit these characters _before_ or _after_ whitespace +using the various whitespace productions. +This works at the cost of allowing spurious markers. + +We want isolate characters to be _outside_ of patterns. +There is an open question about how best to place them. +One option would be to place them adjacent to the "pattern quote" character sequences `{{`/`}}`. +Another option would be to place them _inside_ the pattern quotes, e.g. `{\u2066{`/`}\u2068}`. + +Bidi isolates and marks are invisible characters. +Whitespace is also invisible. +Mixing these may be problematic. +Not allowing these to mix could produce annoying parse errors. + +## Proposed Design + +_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ + +I propose adopting a hybrid approach in which we permit "super-loose isolation". +This allows user to include isolates and strongly directional characters into the whitespace +portions of the syntax in order to make messages appear correctly. + +The second part of the hybrid approach would be to recommend ("SHOULD") the "strict isolation" +design for serializers. +(Note that "strict" and "super-loose" use non-identical productions with the name `bidi`. +These serve different purposes and are consistent with strict being narrower with super-loose.) +This syntax is a subset of the super-loose syntax and can be applied selectively to messages that +have RTL sequences or which have problematic display. + + +## Alternatives Considered + +_What other solutions are available?_ +_How do they compare against the requirements?_ +_What other properties they have?_ + +### Nothing +We could do nothing. + +A likely outcome of doing nothing is that RTL users would insert bidi controls into +_messages_ in an attempt to make the _pattern_ and/or _placeholders_ display correctly. +These controls would become part of the output of the _message_, +showing up inappropriately at runtime. +Because these characters are invisible, users might be very frustrated trying to manage +the results or debug what is wrong with their messages. + +By contrast, if users insert too many or the wrong controls using the recommended design, +the _message_ would still be functional and would emit no undesired characters. + +### LTR Messages with isolating sequences + +The syntax of a _message_ assumes a left-to-right base direction +both for the complete text of the _message_ as well as for each line (paragraph) +contained therein. +We prefer LTR display because human understanding of a _message_ depends on LTR word tokens, +as well as token ordering (as in a placeholder or with variant keys). +Note that LTR display is **_not_** a requirement, because that is beyond the scope of MF2 itself. +However, tool and editor implementers ought to pay attention to this assumption. + +Preferring LTR display is not the disadvantage to right-to-left languages that it might first appear: +- Bidi inside of _patterns_ works normally (we go to great lengths to make the interior + of _patterns_ work as plain text) +- _Placeholders_ and _markup_ can be isolated (treated as neutrals) so that they appear + in the correct location in an RTL _pattern_ +- _Expressions_ use isolates and directional marks to display internal tokens in the + correct order and without spillover effects +- The syntax uses enclosing marks (specifically curly brackets) which the Unicode Bidirectional Algorithm + pairs up for shaping purposes, resulting in a weak form of isolation in the syntax itself. + +The syntax permits (but does not require) isolating bidi controls to be used on the +**outside** of the following: +- unquoted literals +- quoted literals +- quoted patterns + +We permit any of the isolate starting characters (LRI, RLI, FSI) because we want to allow +the user to set the base direction of a _literal_ or _pattern_ according to its respective +actual contents. + +> [!IMPORTANT] +> This change adds a "lookahead" to the process of determining if a given _message_ is +> "simple" or "complex", as LRI, RLI, and FSI are all valid starters for a simple message +> as well as being allowed before a quoted pattern, declaration, or selector. + +This would change the ABNF as follows: +(Notice that this change includes a production `bidi` described further down +in this document) +```abnf +literal = [open-isolate] (quoted-literal / (unquoted-literal [bidi])) [close-isolate] +quoted-pattern = [open-isolate] "{{" pattern "}}" [close-isolate] + +open-isolate = %x2066-2068 +close-isolate = %x2069 +``` + +> [!IMPORTANT] +> The isolating characters go on the **_outside_** of the various _literal_ and _pattern_ +> productions because characters on the **_inside_** of these are part of the _literal_'s +> or _pattern_'s textual content. +> We need to allow users to include bidi characters, including isolates and strongly directional marks +> in the output of MF2. + +- Permit **left-to-right** isolates + (starting with LRI `U+2066` and ending with PDI `U+2069`) + to be used **immediately inside** the following: + - expressions + - markup + +- Permit any type of isolate sequence + (starting with LRI `U+2066`, RLI `U+2067`, or FSI `U+2068` and ending with PDI `U+2069`) + around any token inside of an expression or markup. + +- Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that + **end** with the `name` production in the ABNF. + This includes _identifiers_ found in the names of + _functions_ + and _options_, + plus the names of _variables_, + as well as the contents of _unquoted_ literals. + +This would change the ABNF as follows (assuming the above changes are also incorporated): +```abnf +expression = "{" [LRI] (literal-expression / variable-expression / annotation-expression) [close-isolate] "}" +literal-expression = [s] literal [s annotation] *(s attribute) [s] +variable-expression = [s] variable [s annotation] *(s attribute) [s] +annotation-expression = [s] annotation *(s attribute) [s] +markup = "{" [LRI] [s] "#" identifier *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone + / "{" [LRI] [s] "/" identifier *(s option) *(s attribute) [s] [close-isolate] "}" ; close +LRI = %x2066 +``` + +> [!NOTE] +> This design only permits LTR isolates at the expression level because the contents of an _expression_ +> or _markup_ must be laid out left-to-right. +> _Literal_ values can be right-to-left isolated within that or use strongly +> directional marks to ensure correct display. + +> [!NOTE] +> Notice that _unquoted literals_ can also be surrounded by bidi isolates +> using the previous syntax modification just above. +> The isolates are **not** a part of the literal! + +> [!NOTE] +> Notice that `reserved-annotation` is not in the ABNF changes because it already +> permits the marks in question. +> Any syntax derived from `reserved-annotation` +> (i.e. when unreserving a new statement in a future addition) +> would need to handle bidi explicitly using the model already established here. + +```abnf +variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}" +function = ":" identifier [bidi] *(s option) +option = [LRI] identifier [bidi] [s] "=" [s] (literal / variable) [bidi] [close-isolate] +attribute = [LRI] "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] [close-isolate] +markup = "{" [LRI] [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] [close-isolate] "}" ; open and standalone + / "{" [LRI] [s] "/" identifier [bidi] *(s option) *(s attribute) [s] [close-isolate] "}" ; close +identifier = [(namespace ns-separator)] name +ns-separator = [bidi] ":" +bidi = [ %x200E-200F / %x061C ] +``` + +**Open Issues** + +The ABNF changes found above put isolates and strongly directional marks into specific locations, +such as directly next to `{`/`}`/`{{`/`}}` markers +or directly following "tokens" such as `name`. +This makes it a syntax error for whitespace to appear around the isolates or marks. +A more permissive design would add the isolates and strongly directional marks to required and optional +whitespace in the syntax and depend on users/editors to appropriately pair or position the marks +to get optimal display. + + +### Super-loose isolation + +Add isolates and strongly directional marks to required and optional whitespace in the syntax. +This would permit users to get the effects described by the above design, +as long as they use isolates/marks in a "responsible" way. + +The exception to this is the namespace separator, used in `identifier`. +This requires the ability to insert isolates or strongly directional marks +between the namespace and name portions, where whitespace is not permitted. +This is the only location in the syntax where such characters might be needed +but whitespace is not at least optional. +This could be defined as: +```abnf +ns-separator = [bidi] ":" [bidi] +``` + +Here are the other ABNF changes: + +```abnf +; strongly directional marks and bidi isolates +; ALM / LRM / RLM / LRI / RLI / FSI / PDI +bidi = %x061C / %x200E / %x200F / %x2066-2069 + +; optional whitespace +owsp = *( s / bidi ) + +; required whitespace +wsp = [ owsp ] 1*s [ owsp ] + +; whitespace characters +s = ( SP / HTAB / CR / LF / %x3000 ) +``` + +**Pros** +- Avoids problems with syntax errors that users and tools might find difficult to debug. +- Effective if used carefully. +- Addresses need to comply with UAX#31 + +**Cons** +- Syntax does not prevent poor display outcomes, including enabling some Trojan Source cases (UAX#55); + note that tooling or linting can help ameliorate these issues. + +### Strict isolation all the time + +Apply bidi isolates in a strict way. +In this design: +1. The open/close isolate characters are syntactically required to be paired. + This introduces parse errors for unpaired invisible characters, + which could lead to bad user experiences. + +As noted above, the "strict" version of the ABNF should be adopted by serializers and for +message normalization. + +```abnf +variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}" +function = ":" identifier [bidi] *(s option) +option = identifier [bidi] [s] "=" [s] (literal / variable) [bidi] + / LRI identifier [bidi] [s] "=" [s] (literal / variable) [bidi] close-isolate +attribute = "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] + / LRI "@" identifier [bidi] [[s] "=" [s] ((literal / variable) [bidi])] close-isolate +markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone + / "{" LRI [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] close-isolate "}" + / "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close + / "{" LRI [s] "/" identifier [bidi] *(s option) *(s attribute) [s] close-isolate "}" ; close +identifier = [(namespace ns-separator)] name +ns-separator = [bidi] ":" [bidi] +bidi = [ %x200E-200F / %x061C ] +``` + + +### Isolate `name` rather than `unquoted-literal` + +Isolating rather than marking `name` helps ensure +that its directionality does not spill over to adjoining syntax. + +The following replaces the proposed design's changes to `literal` and the `[bidi]` additions to +`variable-expression`, `function`, `option`, `attribute`, `markup`, and `ns-separator`: +```abnf +name = [open-isolate] name-start *name-char [close-isolate] +quoted-literal = [open-isolate] "|" *(quoted-char / quoted-escape) "|" [close-isolate] +``` + +For example, this allows for the proper rendering of the expression +``` +{⁦:⁧אחת⁩:⁧שתיים⁩⁩} +``` +where "אחת" is the `namespace` of the `identifier`. +Without `name` isolation, this would (misleadingly) render as +``` +{⁦:אחת:שתיים⁩} +``` + +Note that the parsed value of the `name` does not include the open/close isolates, +just as they're not included in the parsed values of quoted literals or quoted patterns, +even though the production includes the characters. +We could accomplish this by adding an additional productions to manage `name`, at the cost +of a more complex ABNF. + +**Pros** +- In the syntax, it's much simpler to include the changes to `name` in the `name` rule, + rather than patching every place where `name` is used. + +**Cons** +- Implementations need to remove isolates from the `name` token before comparing + the value to other values (such as comparing `function` or `variable` names). + Because of namespacing, this requires looking _inside_ the token. +- Implementations might need to insert isolates when generating names upon serialization. + The current data model does not separate `namespace` and `name`, + so this might be more complicated. +- `unquoted-literal` values appear as keys, as operands, and as option values. + If not isolated, these can cause spillover effects, so we might need both `name` + and `unquoted-literal` isolation. + +### Deeper Syntax Changes +We could alter the syntax to make it more "bidi robust", +such as by using strongly directional characters instead of neutrals. + +### Forbid RTL characters in `name` and/or `unquoted` +We could alter the syntax to forbid using RTL characters in names and unquoted literals. +This would make the syntax consist solely of LTR and neutral characters. +One flavor of this would be to restrict tokens to US ASCII. + +Cons: +- This would break compatibility with NCName/QName; we would be back to + defining our own idiosyncratic namespace +- Unicode could define more RTL characters in the future, making the syntax + brittle +- This is not friendly to non-English/non-Latin users and represents a usability + restriction in environments in which names can be non-ASCII values + +### Permit LRI, RLI, and FSI inside expressions and markup + +We could permit RLI/FSI to be used inside _expressions_ and _markup_. +This would be an advantage for simple _expressions_ containing only or primarily +RTL content. +For example: +``` +{⁧لت-123-م...⁩} // RLI isolated +{لت-123-م...} +``` + +We could also permit users/editors to use RTL base direction for editing. +This is tricky, as the syntax promotes the use of left-to-right runs +that will "stick together" unless isolated. +This is most visible in _selectors_ and _variant_ _keys_. + +Consider this message: +``` +.match {$\u06451\u0645}{$\u06462\u0646} +one two {{normal LTR}} +\u2067one\u2069 \u2067two\u2069 {{RLI around each key}} +\u2066one\u2069 \u2066two\u2069 {{LRI around each key}} +\u0645 \u0646 {{RTL}} +* \u0646 {{star is first}} +\u0645 * {{star is second}} +``` + +In an LTR context the _message_ displays like this (red lines around display errors): +![image](https://github.com/unicode-org/message-format-wg/assets/69082/f19cbf99-94f2-4f36-805b-8da0750bc5f2) + +In an RTL context, there is an equivalent case: +![image](https://github.com/unicode-org/message-format-wg/assets/69082/1b2e1c67-aebc-455b-98e9-99f9e620c543) + +Coercing proper display in both LTR and RTL contexts requires +complex sets of controls. + +**Pros** +- Can provide both LTR and RTL native editing experiences + +**Cons** +- Requires complex sets of bidi controls +- RTL editing/display is mostly a special case; + we already afford the ability to edit RTL in _patterns_ and _literals_ + +### Hybrid approaches + +Strict syntactical requirements produce better _display_ outcomes +that solve the various problems enumerated in this design document. +However, the strictness comes with a cost: otherwise-valid messages, +including messages that display completely as expected and are not in any way misleading, +can produce syntax errors. +These errors can be difficult to debug, since the characters are invisible. +Syntax errors are generally treated as fatal by processors. + +Semi-strict or super-loose strategies can be used to avoid producing these types of syntax error. +However, valid messages using these approaches can have stray (e.g. unpaired isolates), +malformed (e.g. PDI before LRI/RLI/FSI), +or badly formatted character sequences (wrapping the wrong things), +unless the user or the user's tools are careful. +This can include deliberate abuse, such as Trojan Source attacks (see UAX#55), +in which Bad Actors create messages that have a misleading appearance vs. their runtime interpretation. + +A hybrid ("Postel's Law") approach would be to permit the use of isolates and strongly directional marks +in whitespace in a permissive way (see: "super-loose isolation"), +particularly in runtime formatting operations +but strongly encourage tools to implement message normalization on a strictly-defined grammar +(see: "strict isolation all the time") +and to encourage users to use the strict version of the grammar when writing or serializing messages. + +The hybrid approach would include tests to allow implementations to claim +adherence to the stricter grammar. + +**Pros** +- Messages can be written that solve all display problems +- Stray, unpaired, repeated, or other invisible typos do not produce spurious + syntax errors +- Provides a foundation for tools to claim strict conformance and message normalization + as well as guidance to implementers to make them want to adopt it +- Messages are valid while being edited (such as when the open or close isolate has been + inserted but the corresponding opposite isolate hasn't been entered yet) + +**Cons** +- Requires additional effort to maintain the grammar +- Requires additional effort to maintain tests +- Valid messages can contain Trojan Source and other negative display consequences; + messages can be checked, however, using the strict grammar, so tools could warn + users of potential abuse diff --git a/exploration/code-mode-introducer.md b/exploration/code-mode-introducer.md index 64b28d2f32..c1d979a212 100644 --- a/exploration/code-mode-introducer.md +++ b/exploration/code-mode-introducer.md @@ -1,6 +1,6 @@ # Design Proposal: Choosing a Code Mode Introducer -Status: **Proposed** +Status: **Accepted**
Metadata @@ -77,10 +77,9 @@ private-start = "^" / "&" _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ -We need to choose one of these (or another option not yet considered). -Presentation at UTW did not produce any opinions. - -Based on the pro/cons below, I would suggest Option D is possibly the best option? +A modified version of Option D was chosen. +The keyword `when` was dropped after this design was completed. +"Simple" messages do not require the `pattern` to be quoted. ## Alternatives Considered diff --git a/exploration/data-driven-tests.md b/exploration/data-driven-tests.md index cf1c710f8a..13f2ee84f2 100644 --- a/exploration/data-driven-tests.md +++ b/exploration/data-driven-tests.md @@ -1,6 +1,6 @@ # Data-driven tests -Status: **Proposed** +Status: **Accepted**
Metadata @@ -20,64 +20,55 @@ One of the [deliverables of the Message Format Working Group (MFWG)](https://git > "A conformance test suite for parsing and formatting messages sufficient to ensure implementations can validate conformance to the specification(s) provided". -This design proposal captures the planned approach for the suite. +This design proposal captures the planned approach for the suite: -This approach includes _how_ tests are written: They should be captured in a single platform-agnostic format that can be utilized by all MF2 implementations. There should be no need to rewrite individual test cases for each platform. +- It captures _what_ kind of tests are written by identifying the aspects of the MessageFormat 2 (MF2) specification that must be tested and the categories of test that do this. -This approach also includes _what_ kind of tests are written. We need to identify which parts of MF2 should be covered by different types of test as a minimum. +- It also captures _how_ tests are written by describing the single platform-agnostic format that can be used by any MF2 test runner. ## Background -Several pre-existing test files have been considered before forming this proposal: +Several pre-existing test suites have been considered before forming this proposal: - [**Unicode's Data Driven Test framework**](https://github.com/unicode-org/conformance) is a project with a goal that aligns with that of MFWG's conformance test suite. - [**message-format-wg XML test format**](https://github.com/unicode-org/message-format-wg/tree/514758923abac13a2c5eb71b6b6cdef4a181280e/test) includes a test schema and accompanying test examples from which we can take inspiration. -- [**Intl.MessageFormat polyfill tests**](https://github.com/messageformat/messageformat/tree/main/packages/mf2-messageformat/src) are implementation-specific but they capture the type of tests that we may want to include in the conformance test suite. The polyfill itself is an implementation that the test suite could be run against. +- [**Intl.MessageFormat polyfill tests**](https://github.com/messageformat/messageformat/blob/ee1bc08826f0855d00a9ace4db001c06a8679983/packages/mf2-messageformat/src/messageformat.test.ts) are implementation-specific but they capture the type of tests that we may want to include in the conformance test suite. The polyfill itself is an implementation that the test suite could be run against. -- [**ICU**](https://github.com/unicode-org/icu) also contains platform-specific MF2 test cases that could be reused for the conformance test suite, including the [ICU4J tests](https://github.com/unicode-org/icu/tree/main/icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2) and [Tim Chevalier's draft ICU4C tests](https://github.com/catamorphism/icu/blob/parser-plus-data-model-plus-full-api/icu4c/source/test/intltest/messageformat2test.cpp). +- [**ICU**](https://github.com/unicode-org/icu) also contains platform-specific MF2 test cases that could be reused for the conformance test suite, including the [ICU4J tests](https://github.com/unicode-org/icu/blob/4f75c627675b426938f569003ee9dc0ea43490bb/icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MessageFormat2Test.java) and [ICU4C tests](https://github.com/unicode-org/icu/blob/6d5555a739179b5d177e73db7c111c5ef1cac22d/icu4c/source/test/intltest/messageformat2test.cpp). ## Use-Cases -**Developers** of MF2 implementations need to easily verify that their completed implementation conforms to the specification. This needs to be fully automated and easily repeatable. +**Developers** of MF2 implementations need to easily verify that their completed implementation conforms to the specification. This needs to be fully automated and easily repeatable. For incomplete and incorrect implementations, it is important for developers to easily understand where the specification is not being met and why. -For incomplete and incorrect implementations, it is important for developers to easily understand where the specification is not being met and why. +**Stakeholders** and **MF2 users** may use the tests as human-readable documentation of the specification. They need to be easily navigable and legible for this purpose. -The main platforms for which the tests should initially run are: - -- Node.js -- ICU4J (Java) -- ICU4C (C++) +**Vendors** using tooling that conforms to the specification may want to run tests against it to verify that this is the case. -Other platforms, such as ICU4X (Rust) may be added later. +## Requirements -**Stakeholders** and **MF2 users** may use the conformance test suite as human-readable documentation of the specification. It needs to be easily navigable and legible for this purpose. +### Test the specification, not necessarily the final output -**Vendors** using tooling that conforms to the specification may want to run tests against it to verify that this is the case. +Every piece of the specification should be testable. In order to test the specification in isolation, the test suite should be independent of Unicode CLDR locale data. -## Requirements +### Provide tests, not runners -- Test framework +Unlike the test suites within [ICU](https://github.com/unicode-org/icu), this suite does not target a specific implementation and is not tied to any particular executor. It is completely platform-agnostic. Consumers of the tests can decide how they are run. - - The test cases and assertions must be easy to read. - - The test cases and assertions must be completely platform-agnostic. - - The framework must include the platform-specific test executors as part of the solution. - - The framework must be extendable with new executors (e.g. ICU4X) and it should be clear how to do this. +### Use a versatile format -- Test content - - **Syntax tests:** These test that valid patterns are evaluated correctly and that invalid patterns are identified. Where standard registry functions are used, they also test that the correct function is invoked with the expected arguments. - - **Selector tests:** These test that the correct case of a `match` statement is selected, based on what follows the `when` keyword. +The tests should be captured in a format that is highly portable and easily integrable with a wide range of technologies. The format should be easy to read while being flexible enough to capture the necessary detail of all test input and output. ## Constraints -### External dependencies can impact portability +### Function output can differ between implementations -The platform-agnostic nature of the tests means that great caution must be taken around adding dependencies. The test suite must cater for a range of technology stacks and workflows with different restrictions around external dependencies. +The behaviour of default registry functions such as `:number` and `:datetime` is dependent on locale-specific data and may vary between implementations. [Test functions](https://github.com/unicode-org/message-format-wg/blob/6414b6c7d9faed6c1b4645b92b3548a8ea0ad332/test/README.md) should be used to write more isolated tests. ### Errors and evaluation strategy may not be consistent -It is important to test error cases for each of the test types mentioned above but, because variable evaluation is not captured within the standard, we cannot guarantee what kind of error will be raised in all cases. +Variable evaluation is not captured within the standard so we cannot guarantee the order in which errors are encountered. For example, the pattern below may or may not result in an error depending on how lazily the expression is evaluated. This presents a challenge for testing. @@ -93,209 +84,92 @@ local $foo = {$bar} {Hello, {$bar}!} ``` -### The output of formatters may not be stable over time - -Where possible, any parts of the suite that do not directly test the formatters should be independent of their output. This is to reduce the number of test failures caused by formatter output changes. +For this reason, error tests should capture all errors present in each test case. ### Data model is not part of the specification -Although a standard data model is included in this repository, there is no requirement for all MF2 implementations to use it. This means that any data model tests included in the test suite may fail for otherwise standard-compliant implementations. If any tests of this type are included, they must be optional. +Although a standard data model is included in this repository, there is no requirement for all MF2 implementations to use it. Tests that rely on the structure of this data model may fail for standard-compliant implementations. If any tests of this type are included, they must be treated as optional. ## Proposed Design -### Test framework - -The MF2 test framework should follow the ['Unicode & CLDR Data Driven Test'](https://github.com/unicode-org/conformance) framework. - -As per the project's [README.md](https://github.com/unicode-org/conformance#readme): - -> "The goal of this work is an easy-to-use framework for verifying that an implementation of ICU functions agrees with the required behavior. When a DDT test passes, it a strong indication that output is consistent across platforms. [...] Data Driven Test (DDT) focuses on functions that accept data input such as numbers, date/time data, and other basic information." - -This aligns closely with the goals and characteristics of the MF2 tests. Parity with ICU procedures is an added advantage. - -The README specifies that test cases and expected results are to be located in separate files (including the rationale for this). - -#### Test file example - -`example_1_test.json` - -```jsonc -{ - "Test scenario": "example_1", - "description": "Test cases for XYZ", - "testType": "syntax", // Tests will require different setup steps or function calls depending on their purpose. - "tests": [ - { - "label": "0000", - "locale": "en-US", - "pattern": "{Some MF2 pattern}", - "options": {}, // Optional configuration - "input": { "namedArg": "foo" } // Arguments to the function being tested, such as a message.formatToString() function. May vary with testType. - } - // ... - ] -} -``` - -#### Verification file example - -`example_1_verify.json` - -```jsonc -{ - "Test scenario": "example_1", - "verifications": [ - { - "label": "0000", - "verify": "Expected result" - } - // ... - ] -} -``` - ### Test format -As per the 'Unicode & CLDR Data Driven Test' documentation, test and verification files are provided in JSON format. The proposal is to write tests in YAML and transpile them to JSON. - -JSON does not support multiline strings so test files may need to include `\n` line breaks in order to capture multiline patterns, which may impact readability. This is the main reason not to author tests in JSON directly. Assuming both the source and JSON-format tests are committed to the repository, the JSON remains the single source of truth for the tests and it can be consumed by the test executor without the need for any transpilation at runtime. - -The source format should offer the following: +Tests should be written in JSON. This format aligns with the requirements above around versatility, as well as providing a favorable editing experience. It offers: -- Precise control over whitespace as many MF2 tests concern this. -- Literal newlines for use in multiline patterns. +- Precise control over whitespace - tests are needed around whitespace handling. - Concise readable syntax. -- Comment syntax. - Validation against a schema. -- (Optional) Editor integration for syntax highlighting and validation. +- Editor integration for syntax highlighting and validation. -YAML fulfils these requirements and is widely used. +Other considerations around using JSON: -There is a [test generator](https://github.com/unicode-org/conformance/tree/main/testgen) included in the 'Unicode & CLDR Data Driven Test' repository. At the time of writing, this is specific to number format tests and is not easily adaptable to the needs of MF2. It does, however, demonstrate generating JSON from source files. +- It does not support multiline strings. Test files may need to include `\n` line breaks in order to capture multiline patterns, which may impact readability. +- It does not include a syntax for comments. The test schema should include an explicit field to capture test descriptions. -### Test content -#### Syntax tests +### Test schema -These tests evaluate the pattern based on the runtime arguments. Formatters are shown as stringified representations of the function because formatter output is tested separately. +JSON Schema should be used to capture the structure of test files. `"$comment"` properties can be used within the schema for any additional documentation required. -Example: +The proposed schema is included under [test/schemas/v0/](https://github.com/unicode-org/message-format-wg/tree/b4fd5a666a02950c57f0a454f65bf16a0bf03bf4/test/schemas/v0). Its version can be incremented to v1 when the proposal is accepted. -```jsonc -{ - "label": "Renders multiple inputs in formatted string", - "locale": "en-US", - "pattern": "{{$strArg :string} and {$numArg :number minimumFractionDigits=2}}", - "inputs": { - "strArg": { "type": "string", "value": "foo" }, - "numArg": { "type": "number", "value": 123 } - } - // "verify": "{ formatter: "string", value: "foo" } and { formatter: number, value: 123, minimumFractionDigits: 2 }" -} -``` +It is important that the schema is versioned. The version number should be captured within the schema files themselves because these files may be copied and used out of the context of this repository. By using a version directory and $id property for the schema, we can bump a schema version by changing one directory name and updating the `$id` property in the schema file(s) to match. -#### Selector tests +Although the use of [semantic versioning](https://semver.org/) has been discussed, it is likely to be overkill for our purposes. -These are extensive tests of the cases within a `match` statement. Testing of multiple selectors is included. +In order to reduce the verbosity of test files that contain multiple similar tests, the MF2 schema should include a `defaultTestProperties` property. This is an object that specifies properties to be used for every test case in the file (unless overridden in individual tests). -Single selector example: +Default properties can be used for expected outputs as well as inputs. For example: ```jsonc -{ - "label": "Matches numbers other than one", - "locale": "en-US", - "pattern": "match {$arg :number} when 1 {result 1} when * {result multi}", - "inputs": { - "arg": { "type": "number", "value": 2 } - } - // "verify": "result multi" -} +// The given locale for every test case is "en-US". +"defaultTestProperties": { "locale": "en-US" } + +// The expected string output for every test case is "Hello" +// and no test cases result in an error. +"defaultTestProperties": { "exp": "Hello", "expErrors": false } ``` -Multiple selector example: +### Test content -```jsonc -{ - "label": "Matches wildcard strings and numbers other than one", - "locale": "en-US", - "pattern": "match {$name :string} {$count :number} when apple 1 {result apple 1} when apple * {result apple multi} when * 1 {result other 1} when * * {result other multi}", - "inputs": { - "name": { "type": "string", "value": "banana" }, - "count": { "type": "number", "value": 3 } - } - // "verify": "result other multi" -} -``` +#### Syntax tests -#### Formatter tests (optional) +These tests evaluate the pattern `src` using the given runtime `params`. Assertions are made on the output, which can be formatted as either a single string or parts, and any resulting errors. Syntax tests are the core of the test suite. -These tests focus on the standard registry's formatters (e.g. `:number`, `:datetime`). They cover the different options that can be passed to each formatter (e.g. `offset`, `skeleton`). +#### Function tests -If the output of a formatter changes in the future, these tests may need updating. +There are two types of function test: -Example: +- __Selector tests__ test the cases within a `match` statement. Testing of multiple selectors is included. +- __Formatter tests__ test the standard registry's formatters (e.g. `:number`, `:datetime`). They cover the different options that can be passed to each formatter (e.g. `offset`, `skeleton`). -```jsonc -{ - "label": "Skeleton affects datetime format", - "locale": "en-US", - "pattern": "{$givenDateTime :datetime skeleton=yMMMdE}", - "inputs": { - "givenDateTime": { "type": "datetime", "value": "2000-12-31T00:00:00.000Z" } - } - // "verify": "Sun, 31 Dec 2000" -} -``` +As mentioned above, the behaviour of some of the default registry functions such as `:number` and `:datetime` is dependent on locale-specific data and may vary between implementations. There are special functions designed for test use only, which include `:test:select` and `:test:format` for replacing selectors and formatters respectively in the syntax tests. More information on these test functions can be found [here](https://github.com/unicode-org/message-format-wg/blob/6414b6c7d9faed6c1b4645b92b3548a8ea0ad332/test/README.md#test-functions). #### Data model tests (optional) There is no standard data model within the specification, which means that we cannot create mandatory data model tests. -If a particular implementation of MF2 exposes a standardized representation of [the data model](../spec/data-model/message.json), perhaps through a `mf2.toCanonicalJson();` function or similar, then we could create tests that assert against this. +If a particular implementation of MF2 exposes a standardized representation of [the data model](../spec/data-model/message.json), perhaps through a `mf2.toCanonicalJson();` function or similar, then we could create tests that assert against this in future. ## Alternatives Considered +### YAML test syntax + +YAML has some advantages over JSON: + +- It is extremely readable. +- It supports multiline strings. +- It supports comments. + +However, the flexibility of the syntax means that there is a risk of introducing ambiguity into the test cases. This makes it unsuitable. + + ### XML test syntax -As mentioned above, there are several advantages to writing tests in XML: +There are several advantages to writing tests in XML: - It allows preservation of whitespace in strings, which is crucial for MF2 test cases. - It allows literal newline characters in strings, which provides enhanced readability for multiline patterns. - It supports a schema format, which can be used to validate test files. -- It is widely supported. - -XML is fairly verbose though. It is better suited to writing markup, which is not our use-case. - -### Gherkin test syntax and Cucumber runner - -Based on the readability concerns mentioned above, the Gherkin syntax was also considered. - -Example: - -```feature -Feature: Multi-selector messages - - Background: - Given the username is "Matt" - And the source is: - """ - match {$photoCount :number} {$userGender :equals} - when 1 masculine {{$userName} added a new photo to his album.} - when 1 feminine {{$userName} added a new photo to her album.} - when 1 * {{$userName} added a new photo to their album.} - when * masculine {{$userName} added {$photoCount} photos to his album.} - when * feminine {{$userName} added {$photoCount} photos to her album.} - when * * {{$userName} added {$photoCount} photos to their album.} - """ - - Scenario: One item - male - When the message is resolved with params: - | key | value | - | photoCount | 1 | - | userGender | masculine | - Then the string output is "Matt added a new photo to his album." -``` - -The [Cucumber framework](https://cucumber.io/) was considered because of its integration with the Gherkin syntax. Cucumber's approach of using platform-specific step definitions for Gherkin scenarios aligns with our goal of having a data-only representation of the test content. It may, however, be difficult to support Cucumber in certain technology stacks and workflows. -It would be possible to transpile Gherkin to JSON without using Cucumber, which would provide similar benefits to the YAML transpilation mentioned above. This can be discussed further. +XML is fairly verbose though. It is better suited to writing markup. diff --git a/exploration/dataflow-composability.md b/exploration/dataflow-composability.md new file mode 100644 index 0000000000..e0ea68c155 --- /dev/null +++ b/exploration/dataflow-composability.md @@ -0,0 +1,827 @@ +# Data Flow for Composable Functions + +Status: **Proposed** + +
+ Metadata +
+
Contributors
+
@catamorphism
+
@stasm
+
First proposed
+
2024-02-13
+
Pull Requests
+
#645
+
#646
+
+
+ +## Objective + +Custom formatting functions should be able to +inspect the raw value and formatting options +of their arguments. +In addition, while a custom formatter may eagerly +format its operand to a string, +returning the raw underlying value +and the formatting options used for formatting +are also useful, +in case another function wants to extend these options +or use them for other logic. + +Making the underlying structure of its inputs, +as well as requiring formatters to return structured outputs, +makes it possible to specify how different functions +can be _composed_ together +(as shown in example 1.1 below). + +### Pull Requests + +The pull request for this design document itself is [#645](https://github.com/unicode-org/message-format-wg/pull/645). + +A draft pull request, [#646](https://github.com/unicode-org/message-format-wg/pull/646), +shows what the [formatting spec](https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md) +would look like if this design document were accepted. +(As of this writing, #646 reflects a slightly older version +of this design document, so some of the names used are different.) + +## Background + +In the accepted version of the spec (as of this writing), +the term "resolved value" is used for several different kinds +of intermediate values, +and the structure of resolved values is left completely +implementation-specific. + +Providing a mechanism for custom formatters to inspect more +detailed information about their arguments requires the +different kinds of intermediate values to be differentiated +from each other and more precisely specified. + +At the same time, the implementation can still be given freedom +to define the underlying types for representing formattable values +and formatted results. This proposal just defines wrappers +for those types that implementations must use in order to +make custom functions as flexible as possible. + +## Use-Cases + +Use cases from [issue 515](https://github.com/unicode-org/message-format-wg/issues/515): + +The following code fragment +invokes the `:number` formatter on the literal `1`, +binds the result to `$a`, and then invokes `:number` on the +value bound to `$a`. + +If the value of `$a` does not allow for inspecting the previous options +passed to the first call to `:number`, +then the `$b` would format as `1.000`. + +### Example 1.1 +``` +.local $a = {1 :number minIntegerDigits=3} // formats as 001. +.local $b = {$a :number minFractionDigits=3} // formats as 001.000 +// min integer digits are preserved from the previous call. +``` + +In other words: the user likely expects this code to be equivalent to: + +### Example 1.2 +``` +.local $b = {1 :number minIntegerDigits=3 minFractionDigits=3} +``` + +But without `:number` being able to access the previously passed options, +the two fragments won't be equivalent. +This requires `:number` to return a value that encodes +the options that were passed in, the value that was passed in, +and the formatted result; +not just the formatted result. + +This example is an instance of the basic motivator for this proposal: +allowing data that flows out of a function call +to flow back into another function call +with all of its metadata (e.g. options) preserved. + +### Example 1.3 +``` +.input {$item :noun case=accusative count=1} +.local $colorMatchingGrammaticalNumberGenderCase = {$color :adjective accord=$item} +``` + +The `:adjective` function is a hypothetical custom formatter. +If the value of its `accord` option is a string, it's hard for `:adjective` +to use the value of `accord` to inflect the value of `$color` appropriately +given the value of `$item`. +We want to pass not the formatted result of `{$item :noun case=accusative count=1}` +into `:adjective`, but rather, a structure that encodes that formatted result, +along with the resolved value of `$item` and the names and values of the options +previously passed to `:noun`: `case` and `count`. + +### Example 1.4 + +``` +.local $foo = {$arg :func} +``` + +Here, `$arg` is treated as an input variable. +Suppose that internally, an implementation wants to pre-define +some formatting options on all input variables +(or all input variables whose values have a particular type). +It would be helpful if functions could accept +a single argument that wraps the value of `$arg` +along with these predefined options, +separate from the options that are specific to the function. + +## Requirements + +- Define the structure passed in as an argument to a custom formatting function. +- Define the structure that a custom formatting function should return. +- Maintain the options passed into the callee as a _separate_ argument to the + formatter, to avoid confusion. (See Example 4 below.) +- The structure returned as a value must encode the formatted result, +input value, and options that were passed in. +- Articulate the difference between a "formattable value" +(which is the range of the input mapping (argument mapping), +and the result of evaluating a _literal_) +and a "formatted value" +(which is what a formatting function (usually) returns). +- Clarify the handling of formattable vs. formatted values: +does a formatting function take either, or both? + - This proposal proposes that formatter _inputs_ are a superset of + formatter _outputs_ (in other words, the output of a formatter can + be passed back in to another formatter). + +Any solution should support the examples shown in the "Examples" section. +Minimally, any solution should identify a set of concepts +sufficient for the spec to articulate +that function return values must include a representation of +their input and options, +and not just a "fully formatted" string, other value, or sequence of values. + +## Constraints + +### Implementation-defined behavior + +According to +[the "Introduction" section of the spec](https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md#introduction) +> "The form of the resolved value is implementation defined" + +In this proposal, we want to maintain all the +flexibility that implementations require, +as promised by the existing spec, +while also describing more precisely +what an expression "resolves to". + +### Requirements for being formattable + +The same paragraph of the spec describes the form of the resolved value as follows: + +> "...it needs to be "formattable", i.e. it contains everything required by the eventual formatting." + +The purpose of this proposal is to define +what "everything required by the eventual formatting" means, +reconciling the promise of an implementation-defined "resolved value" +with the requirement that implementations preserve +enough metadata when binding names to values +(either directly in a _declaration_ +or indirectly in a function call). + +### Evaluation order + +The spec does not require either eager or lazy evaluation. + +Typically, a lazy implementation has an internal +"thunk" type representing a delayed computation. + +That presents no problems for this proposal, +since we distinguish a _nameable value_ type +(which may appear in the formatter's local environment, +but _not_ as a runtime argument to a function) +that may be distinct from any of the other value types. + +### The function registry + +Function registry contains specifications of the correct "input", +but that's distinct from what we're saying is the "input". +Need the right terminology: +* `:number` takes a string or number (for example), +but that's always wrapped in a (?) `FormattedPlaceholder` thing +that also has fields representing the options from the previous +formatter. + +- See the "Function Resolution" section of the spec + +Step 4: "Call the function implementation with the following arguments..." +* "If the expression includes an operand, its resolved value. + +If the form of the resolved value is implementation-defined, +it's hard to say what the form of the input to the formatting function is, +and likewise its result. + +https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md#function-resolution + +### Internal representations + +The interfaces presented in this proposal should +be taken as guidance to implementations +for how to provide the minimum functionality +for composable functions +(how to preserve options through chains of formatters). +The in-memory representation of each type +depends on the programming language +and other implementation choices. +As with the [data model spec](https://github.com/unicode-org/message-format-wg/blob/main/spec/data-model/README.md), +an example model for dataflow is presented here +using TypeScript notation. + +## Proposed Design + +### Taxonomy + + +This proposal introduces several new concepts +and eliminates the term "resolved value" from the spec. + +* _Nameable value_: A value that a variable can be bound to +in the runtime environment that defines local variables. +This concept is introduced to make it easier for the spec +to formally account for both eager and lazy evaluation. +In an eager implementation, "nameable value" is synonymous +with "annotated formattable value", +while in a lazy evaluation, a "nameable value" would be a +closure (code paired with in-scope set of variables). + +> [!IMPORTANT] +> +> In the rest of this proposal, we elide the distinction +> between a _nameable value_ and +> an _annotated formattable value_, +> as the semantics of MessageFormat can be +> implemented either eagerly or lazily +> with the same observable results +> (other than possible differences in the set of errors). +> Examples assume eager evaluation. +> In a lazy implementation, _nameable values_ would +> be constructed when processing a _declaration_, +> and these values would be _forced_ when +> formatting a _pattern_ requires it. + +* _Formattable value_: An implementation-specific type +that is the range of the input mapping (message arguments), +as well as what literals format to. + +* _Formatted value_: The result of a formatting function. +This type is implementation-specific, but expected to +at least include strings. + +* _Formatted part_: The ultimate result of formatting +a part of a pattern (_text_ or _expression_). +In an implementation that offers "formatting to parts" +(as in [the Formatted Parts proposal](./formatted-parts.md)), +the _formatted part_ type might be the same as the +_formatted value_ type, or it might be different. +(In the latter case, the implementation might apply an +implementation-specific transformation +that maps a _formatted value_ onto a _formatted part_ +in a particular _formatting context_.) + +* _Fallback value_: A value representing a formatting error. + * For simplicity, this proposal elides the details of + error handling and thus this type is + not discussed further. + +* _Annotated formattable value_: Encapsulates any +value that can be passed as an argument to a formatting function, +including: + * formattable values that don't yet have formatted output + associated with them + * previously formatted values, which can be reformatted by + another formatting function + * fallback values +The values of named options passed to formatting functions +are also _annotated formattable values_. + +> NOTE: Names are subject to change; an _annotated formattable value_ +> could also be called an _operand value_, +> since it has the capability of being passed into a function. + +* _Markup value_: The result of formatting a markup item +(disjoint from _annotated formattable values_). + * For simplicity, we elide the details of markup values + from this proposal. + +* _Preformatted value_: A formattable value paired with a +formatted value (and some extra information). + +> NOTE: A _preformatted value_ could also be called an _annotated value_ +> (see Example 1.4) + +In the current spec, a "resolved value" can be a +nameable value, formattable value, or preformatted value, +depending on context. + +This proposal keeps the "formattable value" and "formatted value" +types implementation-specific while defining wrapper types +around them, which is intended to strike a balance between +freedom of choice for implementors +and specifiability. + +The following diagram shows the relationships between types, +for a typical implementation that supports formatting to parts. +The largest red box shows the boundaries of the formatter itself. +_Formattable values_ flow into the formatter from the input mapping +and from source code (literals). +Squiggly green lines indicate inclusion (for example, +an _annotated formattable value_ includes a _formattable value_, +as shown by the green line +from `AnnotatedFormattableValue` to `FormattableValue`.) +The orange box represents the execution context for functions. +_Annotated formattable values_ flow in and out of it +through calls and returns (the call and return operations +are implementation-specific.) +An implementation-specific `format` operation can +map a _formattable value_ to a _formatted value_ +(when a value needs to be formatted with defaults according to its type) +as well as mapping a _formatted value_ to a _formatted part_. + +![Diagram of different value types](./dataflow1.jpg) + +#### Example Implementation + +The following diagram is a schematic of +an _example_ implementation. +The example is an eager implementation +that provides "format to parts" functionality. +(For expository reasons, we define a separate +_nameable value_ type even though this would +be the same as _annotated formattable value_.) + +The schematic does not show +inclusions between option lists +and other types. + +![Diagram of example implementation](./dataflow2.jpg) + +A way to think about an implementation declaratively +is to specify some operations and their type signatures. + +Types: + +- Environment : Name → NameableValue +- Callable : Name × OptionMap +- OptionMap : Name → AnnotatedFormattableValue + +Operations: + +* FORCE : NameableValue → AnnotatedFormattableValue +* STORE : Environment -> Name -> NameableValue -> Environment +* LOOKUP-GLOBAL : Name -> FormattableValue +* LOOKUP-LOCAL : Environment -> Name -> NameableValue +* EVAL-LITERAL : Literal -> String +* STRING-TO-FORMATTABLE : String -> FormattableValue +* CALL : Callable -> AnnotatedFormattableValue -> AnnotatedFormattableValue +* FORMAT-TO-VALUE : FormattableValue -> FormattedValue +* FORMAT-TO-PART : FormattedValue -> FormattedPart + +(This presentation does not handle errors, e.g., +what happens if LOOKUP-GLOBAL is invoked on a name +it doesn't have a mapping for.) + +In the diagram, single green arrows represent extracting a named field, +and double green arrows represent wrapping an object in a larger object. +Pink arrows represent operations listed above. + +In this model, summarizing the relationships between the types: + +* A _nameable value_ can be FORCED to an _annotated formattable value_. +* A _formattable value_ can be wrapped in an _annotated formattable value_ + and can be extracted from an _annotated formattable value_. + It can also be queried from the input mapping via LOOKUP-GLOBAL + and can be formatted from a literal via EVAL-LITERAL. +* An _annotated formattable value_ can be STORED in a local environment. + It can be queried from a local environment via LOOKUP-LOCAL. + It can be passed to a function via CALL. + It can be returned from a function via CALL. +- A _preformatted value_ can be extracted from an _annotated formattable value_. +* A _formatted value_ can be obtained from an arbitrary string, + or formatted from a _formattable value_ via FORMAT-TO-VALUE, + or extracted from a _preformatted value_. + It can be formatted to a _formatted part_ via FORMAT-TO-PART. +* A _formatted part_ can be formatted from a _formatted value_ via FORMAT-TO-PART. + +The above exploration is not normative; +rather, it shows how these types could be used to structure an implementation. + +### Interfaces + +Since the rest of the proposal assumes eager evaluation, +this omits a definition for _nameable values_. + +``` +interface FormattableValue = { + type: "formattable"; + value: /* implementation-dependent */; +} + +interface FormattedValue = { + type: "formatted"; + value: /* implementation-dependent */; +} + +interface FormattedPart = { + type: "formattedPart"; + value: /* implementation-dependent */; +} + +type FormatterInput = + | FallbackValue + | FormattableValue + | PreformattedValue; + +interface AnnotatedFormattableValue = { + type: "annotatedFormattable"; + source: string; + value: FormatterInput; +} + +interface FallbackValue = { + type: "fallback"; + fallback: string; +} + +interface PreformattedValue = { + type: "preformatted" + options: Iterable<{ name: string; value: AnnotatedFormattableValue}>; + formatter: string; + input: AnnotatedFormattableValue; + output?: FormattedValue; +} +``` + +Implementations are free to extend the `AnnotatedFormattableValue` +and `PreformattedValue` interfaces with additional fields. + +For example, an implementation could add a `source-text` field +to the `AnnotatedFormattableValue` interface +to track the text (concrete syntax) +of a MessageFormat expression +that was used to construct the value, +for error diagnostics. + +### Function Signatures + +In C++, for example, the interface for custom formatters could look like: + +``` +virtual AnnotatedFormattableValue customFormatter(AnnotatedFormattableValue&& argument, + std::map&& options) const = 0; +``` + +Note that this proposal is orthogonal to the existing +[data model and validation rules for the function registry](https://github.com/unicode-org/message-format-wg/blob/main/spec/registry.md). +Input validation can still be applied to the underlying _formattable values_ and _formatted values_ +wrapped in the `argument` and the values for the `options`. + +## Examples + +### Example 2.1 + +Same code as Example 1.1: +``` +.local $a = {1 :number minIntegerDigits=3} // formats as 001. +.local $b = {$a :number minFractionDigits=3} // formats as 001.000 +``` + +In an implementation with a `FormattedValue` type +that includes a `FormattedNumber` variant, +the right-hand side of `$a` is evaluated to the following _annotated formattable value_: + +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'minIntegerDigits'; value: AnnotatedFormattableValue { + value: FormattableValue ('3')}}]; + formatter: "number", + input: AnnotatedFormattableValue { + value: FormattableValue('1')}; + output: FormattedValue { value: FormattedNumber('001.') }}} +``` + +Then, the right-hand side of `$b` is evaluated +in an environment that binds the name `a` +to the above _annotated formattable value_. +The result is: +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'minFractionDigits'; value: AnnotatedFormattableValue { + value: FormattableValue('3')}}]; + formatter: "number"; + input: AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'minIntegerDigits'; value: AnnotatedFormattableValue { + value: FormattableValue('3')}}]; + formatter: "number", + input: AnnotatedFormattableValue { value: FormattableValue('1')} + output: FormattedValue { value: FormattedNumber('001.') }}}; + output: FormattedValue { value: FormattedNumber('001.000') } }} +``` + +Notice that in the second object, +the `input` field's contents are identical to the first object. + +When calling the implementation of the built-in `number` +formatting function (call it `Number::format()`), +while evaluating the right-hand side of `$b`, +`number` receives two sets of options, in different ways: +* The option `minFractionDigits=3` is passed to `Number::format()` + in its `options` argument. +* The option `minIntegerDigits=3` is embedded in the `argument` value. + +In general, the previous formatter might be different from +`Number::format()`, so all the "previous" options have to be +separated from the options in the option map. +The solution in this proposal encodes them in a tree-like structure +representing the chain of previous formatting calls +(really a list-like structure, since formatters have a single argument). + +### Example 2.2 + +This example motivates why option values need to be +_annotated formattable values_. + +Same code as Example 1.3: +``` +.input {$item :noun case=accusative count=1} +.local $colorMatchingGrammaticalNumberGenderCase = {$color :adjective accord=$item} +``` + +When processing the `.input` declaration, +supposing that `item` is bound in the input mapping +to the string 'balloon', +and `color` is bound in the input mapping +to the string 'red', +the name `item` is bound in the runtime environment +to the following _annotated formattable value_: + + +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'case'; value: AnnotatedFormattableValue { + value: FormattableValue('accusative')}}, + { name: 'count'; value: AnnotatedFormattableValue { + value: FormattableValue('1')}}]; + formatter: "noun", + input: AnnotatedFormattableValue { + value: FormattableValue('balloon')}; + output: FormattedValue { value: 'balloon' }}} +``` + +(As this example uses English, the `case` option has no effect +on the formatted output.) + +Then, when processing the right-hand side of the `local` declaration, +the argument to the `adjective` formatter is as follows: + +``` +AnnotatedFormattableValue { + value: FormattableValue('red') +} +``` + +and the option mapping maps the name 'accord' to the same +_annotated formattable value_ that was returned by the `noun` formatter, +shown above. + +The result of the call to the `adjective` formatter looks like: + +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: [{name: 'accord'; value: AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'case'; value: AnnotatedFormattableValue { + value: FormattableValue('accusative')}}, + { name: 'count'; value: AnnotatedFormattableValue { + value: FormattableValue('1')}}]; + formatter: "noun", + input: AnnotatedFormattableValue { + value: FormattableValue('balloon')}; + output: FormattedValue { value: 'balloon' }}}}]; + formatter: 'adjective'; + input: AnnotatedFormattableValue { value: FormattableValue('red') }; + output: FormattedValue { value: 'red' }}} +``` + +Note that the value of the 'accord' option +in the outer `options` field of the `value` field +is the same as the first `AnnotatedFormattableValue` in this subsection. +As before, since the example uses English, the `accord` option has +no effect on the output. + +If the output of the `adjective` formatter was formatted by a subsequent +formatter, it would be able to inspect the value of the output's +`accord` option, along with all of **its** options. + +### Example 2.3 + +For the same code as Example 1.4: +``` +.local $foo = {$arg :func} +``` + +To get the custom functionality described in that example, +the formatter would proceed as follows, +supposing that in the _input mapping_, the name `arg` +is bound to the value `42`, +and the implementation chooses to annotate all integer inputs +with the option `foo`. + +* Bind the name `foo` to the following _annotated formattable value_. +* Pass an argument with the following structure to `func`: + +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'foo'; value: FormattableValue('bar') } /* , etc. */]; + formatter: "default", + input: AnnotatedFormattableValue { + value: FormattableValue('42')]; + } +} +``` + +("Default" is just an arbitrary name for the purposes of the example, +The implementation could choose any unbound formatter name +to indicate that +this value was constructed by applying "default options" +rather than calling a formatter.) + +If the function `func` returns the string +`"baz"` in this case, +the return value from `func` would look like: + +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: []; + formatter: "func"; + input: AnnotatedFormattableValue { + value: PreformattedValue { + options: [{ name: 'foo'; value: FormattableValue('bar') } /* , etc. */]; + formatter: "default", + input: AnnotatedFormattableValue { + value: FormattableValue('42')]; + }} + output: FormattedValue('baz'); + } +} +``` + +Note that the outer `input` field is the same as +the previous `AnnotatedFormattableValue`. + +## Alternatives Considered + +### Severely limit how local variables are used + +Perhaps the most restrictive option is to forbid composition by +restricting where local variables can be referenced. + +Suppose that local variables can only be used in patterns, +not in the right-hand sides of subsequent declarations. +Furthermore, suppose that they can only appear unannotated +in patterns. This could be enforced via a new type of data model +error. + +This is probably too severe, because we want to be able to write: + +``` +.local $x = {1} +.match {$x :number} +* {{wildcard}} +``` + +and not just: + +``` +.match {1 :number} +* {{wildcard}} +``` + +Such a solution would make local declarations much less useful. + +### Not defining the shape of inputs or outputs to custom formatters + +Leave it to implementations, +as is currently done in the spec with "resolved values". + +The disadvantage of this approach is that +it means an implementation can be spec-compliant +without providing composable functions. + +### Functions return minimal results; formatter fills in extra results + +In this alternative: the function argument would still be +an _annotated formattable value_, +but the function can just return a _formatted value_ +since otherwise, it's just copying identical fields into the +result (the source text doesn't change; the formatter name +and resolved options are already known by the caller code +in the MesageFormat implementation; etc.) + +But, what happens if a function "wants" to just +return the _formattable value_ that is passed in; +if the result type of the function is the same as _formatted value_, +then this can't be expressed. This suggests: + +### Functions return the union of _formatted value_ and a _formattable value_ + +This is hard to express in some programming languages' type systems. + +### Encode metadata in formatting context + +The argument to the function would be the union of +a _formattable value_ and a _formatted value_ (to allow reformatting) +and the function would have to access the formatting context to +query its previously-passed options, etc. + +This does make the common case simple (most custom functions will +probably not need to inspect values in this way), +but allowing the argument to be a union type +also makes it hard to express the function signature +in some programming languages. + +### Functions don't preserve options and can't inspect previous options + +May violate intuition (as with the number example) +or make grammatical transformations much harder to implement +(as in the accord example) + +### Other representations + +* Alternative: No _annotated formattable values_; just +a _formattable value_ with some required fields, and +specify that the implementation may add further fields. + +* Alternative: Still have an _annotated formattable value_ +that wraps a _formattable value_, +but instead of having a separate _preformatted value_ type, +combine the two types and consider some fields optional +(i.e. the fields that only appear in a _preformatted value_ +and not a _formattable value_). + +* Alternative: flat structure instead of tree structure. Consider: + +Consider: + +``` +AnnotatedFormattableValue { + value: PreformattedValue { + options: [{'a': AnnotatedFormattableValue(FormattableValue(1))}]; + formatter: "F", + input: AnnotatedFormattableValue { + value: PreformattedValue { + options: [{'b', AnnotatedFormattableValue(FormattableValue(2))}]; + formatter: "F"; + input: AnnotatedFormattableValue { + value: PreformattedValue { + options: [{'c', AnnotatedFormattableValue(FormattableValue(3))}]; + formatter: "F"; + input: AnnotatedFormattableValue(FormattableValue('foo')); + output: FormattedValue("X"); + } + } + output: FormattedValue("Y")}} + output: FormattedValue("Z")}} +``` + +for some formatter F. Recursing through this tree structure to find all the previous options +might be tedious. +For nested values where the formatter is the same for all the nested values, +a single `options` list might be more convenient. +However, it's unclear how to use a flat representation if the nested values +are produced by different formatters that take different options. +What if the output of formatter `F`, which was passed an option `a`, +is passed back into another formatter `G` which also takes an option `a`, +that has a different semantics from `F`'s semantics for `a`? + +### Restricting composition + +If composition of different functions was disallowed +(made into a data model error), then +there would be no need for a common representation of output values +for all functions. + +## Incidental notes + +The spec currently says: + +> Function access to the _formatting context_ MUST be minimal and read-only, +> and execution time SHOULD be limited. + +Choosing the right representation for _annotated formattable values_ +might reduce the need for functions to access the _formatting context_, +though it would probably not eliminate that need. diff --git a/exploration/dataflow1.jpg b/exploration/dataflow1.jpg new file mode 100644 index 0000000000..53583149c1 Binary files /dev/null and b/exploration/dataflow1.jpg differ diff --git a/exploration/dataflow2.jpg b/exploration/dataflow2.jpg new file mode 100644 index 0000000000..dd8bd451c4 Binary files /dev/null and b/exploration/dataflow2.jpg differ diff --git a/exploration/default-registry-and-mf1-compatibility.md b/exploration/default-registry-and-mf1-compatibility.md index 3890506553..c0fff06066 100644 --- a/exploration/default-registry-and-mf1-compatibility.md +++ b/exploration/default-registry-and-mf1-compatibility.md @@ -81,8 +81,8 @@ Functions for formatting [date/time values](#operands) in the default registry a - `:time` If no options are specified, each of the functions defaults to the following: -- `{$d :datetime}` is the same as `{$d :datetime dateStyle=short timeStyle=short}` -- `{$d :date}` is the same as `{$d :date style=short}` +- `{$d :datetime}` is the same as `{$d :datetime dateStyle=medium timeStyle=short}` +- `{$d :date}` is the same as `{$d :date style=medium}` - `{$t :time}` is the same as `{$t :time style=short}` > [!NOTE] @@ -166,8 +166,8 @@ The function `:date` has these function-specific _style_ options: - `style` - `full` - `long` - - `medium` - - `short` (default) + - `medium` (default) + - `short` The function `:time` has these function-specific _style_ options: - `style` @@ -244,9 +244,9 @@ The followind date/time options are *not* part of the default registry. Implementations SHOULD avoid creating options that conflict with these, but are encouraged to track development of these options during Tech Preview: - `calendar` (default is locale-specific) - - valid [Unicode Calendar Identifier](https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#UnicodeCalendarIdentifier) + - valid [Unicode Calendar Identifier](https://unicode.org/reports/tr35/tr35.html#UnicodeCalendarIdentifier) - `numberingSystem` (default is locale-specific) - - valid [Unicode Number System Identifier](https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#UnicodeNumberSystemIdentifier) + - valid [Unicode Number System Identifier](https://unicode.org/reports/tr35/tr35.html#UnicodeNumberSystemIdentifier) - `timeZone` (default is system default time zone or UTC) - valid identifier per [BCP175](https://www.rfc-editor.org/rfc/rfc6557) @@ -300,7 +300,7 @@ How to write an MF1 format or selector in MF2: | Plural (selector) | `{num,plural, ...}` | `.match {$num :plural}`
`.match {$num :number}` | | | Ordinal (selector) | `{num,selectordinal, ...}` | `.match {$num :ordinal}`
`.match {$num :number select=ordinal}` | | | Ordinal (format) | `{num,ordinal}` | | missing | -| Date | `{date,date}` | `{$date :date}`
`{$date :datetime}` | short date is default | +| Date | `{date,date}` | `{$date :date}`
`{$date :datetime}` | medium date is default | | Date | `{date,date,short}` | `{$date :date style=short}`
`{$date :datetime dateStyle=short}` | also medium,long,full | | Time | `{date,time}` | `{$date :time}`
`{$date :datetime timeStyle=short}` | shorthand or timeStyle required | | Date | `{date,time,short}` | `{$date :time style=short}`
`{$date :datetime timeStyle=short}` | also medium,long,full | diff --git a/exploration/error-handling.md b/exploration/error-handling.md new file mode 100644 index 0000000000..82a412a176 --- /dev/null +++ b/exploration/error-handling.md @@ -0,0 +1,177 @@ +# Error Handling + +Status: **Accepted** + +
+ Metadata +
+
Contributors
+
@echeran
+
First proposed
+
2024-06-02
+
Issues
+
#782
+
#830
+
#831
+
Pull Requests
+
#795
+
#804
+
Meeting Notes
+
2024-05-06
+
2024-05-13
+
2024-05-20
+
2024-07-15
+
2024-07-22
+
+
+ +## Objective + +Decide whether and what implementations "MUST" / "SHOULD" / "MAY" perform after a runtime error, regarding: + +1. information about error(s) + - including, if relevant, the minimum number of errors for which such information is expected +1. a fallback representation of the message + +## Background + +In practice, +runtime errors happen when formatting messages. +It is useful to provide information about any errors back to the callsite. +It is useful to the end user to provide a best effort fallback representation of the message. +Specifying the behavior in such cases promotes consistent results across conformant implementations. + +However, implementations of MessageFormat 2.0 will be faced with different constraints due to various reasons: + +* Programming language: the language of the implementation informs idiomatic patterns of error handling. +In Java, errors are thrown and subsequently caught in `try...catch` block. +In Rust, fallible callsites (those which can return errors) should return a `Result` monad. +In both languages, built-in error handling assumes a singular error. +* Environment constriants: as mentioned in [feedback from ICU4X](https://github.com/unicode-org/message-format-wg/issues/782#issuecomment-2103177417), +ICU4X operates in low resource environments for which returning at most 1 error is desirable +because returning more than 1 error would require heap allocation. +* Programming conventions and idioms: in [feedback from ICU-TC](https://docs.google.com/document/d/11yJUWedBIpmq-YNSqqDfgUxcREmlvV0NskYganXkQHA/edit#bookmark=id.lx4ls9eelh99), +they found over the 25 years of maintaining the library that there was more cost than benefit in additionally providing error information with a default best effort return value compared to just returning the default best effort value. +The additional constraint in ICU4C's C++ style to return an error code rather than throwing errors using the STL further complicates the usefulness and likelihood to be used correctly by developers, especially during nested calls. + +> [!NOTE] +> The wording in this document uses the word "signal" in regards to providing +> information about an error rather than "return" or "emit" when referring to +> a requirement that an implementation must at least indicate that an error has +> occurred. +> The word "signal" better accomodates more alternatives in the solution space +> like those that only choose to indicate that an error occurred, +> while still including those that additionally prefer to return the error +> itself as an error object. +> (By contrast, "return an error" implies that an error object will be thrown or +> returned, and "emit an error" is ambiguous as to what is or isn't performed.) +## Use Cases + +As a software developer, I want message formatting calls to signal runtime errors +in a manner consistent with my programming language/environment. +I would like error signals to include diagnostic information that allows me to debug errors. + +As a software developer, I sometimes need to be able to emit a formatted message +even if a runtime error has occurred. + +As a software developer, I sometimes want to avoid "fatal" error signals, +such as might occur due to unconstrained inputs, +errors in translation of the message, +or other reasons outside my control. +For example, in Java, throwing an Exception is a common means of signaling an error. +However, `java.text.NumberFormat` provide both throwing and non-throwing +`parse` methods to allow developers to avoid a "fatal" throw of `ParseException` +(if the exception were uncaught). + +As a MessageFormat implementer, I want to be able to signal errors in an idiomatic way +for my language and still be conformant with MF2 requirements. + +## Accepted Design + +The following design was selected in #830. + +### MUST signal errors and MUST provide fallback + +* Implementations MUST provide a mechanism for signaling errors. There is no specific requirement for what form signaling an error takes. +* Implementations MUST provide a mechanism for getting a fallback representation of a message that produces a formatting or selection error. Note that this can be entirely separate from the first requirement. +* An implementation is not conformant unless it provides access to both behaviors. It is compliant to do both in a single formatting attempt. + +> In all cases, when encountering an error, +> a message formatter MUST be able to signal an error or errors. +> It MUST also provide the appropriate fallback representation of the _message_ defined +> in this specification. + +This alternative requires that an implementation provide both an error signal +and a means of accessing a "best-effort" fallback message. +This slightly relaxes the requirement of "returning" an error +(to allow a locally-appropriate signal of the error). + +Under this alternative, implementations can be conformant by providing +two separate formatting methods or functions, +one of which returns the fallback string and one of which signals the error. + +Similar to the current spec text, +this alternative requires implementations to provide useful information: +both a signal that an error occurred and a best effort message. +A downside to this alternative is that these requirements together assume that +all implementations will want to pay the cost of constructing a representative mesage +after the occurrence of an error. + +## Alternatives Considered + +### Current spec: require information from error(s) and a representative best effort message + +The current spec text says: + +> In all cases, when encountering a runtime error, +> a message formatter MUST provide some representation of the message. +> An informative error or errors MUST also be separately provided. + +This alternative places constraints on implementations to provide multiple avenues of useful information (to the callsite and user). + +This alternative establishes constraints that would contravene the constraints that exist in projects that have implemented MF 2.0 (or likely will soon), based on: +* programming language idioms/constraints +* execution environment constraints +* experience-based programming guidelines + +For example, in ICU, +[the suggested practice](https://docs.google.com/document/d/11yJUWedBIpmq-YNSqqDfgUxcREmlvV0NskYganXkQHA/edit#bookmark=id.lx4ls9eelh99) +is to avoid additionally returning optional error codes when providing best-effort formatted results. + +### MUST signal errors and SHOULD provide fallback + +* Implementations MUST provide a mechanism for signaling errors. There is no specific requirement for what form signaling an error takes. +* Implementations SHOULD provide a mechanism for getting a fallback representation of a message that produces a formatting or selection error. Note that this can be entirely separate from the first requirement. +* Implementations are conformant if they only signal errors. + +### SHOULD signal errors and MUST provide fallback + +* Implementations SHOULD provide a mechanism for signaling errors. There is no specific requirement for what form signaling an error takes. +* Implementations MUST provide a mechanism for getting a fallback representation of a message that produces a formatting or selection error. Note that this can be entirely separate from the first requirement. +* Implementations are conformant if they only provide a fallback representation of a message. + + +### Error handling is not a normative requirement + +* Implementations are not required by MF2 to signal errors or to provide access to a fallback representation. + - The specification provides guidance on error conditions; on what error types exist; and what the fallback representation is. + +> When encountering an error during formatting, +> a message formatter MAY provide some representation of the message, +> or it MAY provide an informative error or errors. +> An implementation MAY provide both. + +This alternative places no expectations on implementations, +which supports the constraints we know now, +as well as any possible constraints in the future +(ex: new programming languages, new execution environments). + +This alternative does not assume or assert that some type of useful information +(error info, representative message) +will be possible and should be returned. + +### Alternate wording + +> When an error is encountered during formatting, +> a message formatter can provide an informative error (or errors) +> or some representation of the message or both. \ No newline at end of file diff --git a/exploration/exact-match-selector-options.md b/exploration/exact-match-selector-options.md index d35f47dbdc..0280b04ebb 100644 --- a/exploration/exact-match-selector-options.md +++ b/exploration/exact-match-selector-options.md @@ -1,4 +1,4 @@ -# Design Proposal Template +# Name of the "Exact Match" selector function Status: **Accepted** diff --git a/exploration/expression-attributes.md b/exploration/expression-attributes.md index 33009dc95b..0253fc49e0 100644 --- a/exploration/expression-attributes.md +++ b/exploration/expression-attributes.md @@ -1,16 +1,22 @@ # Expression Attributes -Status: **Proposed** +Status: **Accepted**
Metadata
Contributors
@eemeli
+
@aphillips
First proposed
2023-08-27
-
Pull Request
+
Pull Requests
#458
+
#772
+
#780
+
#792
+
#845
+
#846
@@ -24,43 +30,118 @@ Function options may influence the resolution, selection, and formatting of anno These provide a great solution for options like `minFractionDigits`, `dateStyle`, or other similar factors that influence the formatted result. -However, this single bag of options is not appropriate in all cases, -in particular for attributes that pertain to the expression as a selector or a placeholder. -For example, many of the [XLIFF 2 inline element] attributes don't really make sense as function options. +Such options naturally correspond to function arguments or builder-style function constructors. +Each option is specific to the associated API. +Message authors, such as translators or developers, want consistent ways to do common tasks, +such as providing hints to translation tools or overriding the locale, +but, unless MessageFormat provides otherwise, cannot rely on implementations +to consistently implement these. + +To reduce the learning curve for users and improve consistency, +it would be useful to have common options +(generally those related to the formatting context) +shared between all functions. + +Separately from formatting concerns, +it is often useful to attach other information to message expressions and markup. +For example, presenting how an example value could be formatted can be very useful for the message's translation. +Providing the original source representation of a placeholder may be essential for being able to format a non-MF2 message, +if it has been transformed to MF2 to provide translators with a unified experience. +As a specific example, many of the [XLIFF 2 inline element] attributes have no meaning +to the function that they appear as options or annotations of. [XLIFF 2 inline element]: http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html#inlineelements ## Use-Cases +### User Story: Formatting Context Override +As a message author, I want to override values in the _formatting context_ for a specific _expression_. +I would like to do this in a consistent, effective manner that does not require a change to the +_function_ or _markup_ support code in order to be effective. +As far as the code is concerned, it just reads the value from the _formatting context_ normally. + +A common example of this is the _locale_. +Overriding the locale used by a function might be needed if I want a specific locale chosen: +``` +You format {42 :number @locale=fr} like this in French. +``` +Or if I want to supply it in a variable: +``` +You format {42 :number @locale=$userSpecified} like this in {$userSpecified} +``` + +Other examples include _direction_ or the _time zone_: +``` +The MAC address is always LTR: {$mac :string @dir=ltr} +I don't want the system default time zone or the one in $d: {$d :date @timezone=|America/Phoenix|} +``` + +An implementation might want to override a custom contextual value: +``` +You format this specially: {42 :number @amzn:marketplace=US} +``` + +### User Story: Translation Tooling +As a translator or developer, I want to ensure that instructions to CAT tools, +including information for human translators or that help MT can be included into +the message and preserved through the translation process. + +In general, such instructions, metadata, etc. do not effect the runtime formatting of the message. +Implementers of functions or markup do not wish to access these and might be annoyed if +the names of translation-related fields conflict with the normal naming of options. +Message compilers might remove these expression attributes when creating messages for use by the runtime. + +Some examples include: +- In addition to supporting a limited set of HTML elements, + Android String Resources use `` to wrap + [nontranslatable content](https://developer.android.com/guide/topics/resources/localization#mark-message-parts). + This is best represented in MF2 with a `@translate=no` attribute. +- Web extension `messages.json` files allow for named [placeholders](https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/i18n/Locale-Specific_Message_reference#placeholders) + that are mapped to indexed arguments. + These may include an example, which is best represented in MF2 as an `@example=...` attribute. + +- In #772, @eemeli calls out: + > While working on [moz.l10n](https://github.com/mozilla/moz-l10n/), + > a new Python localization library that uses the MF2 message and + > [resource data model](https://github.com/eemeli/message-resource-wg/pull/16) to represent messages + > from a number of different current syntaxes, + + Apple's Xcode supports localization of plural messages via `.stringsdict` XML files, + which encode the plural variable's name as a `NSStringLocalizedFormatKey` value, + where it appears as e.g. `%#@countOfFoo@` or similar. + To display only the relevant "countOfFoo" name of this variable to localizers as context, + it's best to use a `@source=...` attribute on the selector. + +### General Use Cases At least the following expression attributes should be considered: - Attributes with a formatting runtime impact: - - `id` — An identifier for the expression. + - `id` — An identifier for the expression or markup. This is included in the formatted part, - and allows the parts of an expression to be explicitly addressed. + and allows each part of a message to be explicitly addressed. > Example identifying two literal numbers: > > ``` - > The first number was {1234 :number @id=first} and the second {56789 :number @id=second}. + > The first number was {1234 :number u:id=first} and the second {56789 :number u:id=second}. > ``` - `locale` — An override for the locale used to format the expression. - Should be expressed as a non-empty sequence of BCP 47 language codes. - > Example embedding a French literal in an English message: + > Example embedding a French date in an English message: > > ``` - > In French, “{|bonjour| @locale=fr}” is a greeting + > In French, this date would be displayed as {|2024-05-06| :date u:locale=fr} > ``` - `dir` — An override for the LTR/RTL/auto directionality of the expression. - > Example explicitly isolating the directionality of a placeholder: + > Example explicitly isolating the directionality of a placeholder + > for a custom user-defined function: > > ``` - > Welcome, {$username @dir=auto} + > Welcome, {$user :x:username u:dir=auto} > ``` - Attributes relevant for translators, tools, and other message operations, @@ -87,15 +168,26 @@ At least the following expression attributes should be considered: whether the expression should or should not be localised. The values here correspond to those used for this property in HTML and elsewhere. - > Example embedding a non-translatable French literal in an English message: + > Example embedding a non-translatable CLI command in a message: > > ``` - > In French, "{|bonjour| @locale=fr @translate=no}" is a greeting + > Use {+code @translate=no}git ls-files{-code} to list all files in a repository. > ``` - `canCopy`, `canDelete`, `canOverlap`, `canReorder`, etc. — Flags supported by XLIFF 2 inline elements +- Attributes used to represent features of other localization syntaxes + when parsed to a MessageFormat 2 data model. + + - `source` — A literal value representing the source syntax of an expression. + + > Example selector representing an Xcode stringsdict `NSStringLocalizedFormatKey` value: + > + > ``` + > .match {$count :number @source=|%#@count@|} + > ``` + ## Requirements Attributes can be assigned to any expression, @@ -103,20 +195,16 @@ including expressions without an annotation. Attributes are distinct from function options. -Common attributes are defined by the MF2 specification -and must be supported by all implementations. +Common options or attributes should work the same way in different functions. + +Special options or attributes should not conflict with other option names. Users may define their own attributes. Implementations may define their own attributes. -Some attributes may have an effect on the formatting of an expression. -These cannot be defined within comments either within or outside a message. - Each attribute relates to a specific expression. -An attribute's scope is limited to the expression to which it relates. - Multiple attributes should be assignable to a single expression. Attributes should be assignable to all expressions, not just placeholders. @@ -131,38 +219,132 @@ the reserved/private-use rules will need to be adjusted to support attributes. ## Proposed Design -Add support for option-like `@key=value` attribute pairs at the end of any expression. +Provide separate solutions for attributes that impact formatting, +and those which do not, +so that their namespaces are not comingled. + +### Contextual options + +Define the expected values and handling for the following options +wherever they are used: + +- `u:id` — A string value that is included as an `id` or other suitable value + in the formatted part(s) for the placeholder, + or any other structured formatted results. + Ignored when formatting to a string, but could show up in error messages. +- `u:locale` — A comma-delimited list of BCP 47 language tags, + or an implementation-defined list of such tags. + The tags are parsed, and they replace the _locale_ + defined in the _formatting context_ for this expression or markup. +- `u:dir` — One of the string values `ltr`, `rtl`, or `auto`. + Replaces the character directionality + defined in the _formatting context_ for this expression or markup. -If the syntax for function options is extended to support flag-like options -(see #386), -also extend expression attribute syntax to match. +Error handling should be well defined for invalid values. -To distinguish expression attributes from options, +Additional restrictions could be imposed, +e.g. requiring that each `u:id` is unique within a formatted message. + +### Attributes + +Add support for standalone `@key` as well as +option-like `@key=value` attribute pairs with a literal value +at the end of any expression or markup. + +To distinguish attributes from options, +require `@` as a prefix for each attribute asignment. +Examples: `@translate=yes` and `@xliff:canCopy`. + +Do not allow expression or markup attributes to influence the formatting context, +or pass them to function handlers. + +Drop variable values from the `attribute` rule: + +```diff +-attribute = "@" identifier [[s] "=" [s] (literal / variable)] ++attribute = "@" identifier [[s] "=" [s] literal] +``` + +## Alternatives Considered + +### Do nothing + +Continue to [caution](https://github.com/unicode-org/message-format-wg/blob/d38ff326d2381b3ef361e996c3431d1b251518d6/spec/syntax.md#attributes) +function authors and other implementers away from creating function-specific or implementation-specific option values +for the use cases presented above. + +As should be obvious, the current situation is not tenable in the long term, and should be resolved. + +### Do not provide any guidance + +Do not include in the spec rules or guidance for declaring formatted part identifiers, +or overriding the message locale or directionality. + +Do not define a common way to communicate information +about an expression or markup to translators or tools. + +This would mean not defining anything for default registry functions either, +effectively requiring implementation-specific options like `icu:locale`. + +Other functions could use their own definitions and handling for similar options, +such as `locale` or `x:lang`. + +Formatted parts for markup would not be able to directly include an identifier. + +If not explicitly defined, less information will be provided to translators. + +Function options may be used as a workaround, +but each implementation and user will end up with different practices. + +### Define options for default registry only + +Do not define a common way to communicate information +about an expression or markup to translators or tools. +Instead, define at least `locale` and `dir` as options for default registry functions, +with handling internal to each function implementation. + +Other functions could use their own definitions and handling for similar options, +such as `locale` or `x:lang`. + +Formatted parts for markup would not be able to directly include an identifier. +Implementations and users will need to invent their own practices with +markup option names like `l10n-id`, `mf:id`, or `markupId` +to refer to specific markup parts. + +Do not define a common way to communicate information +about an expression or markup to translators or tools. + +### Use attributes also for contextual options + +Add support for standalone `@key` as well as +option-like `@key=value` attribute pairs with a literal or variable value +at the end of any expression or markup. + +To distinguish attributes from options, require `@` as a prefix for each attribute asignment. Examples: `@translate=yes` and `@locale=$exprLocale`. -Define the meaning and supported values of some expression attributes in the specification, +Define the meaning and supported values of some attributes in the specification, including at least `@dir` and `@locale`. To support later extension of the specified set of attributes while allowing user extensibility, -suggest custom attribute names to include a U+002D Hyphen-Minus `-`. -Examples: `@can-copy=no`, `@note-link=|https://...|`. +require custom attribute names to be namespaced. +Examples: `@xliff:can-copy=no`, `@note:link=|https://...|`. Allow expression attributes to influence the formatting context, but do not directly pass them to user-defined functions. -## Alternatives Considered +### Use function options, but with some suggested "discard" namespace like `_` -### Do not support expression attributes +Examples: `_:translate=yes` and `_:example=|World|`. -If not explicitly defined, less information will be provided to translators. +Requires reserving an additional namespace. -Function options may be used as a workaround, -but each implementation and user will end up with different practices. +Requires cooperation from implementers to ignore all options using the namespace. -### Use function options, but with some suggested prefix like `_` +Makes defining namespaced attributes difficult. -A bit less bad than the previous, but still mixes attributes and options into the same namespace. +Could be combined with the definition of `u:dir`, `u:locale`, and `u:id` contextual options. At least a no-op function is required for otherwise unannotated expressions. @@ -176,17 +358,6 @@ esp. if a similar expression is used in multiple variants. Comments should not influence the runtime behaviour of a formatter. -### Define `@attributes` as above, but explicitly namespace custom attributes - -As namespacing may also be required for function names and function option names, -and because we want to allow at least for custom function options -to be definable on default formatters, -the namespace rules for parts of the specification would end up differing. - -By suggesting instead of requiring, -we rely on our stability policy to guide implementations to keep clear of the namespace -that may be claimed by later versions of the specification. - ### Enable function chaining within a single expression By allowing for multiple annotation functions on a single expression, diff --git a/exploration/function-composition-part-1.md b/exploration/function-composition-part-1.md new file mode 100644 index 0000000000..3fb2677136 --- /dev/null +++ b/exploration/function-composition-part-1.md @@ -0,0 +1,1193 @@ +# Function Composition + +Status: **Obsolete** + +
+ Metadata +
+
Contributors
+
@catamorphism (also see Acknowledgments)
+
First proposed
+
2024-03-26
+
Pull Requests
+
#753
+
#806
+
+
+ +## Objectives + +* Present a complete list of alternative designs for how to +provide the machinery for function composition. +* Create a shared vocabulary for discussing these alternatives. + +> [!NOTE] +> This design document is preserved as part of a valuable conversation about +> function composition, but it is not the basis for the design eventually +> accepted. + +### Problem statement: defining resolved values + +The problem defined in this design document is that +the meaning of function composition is ambiguous. + +An obstacle to disambiguating its meaning +is the absence of a definition of "resolved value". + +A necessary but not sufficient condition for specifying +the meaning of function composition +is to define the structure +of the inputs and outputs to function implementations. +These definitions must be flexible enough to accommodate +different implementation languages and design choices, +while also being concrete enough for ease of reasoning. + +It's much easier to think about what it means for +one function to operate on the result of another function +if we know what the inputs and outputs to those functions +look like. + +This design document primarily attempts to address +the constraints that a definition of "resolved value" must satisfy. + +The spec currently leaves the term "resolved value" +undefined. +At the same time, the spec implicitly constrains +the form of a resolved value +through the operations on resolved values that it defines. +Implementing the spec requires the implementor to infer these constraints. +An indirect goal of this document +is to begin making those constraints explicit +so that the implementor can consult documentation +rather than inferring requirements. + +Defining resolved values implies several subsidiary goals, +as there are several things that can be done with resolved values +at runtime: + +1. Resolved values can be bound to _variables_ + (whose names arise from _declarations_ in the syntax). +2. Resolved values can be passed to function implementations. +3. Resolved values can be returned from function implementations. + +This implies: + +4. The resolved value returned from one function implementation +can be passed to another function implementation, +either one that implements the same function or a different function. + +which is the problem of composition. + +### Subsidiary problems + + 1. Define the runtime meaning of `.local` (and `.input`). + +> _In a declaration, the resolved value of the expression is bound to a variable, which is available for use by later expressions_ +(["Expression and Markup Resolution"](https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md#expression-and-markup-resolution)). + +The term "resolved value" is left implementation-dependent, but its meaning affects the observable behavior of the formatter. (Note: Since `.input` can be treated as syntactic sugar for `.local`, we need only consider `.local` in examples.) + + 2. Define the types that custom functions manipulate. + +> _Call the function implementation with the following arguments... If the expression includes an operand, its resolved value. +> ....If the call succeeds, resolve the value of the expression as the result of that function call._ +(["Function Resolution"](https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md#function-resolution) ) + +The same term, "resolved value", is used to describe the values +that function implementations consume and produce. +An implementation that supports custom functions needs to define concrete types +(whose details depend on the underlying programming language of the implementation) +that capture all the details of "resolved values". +Implementors would benefit from guidance +on how to map the MessageFormat concept of a "resolved value" +onto a concrete type. + + 3. Define the semantics of composition + +A function can consume a value produced by another function, +since the language provides `.local` declarations +and any _variable_ can appear in an _expression_ +that has an _annotation_. +(Functions implement _annotations_.) +The meaning of this composition depends on the definition +of "resolved value". +Even with a precise definition of "resolved value", +an additional question is raised of what the contract +with custom functions needs to be +in order to conform with the semantics of MessageFormat. +Disambiguating the composability question +affects observable behavior. + + 4. Provide useful names for different kinds of functions + +Currently, a given MessageFormat function can be thought of as a +formatter, a selector, or both. +Because the implementations for formatters and selectors +naturally have different type signatures +(a formatter consumes and produces a resolved value, +while a selector produces a list of keys), +a single MessageFormat function that supports both formatting and selection +is expected to symbolize multiple implementations. +It may be useful to think of some functions as "transformers" +that consume and produce resolved values, +and others as "formatters" that consume a resolved value +and produce a formatted value. +Defining these terms is another problem this document outlines. + +## Use Cases + +Given the breadth of the problems being addressed, +we start with a set of use cases +before generalizing them in the "Background" section. + +Not all of these use cases may be desirable to enable. +Part of the process of discussing this design document +includes deciding on whether any of them should +be explicitly disallowed. + +Several of the following examples use the `:number` built-in function. +However, the issues illustrated in the examples are general. +Most of the issues identified with `:number` also apply +to **any** function with multiple named options. + +_This document includes examples contributed by Markus Scherer and Elango Cheran_ + +### Composition + +The presence of local _declarations_ +allows function composition. + +The following is _not_ a syntactically valid +MessageFormat 2 message: + +``` +{{{|1| :number} :number}} +``` + +This string is not a syntactically correct message +because no nesting of function annotations is allowed +(see the _expression_ nonterminal in the grammar.) + + +However, by substitution, +one can write an observationally equivalent message: + +``` +.local $x = {|1| :number} +{{$x :number}} +``` + +This implies that the composition of two functions +(conceptually, `number . number` in this case) +has to have a meaning, +since it can be expressed syntactically, +as long as every intermediate result is named. + +That doesn't mean that every possible combination +of functions has to make sense if composed. +Functions are allowed to signal errors +and return values that indicate that an error occurred. + +How functions might compose depends on what they return +and what they can accept as arguments. +In turn, what they can accept as arguments +depends on what a variable name denotes. +Functions do not accept MessageFormat variable names +as arguments, +but rather the **value** denoted by a variable name. +The MessageFormat implementation must +manage the binding of variable names to values. + +### Overriding options + +The meaning of function composition in MessageFormat +depends on what can be assumed +about the arguments and return values +of function implementations. + +Consider the following example: + +Example Y1: +``` + .local $x = {$num :number maxFrac=2} + .local $y = {$x :number maxFrac=5} + {{$x} {$y}} +``` + +If the external input value of `$num` is "0.33333", +what should this message format to? + +1. `0.33 0.33333` +2. `0.33 0.33` +3. It's an error, because the value bound to `$x` is +a formatted number, which `:number` does not accept. + +The answer to this question is unclear. +How different values for the same options might +(or might not) combine is not straightforward. +Perhaps some do, others don't, and the details are +specific to each function. + +Example A6: +Should $y be formatted as ٩٨٧٦٥ or 98765? + +``` + input $num = 98765 + .local $x = {$num :number numberSystem=arab} + .local $y = {$x :number numberSystem=latn} +``` + +Example A7: +Should $y be formatted as +98765 or 98765? +``` + + + input $num = 98765 + .local $x = {$num :number signDisplay=always} + .local $y = {$x :number signDisplay=auto} +``` + +These examples raise the question +of whether functions return values +that preserve the input option names and values +that were passed in to the function. + +While some use cases don't work well +(or at least work surprisingly) +if options are **not** preserved +in function outputs, +defining what it means to compose options +is not straightforward. + +### Combining options + +In Example Y1, +two function calls are composed +with the same set of option names +(and different option values). + +A question also arises of how to combine options with different names. + +Example A4: + +``` + .local $x = {|1| :number minInt=3} + .local $y = {$x :number maxFrac=5} + {{$x} {$y}} +``` + +Is this message equivalent to Example A5? + +Example A5: + +``` + .local $x = {|1| :number maxFrac=5 minInt=3} + {{$x} {$x}} +``` + +In compositions of calls to the same function, +it might make intuitive sense to union together +the option sets, letting the outermost enclosing +call take precedence if the same option is specified +multiple times. + +It's less obvious what to do with compositions of +calls to different functions, which may have different +option sets. + + +### Computation vs. applying formatting + +The previous examples conceptually return +the same value that was passed in, +annotated with formatting hints. + +But functions can do other kinds of computation. + +Example B1: +``` + .local $age = {$person :getAge} + .local $y = {$age :duration skeleton=yM} + .local $z = {$y :uppercase} +``` + +Although there is also a pipeline of functions +(conceptually, `uppercase . duration . getAge`), +the formatted value returned by `:getAge` is _not_ +just "the argument with formatting options applied", +but rather, a piece of the argument. +Other functions can be imagined that do more general +computation on arguments. + +It would not be correct to say that +`:uppercase` converts a person to uppercase, +nor would it be correct to say that `:uppercase` +converts a number to uppercase. + +This example only makes sense if `:uppercase` +operates on the "formatted result" +of evaluating the _expression_ bound to `$y`. + +This suggests a representation for named values +that allows functions to choose whether to +inspect the "formatted result", the "argument" and options, +or both. +Or, it may also suggest that we consider +whether we allow composition at all when the functions differ. + +## Background + +In the use cases, we have seen that +the meaning of "resolved value" affects the observable behavior of formatting. +This in turn affects the semantics of MessageFormat, +particularly with respect to function composition. +The desired semantics for function composition +impose requirements on the expressivity of a "resolved value". + +[The introduction to the spec](https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md) states: + +> _The form of the resolved value is implementation defined and the value might not be evaluated or formatted yet. However, it needs to be "formattable", i.e. it contains everything required by the eventual formatting._ + +What "everything required" means depends on the semantics of formatting. + +And from the ["Expression and Markup Resolution"](https://github.com/unicode-org/message-format-wg/blob/main/spec/formatting.md#expression-and-markup-resolution) section: + +> _Since a variable can be referenced in different ways later, implementations SHOULD NOT immediately fully format the value for output._ + +Is there a distinction between a "resolved value" +and a "fully formatted" resolved value? +This text suggests that there is. +Again, the distinction affects observable behavior. + +To understand how these distinctions affect behavior, +we turn to some examples. + +### Base example + +The following example is unambiguous +with the current spec: + +Example Z1: +``` +.local $n = {|1.00123|} +.local $x = {$n: number maxFrac=2} +.local $y = {$n: number maxFrac=3} +{{$x} {$y}} +``` + +If this message is formatted to a string, +the output is "1.00 1.001". + +If the message body is changed +(for example, to annotate `$x` and/or `$y` +with other _annotations_), +what can we assume about the resolved values +of `$x` and `$y`? + +It must be possible to distinguish +the resolved value of `$x` +from the resolved value of `$y`, +as `{{$x} {$y}}` formats to "1.00 1.001" +while `{{$x} {$x}}` formats to "1.00 1.00". + +The question is _when_ the two resolved values +look the same, +and when they look different. + +Two possible answers: + +1. The two resolved values are always different: +`$x` is bound to a resolved value +that encodes the value of the `maxFrac` option, 2; +while for `$y`, this value is 3. +2. Functions cannot distinguish the two resolved values +from each other. +However, the formatter can distinguish the two resolved values +from each other when formatting a pattern. + +How would the spec need to change in order to +force either answer 1 or answer 2 to be unambiguously true? + +Put a different way: does every expression +have a single resolved value +in a given formatting context? +Or does the resolved value of an expression +depend on the context +in which the expression appears? + +### Custom and built-in functions + +An implementation that supports custom functions +would be expected to define an interface between the message formatter +and function implementations. +An implementation is free to choose whether to use the same interface +for calling built-in functions, +or to build the implementations of those functions +directly into the message formatter. + +For that reason, this document refers to custom functions +when discussing the function interface. +However, in some implementations, the same questions arise +for built-in functions. + +### Formattable and formatted values + +In implementing the custom function registry, +it might be natural to suppose that +conceptually, the type signature of a function implementation +is: + +``` +Formattable -> FormattedValue +``` + + +The names of these two types are taken from the ICU4C +implementation. A `Formattable` is a value +tagged with a type (for example, doubles, integers, +strings, and pointers to arbitrary objects). +A `FormattedValue` represents something that +already contains all the information needed +to either render it as a string, +or convert it to a "part" according to a hypothetical +"format to parts" interface. + +It would be tempting to use the existing `Formattable` +type for the input of a function implementation, +and use the existing `FormattedValue` type for the output. +However, functions with this signature +can't compose with each other. + +When the following text refers to `Formattable` or +`FormattedValue`, it should be taken to refer to +abstract "input" and "output" types. + +### Ambiguous examples + +Returning to Example Y1, consider two possible models +of the runtime behavior of function composition. + +#### Preservation Model + +This model preserves the options in the result of the function. + +1. Evaluate `$num` to a value and pass it to the `:number` function, + along with named options `{"maxFrac": "2"}` +2. Let `X` be the result of the function. `X` is + an object + encapsulating the following fields: + * The source value, `"0.33333"` + * The fully-evaluated options, `{"maxFrac": "2"}` + * The formatted result, a `FormattedNumber` object + representing the string `"0.33"` +2. Bind the name `$x` to the value `X`. +3. Evaluate `$y` to a value, which is `X`, + and pass it to the `:number` function, + along with named options `{"maxFrac": "5"}` +4. Bind `$y` to the result, which is an object `Y` + encapsulating the following fields: + * The source value, `"0.33333"` (same as `X`'s source value) + * The fully-evaluated options, `{"maxFrac": "5"}` + (note: the original `maxFrac` option value has been discarded) + * The formatted result, a `FormattedNumber` object + representing the string `"0.33333"` + +then the formatted result is "0.33 0.33333". + +#### Formatted Value Model + +This model preserves the formatted value of the function, +but not the options that were passed to the function. + +1. Evaluate `$num` to a value and pass it to the `:number` function, + along with named options `{"maxFrac": "2"}` +2. Let `F` be the result of the function. `F` is + a `FormattedNumber` object + representing the string `"0.33"` +2. Bind the name `$x` to the value `F`. +3. Evaluate `$y` to a value, which is F, + and pass it to the `:number` function, + along with named options `{"maxFrac": "5"}` +4. Inside the `:number` function, check the type + of the input, notice that it is already a + `FormattedNumber`, and return it as-is +5. Bind `$y` to the result, which is F. + +then the formatted result is "0.33 0.33". + +#### Comparison between models + +The difference is in step 2: whether the implementation +of the `number` function returns a value encapsulating +the various options that were passed in (preservation model), +or only a formatted result (formatted value model). + +In terms of implementation, the result depends on +what the nature is of the value that is bound to +a local variable in the environment used in evaluation +within the message formatter. +The structure of this value might not be exactly the same +as the structure that is passed to functions: +for example, in a lazy implementation, unevaluated thunks +might be stored in the environment, and then evaluated +before a function call. +Still, whatever value is stored in the environment +must capture as much information as is needed by functions. + +In the formatted value model, the value is a simple "formatted value", +analogously to MessageFormat 1. + +In the preservation model, it is a more structured value that captures +both the "formatted value", and everything that was used to construct it, +as in the first model. + + +### The structure of named values + +In the current spec, the "value bound to a variable" +when processing a `.local` (or `.input`) declaration +is a "resolved value". +A "resolved value" +is also the operand of a function. + +Another way to resolve the ambiguity between +the simple and the preservation model is to ask +when two resolved values are the same +and when they are different. + +Example Z1 shows how two different names, +which are bound to two different resolved values, +**might** appear to be the same resolved value +when appearing as the operand of a function, +depending on how the spec is interpreted. + +_The following is drawn from comments by Mark Davis._ + +Example A1: +``` + .local $n = {|1|} + .local $x = {$n :number maxFrac=2} +``` + +The formatter processes the right-hand side +of the `.local` declaration of `$x` +and binds it to a value (a "resolved value", +per the spec.) + +The use of "resolved value" implies that +the following two concepts: +* "the value of `$x` in the formatter's local environment" +and +* "the meaning of `$x` as an operand to a function" +denote the same entity. + +So what is that entity? There are at least two interpretations, +corresponding to the models previously presented: + +Interpretation 1: The meaning of `$x` is +a value that effectively represents the string `"1.00"`. + +Consider another example: + +Example A2: +``` + .local $n1 = {|1|} + .local $n2 = {|1.00123|} + .local $x1 = {$n1 :number maxFrac=2} + .local $x2 = {$n2 :number maxFrac=2} +``` + +Under interpretation 1, `$x1` and `$x2` are +interchangeable in any further piece of the message +that follows this fragment. +No processing can distinguish the resolved values +of the two variables. +This corresponds to the formatted value model. + +Interpretation 2: The meaning of `$x` is +a value that represents +a formatted string `"1.00"`, +bundled with information about the source value (`"1"`), +formatter name, and formatter options. +If the resolved value of `$x`, V, is passed to another function, +that function can distinguish V from another value V1 +that represents the same formatted string, +with different options. +This corresponds to the preservation model. + +The choice of interpretation affects the meaning of +function composition, +since it affects which values a function implementation +can distinguish from each other. + +The two interpretations affect behavior +only if `$x` is passed to another function; +if it's only used unannotated in a pattern, +then the two interpretations imply the same result. + +In other words, in example A2, +functions can distinguish the resolved value of `$x1` +from the resolved value of `$x2`. + +An implication of interpretation 2 is that the meaning +of `$x2` (in example A2) depends on context. +When it appears in a pattern, with no annotation, +it is interchangeable with `$x1`. +When it appears in a function annotation, +it is distinguishable from `$x1`. +If every expression has a single "resolved value" +that is determined in a context-free way, +that rules out interpretation 2. + +Another way to express interpretation 2 is to +express the semantics of an unannotated variable +in a pattern in this way: + +`{{$x}}` is implicitly +`{{$x :format}}` + +where `:format` extracts the string (or "formatted parts") +representation of `$x`'s resolved value +and discards everything else, which is not needed for +producing the final formatting result. +If this implicit processing +(an implicit type coercion?) +is introduced into the spec, +the meaning of variable names +is once again context-independent. + +Also consider: + +Example A3: + +``` + .local $n1 = {|1.00123|} + .local $n2 = {|1.00|} + .local $x0 = {$n2 :number maxFrac=2} + .local $x = {$n1 :number maxFrac=2} + .local $y = {$x :number maxFrac=5} + {{$x} {$y}} +``` + +Interpretation 1: The result of formatting this message to a string +is `"1.00 1.00000"`. +`$x` is bound to a resolved value denoting a formatted string `1.00`, +with no metadata. +The second call to `:number`, with `$x`'s resolved value as its operand, +cannot distinguish the resolved value of `$x` +from the resolved value of `$x0`. +Therefore, its result is the same as if it was passed +the resolved value of `$x0`. + +Interpretation 2: The result of formatting this message to a string +is `"1.00 1.00123"`. +Under this interpretation, the second call to `:number` +can distinguish the resolved value +of `$x` from the resolved value of `$x0`. +The original string that `$x` was constructed from +is part of the resolved value, +and can be accessed to reformat that string +with more digits of precision. + +When it comes to example A3, +either interpretation might be surprising to some users. + +Under interpretation 2, some users might be surprised that `$y` +has more precision than `$x`. +If their mental model is that `$y` can only depend +on the result of formatting the right-hand side +of the declaration of `$x`, +then the intuition is that the extra digits are "lost" +upon assigning a value to `$x`. + +To understand why interpretation 1 might be surprising +to other users, consider an analogy. + +#### Spreadsheet analogy + +_This idea is from Mark Davis._ + +Interpretation 2 treats variables analogously to cells in a spreadsheet. +In a cell of a spreadsheet, referring to a value by name +creates a reference, +rather than copying its value. +Unlike cells in spreadsheets, variables in MessageFormat are immutable. +However, like cells in spreadsheets, the definitions of variables +can refer to other variables that are in scope. + +Consider a tabular rendering of example A3 (with a pseudo-spreadsheet syntax) +(this would be the "formula" view): + +| | A | B | C | D | E | +|----|---------|---------|------------------|-----------------|------------------| +| 1 | 1.00123 | 1.00 | (=B1, maxFrac=2) | =(A1, maxFrac=2)| =(D1, maxFrac=5) | + +with the following mappings between cells and variables: + +| A1 | n1 | +|----|----| +| B1 | n2 | +|----|----| +| C1 | x0 | +|----|----| +| D1 | x | +|----|----| +| E1 | y | + +And the "output" view + +| | A | B | C | D | E | +|----|---------|---------|------------------|-----------------|------------------| +1 | 1.00123 | 1.00 | 1.00 | 1.00 | 1.00123 | + +In E1, the reference to D1 is a reference to the _value_ of A1, +with added formatting options that can be extended or overridden. + +This is consistent with interpretation 2, in which `$x` is bound to a structure +that contains the value of `$n1`. +(This could be implemented by copying, since MessageFormat variables are immutable.) + + +### Semantics: return values for functions + +Turning to the question of what a function returns, +this at first glance is the same question as +"what is a named value?", +as both are "resolved values" according to the spec. +But both interpretation 1 and interpretation 2 complicate that. + +Alternative 1: A function returns a "formatted value". +This matches the formatted value model, where formatted values +are bound to names. + +Alternative 2: A function returns a composite value +that conceptually pairs a base value (possibly the +operand of the function, but possibly not; see Example B1) +with options. +This matches the preservation model. +If we preserve the single usage of "resolved value" +in the spec, this implies that the (base value, options) +representation applies to all resolved values, +not just those returned by functions. + +For more details, see the "model implementations" section. + +### Allowing different kinds of functions + +The syntax could also distinguish multiple kinds of functions, +some of which are composable and some not. + +Or, the function registry could allow functions to be declared in this way, +with incorrect uses of functions being resolution errors. + +(See "Different kinds of composition" under "Alternatives to consider".) + +### Summarizing use cases + +There seem to be several areas of ambiguity: + +* Are named values essentially `FormattedValue`s, +or do they have additional structure that is used +internally in the formatter? (formatted value model vs. preservation model) +* In the preservation model, some functions "look back" for the original value, +(like `number`) +while others return a new "source value" +(like `getAge` in Example B1). + +The choice of internal value influences both areas, +or rather, the desired answers to these questions +constrain the choice of internal value. + +Composition might mean that "the second function operates on the output of the first", +or "the second function operates on the **input** of the first, plus 'hints' supplied by the first", +or it might mean either one depending on which function(s) are involved. + +The question is how to craft the spec in a way that is consistent with expectations. + +## Requirements + +In the rest of this document, we assume some version of +the preservation model. +However, if the formatted value model is more desired, +the questions arise of how to +forbid compositions of functions that would do surprising things +under that model. + +Even under the preservation model, some instances of composition +won't make sense, +so we need to define the error behavior when it doesn't make sense. + +This implies that we need: + +* Preservation of options +* Preservation of the original operand value, at least for some functions +* Passing through a new operand value _and_ options when appropriate, as with functions that extract fields +(see Example B1) +* (Possibly) Preservation of data about the names of functions that were previously called. + +The function registry needs to be explicit about which pieces of data +a function preserves, +and what it expects in its input. + +More generally than just "preservation of options", +a solution needs to specify a minimum set of requirements +for the internal value representation, +so that functions can be passed the values they need. +It also needs to provide a mechanism for declaring +when functions can compose with each other. + +### Guarantee portability + +A message that has a valid result in one implementation +should not result in an error in a different implementation. + +### Identify a set of use cases that must be supported + +Some use cases for composition are given in this document, +and others are yet to be identified. + +One of the outcomes of the design process +should be to agree on a set of use cases involving function composition +that implementations must support in order to be spec-compliant. + +This does not rule out the possibility of deciding +_not_ to support any of these use cases +(that is, a solution that heavily restricts function composition +rather than giving it meaning). + +### Avoid over-constraining implementations + +An implementation is free to use any types +and set of operations for coercing between types +as long as it preserves the observable behavior +of a message. + +The proposed solution should not impose +unnecessary constraints on the implementation. +However, it _should_ make the requirements +for "resolved values" +clear enough so that implementors can +make a well-informed decision +on what these types and operations should be. + +The proposed solution might involve normative changes to the spec, +or it might be sufficient to define a set of examples +and how they should work. + +### Stay aligned with user needs + +It may be that in practice, MessageFormat users will not +find it useful to compose functions in complex ways. +MessageFormat was not initially meant to be +a general-purpose programming language. +If supporting function composition in its full generality +leads to an excessively complex or restrictive spec, +and if user feedback suggests such use cases are rare in practice, +it might be more desirable to rule out ambiguous examples +by means of syntactic restrictions and/or runtime errors. + +## Constraints + +_What prior decisions and existing conditions limit the possible design?_ + +One prior decision is that the same definition of +"resolved value" appears in multiple places in the spec. +If "resolved value" is defined broadly enough +(an annotated value with rich metadata), +then this prior decision need not be changed. + +A second constraint is +the difficulty of developing a precise definition of "resolved value" +that can be made specific in the interface for custom functions, +which is implementation-language-neutral. + +A third constraint is the "typeless" nature of the existing MessageFormat spec. +The idea of specifying which functions are able to compose with each other +resembles the idea of specifying a type system for functions. +Specifying rules for function composition, while also remaining typeless, +seems difficult and potentially unpredictable. + +## Prior work + +In a previous iteration of the spec, +[PR #198](https://github.com/unicode-org/message-format-wg/pull/198) (September 2021) +proposed to add a general `Formattable` interface to represent runtime values. +This proposal was not merged, and eventually, the term "resolved value" +was incorporated into the spec instead to abstract over this interface. + +Currently, custom functions are passed their argument +(_operand_ in MessageFormat syntax) +and a mapping of option names to option values separately. +In PR 198, a `Formattable` would have encapsulated +both the argument and options. +A variation on this idea might make it simpler +to return values with options: +if a function implementation has type `Formattable -> Formattable`, +then it can preserve some options in the returned value, +omit them, or add new ones. + +A wiki page from August 2021, +["Data and Execution Model Differences"](https://github.com/unicode-org/message-format-wg/wiki/Data-&-Execution-Model-Differences#formatting-function-dependencies), +includes a section on "Formatting Function Dependencies", +which includes the question: +"When implementing a formatting function, +what values/arguments does it need to have access to?" + +A GitHub discussion also from August 2021, +["A Modular and Extensible MessageFormat 2.0"](https://github.com/unicode-org/message-format-wg/discussions/190), +includes discussion of the "runtime model" for values; +a proposed design based on this document +might also specify a runtime model for values. +At the time, "pattern element formatters" +and "formatting functions" were conceived +as separate categories of functions. +This document also suggests +introducing more categories of functions, +but split in a different way. +The August 2021 discussion suggests giving +"pattern element formatters" more access to message context. +Such a distinction might be useful again: +it might be desirable to allow function authors +to declare either simple functions that +don't compose with each other, +and more complex functions with a richer interface +that can compose with each other. + +There are some common elements between those past discussions +and this document. +But in contrast with previous proposals that give functions +access to message context, +this document suggests a purely functional approach +in which values the function needs to have access to +are encapsulated in a single argument type +and values the function needs to return +are encapsulated in a single return type. + +Another notable difference with some of the prior work +is that in the current spec, the data model +is completely separate from the model for values +that functions operate on. +That is, functions operate on values +without needing to know or care +whether those values were obtained from +evaluating a MessageFormat _literal_ or _variable_. +We do not propose changing that. +Hence, revisiting the extensibility of the runtime model +now that the data model is settled +may result in a more workable solution. + +## Alternatives to be considered + +The goal of this section is to present a _complete_ list of +alternatives that may be considered by the working group. + +Each alternative corresponds to a different concrete +definition of "resolved value". + +## Introducing type names + +It's useful to be able to refer to three types: + +* `InputType`: This type encompasses strings, numbers, date/time values, +all other possible implementation-specific types that input variables can be +assigned to. The details are implementation-specific. +* `MessageValue`: The "resolved value" type; see [PR 728](https://github.com/unicode-org/message-format-wg/pull/728). +* `ValueType`: This type is the union of an `InputType` and a `MessageValue`. + +It's tagged with a string tag so functions can do type checks. + +``` +interface ValueType { + type(): string + value(): unknown +} +``` + +## Alternatives to consider + +In lieu of the usual "Proposed design" and "Alternatives considered" sections, +we offer some alternatives already considered in separate discussions. + +Because of our constraints, implementations are **not required** +to use the `MessageValue` interface internally as described in +any of the sections. +The purpose of defining the interface is to guide implementors. +An implementation that uses different types internally +but allows the same observable behavior for composition +is compliant with the spec. + +Five alternatives are presented: +1. Typed functions +2. Formatted value model +3. Preservation model +4. Allow both kinds of composition +5. Don't allow composition + +### Typed functions + +Types are a way for users of a language +to reason about the kinds of data +that functions can operate on. +The most ambitious solution is to specify +a type system for MessageFormat functions. + +In this solution, `ValueType` is not what is defined above, +but instead is the most general type +in a system of user-defined types. +(The internal definitions are omitted.) +Using the function registry, +each custom function could declare its own argument type +and result type. +This does not imply the existence of any static typechecking. + +Example B1: +``` + .local $age = {$person :getAge} + .local $y = {$age :duration skeleton=yM} + .local $z = {$y :uppercase} +``` + +In an informal notation, +the three custom functions in this example +have the following type signatures: + +``` +getAge : Person -> Number +duration : Number -> String +uppercase : String -> String +``` + +The [function registry data model](https://github.com/unicode-org/message-format-wg/blob/main/spec/registry.md) +could be extended to define `Number` and `String` +as subtypes of `MessageValue`. +A custom function author could use the custom +registry they define to define `Person` as +a subtype of `MessageValue`. + +An optional static typechecking pass (linting) +would then detect any cases where functions are composed in a way that +doesn't make sense. The advantage of this approach is documentation. + +### Formatted value model (Composition operates on output) + +To implement the "formatted value" model, +the `MessageValue` definition would look as in [PR 728](https://github.com/unicode-org/message-format-wg/pull/728), but without +the `resolvedOptions()` method: + +```ts +interface MessageValue { + formatToString(): string + formatToX(): X // where X is an implementation-defined type + getValue(): ValueType + selectKeys(keys: string[]): string[] +} +``` + +`MessageValue` is effectively a `ValueType` with methods. + +Using this definition would make some of the use cases +impractical. For example, the result of Example A4 +might be surprising. Also, Example 1.3 from +[the dataflow composability design doc](https://github.com/unicode-org/message-format-wg/blob/main/exploration/dataflow-composability.md) +wouldn't work because options aren't preserved. + +### Preservation model (Composition can operate on input and options) + +In the preservation model, +functions "pipeline" the input through multiple calls. + +The `ValueType` definition is different: + +```ts +interface ValueType { + type(): string + value(): InputType | MessageValue +} +``` + +The resolved value interface would include both "input" +and "output" methods: + +```ts +interface MessageValue { + formatToString(): string + formatToX(): X // where X is an implementation-defined type + getInput(): ValueType + getOutput(): ValueType + properties(): { [key: string]: ValueType } + selectKeys(keys: string[]): string[] +} +``` + +Compared to PR 728: +The `resolvedOptions()` method is renamed to `properties`. +Individual function implementations +choose which options to pass through into the resulting +`MessageValue`. + +Instead of using `unknown` as the result type of `getValue()`, +we use `ValueType`, mentioned previously. +Instead of using `unknown` as the value type for the +`properties()` object, we use `ValueType`, +since options can also be full `MessageValue`s with their own options. +(The motivation for this is Example 1.3 from +[the "dataflow composability" design doc](https://github.com/unicode-org/message-format-wg/blob/main/exploration/dataflow-composability.md).) + +This solution allows functions to pipeline input, +operate on output, or both; as well as to examine +previously passed options. Any example from this +document can be implemented. + +Without a mechanism for type signatures, +it may be hard for users to tell which combinations +of functions compose without errors, +and for implementors to document that information +for users. + +### Allow both kinds of composition (with different syntax) + +By introducing new syntax, the same function could have +either "preservation" or "formatted value" behavior. + +Consider (this suggestion is from Elango Cheran): + +``` + .local $x = {$num :number maxFrac=2} + .pipeline $y = {$x :number maxFrac=5 padStart=3} + {{$x} {$y}} +``` + +`.pipeline` would be a new keyword that acts like `.local`, +except that if its expression has a function annotation, +the formatter would apply the "preservation model" semantics +to the function. + +### Don't allow composition for built-in functions + +Another option is to define the built-in functions this way, +notionally: + +``` +number : Number -> FormattedNumber +date : Date -> FormattedDate +``` + +The `MessageValue` type would be defined the same way +as in the formatted value model. + +The difference is that built-in functions +would not accept a "formatted result" +(would signal a runtime error in these cases). + +As with the formatted value model, this restricts the +behavior of custom functions. + +### Non-alternative: Allow composition in some implementations + +Allow composition only if the implementation requires functions to return a resolved value as defined in [PR 728](https://github.com/unicode-org/message-format-wg/pull/728). + +This violates the portability requirement. + +## Acknowledgments + +This document incorporates comments and suggestions from Elango Cheran, Mark Davis, and Markus Scherer. diff --git a/exploration/maintaining-registry.md b/exploration/maintaining-registry.md new file mode 100644 index 0000000000..f5cc411f02 --- /dev/null +++ b/exploration/maintaining-registry.md @@ -0,0 +1,316 @@ +# Maintaining and Registering Functions + +Status: **Proposed** + +
+ Metadata +
+
Contributors
+
@aphillips
+
First proposed
+
2024-02-12
+
Pull Requests
+
#634
+
+
+ +## Objective + +_What is this proposal trying to achieve?_ + +Describe how to manage the registration of functions and options under the +auspices of MessageFormat 2.0. +This includes the REQUIRED Functions which are normatively required by MF2.0, +functions or options in the Unicode `u:` namespace, +and functions/options that are recommended for interoperability. + +## Background + +_What context is helpful to understand this proposal?_ + +MessageFormat v2 originally included the concept of "function registries", +including a "default function registry" required of conformant implementations. + +The terms "registry" and "default registry" suggest machine-readbility +and various relationships between function sets that the working group decided +was not appropriate. + +MessageFormat v2 includes a REQUIRED set of functions. +Implementations are required to implement all of the _selectors_ +and _formatters_ in this set, +including _operands_, _options_, and option values. +Our goal is to be as universal as possible, +making MFv2's message syntax available to developers in many different +runtimes in a wholly consistent manner. +Because we want broad adoption in many different programming environments +and because the capabilities +and functionality available in these environments vary widely, +this REQUIRED set of functions must be conservative in its requirements +such that every implementation can reasonably implement it. + +Promoting message interoperability can and should go beyond this. +Even when a given feature or function cannot be adopted by all platforms, +diversity in the function names, operands, options, error behavior, +and so forth remains undesirable. +Another way to say this is that, ideally, there should be only one way to +do a given formatting or selection operation in terms of the syntax of a message. + +This suggests that there exist a set of functions and options that +extends the REQUIRED set of functions. +Such a set contains the "templates" for functions that go beyond those every implementation +must provide or which contain additional, optional features (options, option values) +that implementations can provide if they are motivated and capable of doing so. +These specifications are normative for the functionality that they provide, +but are optional for implementaters. + +There also needs to be a mechanism and process by which functions in the default namespace +can be incubated for future inclusion in either the REQUIRED set of functions +or in this extended, optional set. + +### Examples + +_Function Incubation_ + +CLDR and ICU have defined locale data and formatting for personal names. +This functionality is new in CLDR and ICU. +Because it is new, few, if any, non-ICU implementations are currently prepared to implement +a function such as a `:person` formatter or selector. +Implementation and usage experience is limited in ICU. +Where functionality is made available, we don't want it to vary from +platform to platform. + +_Option Incubation_ + +In the Tech Preview (LDML45) release, options for `:number` (and friends) +and `:datetime` (and friends) were omitted, including `currency` for `:number` +and `timeZone` for `:datetime`. +The options and their values were reserved, possibly for the LDML46 release as required, +but they also might be retained at a lower level of maturity. + +## Use-Cases + +_What use-cases do we see? Ideally, quote concrete examples._ + +As an implementer, I want to know what functions, options, and option values are +required to claim support for MF2: +- I want to know what options I am required to implement. +- I want to know what the values of each option are. +- I want to know what the options and their values mean. +- I want to be able to implement all of the required functions using my runtime environment + without difficulty. +- I want to be able to use my local I18N APIs, which might use an older release of CLDR + or might not be based on CLDR data at all. + This could mean that my output might not match that of an CLDR-based implementation. + +As an implementer, user, translator, or tools author I expect functions, options +and option values to be stable. +The meaning and use of these, once established, should never change. +Messages that work today must work tomorrow. +This doesn't mean that the output is stabilized or that selectors won't +produce different results for a given input or locale. + +As an implementer, I want to track best practices for newer I18N APIs +(such as implementing personal name formatting/selection) +without being required to implement any such APIs that I'm not ready for. + +As an implementer, I want to be assured that functions or options added in the future +will not conflict with functions or options that I have created for my local users. + +As a developer, I want to be able to implement my own local functions or local options +and be assured that these do not conflict with future additions by the core standard. + +As a tools developer, I want to track both required and optional function development +so that I can produce consistent support for messages that use these features. + +As a translator, I want messages to be consistent in their meaning. +I want functions and options to work consistently. +I want to selection and formatting rules to be consistent so that I only have +to learn them once and so that there are no local quirks. + +As a user, I want to be able to use required functions and their options in my messages. +I want to be able to quickly adopt new additions as my implementation supports them +or be able to choose plug-in or shim implementations. +I never want to have to rewrite a message because a function or its options have changed. + +As an implementer or user, I want to be able to suggest useful additions to MF2 functionality +so that users can benefit from consistent, standardized features. +I want to understand the status of my proposal (and those of others) and know that a public, +structured, well-managed process has been applied. + +## Requirements + +_What properties does the solution have to manifest to enable the use-cases above?_ + +The Standard Function Set needs to describe the minimum set of selectors and formatters +needed to create messages effectively. +This must be compatible with ICU MessageFormat 1 messages. + +There must be a clear process for the creation of new selectors that are required +by the Standard Function Set, +which includes a maturation process that permits implementer feedback. + +There must be a clear process for the creation of new formatters that are required +by the Standard Function Set, +which includes a maturation process that permits implementer feedback. + +There must be a clear process for the addition of options or option values that are required +by the Standard Function Set, +which includes a maturation process that permits implementer feedback. + +There must be a clear process for the deprecation of any functions, options, or option values +that are no longer I18N best practices. +The stability guarantees of our standard do not permit removal of any of these. + +## Constraints + +_What prior decisions and existing conditions limit the possible design?_ + +## Proposed Design + +_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ + +The MessageFormat WG will release a set of specifications +that standardize the implementation of functions and options in the default namespace of +MessageFormat v2 beginning with the LDML46 release. +Implementations and users are strongly discouraged from defining +their own functions or options that use the default namespace +Future updates to these sets of functions and options will coincide with LDML releases. + +Each _function_ is described by a single specification document. +Each such document will use a common template. +A _function_ can be a _formatting function_, +a _selector_, +or both. + +The specification will indicate if the _formatting function_, +the _selector function_, or, where applicable, both are `REQUIRED` or `RECOMMENDED`. +The specification must describe operands, including literal representations. + +The specification includes all defined _options_ for the function. +Each _option_ must define which values it accepts. +An _option_ is either `REQUIRED` or `RECOMMENDED`. + +_Functions_ or _options_ that have an `RECOMMENDED` status +must have a maturity level assigned. +The maturity levels are: +- **Proposed** +- **Accepted** +- **Released** +- **Deprecated** + +_Functions_ and _options_ that have a `REQUIRED` status have only the +`Released` and `Deprecated` statuses. + +* An _option_ can be `REQUIRED` for an `RECOMMENDED` function. + This means that the function is optional to implement, but that, when implemented, must include the option. +* An _option_ can be `RECOMMENDED` for a `REQUIRED` function. + This means that the function is required, but implementations are not required to implement the option. +* An _option_ can be `RECOMMENDED` for an `RECOMMENDED` function. + This means that the function is optional to implement and the option is optional when implementing the function. + +A function specification describes the functions _operand_ or _operands_, +its formatting options (if any) and their values, +its selection options (if any) and their values, +its formatting behavior (if any), +its selection behavior (if any), +and its resolved value behavior. + +`REQUIRED` functions are stable and subject to stability guarantees. +Such entries will be limited in scope to functions that can reasonably be +implemented in nearly any programming environment. +> Examples: `:string`, `:number`, `:datetime`, `:date`, `:time` + + +`RECOMMENDED` functions are stable and subject to stability guarantees once they +reach the status of **Released**. +Implmentations are not required to implement _functions_ or _options_ with an `RECOMMENDED` status +when claiming MF2 conformance. +Implementations MUST NOT implement functions or options that conflict with `RECOMMENDED` functions or options. + +`RECOMMENDED` values may have their status changed to `REQUIRED`, +but not vice-versa. + +> Option Examples `:datetime` might have a `timezone` option in LDML46. +> Function Examples: We don't currently have any, but potential work here +> might includes personal name formatting, gender-based selectors, etc. + +The CLDR-TC reserves the `u:` namespace for use by the Unicode Consortium. +This namespace can contain _functions_ or _options_. +Implementations are not required to implement these _functions_ or _options_ +and may adopt or ignore them at their discretion, +but are encouraged to implement these items. + +Items in the Unicode Reserved Namespace are stable and subject to stability guarantees. +This namespace might sometimes be used to incubate functionality before +promotion to the default namespace in a future release. +In such cases, the `u:` namespace version is retained, but deprecated. +> Examples: Number and date skeletons are an example of Unicode extension +> possibilities. +> Providing a well-documented shorthand to augment "option bags" is +> popular with some developers, +> but it is not universally available and could represent a barrier to adoption +> if normatively required. + +All `REQUIRED`, `RECOMMENDED`, and Unicode namespace function or option specifications goes through +a development process that includes these levels of maturity: + +1. **Proposed** The _function_ or _option_, along with necessary documentation, + has been proposed for inclusion in a future release. +2. **Accepted** The _function_ or _option_ has been accepted but is not yet released. + During this period, changes can still be made. +3. **Released** The _function_ or _option_ is accepted as of a given LDML release that MUST be specified. +4. **Deprecated** The _function_ or _option_ was previously _released_ but has been deprecated. + Implementations are still required to support `REQUIRED` functions or options that are deprecated. +5. **Rejected** The _function_ or _option_ was considered and rejected by the MF2 WG and/or the CLDR-TC. + Such items are not part of any standard, but might be maintained for historical reference. + +A proposal can seek to modify an existing function. +For example, if a _function_ `:foo` were an `RECOMMENDED` function in the LDMLxx release, +a proposal to add an _option_ `bar` to this function would take the form +of a proposal to alter the existing specification of `:foo`. +Multiple proposals can exist for a given _function_ or _option_. + +### Process + +Proposals for additions are made via pull requests in a unicode-org github repo +using a specific template TBD. +Proposals for changes are made via pull requests in a unicode-org github repo +using a specific template TBD against the existing specification for the function or option. + +Proposals must be made at least _x months_ prior to the release date to be included +in a specific LDML release. +The CLDR-TC will consider each proposal using _process details here_ and make a determination. +The CLDR-TC may delegate approval to the MF2 WG. +Decisions by the MF2 WG may be appealed to the CLDR-TC. +Decisions by the CLDR-TC may be appealed using _existing process_. + +Technical discussion during the approval process is strongly encouraged. +Changes to the proposal, +such as in response to comments or implementation experience, are permitted +until the proposal has been approved. +Once approved, changes require re-approval (how?) + + +The timing of official releases of the Standard Function Set and Optional Set is the same as CLDR/LDML. +Each LDML release will include: +- **Released** specifications in the Standard Function Set +- **Released** specifications in the Unicode reserved namespace +- a section of the MF2 specification specifically incorporating versions of the above +- **Accepted** entries for each of the above available for testing and feedback + +Proposals for additions to any of the above include the following: +- a design document, which MUST contain: + - the exact text to include in the MF2 specification using a template to be named later + +Each proposal is stored in a directory indicating indicating its maturity level. +The maturity levels are: +- **Accepted** Items waiting for the next CLDR release. +- **Released** Complete designs that are released. +- **Proposed** Proposals that have not yet been considered by the MFWG or which are under active development. +- **Rejected** Proposals that have been rejected by the MFWG in the past. + +## Alternatives Considered + +_What other solutions are available?_ +_How do they compare against the requirements?_ +_What other properties they have?_ diff --git a/exploration/number-selection.md b/exploration/number-selection.md index f0a81cbd09..16bb05bff7 100644 --- a/exploration/number-selection.md +++ b/exploration/number-selection.md @@ -1,6 +1,6 @@ # Selection on Numerical Values -Status: **Accepted** +Status: **Re-Opened**
Metadata @@ -13,6 +13,7 @@ Status: **Accepted**
Pull Request
#471
#621
+
#859
@@ -41,8 +42,8 @@ in PR #842 +@eemeli points out a number of gaps or infelicities in the current specification +and there was extensive discussion of how to address these gaps. + +The `key` for exact numeric match in a variant has to be a string. +The format of such strings, therefore, has to be specified if messages are to be portable and interoperable. +In LDML45 Tech Preview we selected JSON's number serialization as a source for `key` values. +The JSON serialization is ambiguous, in that a given number value might be serialized validly in more than one way: +``` +123 +123.0 +1.23E2 +... etc... +``` + ## Use-Cases As a user, I want to write messages that use the correct plural for @@ -68,13 +84,71 @@ As a user, I want to write messages that mix exact matching and either plural or ordinal selection in a single message. > For example: >``` ->.match {$numRemaining} ->0 {{You have no more chances remaining (exact match)}} ->1 {{You have one more chance remaining (exact match)}} +>.match $numRemaining +>0 {{You have no more chances remaining (exact match)}} +>1 {{You have one more chance remaining (exact match)}} >one {{You have {$numRemaining} chance remaining (plural)}} -> * {{You have {$numRemaining} chances remaining (plural)}} +>* {{You have {$numRemaining} chances remaining (plural)}} >``` +As a user, I want the selector to match the options specified: +``` +.local $num = {123.123 :number maximumFractionDigits=2 minimumFractionDigits=2} +.match $num +123.12 {{This matches}} +120 {{This does not match}} +123.123 {{This does not match}} +1.23123E2 {{Does this match?}} +* {{ ... }} +``` + +Note that badly written keys just don't match, but we want users to be able to intuit whether a given set of keys will work or not. + +``` +.local $num = {123.456 :integer} +.match $num +123.456 {{Should not match?}} +123 {{Should match}} +123.0 {{Should not match?}} +* {{ ... }} +``` + +There can be complications, which we might need to define. Consider: + +``` +.local $num = {123.002 :number maximumFractionDigits=1 minimumFractionDigits=0} +.match $num +123.002 {{Should not match?}} +123.0 {{Does minimumFractionDigits make this not match?}} +123 {{Does minimumFractionDigits make this match?}} +* {{ ... }} +``` + +As an implementer, I am concerned about the cost of incorporating _options_ into the selector. +This might be accomplished by building a "second formatter". +Some implementations, such as ICU4J's, might use interfaces like `FormattedNumber` to feed the selector. +Implementations might also apply options by modifying the number value of the _operand_ +(or shadowing the options effect on the value) + +As a user, I want to be able to perform exact match using arbitrary digit numeric types where they are available. + +As an implementer, I do **not** want to be required to provide or implement arbitrary precision +numeric types not available in my platform. +Programming/runtime environments vary widely in support of these types. +MF2 should not prevent the implementation using, for example, `BigDecimal` or `BigInt` types +and permit their use in MF2 messages. +MF2 should not _require_ implementations to support such types where they do not exist. +The problem of numeric type precision, +which is implementation dependent, +should not affect how message `key` values are specified. + +> For example: +>``` +>.local $num = {11111111111111.11111111111111 :number} +>.match $num +>11111111111111.11111111111111 {{This works on some implementations.}} +>* {{... but not on others? ...}} +>``` ## Requirements @@ -166,7 +240,7 @@ function `:number`: - `engineering` - `compact` - `numberingSystem` - - valid [Unicode Number System Identifier](https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#UnicodeNumberSystemIdentifier) + - valid [Unicode Number System Identifier](https://unicode.org/reports/tr35/tr35.html#UnicodeNumberSystemIdentifier) (default is locale-specific) - `signDisplay` - `auto` (default) @@ -206,7 +280,7 @@ function `:integer`: - `ordinal` - `exact` - `numberingSystem` - - valid [Unicode Number System Identifier](https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#UnicodeNumberSystemIdentifier) + - valid [Unicode Number System Identifier](https://unicode.org/reports/tr35/tr35.html#UnicodeNumberSystemIdentifier) (default is locale-specific) - `signDisplay` - `auto` (default) @@ -248,7 +322,7 @@ The following options are _not_ part of the default registry. Implementations SHOULD avoid creating options that conflict with these, but are encouraged to track development of these options during Tech Preview: - `currency` - - valid [Unicode Currency Identifier](https://cldr-smoke.unicode.org/spec/main/ldml/tr35.html#UnicodeCurrencyIdentifier) + - valid [Unicode Currency Identifier](https://unicode.org/reports/tr35/tr35.html#UnicodeCurrencyIdentifier) (no default) - `currencyDisplay` - `symbol` (default) @@ -278,7 +352,8 @@ but can cause problems in target locales that the original developer is not cons > considering other locale's need for a `one` plural: > > ``` -> .match {$var} +> .input {$var :integer} +> .match $var > 1 {{You have one last chance}} > one {{You have {$var} chance remaining}} // needed by languages such as Polish or Russian > // such locales typically require other keywords @@ -290,7 +365,13 @@ but can cause problems in target locales that the original developer is not cons ### Percent Style When implementing `style=percent`, the numeric value of the operand -MUST be divided by 100 for the purposes of formatting. +MUST be multiplied by 100 for the purposes of formatting. + +> For example, +> ``` +> .local $percent = {1 :integer style=percent} +> {{This formats as '100%' in the en-US locale: {$percent}}} +> ``` ### Selection @@ -416,7 +497,9 @@ To expand on the last of these, consider this message: ``` -.match {$count :plural minimumFractionDigits=1} +.input {$count :number minimumFractionDigits=1} +.local $selector = {$count :plural} +.match $selector 0 {{You have no apples}} 1 {{You have exactly one apple}} * {{You have {$count :number minimumFractionDigits=1} apples}} @@ -431,9 +514,9 @@ With the proposed design, this message would much more naturally be written as: ``` .input {$count :number minimumFractionDigits=1} -.match {$count} -0 {{You have no apples}} -1 {{You have exactly one apple}} +.match $count +0.0 {{You have no apples}} +1.0 {{You have exactly one apple}} one {{You have {$count} apple}} * {{You have {$count} apples}} ``` @@ -460,3 +543,96 @@ and they _might_ converge on some overlap that users could safely use across pla #### Cons - No guarantees about interoperability for a relatively core feature. + +## Alternatives Considered (`key` matching) + +### Standardize the Serialization Forms + +Modify the above exact match as follows. +Note that this implementation is less restrictive than before, but still leaves some +values that cannot be matched. +> [!IMPORTANT] +> The exact behavior of exact literal match is only defined for +> a specific range of numeric values and does not support scientific notation. +> Very large or very small numeric values will be difficult to perform +> exact matching on. +> Avoid depending on these types of keys in message selection. +> [!IMPORTANT] +> For implementations that do not have arbitrary precision numeric types +> or operands that do not use these types, +> it is possible to specify a key value that exceeds the precision +> of the underlying type. +> Such a key value will not work reliably or may not work at all +> in such implementations. +> Avoid depending on such keys values in message selection. +Number literals in the MessageFormat 2 syntax use a subset of the +[format defined for a JSON number](https://www.rfc-editor.org/rfc/rfc8259#section-6). +The resolved value of an `operand` exactly matches a numeric literal `key` +if, when the `operand` is serialized using this format +the two strings are equal. +```abnf +number = [ "-" ] int [ fraction ] +integer = "0" / [ "-" ] (digit19 *DIGIT) +int = "0" / (digit19 *DIGIT) +digit19 = %31-39 ; 1-9 +fraction = "." 1*DIGIT +``` +If the function `:integer` is used or the `maximumFractionDigits` is 0, +the production `integer` is used and any fractional amount is omitted, +otherwise the `minimumFractionDigits` number of digits is produced, +zero-filled as needed. +The implementation applies the `maximumSignificantDigits` to the value +being serialized. +This might involve locally-specific rounding. +The `minimumSignificantDigits` has no effect on the value produced for comparison. +The option `signDisplay` has no effect on the value produced for comparison. +> [!NOTE] +> Implementations are not expected to implement this exactly as written, +> as there are clearly optimizations that can be applied. +> Here are some examples: +> ``` +> .input {$num :integer} +> .match $num +> 0 {{The number 0}} +> 1 {{The number 1}} +> -1 {{The number -1}} +> 1.0 {{This cannot match}} +> 1.1 {{This cannot match}} +> ``` +> ``` +> .input {$num :number maximumFractionDigits=2 minimumFractionDigits=2} +> .match $num +> 0 {{This does not match}} +> 0.00 {{This matches the value 0}} +> 0.0 {{This does not match}} +> 0.000 {{This does not match}} +> ``` +> ``` +> .input {$num :number minimumFractionDigits=2 maximumFractionDigits=5} +> .match $num +> 0.12 {{Matches the value 0.12} +> 0.123 {{Matches the value 0.123}} +> 0.12345 {{Matches the values 0.12345}} +> 0.123456 {{Does not match}} +> 0.12346 {{May match the value 0.123456 depending on local rounding mode?}} +> ``` +> ``` +> .input {$num :number} +> -0 {{Error: Bad Variant Key}} +> -99 {{The value -99}} +> 1111111111111111111111111111 {{Might exceed the size of local integer type, but is valid}} +> 11111111111111.1111111111111 {{Might exceed local floating point precision, but is valid}} +> 1.23e-37 {{Error: Bad Variant Key}} +> ``` + + + +### Compare numeric values + +This is the design proposed in #842. + +This modifies the key-match algorithm to use implementation-defined numeric value exact match: + +> 1. Let `exact` be the numeric value represented by `key`. +> 1. If `value` and `exact` are numerically equal, then + diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index e46d7451c7..6eba257ed5 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -1,6 +1,6 @@ # Quoted Literals -Status: **Proposed** +Status: **Accepted**
Metadata diff --git a/exploration/registry-xml/README.md b/exploration/registry-xml/README.md new file mode 100644 index 0000000000..a3a3a6890c --- /dev/null +++ b/exploration/registry-xml/README.md @@ -0,0 +1,184 @@ +# MessageFormat 2.0 Registry + +Implementations and tooling can greatly benefit from a +structured definition of formatting and matching functions available to messages at runtime. + +> [!IMPORTANT] +> This definition was initially developed to be a part of the MessageFormat 2.0 specification, +> but has been left out in preference of less structural definitions of message functions +> and an expectation that real-world experience with tooling will be able to inform +> later considerations to return to this topic. + +## Goals + +The registry provides a description of MessageFormat 2 functions, +in order to support the following goals and use-cases: + +- Validate semantic properties of messages. For example: + - Type-check values passed into functions. + - Validate that matching functions are only called in selectors. + - Validate that formatting functions are only called in placeholders. + - Verify the exhaustiveness of variant keys given a selector. +- Support the localization roundtrip. For example: + - Generate variant keys for a given locale during XLIFF extraction. +- Improve the authoring experience. For example: + - Forbid edits to certain function options (e.g. currency options). + - Autocomplete function and option names. + - Display on-hover tooltips for function signatures with documentation. + - Display/edit known message metadata. + - Restrict input in GUI by providing a dropdown with all viable option values. + +## Conformance and Use + +Implementations are not required to provide a machine-readable registry +nor to read or interpret the registry data model in order to be conformant. + +The MessageFormat 2.0 Registry was created to describe +the core set of formatting and selection _functions_, +including _operands_, _options_, and _option_ values. +This is the minimum set of functionality needed for conformance. +By using the same names and values, _messages_ can be used interchangeably +by different implementations, +regardless of programming language or runtime environment. +This ensures that developers do not have to relearn core MessageFormat syntax +and functionality when moving between platforms +and that translators do not need to know about the runtime environment for most +selection or formatting operations. + +The registry provides a machine-readable description of _functions_ +suitable for tools, such as those used in translation automation, so that +variant expansion and information about available _options_ and their effects +are available in the translation ecosystem. +To that end, implementations are strongly encouraged to provide appropriately +tailored versions of the registry for consumption by tools +(even if not included in software distributions) +and to encourage any add-on or plug-in functionality to provide +a registry to support localization tooling. + +## Registry Data Model + +MessageFormat 2 functions can be invoked in two contexts: + +- inside placeholders, to produce a part of the message's formatted output; + for example, a raw value of `|1.5|` may be formatted to `1,5` in a language which uses commas as decimal separators, +- inside selectors, to contribute to selecting the appropriate variant among all given variants. + +A single _function name_ may be used in both contexts, +regardless of whether it's implemented as one or multiple functions. + +A _signature_ defines one particular set of at most one argument and any number of named options +that can be used together in a single call to the function. +`` corresponds to a function call inside a placeholder inside translatable text. +`` corresponds to a function call inside a selector. + +A signature may define the positional argument of the function with the `` element. +If the `` element is not present, the function is defined as a nullary function. +A signature may also define one or more `