Add interchange data model description + JSON Schema definition #393

eemeli · 2023-06-17T18:18:33Z

This proposes a way for us to fulfill our deliverable of:

A formal definition of the canonical data model for representing localizable dynamic message strings.

This is not a data model that implementations MUST use, but one which they MAY support. It's intended to work as a formalization of the meaning of our syntax as well as an interchange format for messages in other syntaxes, such as MF1. Effectively, it's synonymous with our syntax, but expressed as a parsed JSON structure.

It is intended to be a format capable of representing messages in all current syntaxes. As in, you should be able to parse anything into this structure with no data loss, and then use a compatible MF2 library with your messages. Full roundtripping of formats that support inline selectors like MF1 and Fluent might not retain their original structure, but will be semantically equivalent.

The format is included inline in the markdown doc using TypeScript interfaces, but its canonical form is provided as a JSON Schema definition.

cdaringe · 2023-06-19T18:32:12Z

Could this PR be considered part of an unmentioned step in Add missing formatting sections #396? Could this data-model is may be just one of many in play? This model reads to me like a ParsedMessage data-model, where message parsing is actually the first unmentioned step of Add missing formatting sections #396.
I can forsee other possible data models, like your MessageValue structure, also being a spec deliverable.

In the case of a central message provider dispatching messages to N-type-clients, perhaps in some systems:

sending the raw ParsedMessage model over-the-wire could be applicable (data was not available to the central provider, thus unable to progress thru resolution & selection), or
sending the MessageValue over the wire could be applicable (data was available to the central provider, thus able to progress thru resolution & selection, but defers formatting to the client),

...depending on data availability in either context. Being able to distribute the parsing could be beneficial (as clearly evidenced by this PR)!

eemeli · 2023-06-19T20:37:01Z

This model reads to me like a ParsedMessage data-model, where message parsing is actually the first unmentioned step of #396.

That's a good point; the formatting spec should really mention that it's getting a parsed message structure from somewhere to start with. We should not require parsing syntax into the data model specified here before formatting, but the synonymity between the data model and the syntax should allow for messages from either source (syntax or data model) to be equivalently formatted.

I can forsee other possible data models, like your MessageValue structure, also being a spec deliverable.

That is a possibility, but tbh it's somewhat unlikely. For Intl.MessageFormat the MessageValue structure makes sense, as there it's intended to act as a base platform for further wrapping formatters. Consider a DOM/HTML message formatter: For something like that, it makes much more sense to output DOM fragments rather than MessageValues. Similarly in other settings, the surrounding environment will strongly guide the shape of non-string formatting targets.

I don't think there is sufficient value in harmonising the output format of MF2 to overcome the costs that would be incurred by implementations needing to fit its mold.

sending the raw ParsedMessage model [...] or sending the MessageValue over the wire could be applicable

Parsing something like MessageResource should be comparable to parsing JSON, and formatting a message should be a rather fast operation, so I rather hope that we'll be able to not need such optimizations. But yes, there is value in portable representations of messages.

I see the following as probably the greatest benefits provided by this data model:

Easing the initial or partial adoption of MF2. Consider an existing application or other system that already has messages in some legacy format, tooling and pipelines for their translation, and a runtime for their formatting.

With a well defined data model, it becomes easier for a step-by-step transition to MF2 to take place. Initially, the runtime parser could be replaced with a parser targeting the MF2 data model and and the formatter with an MF2 implementation. Next, the message data could be transformed to MF2 during the application build. Further steps could replace the whole localization stack with MF2.

Without a data model, it becomes much harder to progress in such small steps, and the immediate cost of starting to use MF2 becomes much higher.
Message transformations. Not only between message formats, but also when message contents need to change. With a data model, it's much easier for interoperable tooling to be able to e.g. update a variable reference in all localized versions of a message.
Providing a universal "message" definition. This one is admittedly a bit fuzzier than the preceding, but potentially even more valuable. As far as I know, this data model is the first one that even attempts to be compatible with all message formats. With this, you should be able to take a message in any existing syntax and represent it as structured data. I don't really know what people might do with it, but providing a qualitatively new tool is bound to be useful in all sorts of ways.

spec/data-model.md

catamorphism · 2023-06-20T00:22:36Z

spec/data-model.md

+While this document uses TypeScript syntax for their definition,
+the canonical and authoritative source is the `message.json`
+[JSON Schema](https://json-schema.org/) definition.


I'm a little worried about maintenance of this doc (at least while the proposal continues to change) and keeping the TypeScript and JSON Schema definitions consistent with each other. Though I can see why you want to have the TypeScript interfaces for exposition and the JSON Schema definition for reference. Maybe that's a non-concern given that you say that the JSON Schema one is authoritative.

I don't think we'll be changing the structure much any more, and I'd say that the experience with the ABNF would indicate that the need to update two different places when applying changes is not too onerous. And yes, the explicit reference to the JSON Schema as authoritative is intended to disambiguate any divergences, should such form.

catamorphism · 2023-06-20T00:33:18Z

spec/data-model.md

+  type: 'select'
+  declarations: Declaration[]
+  selectors: Expression[]
+  variants: Variant[]


Currently, the ICU4J Mf2DataModel class has this method to get the message's variants:

public OrderedMap<SelectorKeys, Pattern> getVariants();

This is less convenient to implement in ICU4C than it is to return a list of Variants, but I'm trying to do it anyway for the sake of parity with ICU4J.

I like what you have here (Variant[]) better because it's easier to implement in C++, where ICU4C doesn't enjoy the benefits of Java's polymorphic OrderedMap class. But also because while the list and map representations are isomorphic, I think it's more appealing to have an API that returns a list and let users build their own on top of it that does some kind of optimization for more efficient pattern-matching than it is to return a map when maybe efficient lookup isn't always necessary.

Whether it ends up being a list or a map, mostly I just wanted to highlight that the ICU4J and ICU4C implementations should match what's defined here.

spec/data-model.md

spec/message.json

catamorphism · 2023-06-20T04:11:51Z

spec/message.json

+      "properties": {
+        "type": { "const": "message" },
+        "declarations": { "$ref": "#/$defs/declarations" },
+        "pattern": { "$ref": "#/$defs/pattern" }


Seems like it might be simpler to have a single message, which has a body attribute that's one of pattern or select, and then select would just have the selectors and variants fields? (In other words, declarations would be factored out.)

There's also the approach @stasm has suggested, i.e. expressing single-pattern messages as having an empty list for selectors, and a single entry in variants with a corrspondingly empty list of keys.

As I mentioned on the call, I'd much prefer landing something first and then iterating on it, rather than sorting out the details all in one go.

I filed #437 to continue discussing this.

Co-authored-by: Tim Chevalier <tjc@igalia.com>

catamorphism

Looks good. I agree with your intention to land something and iterate it on it more later, so in that spirit, I think the only one of my suggestions that needs to be addressed is the clarification about what the fields of a Reserved are.

spec/data-model.md

spec/message.json

catamorphism

Looks great!

stasm

As agreed in the yesterday's meeting, we'll land this without the JSOM schema for now.

stasm · 2023-07-11T05:47:43Z

spec/data-model.md

+
+interface Expression {
+  type: 'expression'
+  body: Literal | VariableRef | FunctionRef | Reserved


A FunctionRef can also have a Literal or a VariableRef as an argument, and in fact, due to how our syntax is designed, I'd argue that the function's argument is more important than the function. (E.g. it comes first in the syntax.)

I'd like to suggest an alternative way to structure our expressions, to more closely map to our syntax:

expression = "{" [s] ((operand [s annotation]) / annotation) [s] "}"

Let's special-case argument-less functions rather than function-less operands.

Instead of Literal | VariableRef | FunctionRef here and operand?: Literal | VariableRef inside FunctionRef, we can do:

type Expression = OperandExpr | FunctionExpr; interface OperandExpr { operand: Literal | VariableRef; annotation?: FunctionExpr; } interface FunctionExpr { name: string; options: Map<string, Literal | VariableRef>; }

FWIW, this is how I implemented expressions in stasm/message2:
https://github.com/stasm/message2/blob/4abf43f2023b6e20d8ee1d462684d0741ece791b/syntax/ast.ts#L44-L70

(Not blocking this PR on this, but I'd like to discuss this change as a follow-up.)

I will be happy to discuss this further in a follow-on issue or PR.

Filed #436 to continue this.

spec/data-model.md

eemeli · 2023-07-18T14:10:49Z

@aphillips In the interests of getting this landed, I went ahead and applied the language change myself, dropping references to the JSON schema. As this already has a ✅ from @stasm, I believe that we can merge this without waiting for the next meeting if you yourself are ok with the text.

eemeli · 2023-07-18T14:12:53Z

To make sure that it's usable for future work after this lands, the JSON schema is still available here:
https://github.com/eemeli/message-format-wg/blob/json-schema-data-model/spec/message.json

eemeli added data model Issues related to the Interchange Data Model Agenda+ Requested for upcoming teleconference labels Jun 17, 2023

eemeli requested review from aphillips, stasm, zbraniecki, echeran and mihnita June 17, 2023 18:18

Add interchange data model description + JSON schema

6bdc19a

eemeli force-pushed the data-model branch from 7fa8620 to 6bdc19a Compare June 17, 2023 18:21

catamorphism reviewed Jun 20, 2023

View reviewed changes

Apply suggestions from code review

4802d9b

Co-authored-by: Tim Chevalier <tjc@igalia.com>

eemeli requested a review from catamorphism June 21, 2023 05:01

catamorphism reviewed Jun 24, 2023

View reviewed changes

Apply suggestions from code review

655be51

eemeli requested a review from catamorphism June 25, 2023 07:23

catamorphism approved these changes Jul 5, 2023

View reviewed changes

eemeli mentioned this pull request Jul 6, 2023

Replace (literal / variable) with operand in definition of option #412

Closed

stasm approved these changes Jul 11, 2023

View reviewed changes

eemeli commented Jul 11, 2023

View reviewed changes

spec/data-model.md Outdated Show resolved Hide resolved

eemeli added 2 commits July 18, 2023 17:01

Drop JSON Schema definition

a83ccb3

Include / as a valid Reserved sigil

53c2aef

aphillips removed the Agenda+ Requested for upcoming teleconference label Jul 24, 2023

aphillips approved these changes Jul 24, 2023

View reviewed changes

This was referenced Jul 24, 2023

Split Expression into OperandExpr and FunctionExpr #436

Closed

Unify PatternMessage and SelectMessage #437

Closed

aphillips merged commit 0545f50 into unicode-org:main Jul 24, 2023

eemeli deleted the data-model branch July 24, 2023 19:06

eemeli mentioned this pull request Jul 24, 2023

Add JSON Schema & XML DTD definitions of message data model #439

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interchange data model description + JSON Schema definition #393

Add interchange data model description + JSON Schema definition #393

eemeli commented Jun 17, 2023

cdaringe commented Jun 19, 2023 •

edited

Loading

eemeli commented Jun 19, 2023

catamorphism Jun 20, 2023

eemeli Jun 20, 2023

catamorphism Jun 20, 2023

catamorphism Jun 20, 2023

eemeli Jun 21, 2023

stasm Jul 24, 2023

catamorphism left a comment

catamorphism left a comment

stasm left a comment

stasm Jul 11, 2023 •

edited

Loading

eemeli Jul 11, 2023

stasm Jul 24, 2023 •

edited

Loading

eemeli commented Jul 18, 2023

eemeli commented Jul 18, 2023

Add interchange data model description + JSON Schema definition #393

Add interchange data model description + JSON Schema definition #393

Conversation

eemeli commented Jun 17, 2023

cdaringe commented Jun 19, 2023 • edited Loading

eemeli commented Jun 19, 2023

catamorphism Jun 20, 2023

Choose a reason for hiding this comment

eemeli Jun 20, 2023

Choose a reason for hiding this comment

catamorphism Jun 20, 2023

Choose a reason for hiding this comment

catamorphism Jun 20, 2023

Choose a reason for hiding this comment

eemeli Jun 21, 2023

Choose a reason for hiding this comment

stasm Jul 24, 2023

Choose a reason for hiding this comment

catamorphism left a comment

Choose a reason for hiding this comment

catamorphism left a comment

Choose a reason for hiding this comment

stasm left a comment

Choose a reason for hiding this comment

stasm Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

eemeli Jul 11, 2023

Choose a reason for hiding this comment

stasm Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

eemeli commented Jul 18, 2023

eemeli commented Jul 18, 2023

cdaringe commented Jun 19, 2023 •

edited

Loading

stasm Jul 11, 2023 •

edited

Loading

stasm Jul 24, 2023 •

edited

Loading