Skip to content

Add interchange data model description + JSON Schema definition #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 24, 2023

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Jun 17, 2023

This proposes a way for us to fulfill our deliverable of:

A formal definition of the canonical data model for representing localizable dynamic message strings.

This is not a data model that implementations MUST use, but one which they MAY support. It's intended to work as a formalization of the meaning of our syntax as well as an interchange format for messages in other syntaxes, such as MF1. Effectively, it's synonymous with our syntax, but expressed as a parsed JSON structure.

It is intended to be a format capable of representing messages in all current syntaxes. As in, you should be able to parse anything into this structure with no data loss, and then use a compatible MF2 library with your messages. Full roundtripping of formats that support inline selectors like MF1 and Fluent might not retain their original structure, but will be semantically equivalent.

The format is included inline in the markdown doc using TypeScript interfaces, but its canonical form is provided as a JSON Schema definition.

@eemeli eemeli added data model Issues related to the Interchange Data Model Agenda+ Requested for upcoming teleconference labels Jun 17, 2023
@cdaringe
Copy link
Contributor

cdaringe commented Jun 19, 2023

  1. Could this PR be considered part of an unmentioned step in Add missing formatting sections #396? Could this data-model is may be just one of many in play? This model reads to me like a ParsedMessage data-model, where message parsing is actually the first unmentioned step of Add missing formatting sections #396.
  2. I can forsee other possible data models, like your MessageValue structure, also being a spec deliverable.

In the case of a central message provider dispatching messages to N-type-clients, perhaps in some systems:

  • sending the raw ParsedMessage model over-the-wire could be applicable (data was not available to the central provider, thus unable to progress thru resolution & selection), or
  • sending the MessageValue over the wire could be applicable (data was available to the central provider, thus able to progress thru resolution & selection, but defers formatting to the client),

...depending on data availability in either context. Being able to distribute the parsing could be beneficial (as clearly evidenced by this PR)!

@eemeli
Copy link
Collaborator Author

eemeli commented Jun 19, 2023

This model reads to me like a ParsedMessage data-model, where message parsing is actually the first unmentioned step of #396.

That's a good point; the formatting spec should really mention that it's getting a parsed message structure from somewhere to start with. We should not require parsing syntax into the data model specified here before formatting, but the synonymity between the data model and the syntax should allow for messages from either source (syntax or data model) to be equivalently formatted.

I can forsee other possible data models, like your MessageValue structure, also being a spec deliverable.

That is a possibility, but tbh it's somewhat unlikely. For Intl.MessageFormat the MessageValue structure makes sense, as there it's intended to act as a base platform for further wrapping formatters. Consider a DOM/HTML message formatter: For something like that, it makes much more sense to output DOM fragments rather than MessageValues. Similarly in other settings, the surrounding environment will strongly guide the shape of non-string formatting targets.

I don't think there is sufficient value in harmonising the output format of MF2 to overcome the costs that would be incurred by implementations needing to fit its mold.

sending the raw ParsedMessage model [...] or sending the MessageValue over the wire could be applicable

Parsing something like MessageResource should be comparable to parsing JSON, and formatting a message should be a rather fast operation, so I rather hope that we'll be able to not need such optimizations. But yes, there is value in portable representations of messages.

I see the following as probably the greatest benefits provided by this data model:

  1. Easing the initial or partial adoption of MF2. Consider an existing application or other system that already has messages in some legacy format, tooling and pipelines for their translation, and a runtime for their formatting.

    With a well defined data model, it becomes easier for a step-by-step transition to MF2 to take place. Initially, the runtime parser could be replaced with a parser targeting the MF2 data model and and the formatter with an MF2 implementation. Next, the message data could be transformed to MF2 during the application build. Further steps could replace the whole localization stack with MF2.

    Without a data model, it becomes much harder to progress in such small steps, and the immediate cost of starting to use MF2 becomes much higher.

  2. Message transformations. Not only between message formats, but also when message contents need to change. With a data model, it's much easier for interoperable tooling to be able to e.g. update a variable reference in all localized versions of a message.

  3. Providing a universal "message" definition. This one is admittedly a bit fuzzier than the preceding, but potentially even more valuable. As far as I know, this data model is the first one that even attempts to be compatible with all message formats. With this, you should be able to take a message in any existing syntax and represent it as structured data. I don't really know what people might do with it, but providing a qualitatively new tool is bound to be useful in all sorts of ways.

Comment on lines 14 to 16
While this document uses TypeScript syntax for their definition,
the canonical and authoritative source is the `message.json`
[JSON Schema](https://json-schema.org/) definition.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about maintenance of this doc (at least while the proposal continues to change) and keeping the TypeScript and JSON Schema definitions consistent with each other. Though I can see why you want to have the TypeScript interfaces for exposition and the JSON Schema definition for reference. Maybe that's a non-concern given that you say that the JSON Schema one is authoritative.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'll be changing the structure much any more, and I'd say that the experience with the ABNF would indicate that the need to update two different places when applying changes is not too onerous. And yes, the explicit reference to the JSON Schema as authoritative is intended to disambiguate any divergences, should such form.

type: 'select'
declarations: Declaration[]
selectors: Expression[]
variants: Variant[]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the ICU4J Mf2DataModel class has this method to get the message's variants:

 public OrderedMap<SelectorKeys, Pattern>  getVariants();

This is less convenient to implement in ICU4C than it is to return a list of Variants, but I'm trying to do it anyway for the sake of parity with ICU4J.

I like what you have here (Variant[]) better because it's easier to implement in C++, where ICU4C doesn't enjoy the benefits of Java's polymorphic OrderedMap class. But also because while the list and map representations are isomorphic, I think it's more appealing to have an API that returns a list and let users build their own on top of it that does some kind of optimization for more efficient pattern-matching than it is to return a map when maybe efficient lookup isn't always necessary.

Whether it ends up being a list or a map, mostly I just wanted to highlight that the ICU4J and ICU4C implementations should match what's defined here.

"properties": {
"type": { "const": "message" },
"declarations": { "$ref": "#/$defs/declarations" },
"pattern": { "$ref": "#/$defs/pattern" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it might be simpler to have a single message, which has a body attribute that's one of pattern or select, and then select would just have the selectors and variants fields? (In other words, declarations would be factored out.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the approach @stasm has suggested, i.e. expressing single-pattern messages as having an empty list for selectors, and a single entry in variants with a corrspondingly empty list of keys.

As I mentioned on the call, I'd much prefer landing something first and then iterating on it, rather than sorting out the details all in one go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #437 to continue discussing this.

Co-authored-by: Tim Chevalier <tjc@igalia.com>
@eemeli eemeli requested a review from catamorphism June 21, 2023 05:01
Copy link
Collaborator

@catamorphism catamorphism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I agree with your intention to land something and iterate it on it more later, so in that spirit, I think the only one of my suggestions that needs to be addressed is the clarification about what the fields of a Reserved are.

@eemeli eemeli requested a review from catamorphism June 25, 2023 07:23
Copy link
Collaborator

@catamorphism catamorphism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Collaborator

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed in the yesterday's meeting, we'll land this without the JSOM schema for now.


interface Expression {
type: 'expression'
body: Literal | VariableRef | FunctionRef | Reserved
Copy link
Collaborator

@stasm stasm Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A FunctionRef can also have a Literal or a VariableRef as an argument, and in fact, due to how our syntax is designed, I'd argue that the function's argument is more important than the function. (E.g. it comes first in the syntax.)

I'd like to suggest an alternative way to structure our expressions, to more closely map to our syntax:

expression = "{" [s] ((operand [s annotation]) / annotation) [s] "}"

Let's special-case argument-less functions rather than function-less operands.

Instead of Literal | VariableRef | FunctionRef here and operand?: Literal | VariableRef inside FunctionRef, we can do:

type Expression = OperandExpr | FunctionExpr;
interface OperandExpr {
    operand: Literal | VariableRef;
    annotation?: FunctionExpr;
}
interface FunctionExpr {
    name: string;
    options: Map<string, Literal | VariableRef>;
}

FWIW, this is how I implemented expressions in stasm/message2:
https://github.com/stasm/message2/blob/4abf43f2023b6e20d8ee1d462684d0741ece791b/syntax/ast.ts#L44-L70

(Not blocking this PR on this, but I'd like to discuss this change as a follow-up.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be happy to discuss this further in a follow-on issue or PR.

Copy link
Collaborator

@stasm stasm Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #436 to continue this.

@eemeli
Copy link
Collaborator Author

eemeli commented Jul 18, 2023

@aphillips In the interests of getting this landed, I went ahead and applied the language change myself, dropping references to the JSON schema. As this already has a ✅ from @stasm, I believe that we can merge this without waiting for the next meeting if you yourself are ok with the text.

@eemeli
Copy link
Collaborator Author

eemeli commented Jul 18, 2023

To make sure that it's usable for future work after this lands, the JSON schema is still available here:
https://github.com/eemeli/message-format-wg/blob/json-schema-data-model/spec/message.json

@aphillips aphillips removed the Agenda+ Requested for upcoming teleconference label Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data model Issues related to the Interchange Data Model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants