Skip to content

Draft of the registry specification #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 5, 2023
34 changes: 34 additions & 0 deletions spec/registry.dtd
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<!ELEMENT registry (function*|pattern*)>

<!ELEMENT function (description|(formatSignature|matchSignature)+)>
<!ATTLIST function name NMTOKEN #REQUIRED>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!ATTLIST function name NMTOKEN #REQUIRED>
<!ATTLIST function name ID #REQUIRED>

In the syntax, function names are restricted to name rather than nmtoken, so not all NMTOKEN values can be valid here. Also, ensuring that function definitions map 1:1 to identifiers seems pretty reasonable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID type is also subject to a validity constraint about there only being one element of the given name in the XML document: https://www.w3.org/TR/xml/#id.

I wasn't sure if we wanted to be this strict. Perhaps it's OK to have functions named the same as regex patterns? Or to have two function definitions with the same name? That's why I went for name and NMTOKEN.


<!ELEMENT description (#PCDATA)>

<!ELEMENT pattern EMPTY>
<!ATTLIST pattern id ID #REQUIRED>
<!ATTLIST pattern regex CDATA #REQUIRED>

<!ELEMENT formatSignature (input?|option*)>
<!ATTLIST formatSignature position (open|close|standalone) "standalone">
<!ATTLIST formatSignature locales NMTOKENS #IMPLIED>

<!ELEMENT matchSignature (input?|option*|match*)>
<!ATTLIST matchSignature locales NMTOKENS #IMPLIED>

<!ELEMENT input EMPTY>
<!ATTLIST input values NMTOKENS #IMPLIED>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up allowing for some nmtoken-ish values to be used unquoted as arguments (#364), it would be good to match that rule here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think #364 is related, actually. My intent here was to allow specifying either an enumeration of nmtokens or a regex to validate arguments that are MF2 literals. This seems orthogonal to whether these literals are quoted or not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that if #364 lands, then nearly all NMTOKEN values may be used without quoting, except for the ones that start with : and -. Given that, it might be appropriate to leave those out of the values that are definable by values rather than pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can enforce such restriction in the DTD alone. LDML does it by extending DTD with annotations: https://unicode.org/reports/tr35/#57-dtd-annotations.

<!ATTLIST input pattern NMTOKEN #IMPLIED>
<!ATTLIST input readonly (true|false) "false">

<!ELEMENT option EMPTY>
<!ATTLIST option name NMTOKEN #REQUIRED>
<!ATTLIST option values NMTOKENS #IMPLIED>
<!ATTLIST option default NMTOKEN #IMPLIED>
<!ATTLIST option pattern IDREF #IMPLIED>
<!ATTLIST option required (true|false) "false">
<!ATTLIST option readonly (true|false) "false">

<!ELEMENT match EMPTY>
<!ATTLIST match values NMTOKENS #IMPLIED>
<!ATTLIST match pattern NMTOKEN #IMPLIED>
180 changes: 180 additions & 0 deletions spec/registry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# WIP DRAFT MessageFormat 2.0 Registry

_This document is non-normative._

The implementations and tooling can greatly benefit from a structured definition of formatting and matching functions available to messages at runtime.
The _registry_ is a mechanism for storing such declarations in a portable manner.

## Goals

The registry provides a machine-readable description of MessageFormat extensions (custom functions),
in order to support the following goals and use-cases:

* Validate semantic properties of messages. For example:
* Type-check values passed into functions.
* Validate that matching functions are only called in selectors.
* Validate that formatting functions are only called in placeholders.
* Verify the exhaustiveness of variant keys given a selector.
* Support the localization roundtrip. For example:
* Generate variant keys for a given locale during XLIFF extraction.
* Improve the authoring experience. For example:
* Forbid edits to certain function options (e.g. currency options).
* Autocomplete function and option names.
* Display on-hover tooltips for function signatures with documentation.
* Display/edit known message metadata.
* Restrict input in GUI by providing a dropdown with all viable option values.

## Data Model

The registry contains descriptions of function signatures.
[`registry.dtd`](./registry.dtd) describes its data model.

The main building block of the registry is the `<function>` element.
It represents an implementation of a custom function available to translation at runtime.
A function defines a human-readable _description_ of its behavior
and one or more machine-readable _signatures_ of how to call it.
Named `<pattern>` elements can optionally define regex validation rules for literals, option values, and variant keys.

MessageFormat functions can be invoked in two contexts:
* inside placeholders, to produce a part of the message's formatted output;
for example, a raw value of `|1.5|` may be formatted to `1,5` in a language which uses commas as decimal separators,
* inside selectors, to contribute to selecting the appropriate variant among all given variants.

A single _function name_ may be used in both contexts,
regardless of whether it's implemented as one or multiple functions.

A _signature_ defines one particular set of at most one argument and any number of named options that can be used together in a single call to the function.
`<formatSignature>` corresponds to a function call inside a placeholder inside translatable text.
`<matchSignature>` corresponds to a function call inside a selector.
Signatures with a non-empty `locales` attribute are locale-specific and only available in translations in the given languages.

A signature may define the positional argument of the function with the `<input>` element.
If the `<input>` element is not present, the function is defined as a nullary function.
A signature may also define one or more `<option>` elements representing _named options_ to the function.
An option can be omitted in a call to the function,
unless the `required` attribute is present.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an error for an option to have both a required and a default attribute?

They accept either a finite enumeration of values (the `values` attribute)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a finite enumeration can be expressed as a regex, I'm wondering if it would be simpler to only allow a regex? I can imagine a tool providing finite enumerations as syntactic sugar.

Copy link
Collaborator Author

@stasm stasm May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. I'm not opposed to removing values and keeping pattern as the only option. The reasons why I opted for both were:

  • values allows the values to be specified inline (<match values="one other"/>) whereas pattern refers to a <pattern> element by id. I felt it was more convenient in particular for plural categories, cases and genders to be able to define them inline rather than define a regex for each possible combination.
  • Some supplemental data in CLDR uses values="<enumeration>", too, e.g. https://github.com/unicode-org/cldr/blob/release-43/common/supplemental/grammaticalFeatures.xml, so it's easy to reuse this data as-is.

Do you feel strongly about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating autocompletion values for an editor is easy from an explicit list in values, but really hard from a regexp pattern. I'd much rather keep values even if only for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stasm No, I don't feel strongly, and @eemeli has a good point, so I withdraw the question :)

or validate their input with a regular expression (the `pattern` attribute).
Read-only options (the `readonly` attribute) can be displayed to translators in CAT tools, but may not be edited.

Matching-function signatures additionally include one or more `<match>` elements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "one or more" or "zero or more"? The DTD implies that there can be zero <match> elements, as does the second signature for number in the example below.

to define the keys against which they can match when used as selectors.

## Example

The following `registry.xml` is an example of a registry file
which may be provided by an implementation to describe its built-in functions.
For the sake of brevity, only `locales="en"` is considered.

```xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE registry SYSTEM "./registry.dtd">

<registry>
<function name="platform">
<description>Match the current OS.</description>
<matchSignature>
<match values="windows linux macos android ios"/>
</matchSignature>
</function>

<pattern id="anyNumber" regex="-?[0-9]+(\.[0-9]+)"/>
<pattern id="positiveInteger" regex="[0-9]+"/>
<pattern id="currencyCode" regex="[A-Z]{3}"/>

<function name="number">
<description>
Format a number.
Match a numerical value against CLDR plural categories or against a number literal.
</description>

<matchSignature locales="en">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very curious about the locales attribute--curious in this case is a euphemism for "nervous about". I would prefer if functions were mostly locale neutral (I can declare, for example, a number format in any message in any locale).

Copy link
Collaborator

@eemeli eemeli May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some way to define at least locale-dependent option and match values. For example, the registry should allow for a way to note that while the whole set of CLDR categories is zero one two few many other, a specific locale such as en only uses one other.

One alternative could be for <matchSignature> and <formatSignature> to be able to contain an <override locales="en fr it"> section.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness of the example, my plan was to allow the overrides directly on the level of the matchSignature and formatSignature:

<function name="plural">
    <matchSignature locales="en">
        <input pattern="anyNumber"/>
        <option name="type" values="cardinal ordinal"/>
        <option .../>
        <match values="one other"/>                    ← English plurals
        <match pattern="anyNumber"/>
    </matchSignature>

    <matchSignature locales="pl">
        <input pattern="anyNumber"/>
        <option name="type" values="cardinal ordinal"/>
        <option .../>
        <match values="one few many other"/>           ← Polish plurals
        <match pattern="anyNumber"/>
    </formatSignature>
</function>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this. I think that a plural selector should recognize all of the keywords and not depend on the registry to "allow" or "disallow" the enumerated values for a given locale (note that locales must be a "language range", otherwise you would need to list every possible language tag and is probably an extended language range, e.g. so that pl-PL matches pl-Latn-PL -- not that anyone would use that tag)

I think it is reasonable that (for example) the root locale message can contain keywords that do not fire for a specific locale but which fire in others, in case that message is used in one of those other locales. This is also why the * defaulting key exists (such as the case where an English-language resource with only one and * gets used in the pl locale).

In the case of plurals, CLDR provides the data about which keywords apply to which locale. In the case of some other formatter or selector that is not supplied by CLDR, the implementation should know what applies to each locale. If we want the registry to describe that relationship (for example to support exploding the matrix of keys in localization tools), I think I would prefer that it be separate from the signature, e.g.:

<function name="customPluralLikeSelector">
   <matchSignature>
      <input pattern="anyNumber"/>
      <option name="type" value="foo bar"/>
      <match value="zero one two few many other"/>
      <match pattern="anyNumber"/>
   </matchSignature>
   <validate comment="the naming is terrible and we wouldn't structure it exactly like this">
      <match type="values">
         <value lang="">one other</value>
         <value lang="pl">one few many other</value>
         <value lang="ja">other</value>
         <!-- etc. --->
      </match>
   </validate>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I like it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #410 to discuss this further.

<input pattern="anyNumber"/>
<option name="type" values="cardinal ordinal"/>
<option name="minimumIntegerDigits" pattern="positiveInteger"/>
<option name="minimumFractionDigits" pattern="positiveInteger"/>
<option name="maximumFractionDigits" pattern="positiveInteger"/>
<option name="minimumSignificantDigits" pattern="positiveInteger"/>
<option name="maximumSignificantDigits" pattern="positiveInteger"/>
<match values="one other"/>
<match pattern="anyNumber"/>
</matchSignature>

<formatSignature locales="en">
<input pattern="anyNumber"/>
<option name="minimumIntegerDigits" pattern="positiveInteger"/>
<option name="minimumFractionDigits" pattern="positiveInteger"/>
<option name="maximumFractionDigits" pattern="positiveInteger"/>
<option name="minimumSignificantDigits" pattern="positiveInteger"/>
<option name="maximumSignificantDigits" pattern="positiveInteger"/>
<option name="style" readonly="true" values="decimal currency percent unit" default="decimal"/>
<option name="currency" readonly="true" pattern="currencyCode"/>
</formatSignature>
</function>
</registry>
```

Given the above description, the `:number` function is defined to work both in a selector and a placeholder:

match {$count :number}
when 1 {One new message}
when other {{$count :number} new messages}

Furthermore,
`:number`'s `<matchSignature>` contains two `<match>` elements
which allow to validate the variant keys.
If at least one `<match>` validation rules passes,
a variant key is considered valid.

* `<match pattern="anyNumber"/>` can be used to valide the `when 1` variant
by testing the `1` key against the `anyNumber` regular expression defined in the registry file.
* `<match values="one other"/>` can be used to valide the `when other` variant
by verifying that the `other` key is present in the list of enumarated values: `one other`.

----

A localization engineer can then extend the registry by defining the following `customRegistry.xml` file.

```xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE registry SYSTEM "./registry.dtd">

<registry>
<function name="noun">
<description>Handle the grammar of a noun.</description>
<formatSignature locales="en">
<input/>
<option name="article" values="definite indefinite"/>
<option name="plural" values="one other"/>
<option name="case" values="nominative genitive" default="nominative"/>
</formatSignature>
</function>

<function name="adjective">
<description>Handle the grammar of an adjective.</description>
<formatSignature locales="en">
<input/>
<option name="article" values="definite indefinite"/>
<option name="plural" values="one other"/>
<option name="case" values="nominative genitive" default="nominative"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about why there's a case option for an adjective if this signature is only defined in locale "en".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English doesn't make it easy to build meaningful examples of grammatical features :) I'll try to come up with something better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know :) I was just thinking the example could be from a different language, maybe.

</formatSignature>
<formatSignature locales="en">
<input/>
<option name="article" values="definite indefinite"/>
<option name="accord"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accord option doesn't have a pattern or an enumeration, which surprises me given lines 51-52 (which suggest to me that every option has one or the other).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is still WIP and underspec'ed. I'd like to continue the discussion about validating runtime values outside of this PR.

</formatSignature>
</function>
</registry>
```

Messages can now use the `:noun` and the `:adjective` functions.
The following message references the first signature of `:adjective`,
which expects the `plural` and `case` options:

{You see {$color :adjective article=indefinite plural=one case=nominative} {$object :noun case=nominative}!}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you write {You see {$color :adjective article=indefinite}}? Since none of the options in the example are required, it's unclear which of the two signatures for :adjective should be used. Or does it not matter because the idea is that :adjective has a single implementation for all signatures?


The following message references the second signature of `:adjective`,
which only expects the `accord` option:

let $obj = {$object :noun case=nominative}
{You see {$color :adjective article=indefinite accord=$obj} {$obj}!}