Skip to content

Draft of the registry specification #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 5, 2023
Merged

Conversation

stasm
Copy link
Collaborator

@stasm stasm commented Mar 13, 2023

This PR includes 2 files:

  • registry.dtd is the schema for defining regestries in XML; it is normative.
  • registry.md is the non-normative documentation explaining the motivation and the schema. It also includes examples.

This PR is based on my old spec proposal from January 2022, and the more recent presentation that I did to resume the work on the design of the registry.

For now, I've focused on describing custom functions by defining their signatures. A single signature corresponds to one set of: the current locale, the argument, and the options bag.

I didn't address all feedback from our February 6 meeting in this PR. Looking at my notes, here are the topics for future discussions:

  • Not all options should be locale-specific.
  • Some options should be common to all signatures of a given function.
  • Support other data types besides functions:
    • markup,
    • metadata (comments, max length, screenshot URL, etc.),
    • global variables.
  • Describe the interface of runtime arguments and local variables (i.e. the return types of formatting functions). Right now the validation of arguments and option values only applies to literal values.

This PR includes 2 files:

* `registry.dtd` is the schema for defining regestries in XML; it is normative.
* `registry.md` is the non-normative documentation explaining the motivation and the schema. It also includes examples.

This PR is based on my [old spec proposal from January 2022](unicode-org#218), and the more recent [presentation](https://github.com/unicode-org/message-format-wg/blob/main/meetings/2023/notes-2023-02-06.md) that I did to resume the work on the design of the registry.

For now, I've focused on describing custom functions by defining their signatures. A single signature corresponds to one set of: the current locale, the argument, and the options bag.

I didn't address all feedback from our February 6 meeting in this PR. Looking at my notes, here are the topics for future discussions:

* [ ] Not all options should be locale-specific.
* [ ] Some options should be common to all signatures of a given function.
* Support other data types besides functions:
  * [ ] markup,
  * [ ] metadata (comments, max length, screenshot URL, etc.),
  * [ ] global variables.
* Describe the interface of runtime arguments and local variables (i.e. the return types of formatting functions). Right now the validation of arguments and option values only applies to literal values.
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first pass with some inline comments & questions.

One key factor that seems unaddressed is the representation of variable inputs. In other words, if my message includes something like

{$foo :datetime dateStyle=long}

How can I represent an expectation that the input $foo ought to be an actual date object of whatever description, rather than e.g. a string representation of a date?

Edit: D'oh, that's your last checkbox above.

<!ELEMENT registry (function*|pattern*)>

<!ELEMENT function (description|signature+)>
<!ATTLIST function name NMTOKEN #REQUIRED>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!ATTLIST function name NMTOKEN #REQUIRED>
<!ATTLIST function name ID #REQUIRED>

In the syntax, function names are restricted to name rather than nmtoken, so not all NMTOKEN values can be valid here. Also, ensuring that function definitions map 1:1 to identifiers seems pretty reasonable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID type is also subject to a validity constraint about there only being one element of the given name in the XML document: https://www.w3.org/TR/xml/#id.

I wasn't sure if we wanted to be this strict. Perhaps it's OK to have functions named the same as regex patterns? Or to have two function definitions with the same name? That's why I went for name and NMTOKEN.

<!ATTLIST pattern id ID #REQUIRED>
<!ATTLIST pattern regex CDATA #REQUIRED>

<!ELEMENT signature (input*|param*|match*)>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!ELEMENT signature (input*|param*|match*)>
<!ELEMENT signature (description?|input*|param*|match*)>

Looking at the examples and considering usage in particular for contextual help (e.g. IntelliSense), I would think that we'd like to at least allow for descriptions to attach not just to <function>, but effectively all elements within them?

From the examples, I gather that the title attributes of <input> and <param> are likely meant to feed into such contextual help. What's the reason for preferring an element for this in one place and an attribute in another?


<!ELEMENT input EMPTY>
<!ATTLIST input title CDATA #IMPLIED>
<!ATTLIST input values NMTOKENS #IMPLIED>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up allowing for some nmtoken-ish values to be used unquoted as arguments (#364), it would be good to match that rule here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think #364 is related, actually. My intent here was to allow specifying either an enumeration of nmtokens or a regex to validate arguments that are MF2 literals. This seems orthogonal to whether these literals are quoted or not?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that if #364 lands, then nearly all NMTOKEN values may be used without quoting, except for the ones that start with : and -. Given that, it might be appropriate to leave those out of the values that are definable by values rather than pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can enforce such restriction in the DTD alone. LDML does it by extending DTD with annotations: https://unicode.org/reports/tr35/#57-dtd-annotations.

<!ATTLIST param default NMTOKEN #IMPLIED>
<!ATTLIST param pattern IDREF #IMPLIED>
<!ATTLIST param required (true|false) "false">
<!ATTLIST param readonly (true|false) "false">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<!ATTLIST param readonly (true|false) "false">
<!ATTLIST param translate (true|false) "true">

Given that developers and translators will both be working with MF2, it might be clearer to use translate rather than readonly to communicate the intent of this attribute.

As a related question, shouldn't this also be available for <input>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm not fond of true being the default value. I think I'd prefer to name these attributes such that their lack is equal to false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. readonly/translate for <input> -- good point. I think I'll drop the ability to define more than one <input> element per signature, then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. readonly/translate for <input> -- good point. I think I'll drop the ability to define more than one <input> element per signature, then.

Done in 831a9cd.


<!ELEMENT signature (input*|param*|match*)>
<!ATTLIST signature type (match|format) #REQUIRED>
<!ATTLIST signature position (open|close|standalone) "standalone">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be a building block for markup support? This attribute does not appear to be documented or explained.

@stasm stasm marked this pull request as ready for review March 27, 2023 17:47
Copy link
Collaborator

@catamorphism catamorphism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some inline comments. Overall, the major thing that's unclear to me is the split between formatting and matching functions. More examples (both of valid and invalid messages) might help. I'm also not sure what a matching function returns, in terms of the code that implements the interface.

spec/registry.md Outdated
* Generate variant keys for a given locale during XLIFF extraction.
* Verify the exhaustiveness of variant keys given a selector.
* Type-check values passed into functions.
* Validate that matching functions are only called in selectors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next bullet point are also hard to understand with "matching functions" and "formatting functions" being undefined terms.

Likewise, it might be useful to add "calling a formatting function in a selector context" and "calling a matching function in a placeholder context" to the list of possible errors in formatting.md. Depending on how I read the "Unknown Function errors" section, maybe that's already implicit, but since users would probably find a "function called in wrong context" error more helpful than an "undefined function" error, it's probably better to be explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, let's be explicit about this. I'd prefer to leave this out of this PR, however, because I think that we will first need to decide whether it's allowed to match on a function-less expression (i.e. match {$foo} instead of match {$foo :func}).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #409 to discuss this further.

spec/registry.md Outdated
It represents an implementation of a custom function available to translation at runtime.
A function defines a human-readable _description_ of its behavior
and one or more machine-readable _signatures_ of how to call it.
Named `<pattern>` elements can optionally define regex validation rules for input, option values, and variant keys.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a regex enough to validate inputs or option values? For example, the second example on slide 13 here suggests that a formatting function could take some structured data as an argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet how to validate runtime types. I'm suggesting to use regexes to validate literal values. Let's discuss what sort of type validation is even possible for some implementations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #407 about regexes and #408 about runtime types.

<input title="Adjective id"/>
<option name="article" values="definite indefinite"/>
<option name="plural" values="one other"/>
<option name="case" values="nominative genitive" default="nominative"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about why there's a case option for an adjective if this signature is only defined in locale "en".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English doesn't make it easy to build meaningful examples of grammatical features :) I'll try to come up with something better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know :) I was just thinking the example could be from a different language, maybe.

<formatSignature locales="en">
<input title="Adjective id"/>
<option name="article" values="definite indefinite"/>
<option name="accord"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accord option doesn't have a pattern or an enumeration, which surprises me given lines 51-52 (which suggest to me that every option has one or the other).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is still WIP and underspec'ed. I'd like to continue the discussion about validating runtime values outside of this PR.

A signature may define the positional argument of the function with the `<input>` element.
A signature may also define one or more `<option>` elements representing _named options_ to the function.
Options are optional by default,
unless the `required` attribute is present.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an error for an option to have both a required and a default attribute?

The following message references the first signature of `:adjective`,
which expects the `plural` and `case` options:

{You see {$color :adjective article=indefinite plural=one case=nominative} {$object :noun case=nominative}!}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you write {You see {$color :adjective article=indefinite}}? Since none of the options in the example are required, it's unclear which of the two signatures for :adjective should be used. Or does it not matter because the idea is that :adjective has a single implementation for all signatures?

or validate they input with a regular expression (the `pattern` attribute).
Read-only options (the `readonly` attribute) can be displayed to translators in CAT tools, but may not be edited.

Matching-function signatures additionally include one or more `<match>` elements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "one or more" or "zero or more"? The DTD implies that there can be zero <match> elements, as does the second signature for number in the example below.

<!ATTLIST signature locales NMTOKENS #IMPLIED>

<!ELEMENT input EMPTY>
<!ATTLIST input title CDATA #IMPLIED>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this title (and others such as the one on line 24) are meant to be human readable descriptions of the function/input/parameter? If so, it's bad practice to make them an attribute--because they can't easily be localized that way. It's better to make subsidiary elements to contain natural language text. The title fields don't appear to be necessary to the functionality in any case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove title for now and we can discuss human-readable metadata separately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #405 to discuss this further.

<!ATTLIST input pattern NMTOKEN #IMPLIED>
<!ATTLIST input readonly (true|false) "false">

<!ELEMENT param EMPTY>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing about types here. Strongly typed languages need some way to express this?

spec/registry.md Outdated
* Type-check values passed into functions.
* Validate that matching functions are only called in selectors.
* Validate that formatting functions are only called in placeholders.
* Forbid edits to certain function options (e.g. currency options).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that "currency" is a good example here

Match a numerical value against CLDR plural categories or against a number literal.
</description>

<matchSignature locales="en">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very curious about the locales attribute--curious in this case is a euphemism for "nervous about". I would prefer if functions were mostly locale neutral (I can declare, for example, a number format in any message in any locale).

Copy link
Collaborator

@eemeli eemeli May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some way to define at least locale-dependent option and match values. For example, the registry should allow for a way to note that while the whole set of CLDR categories is zero one two few many other, a specific locale such as en only uses one other.

One alternative could be for <matchSignature> and <formatSignature> to be able to contain an <override locales="en fr it"> section.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness of the example, my plan was to allow the overrides directly on the level of the matchSignature and formatSignature:

<function name="plural">
    <matchSignature locales="en">
        <input pattern="anyNumber"/>
        <option name="type" values="cardinal ordinal"/>
        <option .../>
        <match values="one other"/>                    ← English plurals
        <match pattern="anyNumber"/>
    </matchSignature>

    <matchSignature locales="pl">
        <input pattern="anyNumber"/>
        <option name="type" values="cardinal ordinal"/>
        <option .../>
        <match values="one few many other"/>           ← Polish plurals
        <match pattern="anyNumber"/>
    </formatSignature>
</function>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this. I think that a plural selector should recognize all of the keywords and not depend on the registry to "allow" or "disallow" the enumerated values for a given locale (note that locales must be a "language range", otherwise you would need to list every possible language tag and is probably an extended language range, e.g. so that pl-PL matches pl-Latn-PL -- not that anyone would use that tag)

I think it is reasonable that (for example) the root locale message can contain keywords that do not fire for a specific locale but which fire in others, in case that message is used in one of those other locales. This is also why the * defaulting key exists (such as the case where an English-language resource with only one and * gets used in the pl locale).

In the case of plurals, CLDR provides the data about which keywords apply to which locale. In the case of some other formatter or selector that is not supplied by CLDR, the implementation should know what applies to each locale. If we want the registry to describe that relationship (for example to support exploding the matrix of keys in localization tools), I think I would prefer that it be separate from the signature, e.g.:

<function name="customPluralLikeSelector">
   <matchSignature>
      <input pattern="anyNumber"/>
      <option name="type" value="foo bar"/>
      <match value="zero one two few many other"/>
      <match pattern="anyNumber"/>
   </matchSignature>
   <validate comment="the naming is terrible and we wouldn't structure it exactly like this">
      <match type="values">
         <value lang="">one other</value>
         <value lang="pl">one few many other</value>
         <value lang="ja">other</value>
         <!-- etc. --->
      </match>
   </validate>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I like it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #410 to discuss this further.

@stasm
Copy link
Collaborator Author

stasm commented May 19, 2023

Thank you, @eemeli, @catamorphism, and @aphillips for your reviews. I was able to address some feedback, but not all of it yet. I'll continue over the weekend. I'm already seeing a few topics that I'd like to leave out of this PR and discuss separately:

  • How to represent human-readable meta-data, such as the now-removed title attribute.
  • Add open/close concepts to function signatures. Are they still formatting functions, or do we need a new category for them?
  • What should happen when a formatting function is used for matching, and vice versa? Should implementations support it, or raise errors?
    • Possibly related: should match {$foo} be allowed, or do we want to always require a function name in selectors?
  • How can we validate runtime types? E.g. {$count :number} -- what can we realistically do to know anything about $count?
    • Related: are regexes good enough for validating literals?

I'd like to suggest to try to merge this PR without answering the above questions, and continue iterating on it later.

@aphillips
Copy link
Member

Possibly related: should match {$foo} be allowed, or do we want to always require a function name in selectors?

We absolutely allow this! Note that this is the same as/similar to the current ICU SelectFormat.

match {$foo}
when |bar| {say bar}
when |baz| {say baz}
when * {say whatever}

@aphillips
Copy link
Member

@stasm

Thanks for your (continuing) work on this. This will, of course, be a topic for Monday's call. I will also have recommendations on the other PRs by then.

Do you have opinions about the questions you raise? Would it be productive to try to bring in solutions? Or do you want to raise separate issues to discuss these? I'd like to iterate quickly on these where possible........

@stasm
Copy link
Collaborator Author

stasm commented May 19, 2023

Do you have opinions about the questions you raise? Would it be productive to try to bring in solutions? Or do you want to raise separate issues to discuss these?

Right now I'd prefer to focus energy on merging this PR, in a minimal state that represents at least partial agreement about the direction. (Nothing's final yet, we can iterate freely.)

Let's discuss on Monday and try to evaluate how big these other topics are. I have a sense that some are non-controversial (like adding the open/close concept), while others will require that we better define the requirements: in particular the runtime type validation will benefit if we better define what it means for us to be programming-language-agnostic.

Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to iterate on the registry, but I think we should merge this so that we can do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants