-
-
Notifications
You must be signed in to change notification settings - Fork 36
Draft of the registry specification #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR includes 2 files: * `registry.dtd` is the schema for defining regestries in XML; it is normative. * `registry.md` is the non-normative documentation explaining the motivation and the schema. It also includes examples. This PR is based on my [old spec proposal from January 2022](unicode-org#218), and the more recent [presentation](https://github.com/unicode-org/message-format-wg/blob/main/meetings/2023/notes-2023-02-06.md) that I did to resume the work on the design of the registry. For now, I've focused on describing custom functions by defining their signatures. A single signature corresponds to one set of: the current locale, the argument, and the options bag. I didn't address all feedback from our February 6 meeting in this PR. Looking at my notes, here are the topics for future discussions: * [ ] Not all options should be locale-specific. * [ ] Some options should be common to all signatures of a given function. * Support other data types besides functions: * [ ] markup, * [ ] metadata (comments, max length, screenshot URL, etc.), * [ ] global variables. * Describe the interface of runtime arguments and local variables (i.e. the return types of formatting functions). Right now the validation of arguments and option values only applies to literal values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first pass with some inline comments & questions.
One key factor that seems unaddressed is the representation of variable inputs. In other words, if my message includes something like
{$foo :datetime dateStyle=long}
How can I represent an expectation that the input $foo
ought to be an actual date object of whatever description, rather than e.g. a string representation of a date?
Edit: D'oh, that's your last checkbox above.
<!ELEMENT registry (function*|pattern*)> | ||
|
||
<!ELEMENT function (description|signature+)> | ||
<!ATTLIST function name NMTOKEN #REQUIRED> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!ATTLIST function name NMTOKEN #REQUIRED> | |
<!ATTLIST function name ID #REQUIRED> |
In the syntax, function names are restricted to name
rather than nmtoken
, so not all NMTOKEN
values can be valid here. Also, ensuring that function definitions map 1:1 to identifiers seems pretty reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID
type is also subject to a validity constraint about there only being one element of the given name in the XML document: https://www.w3.org/TR/xml/#id.
I wasn't sure if we wanted to be this strict. Perhaps it's OK to have functions named the same as regex patterns? Or to have two function definitions with the same name? That's why I went for name
and NMTOKEN
.
spec/registry.dtd
Outdated
<!ATTLIST pattern id ID #REQUIRED> | ||
<!ATTLIST pattern regex CDATA #REQUIRED> | ||
|
||
<!ELEMENT signature (input*|param*|match*)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!ELEMENT signature (input*|param*|match*)> | |
<!ELEMENT signature (description?|input*|param*|match*)> |
Looking at the examples and considering usage in particular for contextual help (e.g. IntelliSense), I would think that we'd like to at least allow for descriptions to attach not just to <function>
, but effectively all elements within them?
From the examples, I gather that the title
attributes of <input>
and <param>
are likely meant to feed into such contextual help. What's the reason for preferring an element for this in one place and an attribute in another?
|
||
<!ELEMENT input EMPTY> | ||
<!ATTLIST input title CDATA #IMPLIED> | ||
<!ATTLIST input values NMTOKENS #IMPLIED> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we end up allowing for some nmtoken
-ish values to be used unquoted as arguments (#364), it would be good to match that rule here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think #364 is related, actually. My intent here was to allow specifying either an enumeration of nmtokens or a regex to validate arguments that are MF2 literals. This seems orthogonal to whether these literals are quoted or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that if #364 lands, then nearly all NMTOKEN
values may be used without quoting, except for the ones that start with :
and -
. Given that, it might be appropriate to leave those out of the values that are definable by values
rather than pattern
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can enforce such restriction in the DTD alone. LDML does it by extending DTD with annotations: https://unicode.org/reports/tr35/#57-dtd-annotations.
spec/registry.dtd
Outdated
<!ATTLIST param default NMTOKEN #IMPLIED> | ||
<!ATTLIST param pattern IDREF #IMPLIED> | ||
<!ATTLIST param required (true|false) "false"> | ||
<!ATTLIST param readonly (true|false) "false"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!ATTLIST param readonly (true|false) "false"> | |
<!ATTLIST param translate (true|false) "true"> |
Given that developers and translators will both be working with MF2, it might be clearer to use translate
rather than readonly
to communicate the intent of this attribute.
As a related question, shouldn't this also be available for <input>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'm not fond of true
being the default value. I think I'd prefer to name these attributes such that their lack is equal to false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. readonly
/translate
for <input>
-- good point. I think I'll drop the ability to define more than one <input>
element per signature, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re.
readonly
/translate
for<input>
-- good point. I think I'll drop the ability to define more than one<input>
element per signature, then.
Done in 831a9cd.
spec/registry.dtd
Outdated
|
||
<!ELEMENT signature (input*|param*|match*)> | ||
<!ATTLIST signature type (match|format) #REQUIRED> | ||
<!ATTLIST signature position (open|close|standalone) "standalone"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be a building block for markup support? This attribute does not appear to be documented or explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some inline comments. Overall, the major thing that's unclear to me is the split between formatting and matching functions. More examples (both of valid and invalid messages) might help. I'm also not sure what a matching function returns, in terms of the code that implements the interface.
spec/registry.md
Outdated
* Generate variant keys for a given locale during XLIFF extraction. | ||
* Verify the exhaustiveness of variant keys given a selector. | ||
* Type-check values passed into functions. | ||
* Validate that matching functions are only called in selectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next bullet point are also hard to understand with "matching functions" and "formatting functions" being undefined terms.
Likewise, it might be useful to add "calling a formatting function in a selector context" and "calling a matching function in a placeholder context" to the list of possible errors in formatting.md. Depending on how I read the "Unknown Function errors" section, maybe that's already implicit, but since users would probably find a "function called in wrong context" error more helpful than an "undefined function" error, it's probably better to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, let's be explicit about this. I'd prefer to leave this out of this PR, however, because I think that we will first need to decide whether it's allowed to match on a function-less expression (i.e. match {$foo}
instead of match {$foo :func}
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #409 to discuss this further.
spec/registry.md
Outdated
It represents an implementation of a custom function available to translation at runtime. | ||
A function defines a human-readable _description_ of its behavior | ||
and one or more machine-readable _signatures_ of how to call it. | ||
Named `<pattern>` elements can optionally define regex validation rules for input, option values, and variant keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a regex enough to validate inputs or option values? For example, the second example on slide 13 here suggests that a formatting function could take some structured data as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know yet how to validate runtime types. I'm suggesting to use regexes to validate literal values. Let's discuss what sort of type validation is even possible for some implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<input title="Adjective id"/> | ||
<option name="article" values="definite indefinite"/> | ||
<option name="plural" values="one other"/> | ||
<option name="case" values="nominative genitive" default="nominative"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about why there's a case
option for an adjective if this signature is only defined in locale "en"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
English doesn't make it easy to build meaningful examples of grammatical features :) I'll try to come up with something better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know :) I was just thinking the example could be from a different language, maybe.
<formatSignature locales="en"> | ||
<input title="Adjective id"/> | ||
<option name="article" values="definite indefinite"/> | ||
<option name="accord"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accord
option doesn't have a pattern or an enumeration, which surprises me given lines 51-52 (which suggest to me that every option has one or the other).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is still WIP and underspec'ed. I'd like to continue the discussion about validating runtime values outside of this PR.
A signature may define the positional argument of the function with the `<input>` element. | ||
A signature may also define one or more `<option>` elements representing _named options_ to the function. | ||
Options are optional by default, | ||
unless the `required` attribute is present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be an error for an option to have both a required
and a default
attribute?
The following message references the first signature of `:adjective`, | ||
which expects the `plural` and `case` options: | ||
|
||
{You see {$color :adjective article=indefinite plural=one case=nominative} {$object :noun case=nominative}!} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you write {You see {$color :adjective article=indefinite}}
? Since none of the options in the example are required
, it's unclear which of the two signatures for :adjective
should be used. Or does it not matter because the idea is that :adjective
has a single implementation for all signatures?
or validate they input with a regular expression (the `pattern` attribute). | ||
Read-only options (the `readonly` attribute) can be displayed to translators in CAT tools, but may not be edited. | ||
|
||
Matching-function signatures additionally include one or more `<match>` elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "one or more" or "zero or more"? The DTD implies that there can be zero <match>
elements, as does the second signature for number
in the example below.
spec/registry.dtd
Outdated
<!ATTLIST signature locales NMTOKENS #IMPLIED> | ||
|
||
<!ELEMENT input EMPTY> | ||
<!ATTLIST input title CDATA #IMPLIED> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this title
(and others such as the one on line 24) are meant to be human readable descriptions of the function/input/parameter? If so, it's bad practice to make them an attribute--because they can't easily be localized that way. It's better to make subsidiary elements to contain natural language text. The title
fields don't appear to be necessary to the functionality in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove title
for now and we can discuss human-readable metadata separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #405 to discuss this further.
spec/registry.dtd
Outdated
<!ATTLIST input pattern NMTOKEN #IMPLIED> | ||
<!ATTLIST input readonly (true|false) "false"> | ||
|
||
<!ELEMENT param EMPTY> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing about types here. Strongly typed languages need some way to express this?
spec/registry.md
Outdated
* Type-check values passed into functions. | ||
* Validate that matching functions are only called in selectors. | ||
* Validate that formatting functions are only called in placeholders. | ||
* Forbid edits to certain function options (e.g. currency options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that "currency" is a good example here
Match a numerical value against CLDR plural categories or against a number literal. | ||
</description> | ||
|
||
<matchSignature locales="en"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very curious about the locales
attribute--curious in this case is a euphemism for "nervous about". I would prefer if functions were mostly locale neutral (I can declare, for example, a number format in any message in any locale).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have some way to define at least locale-dependent option and match values. For example, the registry should allow for a way to note that while the whole set of CLDR categories is zero one two few many other
, a specific locale such as en
only uses one other
.
One alternative could be for <matchSignature>
and <formatSignature>
to be able to contain an <override locales="en fr it">
section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness of the example, my plan was to allow the overrides directly on the level of the matchSignature
and formatSignature
:
<function name="plural">
<matchSignature locales="en">
<input pattern="anyNumber"/>
<option name="type" values="cardinal ordinal"/>
<option .../>
<match values="one other"/> ← English plurals
<match pattern="anyNumber"/>
</matchSignature>
<matchSignature locales="pl">
<input pattern="anyNumber"/>
<option name="type" values="cardinal ordinal"/>
<option .../>
<match values="one few many other"/> ← Polish plurals
<match pattern="anyNumber"/>
</formatSignature>
</function>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of this. I think that a plural selector should recognize all of the keywords and not depend on the registry to "allow" or "disallow" the enumerated values for a given locale (note that locales
must be a "language range", otherwise you would need to list every possible language tag and is probably an extended language range, e.g. so that pl-PL
matches pl-Latn-PL
-- not that anyone would use that tag)
I think it is reasonable that (for example) the root
locale message can contain keywords that do not fire for a specific locale but which fire in others, in case that message
is used in one of those other locales. This is also why the *
defaulting key exists (such as the case where an English-language resource with only one
and *
gets used in the pl
locale).
In the case of plurals, CLDR provides the data about which keywords apply to which locale. In the case of some other formatter or selector that is not supplied by CLDR, the implementation should know what applies to each locale. If we want the registry to describe that relationship (for example to support exploding the matrix of keys in localization tools), I think I would prefer that it be separate from the signature, e.g.:
<function name="customPluralLikeSelector">
<matchSignature>
<input pattern="anyNumber"/>
<option name="type" value="foo bar"/>
<match value="zero one two few many other"/>
<match pattern="anyNumber"/>
</matchSignature>
<validate comment="the naming is terrible and we wouldn't structure it exactly like this">
<match type="values">
<value lang="">one other</value>
<value lang="pl">one few many other</value>
<value lang="ja">other</value>
<!-- etc. --->
</match>
</validate>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #410 to discuss this further.
…bout nullary functions
Thank you, @eemeli, @catamorphism, and @aphillips for your reviews. I was able to address some feedback, but not all of it yet. I'll continue over the weekend. I'm already seeing a few topics that I'd like to leave out of this PR and discuss separately:
I'd like to suggest to try to merge this PR without answering the above questions, and continue iterating on it later. |
We absolutely allow this! Note that this is the same as/similar to the current ICU SelectFormat.
|
Thanks for your (continuing) work on this. This will, of course, be a topic for Monday's call. I will also have recommendations on the other PRs by then. Do you have opinions about the questions you raise? Would it be productive to try to bring in solutions? Or do you want to raise separate issues to discuss these? I'd like to iterate quickly on these where possible........ |
Right now I'd prefer to focus energy on merging this PR, in a minimal state that represents at least partial agreement about the direction. (Nothing's final yet, we can iterate freely.) Let's discuss on Monday and try to evaluate how big these other topics are. I have a sense that some are non-controversial (like adding the open/close concept), while others will require that we better define the requirements: in particular the runtime type validation will benefit if we better define what it means for us to be programming-language-agnostic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to iterate on the registry, but I think we should merge this so that we can do so.
This PR includes 2 files:
registry.dtd
is the schema for defining regestries in XML; it is normative.registry.md
is the non-normative documentation explaining the motivation and the schema. It also includes examples.This PR is based on my old spec proposal from January 2022, and the more recent presentation that I did to resume the work on the design of the registry.
For now, I've focused on describing custom functions by defining their signatures. A single signature corresponds to one set of: the current locale, the argument, and the options bag.
I didn't address all feedback from our February 6 meeting in this PR. Looking at my notes, here are the topics for future discussions: