Skip to content

Add Literal Resolution section to formatting.md #382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 5, 2023

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented May 17, 2023

This is in part a follow-up from this conversation with @aphillips: #364 (comment)

The intent here is to be clear and explicit about the meaning of literal values. While putting this together, I started to think that we might need yet another section discussing other treatment of message parts than formatting. For instance, I understand us to be aligned with expecting literal values to by default be presented as non-translatable. It would be good to note this somewhere, but "formatting" isn't really the right place for it.

@eemeli eemeli added spec-text Agenda+ Requested for upcoming teleconference labels May 17, 2023
@@ -8,6 +8,15 @@ when formatting a message for display in a user interface, or for some later pro
The document is part of the MessageFormat 2.0 specification,
the successor to ICU MessageFormat, henceforth called ICU MessageFormat 1.0.

## Literal Resolution

The resolved value of _text_, _literal_ and _nmtoken_ tokens
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to include unquoted if/once #364 is accepted.

Copy link
Collaborator

@catamorphism catamorphism May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "value" seems to be getting used in multiple ways in this paragraph.

The first sentence refers to "the resolved value of text, literal and nmtoken tokens"; if resolution is the relation that maps character strings in the language defined by the ABNF onto values, then I understand "value" as being used semantically here. The spec doesn't (yet?) define criteria for membership in this set of semantic values, not in the precise way that the ABNF defines membership in the set of syntactically valid messages.

However, the next sentence refers to an "option value", which I take as being a syntactic concept: the token that appears on the right-hand side of the '=' in the option nonterminal into the ABNF.

Defining "value" and "resolution" before these terms are used, and replacing "or option value" with "on the right-hand side of an option", might help clarify things. (This could be done in the glossary, which uses "value" many times without defining it (possibly not always with the same meaning), and doesn't define "resolution", and cross-referenced here; could be in a future PR.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended meaning of "resolved value" here is the value that will ultimately get formatted. So for an unquoted literal 42, it would be the string '42', while for a quoted literal |foo\|bar|, it would be the string 'foo|bar'. For a variable reference $foo, it would be the value of the variable, which could really be anything.

My intent would be to explain this term as a part of the bigger formatting PR I'm now working on.

Renaming "option value" does sound like a good idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as "resolved"/"resolution" are defined in the bigger PR, I'm fine with leaving those terms undefined in this one.

The resolved value of _text_, _literal_ and _nmtoken_ tokens
is always a string concatenation of its parts,
with escape sequences resolving to their escaped characters.
When a literal value is used as a formatting function argument or option value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a normative requirement, I would suggest making this its own paragraph.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only non-normative part here is the "... such that e.g. ..." example. Would it be better if that were separated into its own paragraph?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying something different, which is: make each normative requirement start its own paragraph (unless there is a good reason not to). That makes it easier to find each requirement and check that, for example, it has tests or is complied to by one's implementation.

Comment on lines 17 to 18
the formatting function MUST treat option values the same independently of their presentation,
such that e.g. the options `foo=42` and `foo=|42|` have the same effect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might not be correct or that it has the potential to be incorrect?

The reason we define nmtoken separate from literal (or unquoted) is because it fulfills a separate role. For example, in a plural selector, the key values can include nmtoken values from an array of keywords (zero, one, two, few, many, other/*) and separately certain literal values. A formatter such as the number formatter might accept a number of nmtoken keyword arguments (integer, percent) but might also allow literals in the same argument.

Admittedly I can't think of a use case like this currently and it would be a poor formatter design that depended solely on the difference between integer and |integer| to know if the value were meant to be a token or a string. In fact, the more I look at this, the more I tend to think that nmtoken might need to go (reserving literal values as keywords in key and option could be done in the registry)? It would certainly simplify the ABNF.

The downside of removing nmtoken is that the nmtoken production allows key values that start with - and :, such as when -42 or when :foo, without quoting the values (|-42| and |:foo|). I don't care about : (in fact, I think it's confusing to allow it), but the minus sign feels important for operating with numbers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the more I look at this, the more I tend to think that nmtoken might need to go (reserving literal values as keywords in key and option could be done in the registry)? It would certainly simplify the ABNF.

Could you clarify whether you're suggesting that the registry ought to influence what's valid syntax? That would be rather problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not suggesting that the registry would influence what is valid syntax at the ABNF level. What I am suggesting is that the registry might determine what options are valid for a given formatter or selector, e.g. when few is valid (or perhaps "valid") for :plural but when fex is not, even though both are syntactically valid. (In fact, according to formatting.md, it is a selector error).

Another way of saying what I'm saying above is that the key of a when statement is always a literal, whose interpretation is selector-specific.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That's included in the current registry PR #368 as e.g. <match values="one other"/>.

<!ELEMENT match EMPTY>
<!ATTLIST match values NMTOKENS #IMPLIED>
<!ATTLIST match pattern NMTOKEN #IMPLIED>

is always a string concatenation of its parts,
with escape sequences resolving to their escaped characters.
When a literal value is used as a formatting function argument or option value,
the formatting function MUST treat option values the same independently of their presentation,
Copy link
Collaborator

@catamorphism catamorphism May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way of saying this -- if a definition of the term "value" (that is, the range of the resolution function whose domain is the formal language defined by the ABNF) is added -- is that the formatting function is defined on semantic values rather than syntactically. In other words, the resolution relation maps both 42 and |42| to the same semantic value; if semantic values are the domain of the formatting function, then it's impossible by construction for the formatting function to distinguish the two. (Unless the value domain is defined so that 42 and |42| are distinguishable, but even in that case, writing down the meta-language that describes the value domain would make it easier to define where the formatting function should treat different values as equivalent to each other.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going a bit further, I sensed on our call yesterday for us to have a consensus on formatting functions not being able to determine if an option has been set from a literal or variable. As in, a formatting function receiving a value '42' for the foo option would not know if the message had set foo=42, foo=|42|, or foo=$bar where $bar had a value of '42'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, given the consensus, I think it would make sense to rephrase this as a requirement on whatever calls the functions (the formatter?), not on the functions itself. It's true that the formatting function must treat option values the same independently of their presentation, but it's also impossible for it to do otherwise!

A way to phrase it might be that the domain of the formatting function is resolved values, and then if you wanted to add examples, one example could be that 42, |42|, and $bar all map to the same resolved value (assuming an environment in which bar is bound to 42), specifically '42'.

The resolved value of _text_, _literal_ and _nmtoken_ tokens
is always a string concatenation of its parts,
with escape sequences resolving to their escaped characters.
When a literal value is used as a formatting function argument or option value,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure whether the word "value" is meant to be syntactic or semantic in "a literal value". It might be clearer (if more verbose) to write:

"When a text, literal, or nmtoken token is used as a formatting function argument or option value..."

The current wording, if read literally, implies that the requirement to not distinguish between different syntactic constructs with the same meaning only implies to literals. Which would be confusing, since I don't think it's possible to write two lexically distinct literals with the same meaning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #364 is accepted, this should clear up a bit as the ABNF literal becomes the only thing this applies to. But yes, also here should avoid the dread term "value".

@eemeli
Copy link
Collaborator Author

eemeli commented May 23, 2023

@aphillips @catamorphism I've updated the PR following your suggestions; could you re-review and potentially close any/all of the above discussions, or clarify if there are further changes you'd like to see?

Copy link
Collaborator

@catamorphism catamorphism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just one change I would still request (left as a line comment just now). Everything else looks good now! I don't think that the Github UI gives me the ability to close the previous discussions because I submitted them as "comments" rather than "request changes", but if you have that ability, it's fine to close them.

with escape sequences resolving to their escaped characters.
When a _literal_ or _nmtoken_ is used as an _expression_ argument
or on the right-hand side of an _option_,
the formatting function MUST treat their resolved values the same independently of their presentation,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth re-wording this to clarify that the contract between the caller of the formatting function, and the callee (formatting function), makes it impossible to do otherwise. (See #382 (comment) ).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily impossible in all implementations. A formatting function needs to be passed some amount of contextual information, such as the current locale, and it's possible to consider an implementation that also includes in that context something like an AST of the current expression. This might make sense for instance in order to enable errors in specific options to be positioned exactly in terms of source offsets.

This statement is specifying that even in such a hypothetical situation, a valid formatting function is not allowed to vary its behaviour based on the quoting style of the literal value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do errors count as "behavior"? It sounds like you're saying the error might be different based on the AST of the current expression, which suggests not treating resolved values the same independently of their presentation (to me, a different error is different behavior).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that an implementation may exist which, for reasonable reasons, does enable for a pathway to exist by which a formatting function could determine whether an option value was originally quoted or not.

For errors, I think the current spec shape of specifying the type of error is appropriate.

## Literal Resolution

The resolved value of _text_, _literal_ and _nmtoken_ tokens
is always a string concatenation of its parts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the "parts" here? I would think these items (text, literal and nmtoken) are part-less?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to refer to the *-char and *-escape parts of text and literal, and name-char for nmtoken, as hinted by the rest of this sentence.

@aphillips aphillips merged commit 4e33d64 into unicode-org:main Jun 5, 2023
@eemeli eemeli deleted the resolve-literals branch June 5, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants