-
-
Notifications
You must be signed in to change notification settings - Fork 36
Define the grammar as an ABNF (RFC 5234) #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good. Thanks Stas!
Most of my comments are corner cases and probably I should raise them as separate issues.
spec/message.abnf
Outdated
markup-start = "+" name | ||
markup-end = "-" name | ||
name = name-start *name-char | ||
nmtoken = 1*name-char |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is nmtoken
more relaxed about starters than name
? Is it because we want to allow digits??
Note that nmtoken
is basically an unquoted literal which we provide for convenience, e.g. so you can say:
when foo {{$count :number option=bar}}
instead of having to quote foo
and bar
as (foo)
and (bar)
I guess what I'm digging at here is whether we can simplify parsing rules given that we're mostly breaking on whitespace or on specific characters like =
. The less we use specific character lists or rules, the easier it is to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
and nmtoken
are taken directly from https://www.w3.org/TR/xml/#NT-Name. And yes, the goal was to make it convenient to use (a) numerical option values (fractionDigits=1
) and (b) option values such as year=2-digit
.
More broadly, the use of nmtoken
is intended to align MF2 with LDML. CLDR data is defined as attribute enums (e.g. in ldmlSupplemental.dtd), which implies that they are Nmtokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nmtoken
however is actually exposed to more than just LDML/CLDR constructs. Developers of selectors and formatters will use them to define key values and option values. That is, our user base is wider than just CLDR stuff or ICU developers. If we define values that are a strict superset of those restrictions then any XML, LDML, or CLDR values will work. LDML being a dialect of XML means that they are restricted by XML, but we don't have to be. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, are you suggesting that keys and option values should be even less restricted than nmtoken
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal here (last year when I wrote this for my proposal) was to make sure that whatever appears in LDML attlists (now or in the future) can be a valid key or an option value. That's why I chose Nmtoken
for these two productions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm suggesting that any characters we omit we know why we are omitting them. Some implementations will probably not check the values so long as they are able to parse the various tokens (which don't really depend on the characters inside the token); others will pedantically check the character ranges.
To be honest, most developers will write enumerated values in ASCII for keys and option values. But for cases when they don't the restrictions on users should make sense in our world (and not just because someone else did it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LDML being a dialect of XML
I hope you meant "LDML being an application of XML". XML doesn't have any dialects (except for XML 1.1, maybe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@duerst exactly so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving aside my recently-filed issues that are not on topic, I think this is merge-worthy.
Thanks, @aphillips. I researched using ABNF after I started working on whitespace rules. This PR is unfortunately based on my work in #344. I'll keep it as a draft while #344 is being discussed. |
Co-authored-by: Caleb Maclennan <caleb@alerque.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few editorial suggestions inline, but I'm fine with merging even without them.
Co-authored-by: Eemeli Aro <eemeli@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I support an initial mechanical conversion to ABNF, but have some suggestions for followup improvements.
message = [s] *(declaration [s]) body [s] | ||
|
||
declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}" | ||
body = pattern | ||
/ (selectors 1*([s] variant)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment below, I think these rules would be more readable with conventional OWS and RWS (as in "{optional,required} white space") rules, and column-aligned as in RFC 5234.
message = [s] *(declaration [s]) body [s] | |
declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}" | |
body = pattern | |
/ (selectors 1*([s] variant)) | |
message = OWS *(declaration OWS) body OWS | |
declaration = let RWS variable OWS "=" OWS "{" OWS expression OWS "}" | |
body = pattern | |
/ (selectors 1*(OWS variant)) |
and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we discuss using RWS
and OWS
in a separate PR? They touch every production in the ABNF.
I'm against column alignment for as long as we expect the grammar to change. They generate needless diffs. Let's do it once when the grammar stabilizes.
expression = ((literal / variable) [s annotation]) | ||
/ annotation | ||
annotation = function *(s option) | ||
option = name [s] "=" [s] (literal / nmtoken / variable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the expected semantics of an option value that is a nmtoken
but not a name
, as in e.g. {:func foo=1}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It parses as a literal, "1"
. The implementation of :func
can interpret it as a number if it makes sense to do so for the foo
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted? Are {:func foo=|1|}
and {:func foo=1}
therefore indistinguishable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a super-subtle difference between nmtoken
and literal
values.
An nmtoken
value might be validated at parse time and the values that can be present in an nmtoken
are restricted vs. the values permitted in a literal. The use of numbers is fairly common in existing formatters, Cf. Intl.NumberFormat
options such as maximumSignificantDigits
. But other values have limited (and enumerated) values which might be validated at parse time.
A literal
value probably is a parsing error (invalid argument) when the function wants a number or enumerated value. MF's options are untyped, but the underlying implementation might not be.
There may be a "tripping hazard" here for users who can't see the difference between:
{:func symbol=US$}
(invalid, as $
is reserved) and {:func symbol=|US$|}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted?
Yes.
Are
{:func foo=|1|}
and{:func foo=1}
therefore indistinguishable?
As far as I recall we have not had an explicit discussion on this, but my position would be that during formatting the :func
handler should not be able to distinguish these two from each other.
pattern = "{" *(text / placeholder) "}" | ||
selectors = match 1*([s] selector) | ||
selector = "{" [s] expression [s] "}" | ||
variant = when 1*(s key) [s] pattern | ||
key = nmtoken / literal / "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern = "{" *(text / placeholder) "}" | |
selectors = match 1*([s] selector) | |
selector = "{" [s] expression [s] "}" | |
variant = when 1*(s key) [s] pattern | |
key = nmtoken / literal / "*" | |
pattern = "{" *(text / placeholder) "}" | |
selectors = match 1*([s] selector) | |
selector = "{" [s] expression [s] "}" | |
variant = when 1*(s key) [s] pattern | |
key = nmtoken / literal / "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggesting this. As I said above, I'd prefer to not column-align the ABNF for now because I don't think the names of production are final and I expect the next few weeks to bring a few changes.
…cape easier to understand Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
I understand why now (ugly syntax from ABNF)
…On Tue, Feb 28, 2023 at 2:58 PM Stanisław Małolepszy < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In spec/message.abnf
<#347 (comment)>
:
> +let = %x6C.65.74
+match = %x6D.61.74.63.68
+when = %x77.68.65.6E
My understanding is that let = "let" would make the let keyword
case-insensitive, due to how literal text strings work in ABNF. That's why
we'd need the case-sensitive extension: let = %s"let". See also:
https://github.com/unicode-org/message-format-wg/pull/347/files#diff-a33e1c859f2aaad86b37ea3b3b5e8d45331199d6e879c5aba38eca2f23f01865
.
—
Reply to this email directly, view it on GitHub
<#347 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGOZ2URHLNIFAEZOVTWZZ7JBANCNFSM6AAAAAAU3NM7WI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Eemeli Aro <eemeli@gmail.com>
Let me list the suggestions to the PR that I'd like to discuss separately, after it lands, so that we can keep track of them all easily.
|
I'm going to merge this now. Please feel free to suggest further improvements by opening new PRs. Please remember to not only edit |
Based on #344. Fixes #342.
Convert our W3C EBNF grammar to ABNF, as defined by RFC 5234.
I validated the grammar using https://author-tools.ietf.org/abnf.
I was also able to run some fuzz testing by:
Not all files parsed correctly. I'll investigate.
message.ebnf
.syntax.md
.syntax.md
.