Skip to content

Define the grammar as an ABNF (RFC 5234) #347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 1, 2023
Merged

Define the grammar as an ABNF (RFC 5234) #347

merged 15 commits into from
Mar 1, 2023

Conversation

stasm
Copy link
Collaborator

@stasm stasm commented Feb 14, 2023

Based on #344. Fixes #342.

Convert our W3C EBNF grammar to ABNF, as defined by RFC 5234.

I validated the grammar using https://author-tools.ietf.org/abnf.

I was also able to run some fuzz testing by:

Not all files parsed correctly. I'll investigate.

  • Remove message.ebnf.
  • Update BNF snipptes in syntax.md.
  • Update the Complete EBNF section in syntax.md.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. Thanks Stas!

Most of my comments are corner cases and probably I should raise them as separate issues.

markup-start = "+" name
markup-end = "-" name
name = name-start *name-char
nmtoken = 1*name-char
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is nmtoken more relaxed about starters than name? Is it because we want to allow digits??

Note that nmtoken is basically an unquoted literal which we provide for convenience, e.g. so you can say:

when foo {{$count :number option=bar}}

instead of having to quote foo and bar as (foo) and (bar)

I guess what I'm digging at here is whether we can simplify parsing rules given that we're mostly breaking on whitespace or on specific characters like =. The less we use specific character lists or rules, the easier it is to implement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name and nmtoken are taken directly from https://www.w3.org/TR/xml/#NT-Name. And yes, the goal was to make it convenient to use (a) numerical option values (fractionDigits=1) and (b) option values such as year=2-digit.

More broadly, the use of nmtoken is intended to align MF2 with LDML. CLDR data is defined as attribute enums (e.g. in ldmlSupplemental.dtd), which implies that they are Nmtokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nmtoken however is actually exposed to more than just LDML/CLDR constructs. Developers of selectors and formatters will use them to define key values and option values. That is, our user base is wider than just CLDR stuff or ICU developers. If we define values that are a strict superset of those restrictions then any XML, LDML, or CLDR values will work. LDML being a dialect of XML means that they are restricted by XML, but we don't have to be. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are you suggesting that keys and option values should be even less restricted than nmtoken?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal here (last year when I wrote this for my proposal) was to make sure that whatever appears in LDML attlists (now or in the future) can be a valid key or an option value. That's why I chose Nmtoken for these two productions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm suggesting that any characters we omit we know why we are omitting them. Some implementations will probably not check the values so long as they are able to parse the various tokens (which don't really depend on the characters inside the token); others will pedantically check the character ranges.

To be honest, most developers will write enumerated values in ASCII for keys and option values. But for cases when they don't the restrictions on users should make sense in our world (and not just because someone else did it)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aphillips

LDML being a dialect of XML

I hope you meant "LDML being an application of XML". XML doesn't have any dialects (except for XML 1.1, maybe).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duerst exactly so

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving aside my recently-filed issues that are not on topic, I think this is merge-worthy.

@stasm
Copy link
Collaborator Author

stasm commented Feb 14, 2023

Thanks, @aphillips. I researched using ABNF after I started working on whitespace rules. This PR is unfortunately based on my work in #344. I'll keep it as a draft while #344 is being discussed.

Base automatically changed from sta/explicit-whitespace to main February 28, 2023 09:28
@stasm stasm marked this pull request as ready for review February 28, 2023 11:38
@stasm stasm requested review from aphillips and eemeli February 28, 2023 11:39
@stasm stasm requested review from echeran and mihnita February 28, 2023 11:39
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few editorial suggestions inline, but I'm fine with merging even without them.

Co-authored-by: Eemeli Aro <eemeli@gmail.com>
Copy link
Collaborator

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I support an initial mechanical conversion to ABNF, but have some suggestions for followup improvements.

Comment on lines +1 to +5
message = [s] *(declaration [s]) body [s]

declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"
body = pattern
/ (selectors 1*([s] variant))
Copy link
Collaborator

@gibson042 gibson042 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment below, I think these rules would be more readable with conventional OWS and RWS (as in "{optional,required} white space") rules, and column-aligned as in RFC 5234.

Suggested change
message = [s] *(declaration [s]) body [s]
declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"
body = pattern
/ (selectors 1*([s] variant))
message = OWS *(declaration OWS) body OWS
declaration = let RWS variable OWS "=" OWS "{" OWS expression OWS "}"
body = pattern
/ (selectors 1*(OWS variant))

and so on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we discuss using RWS and OWS in a separate PR? They touch every production in the ABNF.

I'm against column alignment for as long as we expect the grammar to change. They generate needless diffs. Let's do it once when the grammar stabilizes.

expression = ((literal / variable) [s annotation])
/ annotation
annotation = function *(s option)
option = name [s] "=" [s] (literal / nmtoken / variable)
Copy link
Collaborator

@gibson042 gibson042 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the expected semantics of an option value that is a nmtoken but not a name, as in e.g. {:func foo=1}?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It parses as a literal, "1". The implementation of :func can interpret it as a number if it makes sense to do so for the foo option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted? Are {:func foo=|1|} and {:func foo=1} therefore indistinguishable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a super-subtle difference between nmtoken and literal values.

An nmtoken value might be validated at parse time and the values that can be present in an nmtoken are restricted vs. the values permitted in a literal. The use of numbers is fairly common in existing formatters, Cf. Intl.NumberFormat options such as maximumSignificantDigits. But other values have limited (and enumerated) values which might be validated at parse time.

A literal value probably is a parsing error (invalid argument) when the function wants a number or enumerated value. MF's options are untyped, but the underlying implementation might not be.

There may be a "tripping hazard" here for users who can't see the difference between:

{:func symbol=US$} (invalid, as $ is reserved) and {:func symbol=|US$|}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted?

Yes.

Are {:func foo=|1|} and {:func foo=1} therefore indistinguishable?

As far as I recall we have not had an explicit discussion on this, but my position would be that during formatting the :func handler should not be able to distinguish these two from each other.

Comment on lines +7 to +11
pattern = "{" *(text / placeholder) "}"
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pattern = "{" *(text / placeholder) "}"
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"
pattern = "{" *(text / placeholder) "}"
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggesting this. As I said above, I'd prefer to not column-align the ABNF for now because I don't think the names of production are final and I expect the next few weeks to bring a few changes.

stasm and others added 3 commits February 28, 2023 22:50
…cape easier to understand

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
@macchiati
Copy link
Member

macchiati commented Feb 28, 2023 via email

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Eemeli Aro <eemeli@gmail.com>
@stasm
Copy link
Collaborator Author

stasm commented Mar 1, 2023

Let me list the suggestions to the PR that I'd like to discuss separately, after it lands, so that we can keep track of them all easily.

@stasm
Copy link
Collaborator Author

stasm commented Mar 1, 2023

I'm going to merge this now. Please feel free to suggest further improvements by opening new PRs. Please remember to not only edit message.abnf, but to also make the corresponding changes in syntax.md.

@stasm stasm merged commit dee9a34 into main Mar 1, 2023
@stasm stasm deleted the sta/abnf branch March 1, 2023 08:34
srl295 added a commit to srl295/cldr that referenced this pull request Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Choose BNF syntax for describing the grammar
7 participants