Define the grammar as an ABNF (RFC 5234) #347

stasm · 2023-02-14T11:10:35Z

Based on #344. Fixes #342.

Convert our W3C EBNF grammar to ABNF, as defined by RFC 5234.

I validated the grammar using https://author-tools.ietf.org/abnf.

I was also able to run some fuzz testing by:

generating 1000 random files using abnfgen,
auto-converting the ABNF back to W3C EBNF using https://www.bottlecaps.de/convert/,
generating an LL(2) parser from it using https://www.bottlecaps.de/rex/,
and finally parsing the generated files with the parser.

Not all files parsed correctly. I'll investigate.

Remove message.ebnf.
Update BNF snipptes in syntax.md.
Update the Complete EBNF section in syntax.md.

spec/message.abnf

aphillips

This is looking really good. Thanks Stas!

Most of my comments are corner cases and probably I should raise them as separate issues.

spec/message.abnf

aphillips · 2023-02-14T16:51:04Z

spec/message.abnf

+markup-start = "+" name
+markup-end = "-" name
+name = name-start *name-char
+nmtoken = 1*name-char


why is nmtoken more relaxed about starters than name? Is it because we want to allow digits??

Note that nmtoken is basically an unquoted literal which we provide for convenience, e.g. so you can say:

when foo {{$count :number option=bar}}

instead of having to quote foo and bar as (foo) and (bar)

I guess what I'm digging at here is whether we can simplify parsing rules given that we're mostly breaking on whitespace or on specific characters like =. The less we use specific character lists or rules, the easier it is to implement.

name and nmtoken are taken directly from https://www.w3.org/TR/xml/#NT-Name. And yes, the goal was to make it convenient to use (a) numerical option values (fractionDigits=1) and (b) option values such as year=2-digit.

More broadly, the use of nmtoken is intended to align MF2 with LDML. CLDR data is defined as attribute enums (e.g. in ldmlSupplemental.dtd), which implies that they are Nmtokens.

nmtoken however is actually exposed to more than just LDML/CLDR constructs. Developers of selectors and formatters will use them to define key values and option values. That is, our user base is wider than just CLDR stuff or ICU developers. If we define values that are a strict superset of those restrictions then any XML, LDML, or CLDR values will work. LDML being a dialect of XML means that they are restricted by XML, but we don't have to be. Thoughts?

Wait, are you suggesting that keys and option values should be even less restricted than nmtoken?

My goal here (last year when I wrote this for my proposal) was to make sure that whatever appears in LDML attlists (now or in the future) can be a valid key or an option value. That's why I chose Nmtoken for these two productions.

Yes. I'm suggesting that any characters we omit we know why we are omitting them. Some implementations will probably not check the values so long as they are able to parse the various tokens (which don't really depend on the characters inside the token); others will pedantically check the character ranges.

To be honest, most developers will write enumerated values in ASCII for keys and option values. But for cases when they don't the restrictions on users should make sense in our world (and not just because someone else did it)

@aphillips

LDML being a dialect of XML

I hope you meant "LDML being an application of XML". XML doesn't have any dialects (except for XML 1.1, maybe).

@duerst exactly so

spec/message.abnf

aphillips

Leaving aside my recently-filed issues that are not on topic, I think this is merge-worthy.

stasm · 2023-02-14T18:29:28Z

Thanks, @aphillips. I researched using ABNF after I started working on whitespace rules. This PR is unfortunately based on my work in #344. I'll keep it as a draft while #344 is being discussed.

spec/message.abnf

Co-authored-by: Caleb Maclennan <caleb@alerque.com>

eemeli

A few editorial suggestions inline, but I'm fine with merging even without them.

spec/message.abnf

spec/syntax.md

Co-authored-by: Eemeli Aro <eemeli@gmail.com>

gibson042

This is great! I support an initial mechanical conversion to ABNF, but have some suggestions for followup improvements.

spec/message.abnf

gibson042 · 2023-02-28T17:47:25Z

spec/message.abnf

+message = [s] *(declaration [s]) body [s]
+
+declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"
+body = pattern
+     / (selectors 1*([s] variant))


Similar to my comment below, I think these rules would be more readable with conventional OWS and RWS (as in "{optional,required} white space") rules, and column-aligned as in RFC 5234.

Suggested change

message = [s] *(declaration [s]) body [s]

declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"

body = pattern

/ (selectors 1*([s] variant))

message = OWS *(declaration OWS) body OWS

declaration = let RWS variable OWS "=" OWS "{" OWS expression OWS "}"

body = pattern

/ (selectors 1*(OWS variant))

and so on.

Can we discuss using RWS and OWS in a separate PR? They touch every production in the ABNF.

I'm against column alignment for as long as we expect the grammar to change. They generate needless diffs. Let's do it once when the grammar stabilizes.

spec/message.abnf

gibson042 · 2023-02-28T18:09:59Z

spec/message.abnf

+expression = ((literal / variable) [s annotation])
+           / annotation
+annotation = function *(s option)
+option = name [s] "=" [s] (literal / nmtoken / variable)


What are the expected semantics of an option value that is a nmtoken but not a name, as in e.g. {:func foo=1}?

It parses as a literal, "1". The implementation of :func can interpret it as a number if it makes sense to do so for the foo option.

Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted? Are {:func foo=|1|} and {:func foo=1} therefore indistinguishable?

There's a super-subtle difference between nmtoken and literal values.

An nmtoken value might be validated at parse time and the values that can be present in an nmtoken are restricted vs. the values permitted in a literal. The use of numbers is fairly common in existing formatters, Cf. Intl.NumberFormat options such as maximumSignificantDigits. But other values have limited (and enumerated) values which might be validated at parse time.

A literal value probably is a parsing error (invalid argument) when the function wants a number or enumerated value. MF's options are untyped, but the underlying implementation might not be.

There may be a "tripping hazard" here for users who can't see the difference between:

{:func symbol=US$} (invalid, as $ is reserved) and {:func symbol=|US$|}

Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted?

Yes.

Are {:func foo=|1|} and {:func foo=1} therefore indistinguishable?

As far as I recall we have not had an explicit discussion on this, but my position would be that during formatting the :func handler should not be able to distinguish these two from each other.

spec/message.abnf

gibson042 · 2023-02-28T18:38:19Z

spec/message.abnf

+pattern = "{" *(text / placeholder) "}"
+selectors = match 1*([s] selector)
+selector = "{" [s] expression [s] "}"
+variant = when 1*(s key) [s] pattern
+key = nmtoken / literal / "*"


Suggested change

pattern = "{" *(text / placeholder) "}"

selectors = match 1*([s] selector)

selector = "{" [s] expression [s] "}"

variant = when 1*(s key) [s] pattern

key = nmtoken / literal / "*"

pattern = "{" *(text / placeholder) "}"

selectors = match 1*([s] selector)

selector = "{" [s] expression [s] "}"

variant = when 1*(s key) [s] pattern

key = nmtoken / literal / "*"

Thanks for suggesting this. As I said above, I'd prefer to not column-align the ABNF for now because I don't think the names of production are final and I expect the next few weeks to bring a few changes.

spec/message.abnf

…cape easier to understand Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

macchiati · 2023-02-28T23:09:02Z

I understand why now (ugly syntax from ABNF)

…

On Tue, Feb 28, 2023 at 2:58 PM Stanisław Małolepszy < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec/message.abnf <#347 (comment)> : > +let = %x6C.65.74 +match = %x6D.61.74.63.68 +when = %x77.68.65.6E My understanding is that let = "let" would make the let keyword case-insensitive, due to how literal text strings work in ABNF. That's why we'd need the case-sensitive extension: let = %s"let". See also: https://github.com/unicode-org/message-format-wg/pull/347/files#diff-a33e1c859f2aaad86b37ea3b3b5e8d45331199d6e879c5aba38eca2f23f01865 . — Reply to this email directly, view it on GitHub <#347 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMGOZ2URHLNIFAEZOVTWZZ7JBANCNFSM6AAAAAAU3NM7WI> . You are receiving this because you commented.Message ID: ***@***.***>

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

spec/message.abnf

Co-authored-by: Eemeli Aro <eemeli@gmail.com>

stasm · 2023-03-01T08:22:22Z

Let me list the suggestions to the PR that I'd like to discuss separately, after it lands, so that we can keep track of them all easily.

Perhaps merge name and nmtoken. Discussed by @aphillips in Define the grammar as an ABNF (RFC 5234) #347 (comment).
Decide whether boolean options for markup should be allowed, e.g. {+checkbox checked}. Suggested by @aphillips in Define the grammar as an ABNF (RFC 5234) #347 (comment).
Perhaps drop the function and markup-start productions. Suggested by @eemeli in Define the grammar as an ABNF (RFC 5234) #347 (comment).
Use RWS and OWS. Suggested by @gibson042 in Define the grammar as an ABNF (RFC 5234) #347 (comment) and Define the grammar as an ABNF (RFC 5234) #347 (comment).
Column-align productions in the entire file, or at least inside blocks of related productions. Suggested by @gibson042 in Define the grammar as an ABNF (RFC 5234) #347 (comment)
Decide whether foo=x and foo=|x| should have different AST representation. See Define the grammar as an ABNF (RFC 5234) #347 (comment).

stasm · 2023-03-01T08:33:25Z

I'm going to merge this now. Please feel free to suggest further improvements by opening new PRs. Please remember to not only edit message.abnf, but to also make the corresponding changes in syntax.md.

- hat tip: unicode-org/message-format-wg#347

alerque reviewed Feb 14, 2023

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

aphillips reviewed Feb 14, 2023

View reviewed changes

stasm mentioned this pull request Feb 14, 2023

Make syntax keywords case-sensitive #349

Closed

aphillips approved these changes Feb 14, 2023

View reviewed changes

aphillips reviewed Feb 17, 2023

View reviewed changes

spec/message.abnf Show resolved Hide resolved

aphillips mentioned this pull request Feb 17, 2023

Document: "no nesting function calls" #353

Closed

stasm mentioned this pull request Feb 17, 2023

Add explicit whitespace definitions #344

Merged

aphillips reviewed Feb 18, 2023

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

Base automatically changed from sta/explicit-whitespace to main February 28, 2023 09:28

stasm and others added 7 commits February 28, 2023 12:03

Define the grammar as an ABNF (RFC 5234)

7d193e7

Remove ALPHA and DIGIT, which are built-in

8c3a4e5

Co-authored-by: Caleb Maclennan <caleb@alerque.com>

Refactor text-char and literal-char to use non-ascii-char

27edd64

Define let, match, when as separate tokens

ec6c07e

Remove the EBNF file

52fe0ca

Inline the non-ascii-char production

fe29172

Update the ABNF snippets inside spec/syntax.md

18ec7c1

stasm force-pushed the sta/abnf branch from 84bbfa1 to 18ec7c1 Compare February 28, 2023 11:36

stasm marked this pull request as ready for review February 28, 2023 11:38

stasm requested review from aphillips and eemeli February 28, 2023 11:39

alerque approved these changes Feb 28, 2023

View reviewed changes

stasm requested review from echeran and mihnita February 28, 2023 11:39

eemeli approved these changes Feb 28, 2023

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

spec/message.abnf Outdated Show resolved Hide resolved

eemeli reviewed Feb 28, 2023

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

aphillips approved these changes Feb 28, 2023

View reviewed changes

Simplify text-escape and literal-escape

6ffc08d

Co-authored-by: Eemeli Aro <eemeli@gmail.com>

stasm mentioned this pull request Feb 28, 2023

Change the literal delimiter to the vertical pipe character. #359

Merged

gibson042 approved these changes Feb 28, 2023

View reviewed changes

stasm and others added 3 commits February 28, 2023 22:50

Introduce the backslash production to make text-escape and literal-es…

8a00351

…cape easier to understand Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

Denormalize the alternatives inside the placeholder production

7ef8a11

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

Link XML's Name and Nmtoken

f3a73d8

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

Use the builtin ABNF whitespace productions

ad48d07

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>

eemeli reviewed Mar 1, 2023

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

Remove the unused markup production

4787729

Co-authored-by: Eemeli Aro <eemeli@gmail.com>

stasm added 2 commits March 1, 2023 09:26

Add comment about keywords being lowercase

4205cd2

Update syntax.md with the recent suggestions to the ABNF

517612d

stasm merged commit dee9a34 into main Mar 1, 2023

stasm deleted the sta/abnf branch March 1, 2023 08:34

stasm mentioned this pull request Nov 3, 2023

Adopt RFC 7405 (Case-Sensitive String Support in ABNF) for grammar #501

Closed

srl295 added a commit to srl295/cldr that referenced this pull request Jan 1, 2025

CLDR-18197 kbd: update spec to mention abnf

133dbf7

- hat tip: unicode-org/message-format-wg#347

Uh oh!

Define the grammar as an ABNF (RFC 5234) #347

Define the grammar as an ABNF (RFC 5234) #347

Uh oh!

Conversation

stasm commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

stasm commented Feb 14, 2023

Uh oh!

Uh oh!

Uh oh!

eemeli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gibson042 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gibson042 Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gibson042 Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

macchiati commented Feb 28, 2023 via email

Uh oh!

Uh oh!

stasm commented Mar 1, 2023

Uh oh!

stasm commented Feb 14, 2023 •

edited

Loading

gibson042 Feb 28, 2023 •

edited

Loading

gibson042 Feb 28, 2023 •

edited

Loading