Skip to content

Define the grammar as an ABNF (RFC 5234) #347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions spec/message.abnf
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
message = [s] *(declaration [s]) body [s]

declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"
body = pattern
/ (selectors 1*([s] variant))
Comment on lines +1 to +5
Copy link
Collaborator

@gibson042 gibson042 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment below, I think these rules would be more readable with conventional OWS and RWS (as in "{optional,required} white space") rules, and column-aligned as in RFC 5234.

Suggested change
message = [s] *(declaration [s]) body [s]
declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"
body = pattern
/ (selectors 1*([s] variant))
message = OWS *(declaration OWS) body OWS
declaration = let RWS variable OWS "=" OWS "{" OWS expression OWS "}"
body = pattern
/ (selectors 1*(OWS variant))

and so on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we discuss using RWS and OWS in a separate PR? They touch every production in the ABNF.

I'm against column alignment for as long as we expect the grammar to change. They generate needless diffs. Let's do it once when the grammar stabilizes.


pattern = "{" *(text / placeholder) "}"
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"
Comment on lines +7 to +11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pattern = "{" *(text / placeholder) "}"
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"
pattern = "{" *(text / placeholder) "}"
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggesting this. As I said above, I'd prefer to not column-align the ABNF for now because I don't think the names of production are final and I expect the next few weeks to bring a few changes.


placeholder = "{" [s] expression [s] "}"
/ "{" [s] markup-start *(s option) [s] "}"
/ "{" [s] markup-end [s] "}"

expression = ((literal / variable) [s annotation])
/ annotation
annotation = function *(s option)
option = name [s] "=" [s] (literal / nmtoken / variable)
Copy link
Collaborator

@gibson042 gibson042 Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the expected semantics of an option value that is a nmtoken but not a name, as in e.g. {:func foo=1}?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It parses as a literal, "1". The implementation of :func can interpret it as a number if it makes sense to do so for the foo option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted? Are {:func foo=|1|} and {:func foo=1} therefore indistinguishable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a super-subtle difference between nmtoken and literal values.

An nmtoken value might be validated at parse time and the values that can be present in an nmtoken are restricted vs. the values permitted in a literal. The use of numbers is fairly common in existing formatters, Cf. Intl.NumberFormat options such as maximumSignificantDigits. But other values have limited (and enumerated) values which might be validated at parse time.

A literal value probably is a parsing error (invalid argument) when the function wants a number or enumerated value. MF's options are untyped, but the underlying implementation might not be.

There may be a "tripping hazard" here for users who can't see the difference between:

{:func symbol=US$} (invalid, as $ is reserved) and {:func symbol=|US$|}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, so the value of an option is either a variable or a literal, but the literal can be implicit rather than quoted?

Yes.

Are {:func foo=|1|} and {:func foo=1} therefore indistinguishable?

As far as I recall we have not had an explicit discussion on this, but my position would be that during formatting the :func handler should not be able to distinguish these two from each other.


; reserved keywords are always lowercase
let = %x6C.65.74 ; "let"
match = %x6D.61.74.63.68 ; "match"
when = %x77.68.65.6E ; "when"

text = 1*(text-char / text-escape)
text-char = %x0-5B ; omit \
/ %x5D-7A ; omit {
/ %x7C ; omit }
/ %x7E-D7FF ; omit surrogates
/ %xE000-10FFFF

literal = "(" *(literal-char / literal-escape) ")"
literal-char = %x0-27 ; omit ( and )
/ %x2A-5B ; omit \
/ %x5D-D7FF ; omit surrogates
/ %xE000-10FFFF

variable = "$" name
function = ":" name
markup-start = "+" name
markup-end = "-" name

name = name-start *name-char ; matches XML https://www.w3.org/TR/xml/#NT-Name
nmtoken = 1*name-char ; matches XML https://www.w3.org/TR/xml/#NT-Nmtokens
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "." / %xB7
/ %x0300-036F / %x203F-2040

text-escape = backslash ( backslash / "{" / "}" )
literal-escape = backslash ( backslash / "(" / ")" )
backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

s = 1*( SP / HTAB / CR / LF )
51 changes: 0 additions & 51 deletions spec/message.ebnf

This file was deleted.

145 changes: 81 additions & 64 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
1. [Design Goals](#design-goals)
1. [Design Restrictions](#design-restrictions)
1. [Overview & Examples](#overview--examples)
1. [Simple Messages](#simple-messages)
1. [Simple Placeholders](#simple-placeholders)
1. [Messages](#messages)
1. [Placeholders](#placeholders)
1. [Formatting Functions](#formatting-functions)
1. [Markup Elements](#markup-elements)
1. [Selection](#selection)
Expand All @@ -21,14 +21,14 @@
1. [Patterns](#patterns)
1. [Placeholders](#placeholders)
1. [Expressions](#expressions)
1. [Markup Elements](#markup-elements)
1. [Markup](#markup)
1. [Tokens](#tokens)
1. [Text](#text)
1. [Keywords](#keywords)
1. [Text and Literals](#text-and-literals)
1. [Names](#names)
1. [Quoted Strings](#quoted-strings)
1. [Escape Sequences](#escape-sequences)
1. [Whitespace](#whitespace)
1. [Complete EBNF](#complete-ebnf)
1. [Complete ABNF](#complete-abnf)

### Introduction to This Section

Expand Down Expand Up @@ -254,8 +254,10 @@ A *message* MUST be delimited with `{` at the start, and `}` at the end. Whitesp
appear outside the delimiters; such whitespace is ignored. No other content is permitted
outside the delimiters.

```ebnf
Message ::= Declaration* ( Pattern | Selector Variant+ )
```abnf
message = [s] *(declaration [s]) body [s]
body = pattern
/ (selectors 1*([s] variant))
```

### Variable Declarations
Expand All @@ -264,17 +266,18 @@ A ***declaration*** is an expression binding a variable identifier
within the scope of the message to the value of an expression.
This local variable can then be used in other expressions within the same message.

```ebnf
Declaration ::= 'let' WhiteSpace Variable '=' '{' Expression '}'
```abnf
declaration = let s variable [s] "=" [s] "{" [s] expression [s] "}"
```

### Selectors

A ***selector*** is a statement containing one or more expressions
which will be used to choose one of the *variants* during formatting.

```ebnf
Selector ::= 'match' ( '{' Expression '}' )+
```abnf
selectors = match 1*([s] selector)
selector = "{" [s] expression [s] "}"
```

Examples:
Expand All @@ -298,9 +301,9 @@ A ***variant*** is a keyed *pattern*.
The keys are used to match against the selector expressions defined in the `match` statement.
The key `*` is a "catch-all" key, matching all selector values.

```ebnf
Variant ::= 'when' ( WhiteSpace VariantKey )+ Pattern
VariantKey ::= Literal | Nmtoken | '*'
```abnf
variant = when 1*(s key) [s] pattern
key = nmtoken / literal / "*"
```

A _well-formed_ message is considered _valid_ if the following requirements are satisfied:
Expand All @@ -325,8 +328,8 @@ This serves 3 purposes:
- The syntax needs to make it as clear as possible which parts of the message body
are translatable and which ones are part of the formatting logic definition.

```ebnf
Pattern ::= '{' (Text | Placeholder)* '}' /* ws: explicit */
```abnf
pattern = "{" *(text / placeholder) "}"
```

Examples:
Expand All @@ -341,8 +344,10 @@ Whitespace within a *pattern* is meaningful and MUST be preserved.

A ***placeholder*** contains either an expression or a markup element.

```ebnf
Placeholder ::= '{' (Expression | Markup | MarkupEnd) '}'
```abnf
placeholder = "{" [s] expression [s] "}"
/ "{" [s] markup-start *(s option) [s] "}"
/ "{" [s] markup-end [s] "}"
```

### Expressions
Expand All @@ -357,13 +362,11 @@ other than the operand in front of them.

Standalone function calls don't have any operands in front of them.

```ebnf
Expression ::= Operand Annotation? | Annotation
Operand ::= Literal | Variable
Annotation ::= Function Option*
Option ::= Name '=' (Literal | Nmtoken | Variable)
Variable ::= '$' Name /* ws: explicit */
Function ::= ':' Name /* ws: explicit */
```abnf
expression = ((literal / variable) [s annotation])
/ annotation
annotation = function *(s option)
option = name [s] "=" [s] (literal / nmtoken / variable)
```

Examples:
Expand Down Expand Up @@ -400,12 +403,6 @@ each with its own syntax.
They mimic XML elements, but do not require well-formedness.
Standalone display elements should be represented as function expressions.

```ebnf
Markup ::= MarkupStart Option*
MarkupStart ::= '+' Name /* ws: explicit */
MarkupEnd ::= '-' Name /* ws: explicit */
```

Examples:

```
Expand All @@ -420,7 +417,18 @@ Examples:

The grammar defines the following tokens for the purpose of the lexical analysis.

### Text and literals
### Keywords

The following three keywords are reserved: `let`, `match`, and `when`.

```abnf
; reserved keywords are always lowercase
let = %x6C.65.74 ; "let"
match = %x6D.61.74.63.68 ; "match"
when = %x77.68.65.6E ; "when"
```

### Text and Literals

_Text_ is the translatable content of a _pattern_, and _Literal_ is used for matching
variants and providing input to expressions.
Expand All @@ -431,19 +439,21 @@ surrogate code points U+D800 through U+DBFF (which cannot be encoded into UTF-8)

All code points are preserved.

#### Text

```ebnf
Text ::= (TextChar | TextEscape)+ /* ws: explicit */
TextChar ::= AnyChar - ('{' | '}' | Esc)
AnyChar ::= [#x0-#x10FFFF] - [#xD800-#xDBFF]
```abnf
text = 1*(text-char / text-escape)
text-char = %x0-5B ; omit \
/ %x5D-7A ; omit {
/ %x7C ; omit }
/ %x7E-D7FF ; omit surrogates
/ %xE000-10FFFF
```

#### Literal

```ebnf
Literal ::= '(' (LiteralChar | LiteralEscape)* ')' /* ws: explicit */
LiteralChar ::= AnyChar - ('(' | ')' | Esc)
```abnf
literal = "(" *(literal-char / literal-escape) ")"
literal-char = %x0-27 ; omit ( and )
/ %x2A-5B ; omit \
/ %x5D-D7FF ; omit surrogates
/ %xE000-10FFFF
```

### Names
Expand All @@ -465,27 +475,34 @@ In particular, the grammatical feature data [specified in LDML](https://unicode.
and [defined in CLDR](https://unicode-org.github.io/cldr-staging/charts/latest/grammar/index.html)
uses Nmtokens.

```ebnf
Name ::= NameStart NameChar* /* ws: explicit */
Nmtoken ::= NameChar+ /* ws: explicit */
NameStart ::= [a-zA-Z] | "_"
| [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF]
| [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D]
| [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF]
| [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar ::= NameStart | [0-9] | "-" | "." | #xB7
| [#x0300-#x036F] | [#x203F-#x2040]
```abnf
variable = "$" name
function = ":" name
markup-start = "+" name
markup-end = "-" name
```

```abnf
name = name-start *name-char ; matches XML https://www.w3.org/TR/xml/#NT-Name
nmtoken = 1*name-char ; matches XML https://www.w3.org/TR/xml/#NT-Nmtokens
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "." / %xB7
/ %x0300-036F / %x203F-2040
```

### Escape Sequences

Escape sequences are introduced by the backslash character (`\`).
They are allowed in translatable text as well as in literals.

```ebnf
Esc ::= '\'
TextEscape ::= Esc Esc | Esc '{' | Esc '}'
LiteralEscape ::= Esc Esc | Esc '(' | Esc ')'
```abnf
text-escape = backslash ( backslash / "{" / "}" )
literal-escape = backslash ( backslash / "(" / ")" )
backslash = %x5C ; U+005C REVERSE SOLIDUS "\"
```

### Whitespace
Expand All @@ -496,12 +513,12 @@ Inside _patterns_,
whitespace is part of the translatable content and is recorded and stored verbatim.
Whitespace is not significant outside translatable text, except where required by the syntax.

```ebnf
WhiteSpace ::= #x9 | #xD | #xA | #x20 /* ws: definition */
```abnf
s = 1*( SP / HTAB / CR / LF )
```

## Complete EBNF
## Complete ABNF

The complete EBNF is available as [`message.ebnf`](./message.ebnf).
It uses the [W3C flavor](https://www.w3.org/TR/xml/#sec-notation) of the BNF notation.
The grammar is an LL(1) grammar without backtracking.
The grammar is formally defined in [`message.abnf`](./message.abnf)
using the ABNF notation,
as specified by [RFC 5234](https://datatracker.ietf.org/doc/html/rfc5234).