Skip to content

whitespace in the EBNF #340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aphillips opened this issue Feb 11, 2023 · 13 comments · Fixed by #344
Closed

whitespace in the EBNF #340

aphillips opened this issue Feb 11, 2023 · 13 comments · Fixed by #344
Labels
Agenda+ Requested for upcoming teleconference syntax Issues related with syntax or ABNF

Comments

@aphillips
Copy link
Member

Is your feature request related to a problem? Please describe.
The EBNF is a bit handwavy about whitespace. As it is currently written, no whitespace is permitted in places where we often write spaces in our examples, e.g. around the = in statements like let $foo = (bar)

Describe the solution you'd like
Go through the EBNF and ensure that we permit (or disallow!) whitespace appropriately. I intend to file a PR for this.

Note that we should discuss whether we permit LWSP or WSP or just our WhiteSpace production. There are some okay-ish arguments for LWSP or just WSP.

Describe why your solution should shape the standard
Parsers will be written from the EBNF. It should be correct and complete.

Additional context or examples
See above.

@aphillips aphillips added syntax Issues related with syntax or ABNF Agenda+ Requested for upcoming teleconference labels Feb 11, 2023
@stasm
Copy link
Collaborator

stasm commented Feb 12, 2023

This is something I intended to do over the end-of-the-year break, but got sick instead. I have a WIP branch in which I ran into some issues with defining the whitespace in when foo vs. when(foo). I'll see if I can rebase and update it today.

@aphillips
Copy link
Member Author

Thanks @stasm. In that case, I won't invest more time in fixing the EBNF and will wait on your PR.

@stasm
Copy link
Collaborator

stasm commented Feb 13, 2023

I worked on this yesterday and this morning and got a bit stuck. Let me try to document my attempt.

My goal was to encode in the EBNF the following two requirements.

  1. Whitespace inside patterns must be explicitly preserved. For example, the whitespace around { this } must be preserved in the parsed output.
  2. Whitespace outside patterns can be ignored. In fact, one of the design constraints for the syntax states: Whitespace outside the translatable content should be insignificant. It should be possible to define a message entirely on a single line with no ambiguity, as well as to format it over multiple lines for clarity..

There are a few follow-up questions and edge-cases related to (2) which require some more consideration:

  1. Is whitespace required around curly braces?
    • let $foo={...} or let $foo = {...}?
    • match{...} or match {...}?
    • My opinion: both are fine → no, whitespace around curly braces is not required.
  2. Is space required after let?
    • let$foo={...} or let $foo={...}?
    • My opinion: forbid let$fooyes, whitespace after let is required.
  3. In a function call, is whitespace required after a literal option value?
    • {:func opt=(literal)opt=literal} or {:func opt=(literal) opt=literal}?
    • My opinion: both are fine → no, whitespace is not required after a literal option value.
  4. Is whitespace required between consecutive literals?
    • when (literal)(literal) or when (literal) (literal)?
    • My opinion: both are fine → no, whitespace is not required between consecutive literals.
  5. Is whitespace required around the default-key star?
    • when* or when *?
    • when ** or when * *?
    • when (literal)* or when (literal) *?
    • My opinion: all are fine → no, not required (although I wouldn't recommend the space-less variant).

I'd like us to agree on all of them before this issue can be fixed.

@aphillips
Copy link
Member Author

@stasm Thanks.

Why not post the PR? I think most of your judgements are in the follow up questions are correct. The differences in the ENBF should be minor. We can have a hot debate about whether LWSP or plain WSP should be used (and where). I think the statement in (2) is mostly be not entirely correct, e.g. space is sometimes used as a separator (next to let... are there any others?)

I also noted your comments elsewhere about which sort of BNF to use and we should discuss that.

@stasm
Copy link
Collaborator

stasm commented Feb 13, 2023

Some more thoughts on context-free grammars.

So far we've managed to keep spec/message.ebnf an LL(1) grammar, at least according to REx, which I mentioned in #342. This was possible due to how REx is able to ignore whitespace in most cases, and requires additional markup (/* ws: explicit */) to not ignore it in certain productions.

I'm not fond of making our grammar specific to one tool, which is why I had attempted to define whitespace rules in the EBNF explicitly before. However, I'm not sure I know how to do this right. The main issue boils down to the fact that: What looks intuitive is oftentimes not LL(1).


For example, I'd like to allow when(literal) but disallow whenkey. One way to think about it is something like the following:

'when' (s+ Nmtoken | s* (Literal | '*'))+
        |            |
        + At least one space required before a "bare" Nmtoken.
                     |
                     + All whitespace optional before a literal or *.

This representation suffers from the so-called first/first conflict, which is common in LL grammars. When the parser sees a space after when it doesn't know which production to choose: s+ Nmtoken or s* (Literal | '*').

This particular issue can be solved by left-factoring the production in question:

'when' ((Literal | '*') | s+ (Literal | '*' | Nmtoken))+
        |                 |
        + Either something without a space...
                          |
                          + or the same thing OR a Nmtoken with a space.

This is now LL(1), but arguably is also less readable for a human reader.


There are other constructs, however, which I don't know how to refactor to keep the LL(1) requirement. Assuming a slightly simplified syntax, I'd like to define that whitespace is required between function options, and optional at the end of placeholder.

Option ::= Name s* '=' s* (Literal | Nmtoken | Variable)
FunctionCall ::= FunctionName (s+ Option)* s*
                                           |
                                           + This whitespace could also be defined
                                             elsewhere, e.g. before the closing '}'
                                             of the Placeholder production.

This is also rather understandable and hopefully readable for a human. It also seems to be a fairly standard way of describing a set of repeated symbols. For example the XML spec defines the start tag as follows:

[40] STag ::= '<' Name (S Attribute)* S? '>'

The problem is that this, again, is not LL(1). This is an example of a first/follow conflict. When the parser is done parsing an Option and sees whitespace, it doesn't know if it should expand another (s+ Option) or instead continue to the trailing s*.

I don't know how to refactor this into LL(1), or even if it's possible.


With all of the above in mind, perhaps LL(1) is too strict? We haven't documented it as a hard requirement, although we did mention in a few discussion about the syntax that it would be nice to have.

I think there are a number of paths forward from here:

  1. Drop the LL(1) requirement, and instead define an LL(k) grammar (although I'm not sure how to fix the second problem) or an LR grammar. This boils down to building an LR parser that can parse the grammar that we define.
  2. Go for LL(1) with backtracking. This forfeits the complexity guarantee of O(n) where n is the length of input, but I think it would solve the issues I outlined above.
  3. Explicitly split the grammar into tokenization and parsing steps. By operating on well-defined tokens, we could control and forbid things like whenkey (missing the required space) -- they would end up as unrecognized tokens.
  4. Maybe try a parser expression grammar (PEG) instead? See Choose BNF syntax for describing the grammar #342.

@stasm
Copy link
Collaborator

stasm commented Feb 13, 2023

Why not post the PR?

I got stuck (as you can probably tell from my comments above) and haven't finished it yet. I think what I'd like to do is make changes that make the grammar LL(1) with backtracking, and submit that for discussion.

@eemeli
Copy link
Collaborator

eemeli commented Feb 13, 2023

My opinions:

  1. Is whitespace required around curly braces?
  • let $foo={...} or let $foo = {...}?
  • match{...} or match {...}?

Whether the value is in braces shouldn't matter; it's more about the context. So I would allow but not require spaces around the =, but I think that each of the match expressions should be separated from its surroundings by whitespace.

  1. Is space required after let?
  • let$foo={...} or let $foo={...}?

Yes, the space after let should be required.

  1. In a function call, is whitespace required after a literal option value?
  • {:func opt=(literal)opt=literal} or {:func opt=(literal) opt=literal}?

Yes, the space between options should be required. Here too it's simpler if the shape of the option value (i.e. literal vs. (literal)) should not affect the spacing requirements.

  1. Is whitespace required between consecutive literals?
  • when (literal)(literal) or when (literal) (literal)?

Yes, as with match the when keys should be space-separated. They should also be space-separated from the {pattern}.

  1. Is whitespace required around the default-key star?
  • when* or when *?
  • when ** or when * *?
  • when (literal)* or when (literal) *?

Yes, spaces should be required around the *. Same logic as before: the shape of the value shouldn't affect its surrounding spacing rules.

@stasm
Copy link
Collaborator

stasm commented Feb 13, 2023

I spent some more time thinking about this after the meeting and I think I agree that it's better to require whitespace around options and variant keys at all times. I call this in my head the "xml model" in which, too, the attributes must be separated by whitespace even if attr="val"attr="val" could parse unambiguously.

I have a stronger conviction that the whitespace around = and {...} should be optional. I'm OK with let $foo={...}, as well as with match{...}{...} and with when key{Hello}.

@eemeli
Copy link
Collaborator

eemeli commented Feb 13, 2023

Agreed on everything except for this:

I'm OK with [...] match{...}{...} and with when key{Hello}.

Specifically, I'm concerned that a line like

when one two{foo}

appears to associate the {foo} with just the two, rather than both one and two. Separating the pattern makes it clearer that it's the pattern for the whole line:

when one two {foo}

Coming from there, and the spaces needed between when one two, I would require the spaces around the match expressions to maintain the correspondance between the expressions and the keys.

@stasm
Copy link
Collaborator

stasm commented Feb 13, 2023

I feel like this is a slippery slope. If we require spaces around the match expressions, then why not around the let expressions? Instead, I think when one two{foo} should be valid, but simply not recommended.

@stasm
Copy link
Collaborator

stasm commented Feb 16, 2023

My general approach to whitespace is that I wouldn't want the syntax to punish users for sloppiness. While I have rather strong opinions on how I'd like to see messages formatted, I don't want to impose them on others. Parsing should be lenient. Linting can be strict.

This is why I originally proposed not even requiring whitespace around literals, e.g. when(foo)(bar). However, when(foo)bar and when(foo)* seemed iffy to me. Same with when**. So I accept the argument of consistency: all keys and all option-value pairs must be surrounded by whitespace.

(Incidentally, if we switch to | as the literal delimiter (#263), when|foo||bar| would be harder for me to accept than when(foo)(bar) is.)

For the same reason of leniency in accepting input, I don't want to require whitespace around {...}, both when used to wrap expressions (in let and match), and when used to wrap patterns. It's not how I would ever choose to format a message myself, but why punish users for it with a syntax error?

@echeran
Copy link
Collaborator

echeran commented Feb 17, 2023

My general approach to whitespace is that I wouldn't want the syntax to punish users for sloppiness. While I have rather strong opinions on how I'd like to see messages formatted, I don't want to impose them on others. Parsing should be lenient. Linting can be strict.

The words "sloppiness", "punish", and "impose" are doing a lot of work here, and I want to offer an alternative perspective. Only in the past few years, I've worked on projects where source code formatting was strictly enforced. At first, it felt cumbersome for me to include an extra step (to use the tooling), but the payoff was consistency across developers, and no spurious diffs in PR reviews due to formatting. Over time, the side benefit of less cognitive load accrued: I no longer had to worry about things I used to, like manually matching the reviewer's / codebase's subjective preferences of obj.method(1 + 2) vs. obj.method( 1 + 2 ), line lengths, etc. In projects where I saw strict formatting introduced, I saw similar initial arguments against it quickly went away (including a recent case where I unwittingly became the agent of change).

Having tooling to help users authoring messages is very useful, of course. For MessageFormat, @nbouvrette showed us way back when about this community tool from @vanwagonet, Online ICU Message Editor, that interactively validates & demonstrates a ICU MessageFormat v1 message pattern.

(Fun anecdote: In cases where the syntax is very regular to begin with, you can create tooling where formatting is 100% predetermined and deterministic.)

I find the strictness actually empowering because it allows me to spend less time thinking about syntax and formatting, so I can spend more time on higher-level concerns.

@stasm
Copy link
Collaborator

stasm commented Feb 18, 2023

Perhaps sloppy wasn't the right word. Is scrappy better?

I find the strictness actually empowering because it allows me to spend less time thinking about syntax and formatting, so I can spend more time on higher-level concerns.

@echeran I share the sentiment. I enjoy coding without thinking about the formatting, too, even if sometimes I'd prefer a different particular formatting. But the benefit of not even having to think about and discuss formatting is far greater than that of applying my own preference.

I think we can expect similar tooling to emerge for MessageFormat 2. It might be a bit more involved for strings embedded directly in the source code (but still feasible). But we should also expect that some users won't have access to such tooling, either because of the stack they use, or the limitations of the build system, or the limitations of the editor. In their case, the cost of the grammar's strictness would be entirely theirs.

Strictness is great once you're past the learning curve and when you have good tooling that can help you comply. Learning is a scrappy process which benefits from lenient parsing.

I guess I'm trying to be realistic: I anticipate that developers will want to spend as little time writing MessageFormat syntax as possible. Hence my attempt to relax the grammar and remove as much friction as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference syntax Issues related with syntax or ABNF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants