Add explicit whitespace definitions #344

stasm · 2023-02-13T22:14:44Z

Fixes #340. This PR implements the following whitespace rules:

Is whitespace required around curly braces?

let $foo={...} or let $foo = {...}?

match{...} or match {...}?

No, whitespace is optional around braced expressions (as in let $foo={...} and match{...}{...}) nor around braced patterns (when key{Hello} is OK). Whitespace is optional around the =.

Is space required after let?

let$foo={...} or let $foo={...}?

Yes, whitespace is required after let.

In a function call, is whitespace required after a literal option value?

{:func opt=(literal)opt=literal} or {:func opt=(literal) opt=literal}?

Yes, whitespace is required between all option pairs, regardless of whether the option value is a literal or not. Whitespace is optional around the =.

Is whitespace required between consecutive literals?

when (literal)(literal) or when (literal) (literal)?

Yes, whitespace is required between all when keys, regardless of the kind of the keys.

Is whitespace required around the default-key star?

when* or when *?

when ** or when * *?

when (literal)* or when (literal) *?

Same as above, whitespace is required between all when keys.

The grammar is now LL(1) with backtracking, however I noticed that with one functional change compared to the current main, we could make it LL(2) without backtracking. Namely, we can do it by forbidding empty placeholders.

There are a few follow-ups I'd like to open PRs for after this one:

Convert to ABNF, as per Choose BNF syntax for describing the grammar #342.
Rename some productions, e.g. Function → FunctionName, Annotation → FunctionCall. Please review this PR without paying too much attention to naming.
Allow both lowercase and uppercase spelling of the let, match, and when keywords.

spec/message.ebnf

eemeli

I think we should require more spaces; see line comments for specifics. My opinions here are not set in stone, but I would prefer a bit more weight to be placed on the reader experience. So while e.g. match{$foo}{$bar} is easily parsable by a machine, I would maintain that a human would find match {$foo} {$bar} faster to understand. As with all code, people will spend far more time reading rather than writing MF2.

spec/message.ebnf

stasm · 2023-02-13T23:17:08Z

I think we should require more spaces; see line comments for specifics. My opinions here are not set in stone, but I would prefer a bit more weight to be placed on the reader experience. So while e.g. match{$foo}{$bar} is easily parsable by a machine, I would maintain that a human would find match {$foo} {$bar} faster to understand. As with all code, people will spend far more time reading rather than writing MF2.

I agree with the sentiment, but the grammar is not, in my opinion, the right vehicle to convey such guidelines. There's precedent and a strong convention in many programming languages that braced blocks can be positioned in a rather free-form manner relative to other symbols. I think we should respect it. Otherwise, it's going to turn out that our syntax is actually really quite whitespace-significant outside patterns, too.

spec/message.ebnf

…cktracking.

aphillips

Looking good as a start. Not a full review--just a couple of quick comments.

aphillips · 2023-02-13T23:45:14Z

spec/message.ebnf

-/* WhiteSpace */
-WhiteSpace ::= #x9 | #xD | #xA | #x20 /* ws: definition */
+/* White space */
+s ::= (#x9 | #xD | #xA | #x20)+


Use LWSP, which is built-in, instead of defining our own?

Looking at the definition in https://datatracker.ietf.org/doc/html/rfc5234#appendix-B, LWSP excludes standalone LF, which I don't think we want to do?

Also, the same appendix says that it contains some basic rules that are in common use and that [b]asic rules are in uppercase. Does common use mean that they're built-in, or perhaps that they're expected to be defined in a certain way?

The reason I'm asking is that I found an ABNF parser at https://author-tools.ietf.org/abnf and it complained about DIGIT and ALPHA being undefined. Bug or intended behavior?

I would prefer keeping an explicit whitespace definition within the spec, rather than introducing an external dependency.

Also, the same appendix says that it contains some basic rules that are in common use and that [b]asic rules are in uppercase. Does common use mean that they're built-in, or perhaps that they're expected to be defined in a certain way?

I think in an RFC, you would usually say that the following rules are imported from RFC 5234, and give a list of the rules you need.

The reason I'm asking is that I found an ABNF parser at https://author-tools.ietf.org/abnf and it complained about DIGIT and ALPHA being undefined. Bug or intended behavior?

DIGIT and ALPHA are too frequent for such a bug not having been detected. I'd guess you have to add the rules you reference from RFC 5234 by hand.

Bill's parser (which powers author-tools) has always complained about the built-ins. I think most I-D authors ignore the warnings (I know I learned to). For example, if you run BCP47's ABNF in the tool, it complains about all of the built-ins (and BCP47 is a published set of RFCs blessed by the editor):

Language-Tag = langtag / privateuse / grandfathered langtag = language [ "-" script ] [ "-" region ] *( "-" variant ) *( "-" extension ) [ "-" privateuse ] ; ALPHA UNDEFINED language = ( 2*3ALPHA [ "-" extlang ] ) / 4ALPHA / 5*8ALPHA extlang = 3ALPHA *2( "-" 3ALPHA ) script = 4ALPHA ; DIGIT UNDEFINED region = 2ALPHA / 3DIGIT variant = 5*8alphanum / ( DIGIT 3alphanum ) extension = singleton 1*( "-" ( 2*8alphanum ) ) singleton = DIGIT / %x41-57 / %x59-5A / %x61-77 / %x79-7A privateuse = "x" 1*( "-" ( 1*8alphanum ) ) grandfathered = irregular / regular irregular = "en-GB-oed" / "i-ami" / "i-bnn" / "i-default" / "i-enochian" / "i-hak" / "i-klingon" / "i-lux" / "i-mingo" / "i-navajo" / "i-pwn" / "i-tao" / "i-tay" / "i-tsu" / "sgn-BE-FR" / "sgn-BE-NL" / "sgn-CH-DE" regular = "art-lojban" / "cel-gaulish" / "no-bok" / "no-nyn" / "zh-guoyu" / "zh-hakka" / "zh-min" / "zh-min-nan" / "zh-xiang" alphanum = ( ALPHA / DIGIT ) obs-language-tag = primary-subtag *( "-" subtag ) primary-subtag = 1*8ALPHA subtag = 1*8( ALPHA / DIGIT ) ; CRLF UNDEFINED registry = record *( "%%" CRLF record ) record = 1*field field = ( field-name field-sep field-body CRLF ) field-name = ( ALPHA / DIGIT ) [ *( ALPHA / DIGIT / "-" ) ( ALPHA / DIGIT ) ] ; SP UNDEFINED field-sep = *SP ":" *SP

Looking at the definition in https://datatracker.ietf.org/doc/html/rfc5234#appendix-B, LWSP excludes standalone LF, which I don't think we want to do?

I'm curious about the LF and its exclusion from LWSP. In addition to +1 on @eemeli's comment to allow for whitespace between Declarations (the let ...), I think it would be natural for people who want to serialize messages to a file to want to put a newline in between them. Same applies to the comment for whitespace between each variant in a selection message. At least, by "natural", I mean that I think this is how we have already written all of our example messages in our documentation so far.

What are the reasons for LWSP not including LF (what use cases does it support, and do those apply here)?

However we decide this, I think our examples in our documentation should match our syntax, and we just have to update one or both to ensure that.

In addition to +1 on @eemeli's comment to allow for whitespace between Declarations (the let ...) [...]

Did you mean to say "to allow" or "to require"? @eemeli's comment uses a stronger language.

Same applies to the comment for whitespace between each variant in a selection message. At least, by "natural", I mean that I think this is how we have already written all of our example messages in our documentation so far.

I agree about the "natural". I think we should recommend readable formatting and lead by example. This PR is more about defining the bounds of what's considered valid, even if it's formatted horribly.

@stasm Ah, yep, I meant "require".

spec/message.ebnf

stasm · 2023-02-16T09:45:10Z

@eemeli All your unresolved comments are related to the whitespace requirement around {...}, both when used to wrap expressions and when used to wrap patterns. Are you willing to relax your opinion on this?

eemeli · 2023-02-16T13:32:57Z

After some consideration, I think my only remaining concern is minification. I agree that there's some value in being lenient during parsing.

For existing web formats such as CSS, JSON & JS, when they're transmitted over the wire, it is common for any extra spaces, comments and such to be stripped out, which leads to a formulation that's sufficiently terse to need tooling before it's understandable by humans.

If we make spaces around {...} optional, we would end up with messages being transmitted in such form as well:

let $f={(foo)}match{$f}when 1{yes}when *{no}

On the other hand, if we require the spaces, the minimum size of this message would grow by six characters, but it would be much more human-readable:

let $f={(foo)}
match {$f}
when 1 {yes}
when * {no}

I would be interested to hear if anyone else has thoughts on this aspect, which I'm not aware of us explicitly discussing before. If you think that this is not of much concern, then I'd be fine relaxing my stance.

stasm · 2023-02-17T12:14:08Z

I would be interested to hear if anyone else has thoughts on this aspect, which I'm not aware of us explicitly discussing before.

It's fair to expect that people will want to attempt minification, although personally I wouldn't recommend it, mostly for practical reasons. Our syntax is already quite minimal and most of the contents of a message is the translation itself—which must not be minified. I can perhaps see some opportunity in removing leading whitespace from before when clauses, if the original formatting indents them. But otherwise, the benefits of minification are going to be minuscule.

alerque · 2023-02-17T12:38:12Z

Compressing message content (over the wire or at rest) any time size matters using GZIP or other standard mechanisms will be several orders of magnitude more effective than any possible minification, and also completely unaffected by any attempts to minify before compression.

eemeli · 2023-02-17T12:40:24Z

[...] I can perhaps see some opportunity in removing leading whitespace from before when clauses, if the original formatting indents them. But otherwise, the benefits of minification are going to be minuscule.

Actually, the benefits of minifying message resources can be significant, when considering a well-commented file as a starting point. All of the contextual information required for translation is unnecessary at runtime, and can easily make up a quarter or more of the total size. If/when used for web content, MF2 resources will be minified.

The application of an intelligent minifier is going to rely on a parser that understands the syntax it's working with, and it will apply all possible savings, trimming whitespace wherever it can even if the difference between match {...} and match{...} is, as I would agree with you, minuscule.

To see what I mean, take a look at the static nominally human-readable text-based assets that are loaded alongside the HTML of this very page, and observe that they are all minified to the furtheset extent possible.

Given this, my point here then is that making the whitespaces around {...} that we're talking about optional will lead to minified MF2 content being just as unreadable as minified CSS, whereas making them required will make the minified output much more readable.

eemeli · 2023-02-17T12:55:08Z

As a data point regarding minification, the total compressed size of all 280 FTL files currently in the Firefox repo is 274kB when raw, but 201kB minified; that's a saving of about 27%. I would expect minification of an MF2 resource format to provide similar file size savings for most users.

stasm · 2023-02-17T16:03:25Z

I would guess most of the savings can be attributed to the removal of comments, the indentation of attributes and the indentation of nested selectors? None of which exist in MessageFormat 2.

Taking a step back: are you advocating that minification is unavoidable and that we should attempt to preserve the readability of minified messages at the expense of the parser’s leniency?

aphillips · 2023-02-17T16:16:24Z

Minification as a process is not geared towards human legibility and it doesn't bother me that JS code (or MF patterns) aren't as legible after passing through such as process. It does, however, provide the "opposite extreme" to blowing up whitespace and might serve as a useful reminder of what "valid" looks like.

Note that translation systems might produce minified patterns by default, since they have no reason to prettify patterns by inserting extraneous whitespace. So we should be reasonably happy with minified results, as lots of processors will likely fold up whitespace.

As such, I think our current syntax holds up pretty well, based on @eemeli's example: let $f={(foo)}match{$f}when 1{yes}when *{no} While that's not optimal, it can be visually parsed.

echeran · 2023-02-17T19:44:25Z

spec/message.ebnf

-/* WhiteSpace */
-WhiteSpace ::= #x9 | #xD | #xA | #x20 /* ws: definition */
+/* White space */
+s ::= (#x9 | #xD | #xA | #x20)+


Looking at the definition in https://datatracker.ietf.org/doc/html/rfc5234#appendix-B, LWSP excludes standalone LF, which I don't think we want to do?

I'm curious about the LF and its exclusion from LWSP. In addition to +1 on @eemeli's comment to allow for whitespace between Declarations (the let ...), I think it would be natural for people who want to serialize messages to a file to want to put a newline in between them. Same applies to the comment for whitespace between each variant in a selection message. At least, by "natural", I mean that I think this is how we have already written all of our example messages in our documentation so far.

What are the reasons for LWSP not including LF (what use cases does it support, and do those apply here)?

However we decide this, I think our examples in our documentation should match our syntax, and we just have to update one or both to ensure that.

echeran · 2023-02-17T19:56:37Z

spec/message.ebnf

-VariantKey ::= Literal | Nmtoken | '*'
-Pattern ::= '{' (Text | Placeholder)* '}' /* ws: explicit */
+Pattern ::= '{' (Text | Placeholder)* '}'
+Selector ::= 'match' (s? '{' s? Expression s? '}')+


For the rule on this line, another point to consider is the name of the rule. It's the singular word "Selector" when it includes 1 or more things that we have been casually referring to as selectors. So I think it would be clearer to use the plural word "Selectors" and separate out a rule for the syntax of that individual concept named using the singular word "Selector":

Suggested change

Selector ::= 'match' (s? '{' s? Expression s? '}')+

Selectors ::= 'match' (s Selector)+

Selector ::= '{' s? Expression s? '}'

If doing so, then rename the reference to this rule within the Body rule accordingly.

(This suggested change also incorporates @eemeli's suggested change in a parallel comment that LGTM.)

Thanks, @echeran. I tried to limit the changes in this PR to only those related to whitespace. That's why in the description I wrote: Please review this PR without paying too much attention to naming. :)

The exact change you're suggesting is implemented in #347. Glad to see that we agree :)

echeran · 2023-02-17T22:54:13Z

I would be interested to hear if anyone else has thoughts on this aspect, which I'm not aware of us explicitly discussing before. If you think that this is not of much concern, then I'd be fine relaxing my stance.

I also think it would be preferable to require whitespace around {...}, even if that is stricter. I think it increases the readability of the selector+variant portion because the whole thing becomes consistently space-separated.

We haven't talked about whitespace specifically, but here are my high level understandings of how we think about the syntax:

It should be able to represent the data model 1:1
It should be human-friendly

Explaining the above, starting with the 2nd point (human-friendly representation), I think the argument for eschewing delimiters (#253, #255) in favor of keywords (#287) was that it would be more human-friendly. The space savings and regularity of delimiters were not the biggest criteria for those arguments in favor of keywords.

Regarding the first point, it should be obvious. Ever since we started talking about the syntax in earnest (#178), we've wanted to have a syntax that represents anything that the data model can specify. And even though "human-friendly" can be a squishy thing to measure (ex: I think delimiters are more important for readability than keywords), I'm okay because I know there's always a 1:1 correspondence to any other representation/API I would find.

So we already have precedent for when readability in service of being human-friendly wins over saving a few characters. I think the minification arguments don't move me much because, at this point, the unresolved comments all relate to whitespace around {...}, which seems too relatively infrequent. I also trust / agree with @alerque 's sentiment that if we really care about space savings, then we would be better off starting with a compression like GZIP.

@stasm's point about parser leniency (should this be required vs. optional?) is expounded upon more over in the issue at #340 (comment), so I'll respond there.

alerque · 2023-02-17T23:04:56Z

@eemeli How much of that difference is the removal of optional and inconsequential (but potentially hefty) blobs like comments an how much is reducing the syntax niceties like whitespace to a minimum?

eemeli · 2023-02-18T08:21:12Z

@stasm:
I would guess most of the savings can be attributed to the removal of comments, the indentation of attributes and the indentation of nested selectors? None of which exist in MessageFormat 2.

Also in answer to @alerque: The vast majority (probably something like 95%) of the savings come from dropping comments.

Taking a step back: are you advocating that minification is unavoidable and that we should attempt to preserve the readability of minified messages at the expense of the parser’s leniency?

I would like us to be explicit about the choice we're making. We're building a new format for encoding human messages that we've put some work into making human-readable. The ecosystem around us is one that now has years of experience applying AST transforms on source code for various purposes, so the complexity and cost of applying e.g. minification to MF2 is marginal. Therefore, we should presume that it will be used by at least some large fraction of MF2 emitters that presume that their output will primarily be consumed by machines rather than humans.

So while it is entirely valid for us to make the call that whitespace being optional results in a more lenient parser and that's desirable, I want our choice on this to be explicit rather than implicit.

stasm · 2023-02-27T19:22:16Z

At the February 27 teleconference we agreed to move ahead with this PR. To summarize the current agreement, there have been three major threads of discussion above:

whether to require whitespace around {...}, i.e. whether match{...} should be allowed,
whether to require whitespace between variant keys, i.e. whether when*** and when foo* should be allowed,
whether to require whitespace between option pairs, i.e. whether {:func opt=(x)opt=y} should be allowed.

This PR implements the following answers:

no, whitespace around {...} is optional,
yes, whitespace between variant keys is required,
yes, whitespace between option pairs is required.

During the meeting we also discussed possibly further relaxing the whitespace requirements related to (2) and (3) above. I'll file a new PR with suggested changes to the BNF, and we will continue the discussion there.

eemeli

As discussed on the call today, I'm ok with this proceeding in its current shape.

aphillips

Please merge this. I believe this will almost immediately be overwritten by the ABNF version??

stasm · 2023-02-28T09:26:00Z

I believe this will almost immediately be overwritten by the ABNF version??

Yes. #347 will follow today or tomorrow.

stasm commented Feb 13, 2023

View reviewed changes

spec/message.ebnf Outdated Show resolved Hide resolved

stasm requested review from eemeli, aphillips, echeran and mihnita February 13, 2023 22:17

eemeli requested changes Feb 13, 2023

View reviewed changes

spec/message.ebnf Show resolved Hide resolved

spec/message.ebnf Show resolved Hide resolved

spec/message.ebnf Show resolved Hide resolved

spec/message.ebnf Show resolved Hide resolved

spec/message.ebnf Show resolved Hide resolved

aphillips reviewed Feb 13, 2023

View reviewed changes

spec/message.ebnf Outdated Show resolved Hide resolved

Add explicit whitespace definitions. The grammar is now LL(1) with ba…

8106286

…cktracking.

stasm force-pushed the sta/explicit-whitespace branch from 626f9ec to 8106286 Compare February 13, 2023 23:39

aphillips reviewed Feb 13, 2023

View reviewed changes

This was referenced Feb 14, 2023

Disallow empty placeholders in the syntax #345

Closed

Consider escaping by doubling the special characters #346

Closed

Define the grammar as an ABNF (RFC 5234) #347

Merged

echeran reviewed Feb 17, 2023

View reviewed changes

Merge branch 'main' into sta/explicit-whitespace

05d0da5

stasm requested review from aphillips and eemeli February 27, 2023 19:22

stasm requested a review from echeran February 27, 2023 19:22

eemeli approved these changes Feb 27, 2023

View reviewed changes

aphillips approved these changes Feb 27, 2023

View reviewed changes

stasm merged commit 6f86f8a into main Feb 28, 2023

stasm deleted the sta/explicit-whitespace branch February 28, 2023 09:28

	Selector ::= 'match' (s? '{' s? Expression s? '}')+
	Selectors ::= 'match' (s Selector)+
	Selector ::= '{' s? Expression s? '}'

Uh oh!

Add explicit whitespace definitions #344

Add explicit whitespace definitions #344

Uh oh!

Conversation

stasm commented Feb 13, 2023

Uh oh!

Uh oh!

eemeli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stasm commented Feb 13, 2023

Uh oh!

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stasm commented Feb 16, 2023

Uh oh!

eemeli commented Feb 16, 2023

Uh oh!

stasm commented Feb 17, 2023

Uh oh!

alerque commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eemeli commented Feb 17, 2023

Uh oh!

eemeli commented Feb 17, 2023

Uh oh!

stasm commented Feb 17, 2023

Uh oh!

aphillips commented Feb 17, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stasm Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeran commented Feb 17, 2023

Uh oh!

alerque commented Feb 17, 2023

Uh oh!

eemeli commented Feb 18, 2023

Uh oh!

stasm commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eemeli left a comment

Choose a reason for hiding this comment

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

alerque commented Feb 17, 2023 •

edited

Loading

stasm Feb 17, 2023 •

edited

Loading

stasm commented Feb 27, 2023 •

edited

Loading