-
-
Notifications
You must be signed in to change notification settings - Fork 36
Add explicit whitespace definitions #344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should require more spaces; see line comments for specifics. My opinions here are not set in stone, but I would prefer a bit more weight to be placed on the reader experience. So while e.g. match{$foo}{$bar}
is easily parsable by a machine, I would maintain that a human would find match {$foo} {$bar}
faster to understand. As with all code, people will spend far more time reading rather than writing MF2.
I agree with the sentiment, but the grammar is not, in my opinion, the right vehicle to convey such guidelines. There's precedent and a strong convention in many programming languages that braced blocks can be positioned in a rather free-form manner relative to other symbols. I think we should respect it. Otherwise, it's going to turn out that our syntax is actually really quite whitespace-significant outside patterns, too. |
626f9ec
to
8106286
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good as a start. Not a full review--just a couple of quick comments.
/* WhiteSpace */ | ||
WhiteSpace ::= #x9 | #xD | #xA | #x20 /* ws: definition */ | ||
/* White space */ | ||
s ::= (#x9 | #xD | #xA | #x20)+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use LWSP
, which is built-in, instead of defining our own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the definition in https://datatracker.ietf.org/doc/html/rfc5234#appendix-B, LWSP
excludes standalone LF
, which I don't think we want to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the same appendix says that it contains some basic rules that are in common use and that [b]asic rules are in uppercase. Does common use mean that they're built-in, or perhaps that they're expected to be defined in a certain way?
The reason I'm asking is that I found an ABNF parser at https://author-tools.ietf.org/abnf and it complained about DIGIT
and ALPHA
being undefined. Bug or intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer keeping an explicit whitespace definition within the spec, rather than introducing an external dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the same appendix says that it contains some basic rules that are in common use and that [b]asic rules are in uppercase. Does common use mean that they're built-in, or perhaps that they're expected to be defined in a certain way?
I think in an RFC, you would usually say that the following rules are imported from RFC 5234, and give a list of the rules you need.
The reason I'm asking is that I found an ABNF parser at https://author-tools.ietf.org/abnf and it complained about
DIGIT
andALPHA
being undefined. Bug or intended behavior?
DIGIT and ALPHA are too frequent for such a bug not having been detected. I'd guess you have to add the rules you reference from RFC 5234 by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bill's parser (which powers author-tools) has always complained about the built-ins. I think most I-D authors ignore the warnings (I know I learned to). For example, if you run BCP47's ABNF in the tool, it complains about all of the built-ins (and BCP47 is a published set of RFCs blessed by the editor):
Language-Tag = langtag / privateuse / grandfathered
langtag = language [ "-" script ] [ "-" region ] *( "-" variant ) *( "-" extension ) [ "-" privateuse ]
; ALPHA UNDEFINED
language = ( 2*3ALPHA [ "-" extlang ] ) / 4ALPHA / 5*8ALPHA
extlang = 3ALPHA *2( "-" 3ALPHA )
script = 4ALPHA
; DIGIT UNDEFINED
region = 2ALPHA / 3DIGIT
variant = 5*8alphanum / ( DIGIT 3alphanum )
extension = singleton 1*( "-" ( 2*8alphanum ) )
singleton = DIGIT / %x41-57 / %x59-5A / %x61-77 / %x79-7A
privateuse = "x" 1*( "-" ( 1*8alphanum ) )
grandfathered = irregular / regular
irregular = "en-GB-oed" / "i-ami" / "i-bnn" / "i-default" / "i-enochian" / "i-hak" / "i-klingon" / "i-lux" / "i-mingo" / "i-navajo" / "i-pwn" / "i-tao" / "i-tay" / "i-tsu" / "sgn-BE-FR" / "sgn-BE-NL" / "sgn-CH-DE"
regular = "art-lojban" / "cel-gaulish" / "no-bok" / "no-nyn" / "zh-guoyu" / "zh-hakka" / "zh-min" / "zh-min-nan" / "zh-xiang"
alphanum = ( ALPHA / DIGIT )
obs-language-tag = primary-subtag *( "-" subtag )
primary-subtag = 1*8ALPHA
subtag = 1*8( ALPHA / DIGIT )
; CRLF UNDEFINED
registry = record *( "%%" CRLF record )
record = 1*field
field = ( field-name field-sep field-body CRLF )
field-name = ( ALPHA / DIGIT ) [ *( ALPHA / DIGIT / "-" ) ( ALPHA / DIGIT ) ]
; SP UNDEFINED
field-sep = *SP ":" *SP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the definition in https://datatracker.ietf.org/doc/html/rfc5234#appendix-B,
LWSP
excludes standaloneLF
, which I don't think we want to do?
I'm curious about the LF
and its exclusion from LWSP
. In addition to +1 on @eemeli's comment to allow for whitespace between Declaration
s (the let ...
), I think it would be natural for people who want to serialize messages to a file to want to put a newline in between them. Same applies to the comment for whitespace between each variant in a selection message. At least, by "natural", I mean that I think this is how we have already written all of our example messages in our documentation so far.
What are the reasons for LWSP
not including LF
(what use cases does it support, and do those apply here)?
However we decide this, I think our examples in our documentation should match our syntax, and we just have to update one or both to ensure that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to +1 on @eemeli's comment to allow for whitespace between
Declaration
s (thelet ...
) [...]
Did you mean to say "to allow" or "to require"? @eemeli's comment uses a stronger language.
Same applies to the comment for whitespace between each variant in a selection message. At least, by "natural", I mean that I think this is how we have already written all of our example messages in our documentation so far.
I agree about the "natural". I think we should recommend readable formatting and lead by example. This PR is more about defining the bounds of what's considered valid, even if it's formatted horribly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stasm Ah, yep, I meant "require".
@eemeli All your unresolved comments are related to the whitespace requirement around |
After some consideration, I think my only remaining concern is minification. I agree that there's some value in being lenient during parsing. For existing web formats such as CSS, JSON & JS, when they're transmitted over the wire, it is common for any extra spaces, comments and such to be stripped out, which leads to a formulation that's sufficiently terse to need tooling before it's understandable by humans. If we make spaces around
On the other hand, if we require the spaces, the minimum size of this message would grow by six characters, but it would be much more human-readable:
I would be interested to hear if anyone else has thoughts on this aspect, which I'm not aware of us explicitly discussing before. If you think that this is not of much concern, then I'd be fine relaxing my stance. |
It's fair to expect that people will want to attempt minification, although personally I wouldn't recommend it, mostly for practical reasons. Our syntax is already quite minimal and most of the contents of a message is the translation itself—which must not be minified. I can perhaps see some opportunity in removing leading whitespace from before |
Compressing message content (over the wire or at rest) any time size matters using GZIP or other standard mechanisms will be several orders of magnitude more effective than any possible minification, and also completely unaffected by any attempts to minify before compression. |
Actually, the benefits of minifying message resources can be significant, when considering a well-commented file as a starting point. All of the contextual information required for translation is unnecessary at runtime, and can easily make up a quarter or more of the total size. If/when used for web content, MF2 resources will be minified. The application of an intelligent minifier is going to rely on a parser that understands the syntax it's working with, and it will apply all possible savings, trimming whitespace wherever it can even if the difference between To see what I mean, take a look at the static nominally human-readable text-based assets that are loaded alongside the HTML of this very page, and observe that they are all minified to the furtheset extent possible. Given this, my point here then is that making the whitespaces around |
As a data point regarding minification, the total compressed size of all 280 FTL files currently in the Firefox repo is 274kB when raw, but 201kB minified; that's a saving of about 27%. I would expect minification of an MF2 resource format to provide similar file size savings for most users. |
I would guess most of the savings can be attributed to the removal of comments, the indentation of attributes and the indentation of nested selectors? None of which exist in MessageFormat 2. Taking a step back: are you advocating that minification is unavoidable and that we should attempt to preserve the readability of minified messages at the expense of the parser’s leniency? |
Minification as a process is not geared towards human legibility and it doesn't bother me that JS code (or MF patterns) aren't as legible after passing through such as process. It does, however, provide the "opposite extreme" to blowing up whitespace and might serve as a useful reminder of what "valid" looks like. Note that translation systems might produce minified patterns by default, since they have no reason to prettify patterns by inserting extraneous whitespace. So we should be reasonably happy with minified results, as lots of processors will likely fold up whitespace. As such, I think our current syntax holds up pretty well, based on @eemeli's example: |
/* WhiteSpace */ | ||
WhiteSpace ::= #x9 | #xD | #xA | #x20 /* ws: definition */ | ||
/* White space */ | ||
s ::= (#x9 | #xD | #xA | #x20)+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the definition in https://datatracker.ietf.org/doc/html/rfc5234#appendix-B,
LWSP
excludes standaloneLF
, which I don't think we want to do?
I'm curious about the LF
and its exclusion from LWSP
. In addition to +1 on @eemeli's comment to allow for whitespace between Declaration
s (the let ...
), I think it would be natural for people who want to serialize messages to a file to want to put a newline in between them. Same applies to the comment for whitespace between each variant in a selection message. At least, by "natural", I mean that I think this is how we have already written all of our example messages in our documentation so far.
What are the reasons for LWSP
not including LF
(what use cases does it support, and do those apply here)?
However we decide this, I think our examples in our documentation should match our syntax, and we just have to update one or both to ensure that.
VariantKey ::= Literal | Nmtoken | '*' | ||
Pattern ::= '{' (Text | Placeholder)* '}' /* ws: explicit */ | ||
Pattern ::= '{' (Text | Placeholder)* '}' | ||
Selector ::= 'match' (s? '{' s? Expression s? '}')+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the rule on this line, another point to consider is the name of the rule. It's the singular word "Selector" when it includes 1 or more things that we have been casually referring to as selectors. So I think it would be clearer to use the plural word "Selectors" and separate out a rule for the syntax of that individual concept named using the singular word "Selector":
Selector ::= 'match' (s? '{' s? Expression s? '}')+ | |
Selectors ::= 'match' (s Selector)+ | |
Selector ::= '{' s? Expression s? '}' |
If doing so, then rename the reference to this rule within the Body
rule accordingly.
(This suggested change also incorporates @eemeli's suggested change in a parallel comment that LGTM.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it would be preferable to require whitespace around We haven't talked about whitespace specifically, but here are my high level understandings of how we think about the syntax:
Explaining the above, starting with the 2nd point (human-friendly representation), I think the argument for eschewing delimiters (#253, #255) in favor of keywords (#287) was that it would be more human-friendly. The space savings and regularity of delimiters were not the biggest criteria for those arguments in favor of keywords. Regarding the first point, it should be obvious. Ever since we started talking about the syntax in earnest (#178), we've wanted to have a syntax that represents anything that the data model can specify. And even though "human-friendly" can be a squishy thing to measure (ex: I think delimiters are more important for readability than keywords), I'm okay because I know there's always a 1:1 correspondence to any other representation/API I would find. So we already have precedent for when readability in service of being human-friendly wins over saving a few characters. I think the minification arguments don't move me much because, at this point, the unresolved comments all relate to whitespace around @stasm's point about parser leniency (should this be required vs. optional?) is expounded upon more over in the issue at #340 (comment), so I'll respond there. |
@eemeli How much of that difference is the removal of optional and inconsequential (but potentially hefty) blobs like comments an how much is reducing the syntax niceties like whitespace to a minimum? |
Also in answer to @alerque: The vast majority (probably something like 95%) of the savings come from dropping comments.
I would like us to be explicit about the choice we're making. We're building a new format for encoding human messages that we've put some work into making human-readable. The ecosystem around us is one that now has years of experience applying AST transforms on source code for various purposes, so the complexity and cost of applying e.g. minification to MF2 is marginal. Therefore, we should presume that it will be used by at least some large fraction of MF2 emitters that presume that their output will primarily be consumed by machines rather than humans. So while it is entirely valid for us to make the call that whitespace being optional results in a more lenient parser and that's desirable, I want our choice on this to be explicit rather than implicit. |
At the February 27 teleconference we agreed to move ahead with this PR. To summarize the current agreement, there have been three major threads of discussion above:
This PR implements the following answers:
During the meeting we also discussed possibly further relaxing the whitespace requirements related to (2) and (3) above. I'll file a new PR with suggested changes to the BNF, and we will continue the discussion there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on the call today, I'm ok with this proceeding in its current shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge this. I believe this will almost immediately be overwritten by the ABNF version??
Yes. #347 will follow today or tomorrow. |
Fixes #340. This PR implements the following whitespace rules:
No, whitespace is optional around braced expressions (as in
let $foo={...}
andmatch{...}{...}
) nor around braced patterns (when key{Hello}
is OK). Whitespace is optional around the=
.Yes, whitespace is required after
let
.Yes, whitespace is required between all option pairs, regardless of whether the option value is a literal or not. Whitespace is optional around the
=
.Yes, whitespace is required between all when keys, regardless of the kind of the keys.
Same as above, whitespace is required between all when keys.
The grammar is now LL(1) with backtracking, however I noticed that with one functional change compared to the current
main
, we could make it LL(2) without backtracking. Namely, we can do it by forbidding empty placeholders.There are a few follow-ups I'd like to open PRs for after this one:
Function
→FunctionName
,Annotation
→FunctionCall
. Please review this PR without paying too much attention to naming.let
,match
, andwhen
keywords.