Add negative-start rule #399

eemeli · 2023-06-19T17:52:11Z

This is yet another potential solution for allowing unquoted negative numbers; see also #397 and #398.

This change makes a specific exception for - to start an unquoted, provided that it is followed by a . or a digit. As these characters are not included in name-start, they disambiguate the parsing of e.g. -1 or -.5 when used as an operand.

spec/message.abnf

spec/syntax.md

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>

eemeli · 2023-06-30T02:46:19Z

From @gibson042:

I've asked before, and will repeat the question: what is the value of staying close to XML Nmtoken but not matching it exactly? If there's going to be deviation (and unexplained deviation at that), what makes that a better foundation than e.g. UAX 31 Identifiers, or for that matter of dropping the pretense that this syntax has any relevant relationship to another one?

UAX 31 identifiers support a much more limited set of strings than we want to support as unquoted values; for example values such as 42, 13.0 and html:b are currently valid unquoted literals, but are not valid UAX 31 identifiers. XML Nmtoken is really close to being exactly what we want, with two specific starting-character exclusions we've needed to carve out.

One key differentiator here is that for unquoted we're not trying to match an "identifier" grammar, but a "token" grammar. As far as I'm aware, no real regularity exists on this, and the XML Nmtoken seems like a pretty good starting point.

From @peter-b:

We really do not need a N+1th incompatible specification for identifiers. It would be greatly preferable to either:

use an identifier grammar production that exactly matches some existing standard

use UAX 31

Any other choice requires an extraordinarily strong justification, which has not been provided here.

Please correct me if I'm mistaken, but this seems like a point that's more about name rather than unquoted, yes? If this is the case, replies to the following should be spun off into their own issue rather than being tracked under this PR:

For name, using UAX 31 identifiers would be conceivable as a starting point, provided that the grammar also allowed for a path-splitting character such as ., -, or : to support constructs like html:b or foo.bar. This is not an uncontroversial topic, and when settling on a syntax the WG opted to allow each implementation to figure out for itself e.g. whether the . in foo.bar has any special meaning, rather than making the call at the syntax level. Basing name on XML Name supports this, while other choices would require this discussion to be reopened.

gibson042 · 2023-06-30T08:30:01Z

XML Nmtoken is really close to being exactly what we want, with two specific starting-character exclusions we've needed to carve out.

"Needed" seems like an overly strong characterization. What problems would ensue from updating the syntax to match XML Nmtoken exactly (keeping in mind that it is possible for source to be syntactically valid while still being rejected for higher-level semantic reasons)? And if the deviation is actually necessary, shouldn't the reason be documented in a grammar comment, either directly or with a referencing link?

eemeli · 2023-06-30T09:16:36Z

What problems would ensue from updating the syntax to match XML Nmtoken exactly (keeping in mind that it is possible for source to be syntactically valid while still being rejected for higher-level semantic reasons)?

XML Nmtoken may start with either the : or - characters, which are function start indicators in MF2. This causes an overlap at the beginning of an expression, which may have an operand. Effectively, matching the rule exactly for unquoted would require us to change the function start characters, or disallow unquoted as an operand; only allowing it as an option value.

And if the deviation is actually necessary, shouldn't the reason be documented in a grammar comment, either directly or with a referencing link?

This clarification is provided in the syntax.md, to which this PR adds this text:

To make _unquoted_ literals distinct from _function_ names,
a literal MUST be quoted if it begins with a `:`
or if it begins with a `-` that is **not** followed by a `.` or a digit.

At least my understanding is that our syntax is defined by both the message.abnf as well as the syntax.md, and that duplication of the latter into the former would not necessarily be beneficial -- if you're looking for explanations, then syntax.md already provides that, along with the ABNF rules.

aphillips

This addresses my previous comments, but this PR's discussion thread makes me think we should make some changes to the text to clarify what we're doing.

aphillips · 2023-06-30T16:31:50Z

spec/syntax.md

@@ -462,9 +462,10 @@ except for surrogate code points U+D800 through U+DFFF.
 The characters `\` and `|` MUST be escaped as `\\` and `\|`.

 **_Unquoted_** literals have a much more restricted range that


I think this is missing some information that could help avoid repeats of the conversation in this PR.

Suggested change

**_Unquoted_** literals have a much more restricted range that

A _literal_ MAY appear **_unquoted_** when its content matches

the `unquoted` production.

Similar to other productions in the message syntax,

the content of an _unquoted_ literal is identical to XML's [Nmtoken]

(https://www.w3.org/TR/xml/#NT-Nmtoken) except that specific characters

meaningful to other portions of the syntax are not allowed at the start. Specifically:

* A _literal_ MUST be quoted if it begins with a `:` (U+003A COLON).

* An _unquoted_ literal MUST NOT begin with a `-` (U+002D HYPHEN MINUS)

unless that character is followed by a `.` (U+002E FULL STOP) or an ASCII digit.

This allows literals which might be used to represent negative numbers, such

as `-42` or `-.345`, to appear unquoted.

I'm fine with being more verbose about the description here, but the above suggestion is a bit problematic:

The first paragraph is somewhat tautological, given that this is how practically all of the syntax is defined.

I don't think we have any other productions that refer to or approximate XML Nmtoken. Our name is rather close to XML Name, but its first-character rules are quire different.

spec/message.abnf

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>

gibson042 · 2023-07-07T20:31:14Z

XML Nmtoken may start with either the : or - characters, which are function start indicators in MF2. This causes an overlap at the beginning of an expression, which may have an operand. Effectively, matching the rule exactly for unquoted would require us to change the function start characters, or disallow unquoted as an operand; only allowing it as an option value.

If XML Nmtoken is sufficiently important to use as a foundation for unquoted literals, then having overlap between it and function prefix sigils seems like an unnecessary misstep. Without dragging this too far into the weeds, I will note that most ASCII punctuation characters are excluded from NameChar and therefore available for that use in MF2, particularly all of those that are currently included in reserved-start = "!" / "@" / "#" / "%" / "^" / "&" / "*" / "<" / ">" / "?" / "~". Accordingly, of the three PRs relating to unquoted negative number operands, I think {#open}...{/close} of #398 heads in the best direction (although it would still need to replace {:functionName} with something like {!functionName} so that valid Nmtoken instances like ":functionName" can be used as unquoted operands).

aphillips · 2023-07-07T23:32:40Z

@gibson042 That's a good point. If we want to claim that our naming is rooted in XML, why not just go all the way and use Name and Nmtoken and choose sigils (there are plenty available) that don't conflict with these? The main thing is that we'd need to change our (arbitrary) choice of : too. It would greatly simplify the logic behind our naming and perhaps our syntax.

macchiati · 2023-07-08T00:04:04Z

The important issue is that our syntax be even as natural and non-error prone as possible. I don't see that there is much if any value to adhering strictly to Nmtoken if that adherence compromises the syntax, as long as we have an unambiguous definition.

gibson042 · 2023-07-08T13:33:15Z

Sure, but are you claiming that adopting Nmtoken would compromise the syntax? Is {:foo} really more natural than {!foo}/{@foo}/{#foo}/etc.? Is it really less error prone to accept {-1 :bar} while rejecting {+1 :bar}?

aphillips · 2023-07-08T14:39:59Z

@macchiati I agree. However, if we can achieve full compatibility with our chosen namespace scheme by choosing a different (somewhat randomly chosen) sigil, doesn't that (slightly) reduce potential errors? The syntax is only required to be internally consistent, but its better if it is also widely compatible with the contexts in which it will appear and if exterior software requires only minimal modification to interact with our constructs.

Of course, it isn't as if the world is constructed around Nmtoken. It's nice that our naming matches something, but that something is not exactly like most programming language namespaces and it will conflict with naming conventions established in some other resource, templating, scripting, or programming contexts somewhere. As a reminder, the reason we chose Nmtoken (and Name) was to maximize compatibility with (potential) LDML constructs, since CLDR uses XML. The "carve out" for various sigils doesn't conflict with naming in LDML currently and future conflicts are under our control. If we don't care about other XML dialects, non-conformance with Nmtoken (et al) may ultimately not matter. Most users will name their variables and functions something rational (probably an ASCII alphanumeric token).

stasm · 2023-09-18T11:29:30Z

I continue to believe that this PR treats the symptom rather than the root cause. I think there are two blockers before we can consider this PR for merging:

Finish the design of the open/close spannables in Design document for Open/Close Expressions #470, including their syntax.
Document the use-cases for unquoted literals in Document why we need unquoted literals #478.

eemeli · 2023-09-21T09:55:42Z

Just to note, on the call of 4 September we agreed to merge this on or immediately after the end of the Seville colloquium on 13 September, unless we explicitly decide against doing so. No such decision has been made.

I agree that this is treating the symptom rather than the root cause, but that something very much like this is required for negative number support with our current syntax. We should patch this now, while acknowledging that it may be wholly superseded by later changes.

stasm · 2023-09-21T10:25:26Z

That agreement was based on the assumption that we would have completed the design of open/close in Seville. That didn't happen and the design is still ongoing in #470.

I want to this PR to stay open because I want this to be a friction point, so that we don't get complacent with the current proposed design on open/close spannables.

An alternative which I'd be happy to go with instead is to remove the current open/close features from the spec on the main branch, since it's pending a refactor in #470 anyways. We can then fix the negative literals the proper way, and then discuss if we want to compromise in order to enable the + and - syntax for spannables.

aphillips · 2023-09-21T16:26:43Z

We have a significant refactor of the ABNF that's going to be needed due to all of the changes we agreed to in Seville and later.

We do need to incorporate negative number literals (which we have consensus on) at some point. I think your point, @stasm, is that this particular formulation is only needed if we use the sigil - in spannables.

I don't think this PR is necessarily the right friction point for addressing the sigil use. My suggestion would be (a) merge this (knowing that it's potentially throw-away); (b) figure out the more-radical syntax changes (specifically text-mode and namespacing) and implement those in the ABNF; (c) then address any changes for spannable (which might change the sigils). Isn't #470 sufficient as a friction point for the last?

aphillips · 2023-11-06T18:37:03Z

Held for discussion in 2023-11-13 call

aphillips · 2023-12-04T19:58:52Z

I believe this is now obsolete and should be closed in favor of #553

eemeli · 2023-12-04T20:10:54Z

Yeah, this is out of date.

eemeli force-pushed the negativity branch from 4b1ac92 to 8324e33 Compare June 19, 2023 19:29

Add negative-start rule

c60f711

eemeli force-pushed the negativity branch from 8324e33 to c60f711 Compare June 19, 2023 19:29

eemeli marked this pull request as ready for review June 19, 2023 19:33

eemeli requested review from aphillips and stasm June 19, 2023 19:34

aphillips requested changes Jun 19, 2023

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

spec/syntax.md Outdated Show resolved Hide resolved

spec/syntax.md Outdated Show resolved Hide resolved

eemeli and others added 2 commits June 19, 2023 23:53

Apply suggestions from code review

f1166f5

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>

Merge negative-start into unquoted-start

686e99c

eemeli requested a review from aphillips June 30, 2023 02:46

aphillips approved these changes Jun 30, 2023

View reviewed changes

Update spec/message.abnf

b8b0dfa

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>

stasm mentioned this pull request Jul 2, 2023

Change the syntax of the #open and /close function calls #398

Closed

Merge branch 'main' into negativity

dd3abc7

aphillips added the Agenda+ Requested for upcoming teleconference label Nov 6, 2023

gibson042 mentioned this pull request Nov 8, 2023

Name syntax should align with XML #519

Closed

gibson042 mentioned this pull request Nov 27, 2023

Implementing namespacing #524

Merged

aphillips added Stale Obsolete? and removed Agenda+ Requested for upcoming teleconference labels Dec 4, 2023

eemeli closed this Dec 4, 2023

eemeli deleted the negativity branch December 4, 2023 20:10

		@@ -462,9 +462,10 @@ except for surrogate code points U+D800 through U+DFFF.
		The characters `\` and `\|` MUST be escaped as `\\` and `\\|`.

		_Unquoted_ literals have a much more restricted range that

-**_Unquoted_** literals have a much more restricted range that
+A _literal_ MAY appear **_unquoted_** when its content matches
+the `unquoted` production.
+Similar to other productions in the message syntax,
+the content of an _unquoted_ literal is identical to XML's [Nmtoken]
+(https://www.w3.org/TR/xml/#NT-Nmtoken) except that specific characters
+meaningful to other portions of the syntax are not allowed at the start. Specifically:
+* A _literal_ MUST be quoted if it begins with a `:` (U+003A COLON).
+* An _unquoted_ literal MUST NOT begin with a `-`  (U+002D HYPHEN MINUS)
+unless that character is followed by a `.` (U+002E FULL STOP) or an ASCII digit.
+This allows literals which might be used to represent negative numbers, such
+as `-42` or `-.345`, to appear unquoted.

Uh oh!

Add negative-start rule #399

Add negative-start rule #399

Uh oh!

Conversation

eemeli commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eemeli commented Jun 30, 2023

Uh oh!

gibson042 commented Jun 30, 2023

Uh oh!

eemeli commented Jun 30, 2023

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

aphillips Jun 30, 2023

Choose a reason for hiding this comment

Uh oh!

eemeli Jul 2, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gibson042 commented Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aphillips commented Jul 7, 2023

Uh oh!

macchiati commented Jul 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gibson042 commented Jul 8, 2023

Uh oh!

aphillips commented Jul 8, 2023

Uh oh!

stasm commented Sep 18, 2023

Uh oh!

eemeli commented Sep 21, 2023

Uh oh!

stasm commented Sep 21, 2023

Uh oh!

aphillips commented Sep 21, 2023

Uh oh!

aphillips commented Nov 6, 2023

Uh oh!

aphillips commented Dec 4, 2023

Uh oh!

eemeli commented Dec 4, 2023

Uh oh!

Uh oh!

eemeli commented Jun 19, 2023 •

edited

Loading

gibson042 commented Jul 7, 2023 •

edited

Loading

macchiati commented Jul 8, 2023 •

edited

Loading