Skip to content

Add negative-start rule #399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Add negative-start rule #399

wants to merge 5 commits into from

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Jun 19, 2023

This is yet another potential solution for allowing unquoted negative numbers; see also #397 and #398.

This change makes a specific exception for - to start an unquoted, provided that it is followed by a . or a digit. As these characters are not included in name-start, they disambiguate the parsing of e.g. -1 or -.5 when used as an operand.

@eemeli eemeli marked this pull request as ready for review June 19, 2023 19:33
@eemeli eemeli requested review from aphillips and stasm June 19, 2023 19:34
eemeli and others added 2 commits June 19, 2023 23:53
Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
@eemeli
Copy link
Collaborator Author

eemeli commented Jun 30, 2023

From @gibson042:

I've asked before, and will repeat the question: what is the value of staying close to XML Nmtoken but not matching it exactly? If there's going to be deviation (and unexplained deviation at that), what makes that a better foundation than e.g. UAX 31 Identifiers, or for that matter of dropping the pretense that this syntax has any relevant relationship to another one?

UAX 31 identifiers support a much more limited set of strings than we want to support as unquoted values; for example values such as 42, 13.0 and html:b are currently valid unquoted literals, but are not valid UAX 31 identifiers. XML Nmtoken is really close to being exactly what we want, with two specific starting-character exclusions we've needed to carve out.

One key differentiator here is that for unquoted we're not trying to match an "identifier" grammar, but a "token" grammar. As far as I'm aware, no real regularity exists on this, and the XML Nmtoken seems like a pretty good starting point.


From @peter-b:

We really do not need a N+1th incompatible specification for identifiers. It would be greatly preferable to either:

  • use an identifier grammar production that exactly matches some existing standard
  • use UAX 31

Any other choice requires an extraordinarily strong justification, which has not been provided here.

Please correct me if I'm mistaken, but this seems like a point that's more about name rather than unquoted, yes? If this is the case, replies to the following should be spun off into their own issue rather than being tracked under this PR:

For name, using UAX 31 identifiers would be conceivable as a starting point, provided that the grammar also allowed for a path-splitting character such as ., -, or : to support constructs like html:b or foo.bar. This is not an uncontroversial topic, and when settling on a syntax the WG opted to allow each implementation to figure out for itself e.g. whether the . in foo.bar has any special meaning, rather than making the call at the syntax level. Basing name on XML Name supports this, while other choices would require this discussion to be reopened.

@eemeli eemeli requested a review from aphillips June 30, 2023 02:46
@gibson042
Copy link
Collaborator

XML Nmtoken is really close to being exactly what we want, with two specific starting-character exclusions we've needed to carve out.

"Needed" seems like an overly strong characterization. What problems would ensue from updating the syntax to match XML Nmtoken exactly (keeping in mind that it is possible for source to be syntactically valid while still being rejected for higher-level semantic reasons)? And if the deviation is actually necessary, shouldn't the reason be documented in a grammar comment, either directly or with a referencing link?

@eemeli
Copy link
Collaborator Author

eemeli commented Jun 30, 2023

What problems would ensue from updating the syntax to match XML Nmtoken exactly (keeping in mind that it is possible for source to be syntactically valid while still being rejected for higher-level semantic reasons)?

XML Nmtoken may start with either the : or - characters, which are function start indicators in MF2. This causes an overlap at the beginning of an expression, which may have an operand. Effectively, matching the rule exactly for unquoted would require us to change the function start characters, or disallow unquoted as an operand; only allowing it as an option value.

And if the deviation is actually necessary, shouldn't the reason be documented in a grammar comment, either directly or with a referencing link?

This clarification is provided in the syntax.md, to which this PR adds this text:

To make _unquoted_ literals distinct from _function_ names,
a literal MUST be quoted if it begins with a `:`
or if it begins with a `-` that is **not** followed by a `.` or a digit.

At least my understanding is that our syntax is defined by both the message.abnf as well as the syntax.md, and that duplication of the latter into the former would not necessarily be beneficial -- if you're looking for explanations, then syntax.md already provides that, along with the ABNF rules.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses my previous comments, but this PR's discussion thread makes me think we should make some changes to the text to clarify what we're doing.

spec/syntax.md Outdated
@@ -462,9 +462,10 @@ except for surrogate code points U+D800 through U+DFFF.
The characters `\` and `|` MUST be escaped as `\\` and `\|`.

**_Unquoted_** literals have a much more restricted range that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing some information that could help avoid repeats of the conversation in this PR.

Suggested change
**_Unquoted_** literals have a much more restricted range that
A _literal_ MAY appear **_unquoted_** when its content matches
the `unquoted` production.
Similar to other productions in the message syntax,
the content of an _unquoted_ literal is identical to XML's [Nmtoken]
(https://www.w3.org/TR/xml/#NT-Nmtoken) except that specific characters
meaningful to other portions of the syntax are not allowed at the start. Specifically:
* A _literal_ MUST be quoted if it begins with a `:` (U+003A COLON).
* An _unquoted_ literal MUST NOT begin with a `-` (U+002D HYPHEN MINUS)
unless that character is followed by a `.` (U+002E FULL STOP) or an ASCII digit.
This allows literals which might be used to represent negative numbers, such
as `-42` or `-.345`, to appear unquoted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with being more verbose about the description here, but the above suggestion is a bit problematic:

  • The first paragraph is somewhat tautological, given that this is how practically all of the syntax is defined.
  • I don't think we have any other productions that refer to or approximate XML Nmtoken. Our name is rather close to XML Name, but its first-character rules are quire different.

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
@gibson042
Copy link
Collaborator

gibson042 commented Jul 7, 2023

XML Nmtoken may start with either the : or - characters, which are function start indicators in MF2. This causes an overlap at the beginning of an expression, which may have an operand. Effectively, matching the rule exactly for unquoted would require us to change the function start characters, or disallow unquoted as an operand; only allowing it as an option value.

If XML Nmtoken is sufficiently important to use as a foundation for unquoted literals, then having overlap between it and function prefix sigils seems like an unnecessary misstep. Without dragging this too far into the weeds, I will note that most ASCII punctuation characters are excluded from NameChar and therefore available for that use in MF2, particularly all of those that are currently included in reserved-start = "!" / "@" / "#" / "%" / "^" / "&" / "*" / "<" / ">" / "?" / "~". Accordingly, of the three PRs relating to unquoted negative number operands, I think {#open}...{/close} of #398 heads in the best direction (although it would still need to replace {:functionName} with something like {!functionName} so that valid Nmtoken instances like ":functionName" can be used as unquoted operands).

@aphillips
Copy link
Member

@gibson042 That's a good point. If we want to claim that our naming is rooted in XML, why not just go all the way and use Name and Nmtoken and choose sigils (there are plenty available) that don't conflict with these? The main thing is that we'd need to change our (arbitrary) choice of : too. It would greatly simplify the logic behind our naming and perhaps our syntax.

@macchiati
Copy link
Member

macchiati commented Jul 8, 2023

The important issue is that our syntax be even as natural and non-error prone as possible. I don't see that there is much if any value to adhering strictly to Nmtoken if that adherence compromises the syntax, as long as we have an unambiguous definition.

@gibson042
Copy link
Collaborator

Sure, but are you claiming that adopting Nmtoken would compromise the syntax? Is {:foo} really more natural than {!foo}/{@foo}/{#foo}/etc.? Is it really less error prone to accept {-1 :bar} while rejecting {+1 :bar}?

@aphillips
Copy link
Member

@macchiati I agree. However, if we can achieve full compatibility with our chosen namespace scheme by choosing a different (somewhat randomly chosen) sigil, doesn't that (slightly) reduce potential errors? The syntax is only required to be internally consistent, but its better if it is also widely compatible with the contexts in which it will appear and if exterior software requires only minimal modification to interact with our constructs.

Of course, it isn't as if the world is constructed around Nmtoken. It's nice that our naming matches something, but that something is not exactly like most programming language namespaces and it will conflict with naming conventions established in some other resource, templating, scripting, or programming contexts somewhere. As a reminder, the reason we chose Nmtoken (and Name) was to maximize compatibility with (potential) LDML constructs, since CLDR uses XML. The "carve out" for various sigils doesn't conflict with naming in LDML currently and future conflicts are under our control. If we don't care about other XML dialects, non-conformance with Nmtoken (et al) may ultimately not matter. Most users will name their variables and functions something rational (probably an ASCII alphanumeric token).

@stasm
Copy link
Collaborator

stasm commented Sep 18, 2023

I continue to believe that this PR treats the symptom rather than the root cause. I think there are two blockers before we can consider this PR for merging:

@eemeli
Copy link
Collaborator Author

eemeli commented Sep 21, 2023

Just to note, on the call of 4 September we agreed to merge this on or immediately after the end of the Seville colloquium on 13 September, unless we explicitly decide against doing so. No such decision has been made.

I agree that this is treating the symptom rather than the root cause, but that something very much like this is required for negative number support with our current syntax. We should patch this now, while acknowledging that it may be wholly superseded by later changes.

@stasm
Copy link
Collaborator

stasm commented Sep 21, 2023

That agreement was based on the assumption that we would have completed the design of open/close in Seville. That didn't happen and the design is still ongoing in #470.

I want to this PR to stay open because I want this to be a friction point, so that we don't get complacent with the current proposed design on open/close spannables.

An alternative which I'd be happy to go with instead is to remove the current open/close features from the spec on the main branch, since it's pending a refactor in #470 anyways. We can then fix the negative literals the proper way, and then discuss if we want to compromise in order to enable the + and - syntax for spannables.

@aphillips
Copy link
Member

We have a significant refactor of the ABNF that's going to be needed due to all of the changes we agreed to in Seville and later.

We do need to incorporate negative number literals (which we have consensus on) at some point. I think your point, @stasm, is that this particular formulation is only needed if we use the sigil - in spannables.

I don't think this PR is necessarily the right friction point for addressing the sigil use. My suggestion would be (a) merge this (knowing that it's potentially throw-away); (b) figure out the more-radical syntax changes (specifically text-mode and namespacing) and implement those in the ABNF; (c) then address any changes for spannable (which might change the sigils). Isn't #470 sufficient as a friction point for the last?

@aphillips aphillips added the Agenda+ Requested for upcoming teleconference label Nov 6, 2023
@aphillips
Copy link
Member

Held for discussion in 2023-11-13 call

@aphillips
Copy link
Member

I believe this is now obsolete and should be closed in favor of #553

@aphillips aphillips added Stale Obsolete? and removed Agenda+ Requested for upcoming teleconference labels Dec 4, 2023
@eemeli
Copy link
Collaborator Author

eemeli commented Dec 4, 2023

Yeah, this is out of date.

@eemeli eemeli closed this Dec 4, 2023
@eemeli eemeli deleted the negativity branch December 4, 2023 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Obsolete?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants