Allow colon in name-start, matching XML Name #483

eemeli · 2023-09-22T07:56:02Z

Ping @gibson042, as I apparently can't request a review from you.

While recently looking at the syntax, I realised that there is no place in our syntax where a leading : in a name actually conflicts with the function sigil :. Using it as a first character is probably still a bad idea.

@stasm, note that this change would preclude us later allowing for multiple annotations in a single expression.

A : as a first character continues to not be allowed in unquoted, where it would indeed conflict with function.

stasm

I'd rather see us go the other way: have a custom name, but stick to XML's nmtoken.

Making names syntax stricter is actually good for interchange: it means that there's fewer risks that other data formats will not be able to represent names easily.

Furthermore, we may need to forbid one certain character from the name-char production, in order to bless it as the namespace separator (#475).

When it comes to names, the source of truth is on our side (MF), and we care about making it expressible in other formats.

OTOH, I'd like to make our nmtoken conform exactly to XML's Nmtoken, because it will be used to represent data defined somewhere else than MF (e.g. LDML).

This, however, would mean giving up on unquoted literals as operands. I filed #478 to document why that's not acceptable (or perhaps to reconsider whether it actually is).

aphillips · 2023-09-22T14:56:30Z

Thanks for thinking about this. I'm kind of sad to just jump into trying to "mend" the namespace. Solving the syntax will help us close out naming, not the other way around

I agree that we should either (a) commit to some established naming regime or (b) clearly cast off and define our own (and not pretend to be sorta-kinda something else). Some thoughts:

We have an open design document about namespacing in which the : is suggested as a namespace separator. Yes, there are other characters to consider instead. But any character accepted there will necessarily affect the name syntax.
I think NCName in XML Namespaces might be a candidate if we go with colon for a namespace separator (this is almost directly opposite this PR, since that definition disallows exactly one character--the : !!).
In Seville we discussed why UAX31 is not attractive.
Our tie to XML is somewhat nebulous in any case: it's to ensure that LDML constructs are always supported. But it's doubtful that CLDR will test the limits of XML (and maybe they should be compatible with us in the future 🤣)
In practice, only a very small set of users will choose non-ASCII non-alphanum identifiers, since there are practical problems with using non-ASCII characters in a translation regime. I think we need to permit non-ASCII, but we don't need to get crazy in doing so. That leaves punctuation characters, which we use a lot of in our syntax. Hence my desire to solve that first.

stasm · 2023-09-22T21:33:10Z

Solving the syntax will help us close out naming, not the other way around.

Thanks, this sounds aligned with the process I'd like us to follow here:

Identify problems (e.g. namespacing, spannables, text-first).
Gather the use-cases.
Distill requirements from them.
Design a solution.
Document newly imposed constraints.
Change other parts of the syntax according to the constraints.

This is why I've been holding off the name/nmtoken discussion — we might not even need it if some of our other discussions in flight require to go with something else.

I'd be OK hitting pause on this PR, too.

stasm · 2023-09-22T21:40:32Z

Our tie to XML is somewhat nebulous in any case: it's to ensure that LDML constructs are always supported. But it's doubtful that CLDR will test the limits of XML (and maybe they should be compatible with us in the future 🤣)

Right, this is important. Plus, realistically, any LDML troublemaker can still be quoted if needed. OTOH, I feel rather strongly about not differing by a single character from XML's Nmtoken.

Out of curiosity, I browsed the CLDR to look for any such troublemakers (i.e. LDML values which are XML Nmtoken but are not our current nmtoken). I only found -x and -Inf in RBNF.

aphillips · 2023-09-22T22:52:11Z

OTOH, I feel rather strongly about not differing by a single character from XML's Nmtoken.

Can you clarify why you feel this way? Is it because Nmtoken is important as a namespace construct in its own right? Because Nmtoken is implemented elsewhere, so it will be easier to validate? Or does it have technical qualities that are important to our design?

I tend to thinking that being "mostly compatible" with some standard, such as Nmtoken, is the same thing as "not compatible" and we should either be compatible or not even mention Nmtoken

Or is it something else?

Note that switching to NCName (which differs from Nmtoken by almost exactly a single character), as proposed in #475 (for example) would break this.

For reference, I think it's helpful to remind ourselves of what is in Nmtoken:

NameStartChar  ::= ":"  | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
                   [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | 
                   [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |  [#x10000-#xEFFFF]
NameChar       ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Name           ::= NameStartChar (NameChar)*
Nmtoken        ::= (NameChar)+

And where we are:

name = name-start *name-char
name-start = ALPHA / "_"
           / %xC0-D6 / %xD8-F6 / %xF8-2FF
           / %x370-37D / %x37F-1FFF / %x200C-200D
           / %x2070-218F / %x2C00-2FEF / %x3001-D7FF
           / %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
name-char  = name-start / DIGIT / "-" / "." / ":"
           / %xB7 / %x300-36F / %x203F-2040

We are closer to Name than Nmtoken, because we use name-start. But we're different because we don't permit : at the start. (Note: XML1.1 is slightly different here)

stasm · 2023-09-23T09:36:58Z

I feel rather strongly about not differing by a single character from XML's Nmtoken

Can you clarify why you feel this way? Is it because Nmtoken is important as a namespace construct in its own right? Because Nmtoken is implemented elsewhere, so it will be easier to validate? Or does it have technical qualities that are important to our design?

Thanks for asking. I should have elaborated in my previous comment, but it was already late here. What I meant to say is effectively the same as you did:

I tend to thinking that being "mostly compatible" with some standard, such as Nmtoken, is the same thing as "not compatible" and we should either be compatible or not even mention Nmtoken

The "feel strongly" part was about the fact that today we keep talking about nmtoken but in fact, it's not the nmtoken. If we keep referencing it, then I would strongly want us to not differ by just one or a couple characters from the XML's one.

Then, there's also the matter of principles of design. I don't want us to reinvent concepts where well-established alternatives exist, in particular in matters not directly related to i18n. I think nmtoken is potentially one such alternative for the concept of unquoted literals (are there others?). The 100% compatibility with LDML is a nice touch, but as noted, we have a solution for that: quoted literals.

Related to the above point is this:

Because Nmtoken is implemented elsewhere, so it will be easier to validate?

This is a nice side effect of reusing well-established concepts. It's likely not enough on its own to be the reason for sticking to XML's Nmtoken, but it's an example of additional benefits that we can reap if we do.

Most importantly, I agree with you that we should first solve other issues currently in flight and then come back here and figure out what we want name and nmtoken to be like.

aphillips

We should fix this by choosing XML Name's namespace instead of trying to force-fit XML Name.

aphillips · 2023-11-12T16:45:44Z

spec/syntax.md

+The namespace for _name_ matches XML's [Name](https://www.w3.org/TR/xml/#NT-Name).
+
+As `:` is also used as the start sigil of _function_,
+using a _name_ with it as a first character is NOT RECOMMENDED.


I disagree with this. We have namespacing in another PR just now and there's a reasonable solution: instead of XML Name use XML-Name's NCName as the basis. The definition of NCName is exactly "Name minus the : character"

Why should there be such an unnecessarily complex overlap between variable/function/option names and unquoted literals? If the function sigil were just replaced with something that does not appear in XML NameChar then the names could be exactly described by XML Name and the unquoted literals by either 1*name-char (i.e., XML Nmtoken) or by 1*(name-char / "+" / …), in either case forming a strict superset rather than a near-superset to exclude artificially-induced ambiguity w.r.t. colons.

eemeli · 2023-11-29T09:46:11Z

Closing, as this no longer fits with the accepted namespacing design.

Allow colon in name-start, matching XML Name

afb920a

eemeli added the syntax Issues related with syntax or ABNF label Sep 22, 2023

eemeli requested review from aphillips and stasm September 22, 2023 07:56

stasm reviewed Sep 22, 2023

View reviewed changes

gibson042 mentioned this pull request Oct 22, 2023

[* ACTION REQUIRED *] Choosing a Core Syntax #499

Closed

gibson042 mentioned this pull request Nov 8, 2023

Name syntax should align with XML #519

Closed

aphillips requested changes Nov 12, 2023

View reviewed changes

eemeli closed this Nov 29, 2023

eemeli deleted the normal-name branch January 23, 2024 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow colon in name-start, matching XML Name #483

Allow colon in name-start, matching XML Name #483

eemeli commented Sep 22, 2023

stasm left a comment •

edited

Loading

aphillips commented Sep 22, 2023

stasm commented Sep 22, 2023

stasm commented Sep 22, 2023

aphillips commented Sep 22, 2023

stasm commented Sep 23, 2023

aphillips left a comment

aphillips Nov 12, 2023

gibson042 Nov 12, 2023

eemeli commented Nov 29, 2023

Allow colon in name-start, matching XML Name #483

Allow colon in name-start, matching XML Name #483

Conversation

eemeli commented Sep 22, 2023

stasm left a comment • edited Loading

Choose a reason for hiding this comment

aphillips commented Sep 22, 2023

stasm commented Sep 22, 2023

stasm commented Sep 22, 2023

aphillips commented Sep 22, 2023

stasm commented Sep 23, 2023

aphillips left a comment

Choose a reason for hiding this comment

aphillips Nov 12, 2023

Choose a reason for hiding this comment

gibson042 Nov 12, 2023

Choose a reason for hiding this comment

eemeli commented Nov 29, 2023

stasm left a comment •

edited

Loading