Skip to content

Allow colon in name-start, matching XML Name #483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Sep 22, 2023

Ping @gibson042, as I apparently can't request a review from you.

While recently looking at the syntax, I realised that there is no place in our syntax where a leading : in a name actually conflicts with the function sigil :. Using it as a first character is probably still a bad idea.

@stasm, note that this change would preclude us later allowing for multiple annotations in a single expression.

A : as a first character continues to not be allowed in unquoted, where it would indeed conflict with function.

@eemeli eemeli added the syntax Issues related with syntax or ABNF label Sep 22, 2023
@eemeli eemeli requested review from aphillips and stasm September 22, 2023 07:56
Copy link
Collaborator

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see us go the other way: have a custom name, but stick to XML's nmtoken.


Making names syntax stricter is actually good for interchange: it means that there's fewer risks that other data formats will not be able to represent names easily.

Furthermore, we may need to forbid one certain character from the name-char production, in order to bless it as the namespace separator (#475).

When it comes to names, the source of truth is on our side (MF), and we care about making it expressible in other formats.


OTOH, I'd like to make our nmtoken conform exactly to XML's Nmtoken, because it will be used to represent data defined somewhere else than MF (e.g. LDML).

This, however, would mean giving up on unquoted literals as operands. I filed #478 to document why that's not acceptable (or perhaps to reconsider whether it actually is).

@aphillips
Copy link
Member

Thanks for thinking about this. I'm kind of sad to just jump into trying to "mend" the namespace. Solving the syntax will help us close out naming, not the other way around

I agree that we should either (a) commit to some established naming regime or (b) clearly cast off and define our own (and not pretend to be sorta-kinda something else). Some thoughts:

  • We have an open design document about namespacing in which the : is suggested as a namespace separator. Yes, there are other characters to consider instead. But any character accepted there will necessarily affect the name syntax.
  • I think NCName in XML Namespaces might be a candidate if we go with colon for a namespace separator (this is almost directly opposite this PR, since that definition disallows exactly one character--the : !!).
  • In Seville we discussed why UAX31 is not attractive.
  • Our tie to XML is somewhat nebulous in any case: it's to ensure that LDML constructs are always supported. But it's doubtful that CLDR will test the limits of XML (and maybe they should be compatible with us in the future 🤣)
  • In practice, only a very small set of users will choose non-ASCII non-alphanum identifiers, since there are practical problems with using non-ASCII characters in a translation regime. I think we need to permit non-ASCII, but we don't need to get crazy in doing so. That leaves punctuation characters, which we use a lot of in our syntax. Hence my desire to solve that first.

@stasm
Copy link
Collaborator

stasm commented Sep 22, 2023

Solving the syntax will help us close out naming, not the other way around.

Thanks, this sounds aligned with the process I'd like us to follow here:

  1. Identify problems (e.g. namespacing, spannables, text-first).
  2. Gather the use-cases.
  3. Distill requirements from them.
  4. Design a solution.
  5. Document newly imposed constraints.
  6. Change other parts of the syntax according to the constraints.

This is why I've been holding off the name/nmtoken discussion — we might not even need it if some of our other discussions in flight require to go with something else.

I'd be OK hitting pause on this PR, too.

@stasm
Copy link
Collaborator

stasm commented Sep 22, 2023

Our tie to XML is somewhat nebulous in any case: it's to ensure that LDML constructs are always supported. But it's doubtful that CLDR will test the limits of XML (and maybe they should be compatible with us in the future 🤣)

Right, this is important. Plus, realistically, any LDML troublemaker can still be quoted if needed. OTOH, I feel rather strongly about not differing by a single character from XML's Nmtoken.

Out of curiosity, I browsed the CLDR to look for any such troublemakers (i.e. LDML values which are XML Nmtoken but are not our current nmtoken). I only found -x and -Inf in RBNF.

@aphillips
Copy link
Member

OTOH, I feel rather strongly about not differing by a single character from XML's Nmtoken.

Can you clarify why you feel this way? Is it because Nmtoken is important as a namespace construct in its own right? Because Nmtoken is implemented elsewhere, so it will be easier to validate? Or does it have technical qualities that are important to our design?

I tend to thinking that being "mostly compatible" with some standard, such as Nmtoken, is the same thing as "not compatible" and we should either be compatible or not even mention Nmtoken

Or is it something else?

Note that switching to NCName (which differs from Nmtoken by almost exactly a single character), as proposed in #475 (for example) would break this.

For reference, I think it's helpful to remind ourselves of what is in Nmtoken:

NameStartChar  ::= ":"  | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
                   [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | 
                   [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |  [#x10000-#xEFFFF]
NameChar       ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Name           ::= NameStartChar (NameChar)*
Nmtoken        ::= (NameChar)+

And where we are:

name = name-start *name-char
name-start = ALPHA / "_"
           / %xC0-D6 / %xD8-F6 / %xF8-2FF
           / %x370-37D / %x37F-1FFF / %x200C-200D
           / %x2070-218F / %x2C00-2FEF / %x3001-D7FF
           / %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
name-char  = name-start / DIGIT / "-" / "." / ":"
           / %xB7 / %x300-36F / %x203F-2040

We are closer to Name than Nmtoken, because we use name-start. But we're different because we don't permit : at the start. (Note: XML1.1 is slightly different here)

@stasm
Copy link
Collaborator

stasm commented Sep 23, 2023

I feel rather strongly about not differing by a single character from XML's Nmtoken

Can you clarify why you feel this way? Is it because Nmtoken is important as a namespace construct in its own right? Because Nmtoken is implemented elsewhere, so it will be easier to validate? Or does it have technical qualities that are important to our design?

Thanks for asking. I should have elaborated in my previous comment, but it was already late here. What I meant to say is effectively the same as you did:

I tend to thinking that being "mostly compatible" with some standard, such as Nmtoken, is the same thing as "not compatible" and we should either be compatible or not even mention Nmtoken

The "feel strongly" part was about the fact that today we keep talking about nmtoken but in fact, it's not the nmtoken. If we keep referencing it, then I would strongly want us to not differ by just one or a couple characters from the XML's one.

Then, there's also the matter of principles of design. I don't want us to reinvent concepts where well-established alternatives exist, in particular in matters not directly related to i18n. I think nmtoken is potentially one such alternative for the concept of unquoted literals (are there others?). The 100% compatibility with LDML is a nice touch, but as noted, we have a solution for that: quoted literals.

Related to the above point is this:

Because Nmtoken is implemented elsewhere, so it will be easier to validate?

This is a nice side effect of reusing well-established concepts. It's likely not enough on its own to be the reason for sticking to XML's Nmtoken, but it's an example of additional benefits that we can reap if we do.


Most importantly, I agree with you that we should first solve other issues currently in flight and then come back here and figure out what we want name and nmtoken to be like.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix this by choosing XML Name's namespace instead of trying to force-fit XML Name.

Comment on lines +634 to +637
The namespace for _name_ matches XML's [Name](https://www.w3.org/TR/xml/#NT-Name).

As `:` is also used as the start sigil of _function_,
using a _name_ with it as a first character is NOT RECOMMENDED.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. We have namespacing in another PR just now and there's a reasonable solution: instead of XML Name use XML-Name's NCName as the basis. The definition of NCName is exactly "Name minus the : character"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should there be such an unnecessarily complex overlap between variable/function/option names and unquoted literals? If the function sigil were just replaced with something that does not appear in XML NameChar then the names could be exactly described by XML Name and the unquoted literals by either 1*name-char (i.e., XML Nmtoken) or by 1*(name-char / "+" / …), in either case forming a strict superset rather than a near-superset to exclude artificially-induced ambiguity w.r.t. colons.

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 29, 2023

Closing, as this no longer fits with the accepted namespacing design.

@eemeli eemeli closed this Nov 29, 2023
@eemeli eemeli deleted the normal-name branch January 23, 2024 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
syntax Issues related with syntax or ABNF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants