Skip to content

Forbid {�} as a valid expression #576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 9, 2024
Merged

Forbid {�} as a valid expression #576

merged 2 commits into from
Jan 9, 2024

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Dec 24, 2023

Implementation feedback from working on updates to the JS messageformat library:

At the moment, the name rule includes U+FFFD as a valid name-start character. This means that {�} is a valid unquoted literal.

This is a potential footgun for implementations that choose to extend our "always emit something, report errors via side channel" formatting behaviour also to their stringification API. If such a stringifier is asked to stringify a data model with unsupported contents, is an obvious character to use in the output, especially given how we're already using it. Currently, however, this would mean that the broken output of the stringifier would parse as valid MF2 syntax.

There is no good reason why should be supported as a name, and broken output should not appear valid.

@eemeli eemeli added the syntax Issues related with syntax or ABNF label Dec 24, 2023
@aphillips
Copy link
Member

aphillips commented Dec 24, 2023

Actually, the error is partly on our part. We want name to be NCName and NCName uses XML's Char production.

Char is defined:

  Char ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

We would deviate from this by omitting FFFD, which I think is an error on NCName's part.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the minor tweak, this should be merged.

@aphillips aphillips added blocker-candidate The submitter thinks this might be a block for the next release fast-track Editorial change permitted to use fast-track merge rules normative Issue affects normative text in the specification LDML45 LDML45 Release (Tech Preview) labels Dec 24, 2023
@duerst
Copy link

duerst commented Dec 24, 2023

This may have to go into a separate issue/PR, but it seems related, so I'm posting it here:
The grammar currently has the following:

simple-start-char = %x0-2D
quoted-char = %x0-5B         ; omit \
reserved-char  = %x00-08        ; omit HTAB and LF

This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior. I think excluding those would be very prudent. This would avoid security issues and other nastities (e.g. searching for invisible differences if one of these characters is invisible).

@duerst
Copy link

duerst commented Dec 25, 2023

This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior.

On rereading, I realized I should have been a bit more specific. I'm talking about the C0 area, %x00-19.

@gibson042
Copy link
Collaborator

This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior.

On rereading, I realized I should have been a bit more specific. I'm talking about the C0 area, %x00-19.

Rejecting control characters was explicitly considered, cf. #268 and #282 and #290. I believe the outcome is explained in syntax.md (emphasis mine):

The syntax should define as few special characters and sigils as possible. Note that this necessitates extra care when presenting messages for human consumption, because they may contain invisible characters such as U+200B ZERO WIDTH SPACE, control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.

Any Unicode code point is allowed [in text], except for surrogate code points U+D800 through U+DFFF inclusive.

A literal MAY include any Unicode code point except for surrogate code points U+D800 through U+DFFF.

@duerst
Copy link

duerst commented Jan 4, 2024

@gibson042 Thanks for your explanation. I think this is all okay, except when it comes to security considerations. Somewhere in the spec, the fact that a message can contain arbitrary control characters should be clearly called out as a security issue.

@aphillips
Copy link
Member

@duerst Thanks for the comment. Of course you're right: there are security risks here. For example, bidi-based spoofing is an issue. I created #579 to track this and other security considerations.

Somewhere in the spec, the fact that a message can contain arbitrary control characters should be clearly called out as a security issue.

Our syntax permits arbitrary control characters to appear inside the pattern portions of a message, but this does not imply anything about environment in which the message is serialized, stored, or processed. We don't provide character escapes, for example, because it is expected that the host format or system will do so. Most users will see the message as it is layered in their source code format and this will often make control characters visible as escaped.

Most string types are defined as a sequence of code points or code units with minimal restriction on the characters used. Our goal is that anything you could write as a string resource can be used in a pattern.

@eemeli eemeli requested a review from aphillips January 5, 2024 08:44
@aphillips
Copy link
Member

Ship it...

          |\___..--"/
   __..--``        /
  \_______________/   

Copy link
Collaborator

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Would it make sense to treat U+FFFD the same way we treat surrogates, and forbid it globally from simple-start-char (which implies text-char), from quoted-char, and from reserved-char?

@aphillips
Copy link
Member

@stasm

Would it make sense to treat U+FFFD the same way we treat surrogates, and forbid it globally from simple-start-char (which implies text-char), from quoted-char, and from reserved-char?

No, because it is a valid text char (can appear in literals).

@aphillips aphillips merged commit a074819 into main Jan 9, 2024
@aphillips aphillips deleted the forbid-logo branch January 9, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker-candidate The submitter thinks this might be a block for the next release fast-track Editorial change permitted to use fast-track merge rules LDML45 LDML45 Release (Tech Preview) normative Issue affects normative text in the specification syntax Issues related with syntax or ABNF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants