Forbid {�} as a valid expression #576

eemeli · 2023-12-24T11:56:15Z

Implementation feedback from working on updates to the JS messageformat library:

At the moment, the name rule includes U+FFFD as a valid name-start character. This means that {�} is a valid unquoted literal.

This is a potential footgun for implementations that choose to extend our "always emit something, report errors via side channel" formatting behaviour also to their stringification API. If such a stringifier is asked to stringify a data model with unsupported contents, � is an obvious character to use in the output, especially given how we're already using it. Currently, however, this would mean that the broken output of the stringifier would parse as valid MF2 syntax.

There is no good reason why � should be supported as a name, and broken output should not appear valid.

aphillips · 2023-12-24T15:21:53Z

Actually, the error is partly on our part. We want name to be NCName and NCName uses XML's Char production.

Char is defined:

  Char ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

We would deviate from this by omitting FFFD, which I think is an error on NCName's part.

aphillips

With the minor tweak, this should be merged.

spec/message.abnf

spec/syntax.md

duerst · 2023-12-24T23:07:44Z

This may have to go into a separate issue/PR, but it seems related, so I'm posting it here:
The grammar currently has the following:

simple-start-char = %x0-2D
quoted-char = %x0-5B         ; omit \
reserved-char  = %x00-08        ; omit HTAB and LF

This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior. I think excluding those would be very prudent. This would avoid security issues and other nastities (e.g. searching for invisible differences if one of these characters is invisible).

duerst · 2023-12-25T02:40:05Z

This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior.

On rereading, I realized I should have been a bit more specific. I'm talking about the C0 area, %x00-19.

gibson042 · 2023-12-26T17:06:39Z

This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior.

On rereading, I realized I should have been a bit more specific. I'm talking about the C0 area, %x00-19.

Rejecting control characters was explicitly considered, cf. #268 and #282 and #290. I believe the outcome is explained in syntax.md (emphasis mine):

The syntax should define as few special characters and sigils as possible. Note that this necessitates extra care when presenting messages for human consumption, because they may contain invisible characters such as U+200B ZERO WIDTH SPACE, control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.

Any Unicode code point is allowed [in text], except for surrogate code points U+D800 through U+DFFF inclusive.

A literal MAY include any Unicode code point except for surrogate code points U+D800 through U+DFFF.

duerst · 2024-01-04T05:52:19Z

@gibson042 Thanks for your explanation. I think this is all okay, except when it comes to security considerations. Somewhere in the spec, the fact that a message can contain arbitrary control characters should be clearly called out as a security issue.

aphillips · 2024-01-04T16:29:50Z

@duerst Thanks for the comment. Of course you're right: there are security risks here. For example, bidi-based spoofing is an issue. I created #579 to track this and other security considerations.

Somewhere in the spec, the fact that a message can contain arbitrary control characters should be clearly called out as a security issue.

Our syntax permits arbitrary control characters to appear inside the pattern portions of a message, but this does not imply anything about environment in which the message is serialized, stored, or processed. We don't provide character escapes, for example, because it is expected that the host format or system will do so. Most users will see the message as it is layered in their source code format and this will often make control characters visible as escaped.

Most string types are defined as a sequence of code points or code units with minimal restriction on the characters used. Our goal is that anything you could write as a string resource can be used in a pattern.

aphillips · 2024-01-05T17:40:00Z

Ship it...

          |\___..--"/
   __..--``        /
  \_______________/

stasm

LGTM.

Would it make sense to treat U+FFFD the same way we treat surrogates, and forbid it globally from simple-start-char (which implies text-char), from quoted-char, and from reserved-char?

aphillips · 2024-01-09T14:08:05Z

@stasm

Would it make sense to treat U+FFFD the same way we treat surrogates, and forbid it globally from simple-start-char (which implies text-char), from quoted-char, and from reserved-char?

No, because it is a valid text char (can appear in literals).

Forbid {�} as a valid expression

fdb8da3

eemeli added the syntax Issues related with syntax or ABNF label Dec 24, 2023

aphillips requested changes Dec 24, 2023

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

spec/syntax.md Outdated Show resolved Hide resolved

aphillips added blocker-candidate The submitter thinks this might be a block for the next release fast-track Editorial change permitted to use fast-track merge rules normative Issue affects normative text in the specification LDML45 LDML45 Release (Tech Preview) labels Dec 24, 2023

aphillips mentioned this pull request Jan 4, 2024

Security considerations section #579

Closed

Drop U+FFFD also from name-char

f556bfb

eemeli requested a review from aphillips January 5, 2024 08:44

aphillips approved these changes Jan 5, 2024

View reviewed changes

stasm approved these changes Jan 9, 2024

View reviewed changes

aphillips merged commit a074819 into main Jan 9, 2024

aphillips deleted the forbid-logo branch January 9, 2024 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Forbid {�} as a valid expression #576

Forbid {�} as a valid expression #576

Uh oh!

eemeli commented Dec 24, 2023

Uh oh!

aphillips commented Dec 24, 2023 •

edited

Loading

Uh oh!

aphillips left a comment

Uh oh!

Uh oh!

Uh oh!

duerst commented Dec 24, 2023

Uh oh!

duerst commented Dec 25, 2023

Uh oh!

gibson042 commented Dec 26, 2023

Uh oh!

duerst commented Jan 4, 2024

Uh oh!

aphillips commented Jan 4, 2024

Uh oh!

aphillips commented Jan 5, 2024

Uh oh!

stasm left a comment •

edited

Loading

Uh oh!

aphillips commented Jan 9, 2024

Uh oh!

Uh oh!

Uh oh!

Forbid {�} as a valid expression #576

Forbid {�} as a valid expression #576

Uh oh!

Conversation

eemeli commented Dec 24, 2023

Uh oh!

aphillips commented Dec 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

duerst commented Dec 24, 2023

Uh oh!

duerst commented Dec 25, 2023

Uh oh!

gibson042 commented Dec 26, 2023

Uh oh!

duerst commented Jan 4, 2024

Uh oh!

aphillips commented Jan 4, 2024

Uh oh!

aphillips commented Jan 5, 2024

Uh oh!

stasm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aphillips commented Jan 9, 2024

Uh oh!

Uh oh!

aphillips commented Dec 24, 2023 •

edited

Loading

stasm left a comment •

edited

Loading