-
-
Notifications
You must be signed in to change notification settings - Fork 36
Forbid {�} as a valid expression #576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Actually, the error is partly on our part. We want Char is defined: Char ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF] | /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ We would deviate from this by omitting FFFD, which I think is an error on NCName's part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the minor tweak, this should be merged.
This may have to go into a separate issue/PR, but it seems related, so I'm posting it here:
This not only allows CR/LF/tab/FF, which may be okay, but also all kinds of other characters with essentially undefined or in one or another way weird (think e.g. about ESC) behavior. I think excluding those would be very prudent. This would avoid security issues and other nastities (e.g. searching for invisible differences if one of these characters is invisible). |
On rereading, I realized I should have been a bit more specific. I'm talking about the C0 area, %x00-19. |
Rejecting control characters was explicitly considered, cf. #268 and #282 and #290. I believe the outcome is explained in syntax.md (emphasis mine):
|
@gibson042 Thanks for your explanation. I think this is all okay, except when it comes to security considerations. Somewhere in the spec, the fact that a message can contain arbitrary control characters should be clearly called out as a security issue. |
@duerst Thanks for the comment. Of course you're right: there are security risks here. For example, bidi-based spoofing is an issue. I created #579 to track this and other security considerations.
Our syntax permits arbitrary control characters to appear inside the pattern portions of a message, but this does not imply anything about environment in which the message is serialized, stored, or processed. We don't provide character escapes, for example, because it is expected that the host format or system will do so. Most users will see the message as it is layered in their source code format and this will often make control characters visible as escaped. Most string types are defined as a sequence of code points or code units with minimal restriction on the characters used. Our goal is that anything you could write as a string resource can be used in a pattern. |
Ship it...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Would it make sense to treat U+FFFD the same way we treat surrogates, and forbid it globally from simple-start-char
(which implies text-char
), from quoted-char
, and from reserved-char
?
No, because it is a valid text char (can appear in literals). |
Implementation feedback from working on updates to the JS
messageformat
library:At the moment, the
name
rule includes U+FFFD as a validname-start
character. This means that{�}
is a valid unquoted literal.This is a potential footgun for implementations that choose to extend our "always emit something, report errors via side channel" formatting behaviour also to their stringification API. If such a stringifier is asked to stringify a data model with unsupported contents,
�
is an obvious character to use in the output, especially given how we're already using it. Currently, however, this would mean that the broken output of the stringifier would parse as valid MF2 syntax.There is no good reason why
�
should be supported as aname
, and broken output should not appear valid.