-
-
Notifications
You must be signed in to change notification settings - Fork 36
Add markup as designed #574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Markup-open and markup-close parts SHOULD be paired, but this is not a requirement. A markup-open MAY be present in a message without a corresponding markup-close, and vice versa. Markup-standalone parts are not paired.
There is no mention of whether markup is forced to nest or not. It seems like we should have a statement about it, rather than leaving the question open in reader's minds.
Maybe it would be better to state that whether markup-open and markup-close are paired and/or nested depends on the markup standard that they are following, not this specification.
I agree that we should spell this out. Note that the design document does cover it: markup is neither actually paired nor does it require any specific nesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more thorough review. Lots of nitty-gritty details.
**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts | ||
that can be used to represent non-language parts of a _message_, | ||
such as inline elements or styling that should apply to a span of parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the language here. Perhaps:
**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts | |
that can be used to represent non-language parts of a _message_, | |
such as inline elements or styling that should apply to a span of parts. | |
**_<dfn>Markup</dfn>_** are _placeholders_ in a _pattern_ that | |
represent externally defined, non-formattable parts of a _message_. | |
Unless otherwise provisioned, _markup_ is omitted from the formatted _message_. | |
Examples of markup might include inline styling or translator | |
instructions to be applied to a span of _message_ parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to presume that "formatting" is a process that always produces a string. I don't think that's an accurate portrayal for this spec, which is supporting non-string formatting targets even if it does not explicitly define them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's a jump. It doesn't say anything about string resolution--it just says that the markup can be omitted ("unless otherwise provisioned"). It doesn't say what form any of that takes. We could expand on the examples, I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still referring to markup as "non-formattable", and states that "markup is omitted [unless] otherwise provisioned". Our design doc, on the other hand, includes this:
message-format-wg/exploration/open-close-placeholders.md
Lines 253 to 259 in 2a69d2a
When formatting to parts (as proposed in <a href="https://github.com/unicode-org/message-format-wg/pull/463">#463</a>), | |
markup placeholders format to an object including the following properties: | |
- The `type` of the markup: `"markup-open" | "markup-standalone" | "markup-close"` | |
- The `name` of the markup, e.g. `"b"` for `{#b}` | |
- For _markup_, | |
the `options` with the resolved key-value pairs of the placeholder options |
So they are formattable, and are not omitted by default except when formatting to a string.
Also, we did end up includes standalone, for which placeholders like {#img/}
or {#button/}
are rather likely to end up as concrete parts of the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but then... how are img
or button
different from b
?
I think what we might be trying to say is:
**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts | |
that can be used to represent non-language parts of a _message_, | |
such as inline elements or styling that should apply to a span of parts. | |
**_<dfn>Markup</dfn>_** is an implementation-specific | |
(but otherwise unsupported) mechanism intended to provide for | |
the inclusion of styling, annotation, or various markup syntaxes | |
in a _message_ in ways that are compatible with translation. | |
A _markup_ _placeholder_ can only appear in a _pattern_ | |
and might be omitted or remain unprocessed during the formatting process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but then... how are
img
orbutton
different fromb
?
They don't have a span of parts within the message that they're wrapping. They explicitly enable the representation of inline elements that should be called out in the explanation of markup.
Markup is an implementation-specific (but otherwise unsupported) mechanism [...]
What does this mean, and why does it need to be called out as the first thing about markup? How is markup "unsupported"?
My understanding of the MF2 position on markup is that we support syntax for it, and define how to represent it in non-string output. I don't really see how it could be more "supported" than that? Defining the output more precisely would've been much easier if we could have included formatted parts in the spec, but alas, that was not to be.
Maybe @mihnita already has some idea how the ICU4J implementation will handle markup, to give us another point of view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the above hurriedly yesterday, regretting "unsupported" even as I suggested it. I don't think it's necessarily "wrong" though.
The problem with markup is that we don't provide real support for it. It doesn't use functions per-se and we don't provide anything in the registry for it. Open/close don't mean anything, other than as semantic constructs--we don't require them to be in order or paired.
What we're doing is reserving syntax for a partially specified feature. I think there are three ways that implementations will respond to this: (i) they won't support markup; (ii) they'll provide "pass through" support for it using the data model and/or some other means; or (iii) they'll implement either specific markup support (such as for HTML) or some form of pluggable (generic) support.
All implementations can check the syntax of a placeholder and can resolve any variables. Therefore all implementations MUST emit a syntax error for a malformed placeholder and MUST emit a resolution error if they encounter one.
Let's consider an implementation that supports markup. I think we should provide support for implementers who want to put additional requirements on (specific) markup, such as requiring open/close pairing, requiring open/close to be in order, require specific options (or validate option values). Users will appreciate it if a typo in their markup produces an appropriate error (@stasm has cited this in the past). This suggests that we provide an optional MarkupError that implementations MAY emit.
What about pass through? I think this is what we currently describe, albeit we require formatToString
to remove the markup (pass through is to the data model or parts. Here we might want to relax the requirement to allow implementations to decide how to pass markup through (callers might need to reparse the output as a result).
What about non-supporting implementations? They might still emit data model parts or they might eat them. I think we might want to require that they not fail on markup but permit them to not emit the markup.
A key consideration is not to allow markup to become a security attack vector. We don't want <scr{#bad}ipt>
to become a script tag (or other abusive sequence) in ways that are hard to detect. This can happen if the markup doesn't collapse to nothing until late.
The last two considered together might look like:
- An implementation SHOULD NOT emit the markup placeholder when formatting to string.
- An implementation MAY (SHOULD?) emit unrecognized markup placeholders in the data model (i.e. they do something other than formatToString)
- An implementation SHOULD NOT apply any additional validation (beyond syntax and resolution) to unrecognized or unsupported markup.
- An implementation MAY emit a MarkupError for unrecognized or unsupported markup. (this has to be allowed if implementations that support markup are to identify typos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel strongly about this thread, but think we need to merge the overall changes into the spec. I am filing a separate issue to track this conversation.
@@ -12,7 +12,7 @@ their handling during formatting is specified here as well. | |||
|
|||
Formatting of a _message_ is defined by the following operations: | |||
|
|||
- **_Expression Resolution_** determines the value of an _expression_, | |||
- **_Expression and Markup Resolution_** determines the value of an _expression_ or _markup_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make markup resolution a separate bullet in this list. Resolving markup is a very different operations from resolving expressions due to markup's not calling any functions, due to the default ignorable property in formatToString
, and due to the fact that it can only happen inside patterns.
In fact, I'm not even sure if we should call it resolution. I guess we still resolve options, but crucially, in its current design, markup is encoded in the data model rather directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make markup resolution a separate bullet in this list. Resolving markup is a very different operations from resolving expressions due to markup's not calling any functions, due to the default ignorable property in
formatToString
, and due to the fact that it can only happen inside patterns.
I would prefer to keep these together, because the similarities are much greater than the differences.
- Markup resolution does not call external functions, but its results may be defined in the same way as for expressions. We also have expressions such as literals and variables which do not necessarily call any functions.
- Markup resolution uses the same option resolution as functions.
- The behaviour of markup during string formatting is a concern during formatting, but not resolution.
- Markup resolution only being used for pattern placeholders does not change how it is done.
In fact, I'm not even sure if we should call it resolution. I guess we still resolve options, but crucially, in its current design, markup is encoded in the data model rather directly.
Could you clarify what you mean by "data model" here? The only data model we define in detail is for the message before formatting, from which we need to select, resolve, and format a pattern during formatting. We left out formatted parts, so the output structure is not well defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first reaction to the draft was like @stasm's: it's okay to be somewhat repetitious in a spec. Keeping a wall between the two similar-but-unlike concepts might be useful.
I didn't make that comment in the end, because there is a lot of commonality and the resolution into different buckets does happen later.
We also have expressions such as literals and variables which do not necessarily call any functions.
Are you sure? If there is a placeholder {|zebra|}
, I'm pretty sure that invokes the :string
formatting function. If there is a placeholder {$foo}
, I'd allow the implementation to introspect foo
's type, but I would expect some function to be called in the end. We don't want it to be an error to write messages like Hello {$user}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a placeholder
{|zebra|}
, I'm pretty sure that invokes the:string
formatting function. If there is a placeholder{$foo}
, I'd allow the implementation to introspectfoo
's type, but I would expect some function to be called in the end.
I agree that this is a reasonable thing for an implementation to do, but I don't think we should expect it to always happen. There are also reasonable further questions, such as: What formatter is called for {|zebra|}
if an implementation allows for overriding default formatters, and :string
is so overridden? Is it the core :string
formatter, or the user-provided :string
formatter that handles the zebra?
I think that's a question we ought to not resolve in this spec, leaving it instead to implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a question we ought to not resolve in this spec, leaving it instead to implementations.
I agree. Hmm... what concrete change are we proposing with this thread?
literal = quoted / unquoted | ||
variable = "$" name | ||
function = (":" / "+" / "-") identifier | ||
function = ":" identifier *(s option) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move *(s option)
back to the annotation
production. It's helpful that variable
and function
define names. In fact, we should consider renaming them to veriable-name
and function-name
, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, isn't it also useful for a function to have options? Previously, they were only a part of the annotation, which could effectively have any shape as private-use and reserved annotations.
@@ -3,7 +3,7 @@ message = simple-message / complex-message | |||
simple-message = [simple-start pattern] | |||
simple-start = simple-start-char / text-escape / placeholder | |||
pattern = *(text-char / text-escape / placeholder) | |||
placeholder = expression | |||
placeholder = expression / markup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is cosmetic (or academic?), but I think I was assuming we'd instead do:
pattern = *(text-char / text-escape / placeholder / markup)
placeholder = expression
That is, that markup won't end up being placeholders, but rather a new type of pattern elements, similar to how you implemented it in the data model. I presume that ABNF will have the most impact on shaping MF2's vocabulary, which is why I think we should be careful here.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the ABNF was explicitly included in the design doc, and so I'd really rather not change that here. I also think that having a catch-all name for "stuff in a pattern that has curly braces" is useful, and that "placeholder" works for that.
I would strongly prefer discussing this change as a separate follow-up to this PR.
Co-authored-by: Addison Phillips <addison@unicode.org> Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>
I think it is "unsupported" in the sense that MF itself attaches no
specific meaning to any of the markup identifiers; it's just a 'pass
through' mechanism.
…On Sat, Dec 23, 2023 at 8:08 AM Eemeli Aro ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In spec/syntax.md
<#574 (comment)>
:
> +**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
+that can be used to represent non-language parts of a _message_,
+such as inline elements or styling that should apply to a span of parts.
You're right, but then... how are img or button different from b?
They don't have a span of parts within the message that they're wrapping.
They explicitly enable the representation of inline elements that should be
called out in the explanation of *markup*.
*Markup* is an implementation-specific (but otherwise unsupported)
mechanism [...]
What does this mean, and why does it need to be called out as the first
thing about markup? How is markup "unsupported"?
My understanding of the MF2 position on *markup* is that we support
syntax for it, and define how to represent it in non-string output. I don't
really see how it could be more "supported" than that? Defining the output
more precisely would've been much easier if we could have included
formatted parts in the spec, but alas, that was not to be.
Maybe @mihnita <https://github.com/mihnita> already has some idea how the
ICU4J implementation will handle markup, to give us another point of view?
—
Reply to this email directly, view it on GitHub
<#574 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMHXC7CVJYL6AXFRR7DYK36ZLAVCNFSM6AAAAABA5STDYCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOOJVGQ2TCOJSGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
spec/data-model/message.json
Outdated
} | ||
}, | ||
"required": ["name", "value"] | ||
"options": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but aesthetically I'd prefer keeping the shallower approach and just duplicating "options": { "type": "array", "items": { "$ref": "#/$defs/option" } }
(it's nice to have an easily-referenced "option" type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what use case you have in mind for an externally-referrable JSON schema option type?
In my experience (and according to the reference docs), these $defs
are meant only for internal use within this schema, and that only the top-level entity here should be considered as an external API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, it's just aesthetic. Referencing a singular is both shallower and a bit easier to keep in my human memory when contemplating the schema.
- For _markup-open_ and _markup_standalone_, | ||
the resolved _options_ values after _option resolution_. | ||
|
||
The resolution of _markup_ MUST always succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look quite right for a use of RFC 2119 "MUST".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage here is intentional, at least on my part: An implementation must not allow for the resolution of any {#foo}
part to ever fail.
Co-authored-by: Richard Gibson <richard.gibson@gmail.com> Co-authored-by: Addison Phillips <addison@unicode.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments. Getting close.
@@ -214,11 +220,43 @@ that the implementation attaches to that _annotation_. | |||
```ts | |||
interface UnsupportedAnnotation { | |||
type: "unsupported-annotation"; | |||
sigil: "!" | "@" | "#" | "%" | "^" | "&" | "*" | "<" | ">" | "/" | "?" | "~"; | |||
sigil: "!" | "@" | "%" | "^" | "&" | "*" | "+" | "<" | ">" | "?" | "~"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what order is this list in? Probably we should sort the reserved sigils into code point order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's currently something close to the layout order on a US keyboard? I've no opinion on the order here.
Unlike _functions_, the resolution of _markup_ is not customizable. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm saying is: it's not "afterwards" at all, it might be now as part of the resolution, if the implementation chooses to make it work that way.
Instead of saying this (which doesn't really help implementations either way), perhaps state what our limits are:
Unlike _functions_, the resolution of _markup_ is not customizable. | |
This specification does not fully define how _markup_ is resolved. | |
Implementations can handle _markup_ according to their own needs, so long as the following | |
requirements are met. |
When formatting to a string, the default representation of all _markup_ | ||
MUST be an empty string. | ||
Implementations MAY offer functionality for customizing this, | ||
such as by emitting XML-ish tags for each _markup_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
reserved-annotation-start = "!" / "@" / "%" / "*" / "+" | ||
/ "<" / ">" / "?" / "~" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, we should probably sort these
Co-authored-by: Addison Phillips <addison@unicode.org>
markup = "{" [s] markup-open [s] ["/"] "}" | ||
/ "{" [s] markup-close [s] "}" | ||
markup-open = "#" identifier *(s option) | ||
markup-close = "/" identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing markup-standalone ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's there as ["/"]
in the first line. It's not a separate production because it would require an infinite lookahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this important set of changes in and use separate issues to make any necessary modifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the term "shape" is confusing for many readers and may well be misleading: I stumble across a term like "implementation's shape".
It appears that "structure" is what is meant. If so, it should be replaced by that.
If this is merged, I can move this comment to the new issue for cleanup.
@macchiati I think your suggested (editorial) change is worth considering. I am filing an issue on your behalf to track it, so that we can merge this work. |
Per the 2024-11-08 teleconference, merging this work. Please use new issues or PRs to address specific items within it. |
This implements markup as designed, including syntax, data model, and formatting changes.
The open/close/value "kind" concept is removed from functions, which now only ever use
:
as a prefix. The+
is added toreserved-annotation-start
, but-
is not.Resolution and formatting to a string is defined for markup, but formatting to parts is not, given that we don't define parts.
Noting for @gibson042 in particular: In the data model, I did need to add a
type: "expression"
on the expressions, so that they could be differentiated from thetype: "markup"
elements that a pattern may now also hold.