Add markup as designed #574

eemeli · 2023-12-20T23:55:12Z

This implements markup as designed, including syntax, data model, and formatting changes.

The open/close/value "kind" concept is removed from functions, which now only ever use : as a prefix. The + is added to reserved-annotation-start, but - is not.

Resolution and formatting to a string is defined for markup, but formatting to parts is not, given that we don't define parts.

Noting for @gibson042 in particular: In the data model, I did need to add a type: "expression" on the expressions, so that they could be differentiated from the type: "markup" elements that a pattern may now also hold.

macchiati

Markup-open and markup-close parts SHOULD be paired, but this is not a requirement. A markup-open MAY be present in a message without a corresponding markup-close, and vice versa. Markup-standalone parts are not paired.

There is no mention of whether markup is forced to nest or not. It seems like we should have a statement about it, rather than leaving the question open in reader's minds.

Maybe it would be better to state that whether markup-open and markup-close are paired and/or nested depends on the markup standard that they are following, not this specification.

spec/data-model/README.md

aphillips · 2023-12-21T14:52:44Z

@macchiati

There is no mention of whether markup is forced to nest or not. It seems like we should have a statement about it, rather than leaving the question open in reader's minds.

I agree that we should spell this out. Note that the design document does cover it: markup is neither actually paired nor does it require any specific nesting.

aphillips

A more thorough review. Lots of nitty-gritty details.

spec/data-model/README.md

spec/data-model/message.json

aphillips · 2023-12-21T18:45:29Z

spec/syntax.md

+**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
+that can be used to represent non-language parts of a _message_,
+such as inline elements or styling that should apply to a span of parts.


I'm not sure about the language here. Perhaps:

Suggested change

**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts

that can be used to represent non-language parts of a _message_,

such as inline elements or styling that should apply to a span of parts.

**_<dfn>Markup</dfn>_** are _placeholders_ in a _pattern_ that

represent externally defined, non-formattable parts of a _message_.

Unless otherwise provisioned, _markup_ is omitted from the formatted _message_.

Examples of markup might include inline styling or translator

instructions to be applied to a span of _message_ parts.

This seems to presume that "formatting" is a process that always produces a string. I don't think that's an accurate portrayal for this spec, which is supporting non-string formatting targets even if it does not explicitly define them.

No, that's a jump. It doesn't say anything about string resolution--it just says that the markup can be omitted ("unless otherwise provisioned"). It doesn't say what form any of that takes. We could expand on the examples, I suppose.

It's still referring to markup as "non-formattable", and states that "markup is omitted [unless] otherwise provisioned". Our design doc, on the other hand, includes this:

message-format-wg/exploration/open-close-placeholders.md

Lines 253 to 259 in 2a69d2a

When formatting to parts (as proposed in <a href="https://github.com/unicode-org/message-format-wg/pull/463">#463</a>),

markup placeholders format to an object including the following properties:

- The `type` of the markup: `"markup-open" | "markup-standalone" | "markup-close"`

- The `name` of the markup, e.g. `"b"` for `{#b}`

- For _markup_,

the `options` with the resolved key-value pairs of the placeholder options

So they are formattable, and are not omitted by default except when formatting to a string.

Also, we did end up includes standalone, for which placeholders like {#img/} or {#button/} are rather likely to end up as concrete parts of the output

You're right, but then... how are img or button different from b?

I think what we might be trying to say is:

Suggested change

**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts

that can be used to represent non-language parts of a _message_,

such as inline elements or styling that should apply to a span of parts.

**_<dfn>Markup</dfn>_** is an implementation-specific

(but otherwise unsupported) mechanism intended to provide for

the inclusion of styling, annotation, or various markup syntaxes

in a _message_ in ways that are compatible with translation.

A _markup_ _placeholder_ can only appear in a _pattern_

and might be omitted or remain unprocessed during the formatting process.

You're right, but then... how are img or button different from b?

They don't have a span of parts within the message that they're wrapping. They explicitly enable the representation of inline elements that should be called out in the explanation of markup.

Markup is an implementation-specific (but otherwise unsupported) mechanism [...]

What does this mean, and why does it need to be called out as the first thing about markup? How is markup "unsupported"?

My understanding of the MF2 position on markup is that we support syntax for it, and define how to represent it in non-string output. I don't really see how it could be more "supported" than that? Defining the output more precisely would've been much easier if we could have included formatted parts in the spec, but alas, that was not to be.

Maybe @mihnita already has some idea how the ICU4J implementation will handle markup, to give us another point of view?

I wrote the above hurriedly yesterday, regretting "unsupported" even as I suggested it. I don't think it's necessarily "wrong" though.

The problem with markup is that we don't provide real support for it. It doesn't use functions per-se and we don't provide anything in the registry for it. Open/close don't mean anything, other than as semantic constructs--we don't require them to be in order or paired.

What we're doing is reserving syntax for a partially specified feature. I think there are three ways that implementations will respond to this: (i) they won't support markup; (ii) they'll provide "pass through" support for it using the data model and/or some other means; or (iii) they'll implement either specific markup support (such as for HTML) or some form of pluggable (generic) support.

All implementations can check the syntax of a placeholder and can resolve any variables. Therefore all implementations MUST emit a syntax error for a malformed placeholder and MUST emit a resolution error if they encounter one.

Let's consider an implementation that supports markup. I think we should provide support for implementers who want to put additional requirements on (specific) markup, such as requiring open/close pairing, requiring open/close to be in order, require specific options (or validate option values). Users will appreciate it if a typo in their markup produces an appropriate error (@stasm has cited this in the past). This suggests that we provide an optional MarkupError that implementations MAY emit.

What about pass through? I think this is what we currently describe, albeit we require formatToString to remove the markup (pass through is to the data model or parts. Here we might want to relax the requirement to allow implementations to decide how to pass markup through (callers might need to reparse the output as a result).

What about non-supporting implementations? They might still emit data model parts or they might eat them. I think we might want to require that they not fail on markup but permit them to not emit the markup.

A key consideration is not to allow markup to become a security attack vector. We don't want <scr{#bad}ipt> to become a script tag (or other abusive sequence) in ways that are hard to detect. This can happen if the markup doesn't collapse to nothing until late.

The last two considered together might look like:

An implementation SHOULD NOT emit the markup placeholder when formatting to string.

An implementation MAY (SHOULD?) emit unrecognized markup placeholders in the data model (i.e. they do something other than formatToString)

An implementation SHOULD NOT apply any additional validation (beyond syntax and resolution) to unrecognized or unsupported markup.

An implementation MAY emit a MarkupError for unrecognized or unsupported markup. (this has to be allowed if implementations that support markup are to identify typos)

I feel strongly about this thread, but think we need to merge the overall changes into the spec. I am filing a separate issue to track this conversation.

spec/syntax.md

stasm · 2023-12-22T00:07:57Z

spec/formatting.md

@@ -12,7 +12,7 @@ their handling during formatting is specified here as well.

 Formatting of a _message_ is defined by the following operations:

- **_Expression Resolution_** determines the value of an _expression_,
+- **_Expression and Markup Resolution_** determines the value of an _expression_ or _markup_,


I think we should make markup resolution a separate bullet in this list. Resolving markup is a very different operations from resolving expressions due to markup's not calling any functions, due to the default ignorable property in formatToString, and due to the fact that it can only happen inside patterns.

In fact, I'm not even sure if we should call it resolution. I guess we still resolve options, but crucially, in its current design, markup is encoded in the data model rather directly.

I think we should make markup resolution a separate bullet in this list. Resolving markup is a very different operations from resolving expressions due to markup's not calling any functions, due to the default ignorable property in formatToString, and due to the fact that it can only happen inside patterns.

I would prefer to keep these together, because the similarities are much greater than the differences.

Markup resolution does not call external functions, but its results may be defined in the same way as for expressions. We also have expressions such as literals and variables which do not necessarily call any functions.

Markup resolution uses the same option resolution as functions.

The behaviour of markup during string formatting is a concern during formatting, but not resolution.

Markup resolution only being used for pattern placeholders does not change how it is done.

In fact, I'm not even sure if we should call it resolution. I guess we still resolve options, but crucially, in its current design, markup is encoded in the data model rather directly.

Could you clarify what you mean by "data model" here? The only data model we define in detail is for the message before formatting, from which we need to select, resolve, and format a pattern during formatting. We left out formatted parts, so the output structure is not well defined.

My first reaction to the draft was like @stasm's: it's okay to be somewhat repetitious in a spec. Keeping a wall between the two similar-but-unlike concepts might be useful.

I didn't make that comment in the end, because there is a lot of commonality and the resolution into different buckets does happen later.

We also have expressions such as literals and variables which do not necessarily call any functions.

Are you sure? If there is a placeholder {|zebra|}, I'm pretty sure that invokes the :string formatting function. If there is a placeholder {$foo}, I'd allow the implementation to introspect foo's type, but I would expect some function to be called in the end. We don't want it to be an error to write messages like Hello {$user}...

If there is a placeholder {|zebra|}, I'm pretty sure that invokes the :string formatting function. If there is a placeholder {$foo}, I'd allow the implementation to introspect foo's type, but I would expect some function to be called in the end.

I agree that this is a reasonable thing for an implementation to do, but I don't think we should expect it to always happen. There are also reasonable further questions, such as: What formatter is called for {|zebra|} if an implementation allows for overriding default formatters, and :string is so overridden? Is it the core :string formatter, or the user-provided :string formatter that handles the zebra?

I think that's a question we ought to not resolve in this spec, leaving it instead to implementations.

I think that's a question we ought to not resolve in this spec, leaving it instead to implementations.

I agree. Hmm... what concrete change are we proposing with this thread?

stasm · 2023-12-22T00:22:22Z

spec/message.abnf

 literal = quoted / unquoted
 variable = "$" name
-function = (":" / "+" / "-") identifier
+function = ":" identifier *(s option)


Please move *(s option) back to the annotation production. It's helpful that variable and function define names. In fact, we should consider renaming them to veriable-name and function-name, respectively.

On the other hand, isn't it also useful for a function to have options? Previously, they were only a part of the annotation, which could effectively have any shape as private-use and reserved annotations.

stasm · 2023-12-22T00:22:41Z

spec/message.abnf

@@ -3,7 +3,7 @@ message = simple-message / complex-message
 simple-message = [simple-start pattern]
 simple-start = simple-start-char / text-escape / placeholder
 pattern = *(text-char / text-escape / placeholder)
-placeholder = expression
+placeholder = expression / markup


I realize this is cosmetic (or academic?), but I think I was assuming we'd instead do:

pattern = *(text-char / text-escape / placeholder / markup) placeholder = expression

That is, that markup won't end up being placeholders, but rather a new type of pattern elements, similar to how you implemented it in the data model. I presume that ABNF will have the most impact on shaping MF2's vocabulary, which is why I think we should be careful here.

Thoughts?

This part of the ABNF was explicitly included in the design doc, and so I'd really rather not change that here. I also think that having a catch-all name for "stuff in a pattern that has curly braces" is useful, and that "placeholder" works for that.

I would strongly prefer discussing this change as a separate follow-up to this PR.

spec/syntax.md

spec/data-model/README.md

Co-authored-by: Addison Phillips <addison@unicode.org> Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>

spec/syntax.md

macchiati · 2023-12-23T22:35:17Z

I think it is "unsupported" in the sense that MF itself attaches no specific meaning to any of the markup identifiers; it's just a 'pass through' mechanism.

…

On Sat, Dec 23, 2023 at 8:08 AM Eemeli Aro ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec/syntax.md <#574 (comment)> : > +**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts +that can be used to represent non-language parts of a _message_, +such as inline elements or styling that should apply to a span of parts. You're right, but then... how are img or button different from b? They don't have a span of parts within the message that they're wrapping. They explicitly enable the representation of inline elements that should be called out in the explanation of *markup*. *Markup* is an implementation-specific (but otherwise unsupported) mechanism [...] What does this mean, and why does it need to be called out as the first thing about markup? How is markup "unsupported"? My understanding of the MF2 position on *markup* is that we support syntax for it, and define how to represent it in non-string output. I don't really see how it could be more "supported" than that? Defining the output more precisely would've been much easier if we could have included formatted parts in the spec, but alas, that was not to be. Maybe @mihnita <https://github.com/mihnita> already has some idea how the ICU4J implementation will handle markup, to give us another point of view? — Reply to this email directly, view it on GitHub <#574 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMHXC7CVJYL6AXFRR7DYK36ZLAVCNFSM6AAAAABA5STDYCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOOJVGQ2TCOJSGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

spec/data-model/README.md

gibson042 · 2023-12-24T18:41:26Z

spec/data-model/message.json

-        }
-      },
-      "required": ["name", "value"]
+    "options": {


This works, but aesthetically I'd prefer keeping the shallower approach and just duplicating "options": { "type": "array", "items": { "$ref": "#/$defs/option" } } (it's nice to have an easily-referenced "option" type).

Could you clarify what use case you have in mind for an externally-referrable JSON schema option type?

In my experience (and according to the reference docs), these $defs are meant only for internal use within this schema, and that only the top-level entity here should be considered as an external API.

As I said, it's just aesthetic. Referencing a singular is both shallower and a bit easier to keep in my human memory when contemplating the schema.

gibson042 · 2023-12-24T18:45:35Z

spec/formatting.md

+- For _markup-open_ and _markup_standalone_,
+  the resolved _options_ values after _option resolution_.
+
+The resolution of _markup_ MUST always succeed.


This doesn't look quite right for a use of RFC 2119 "MUST".

The usage here is intentional, at least on my part: An implementation must not allow for the resolution of any {#foo} part to ever fail.

Co-authored-by: Richard Gibson <richard.gibson@gmail.com> Co-authored-by: Addison Phillips <addison@unicode.org>

aphillips

Mostly minor comments. Getting close.

spec/data-model/README.md

aphillips · 2024-01-05T19:26:55Z

spec/data-model/README.md

@@ -214,11 +220,43 @@ that the implementation attaches to that _annotation_.
 ```ts
 interface UnsupportedAnnotation {
  type: "unsupported-annotation";
-  sigil: "!" | "@" | "#" | "%" | "^" | "&" | "*" | "<" | ">" | "/" | "?" | "~";
+  sigil: "!" | "@" | "%" | "^" | "&" | "*" | "+" | "<" | ">" | "?" | "~";


what order is this list in? Probably we should sort the reserved sigils into code point order?

I think it's currently something close to the layout order on a US keyboard? I've no opinion on the order here.

spec/data-model/README.md

spec/formatting.md

aphillips · 2024-01-05T19:38:47Z

spec/formatting.md

+Unlike _functions_, the resolution of _markup_ is not customizable.
+


What I'm saying is: it's not "afterwards" at all, it might be now as part of the resolution, if the implementation chooses to make it work that way.

Instead of saying this (which doesn't really help implementations either way), perhaps state what our limits are:

Suggested change

Unlike _functions_, the resolution of _markup_ is not customizable.

This specification does not fully define how _markup_ is resolved.

Implementations can handle _markup_ according to their own needs, so long as the following

requirements are met.

aphillips · 2024-01-05T19:40:00Z

spec/formatting.md

+When formatting to a string, the default representation of all _markup_
+MUST be an empty string.
+Implementations MAY offer functionality for customizing this,
+such as by emitting XML-ish tags for each _markup_.


aphillips · 2024-01-05T19:40:20Z

spec/message.abnf

+reserved-annotation-start = "!" / "@" / "%" / "*" / "+"
+                          / "<" / ">" / "?" / "~"


As noted above, we should probably sort these

Co-authored-by: Addison Phillips <addison@unicode.org>

mihnita · 2024-01-08T18:23:07Z

spec/message.abnf

+markup       = "{" [s] markup-open [s] ["/"] "}"
+             / "{" [s] markup-close [s] "}"
+markup-open  = "#" identifier *(s option)
+markup-close = "/" identifier


missing markup-standalone ?

It's there as ["/"] in the first line. It's not a separate production because it would require an infinite lookahead.

aphillips

Let's get this important set of changes in and use separate issues to make any necessary modifications.

macchiati

I think the term "shape" is confusing for many readers and may well be misleading: I stumble across a term like "implementation's shape".

It appears that "structure" is what is meant. If so, it should be replaced by that.

If this is merged, I can move this comment to the new issue for cleanup.

aphillips · 2024-01-11T16:46:48Z

@macchiati I think your suggested (editorial) change is worth considering. I am filing an issue on your behalf to track it, so that we can merge this work.

aphillips · 2024-01-11T16:49:55Z

Per the 2024-11-08 teleconference, merging this work. Please use new issues or PRs to address specific items within it.

* Fix merge of #574 and #585 * Fix JSON Schema options definition

Add markup as designed

6738e5e

eemeli requested review from aphillips, stasm and gibson042 December 20, 2023 23:55

macchiati reviewed Dec 21, 2023

View reviewed changes

aphillips reviewed Dec 21, 2023

View reviewed changes

spec/data-model/README.md Outdated Show resolved Hide resolved

spec/data-model/README.md Outdated Show resolved Hide resolved

aphillips requested changes Dec 21, 2023

View reviewed changes

stasm requested changes Dec 22, 2023

View reviewed changes

eemeli commented Dec 22, 2023

View reviewed changes

spec/data-model/README.md Outdated Show resolved Hide resolved

eemeli commented Dec 22, 2023

View reviewed changes

spec/data-model/README.md Outdated Show resolved Hide resolved

eemeli and others added 2 commits December 22, 2023 09:22

Apply suggestions from code review

4a2ebf0

Co-authored-by: Addison Phillips <addison@unicode.org> Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>

Apply suggestions from code review

6bc7786

eemeli commented Dec 22, 2023

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

eemeli mentioned this pull request Dec 22, 2023

The data model uses "name" for both namespaced and non-namespaced names #575

Closed

Add separate markup-standalone schema in messages.json

24c8cd9

gibson042 reviewed Dec 24, 2023

View reviewed changes

eemeli mentioned this pull request Jan 3, 2024

Update MF2 to match upstream changes messageformat/messageformat#414

Merged

Apply suggestions from code review

a5a3f06

Co-authored-by: Richard Gibson <richard.gibson@gmail.com> Co-authored-by: Addison Phillips <addison@unicode.org>

eemeli requested review from aphillips, gibson042, stasm and macchiati January 5, 2024 18:37

aphillips reviewed Jan 5, 2024

View reviewed changes

Apply suggestions from code review

a96c442

Co-authored-by: Addison Phillips <addison@unicode.org>

mihnita reviewed Jan 8, 2024

View reviewed changes

eemeli mentioned this pull request Jan 10, 2024

Flatten JSON & TS data model Pattern element #585

Merged

aphillips mentioned this pull request Jan 10, 2024

Normative requirements for markup handling #586

Closed

aphillips approved these changes Jan 10, 2024

View reviewed changes

aphillips mentioned this pull request Jan 10, 2024

Sort the reserved sigils, check for completeness #587

Closed

macchiati reviewed Jan 10, 2024

View reviewed changes

Merge branch 'main' into add-markup

da349bd

aphillips merged commit a5d2d89 into main Jan 11, 2024

aphillips deleted the add-markup branch January 11, 2024 16:50

aphillips mentioned this pull request Jan 11, 2024

Replace 'shape' with another term in README and formatting #590

Closed

eemeli added a commit that referenced this pull request Jan 12, 2024

Fix merge of #574 and #585

582029f

eemeli mentioned this pull request Jan 12, 2024

Fix merge of markup PR #591

Merged

aphillips pushed a commit that referenced this pull request Jan 13, 2024

Fix merge of markup PR (#591)

715e99e

* Fix merge of #574 and #585 * Fix JSON Schema options definition

eemeli mentioned this pull request Mar 2, 2024

Drop obsolete mention of "kind" in function-annotation JSON Schema definition #699

Merged

-**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
-that can be used to represent non-language parts of a _message_,
-such as inline elements or styling that should apply to a span of parts.
+**_<dfn>Markup</dfn>_** are _placeholders_ in a _pattern_ that
+represent externally defined, non-formattable parts of a _message_.
+Unless otherwise provisioned, _markup_ is omitted from the formatted _message_.
+Examples of markup might include inline styling or translator
+instructions to be applied to a span of _message_ parts.

	When formatting to parts (as proposed in <a href="https://github.com/unicode-org/message-format-wg/pull/463">#463</a>),
	markup placeholders format to an object including the following properties:

	- The `type` of the markup: `"markup-open" \| "markup-standalone" \| "markup-close"`
	- The `name` of the markup, e.g. `"b"` for `{#b}`
	- For _markup_,
	the `options` with the resolved key-value pairs of the placeholder options

-**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
-that can be used to represent non-language parts of a _message_,
-such as inline elements or styling that should apply to a span of parts.
+**_<dfn>Markup</dfn>_** is an implementation-specific
+(but otherwise unsupported) mechanism intended to provide for
+the inclusion of styling, annotation, or various markup syntaxes
+in a _message_ in ways that are compatible with translation.
+A _markup_ _placeholder_ can only appear in a _pattern_
+and might be omitted or remain unprocessed during the formatting process.

		Unlike _functions_, the resolution of _markup_ is not customizable.

-Unlike _functions_, the resolution of _markup_ is not customizable.
+This specification does not fully define how _markup_ is resolved.
+Implementations can handle _markup_ according to their own needs, so long as the following
+requirements are met.

		reserved-annotation-start = "!" / "@" / "%" / "*" / "+"
		/ "<" / ">" / "?" / "~"

Uh oh!

Add markup as designed #574

Add markup as designed #574

Uh oh!

Conversation

eemeli commented Dec 20, 2023

Uh oh!

macchiati left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aphillips commented Dec 21, 2023

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macchiati commented Dec 23, 2023 via email

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aphillips left a comment

Choose a reason for hiding this comment

macchiati left a comment •

edited

Loading