Skip to content

Add markup as designed #574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 11, 2024
Merged

Add markup as designed #574

merged 7 commits into from
Jan 11, 2024

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Dec 20, 2023

This implements markup as designed, including syntax, data model, and formatting changes.

The open/close/value "kind" concept is removed from functions, which now only ever use : as a prefix. The + is added to reserved-annotation-start, but - is not.

Resolution and formatting to a string is defined for markup, but formatting to parts is not, given that we don't define parts.

Noting for @gibson042 in particular: In the data model, I did need to add a type: "expression" on the expressions, so that they could be differentiated from the type: "markup" elements that a pattern may now also hold.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markup-open and markup-close parts SHOULD be paired, but this is not a requirement. A markup-open MAY be present in a message without a corresponding markup-close, and vice versa. Markup-standalone parts are not paired.

There is no mention of whether markup is forced to nest or not. It seems like we should have a statement about it, rather than leaving the question open in reader's minds.

Maybe it would be better to state that whether markup-open and markup-close are paired and/or nested depends on the markup standard that they are following, not this specification.

@aphillips
Copy link
Member

@macchiati

There is no mention of whether markup is forced to nest or not. It seems like we should have a statement about it, rather than leaving the question open in reader's minds.

I agree that we should spell this out. Note that the design document does cover it: markup is neither actually paired nor does it require any specific nesting.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more thorough review. Lots of nitty-gritty details.

Comment on lines +656 to +658
**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
that can be used to represent non-language parts of a _message_,
such as inline elements or styling that should apply to a span of parts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the language here. Perhaps:

Suggested change
**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
that can be used to represent non-language parts of a _message_,
such as inline elements or styling that should apply to a span of parts.
**_<dfn>Markup</dfn>_** are _placeholders_ in a _pattern_ that
represent externally defined, non-formattable parts of a _message_.
Unless otherwise provisioned, _markup_ is omitted from the formatted _message_.
Examples of markup might include inline styling or translator
instructions to be applied to a span of _message_ parts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to presume that "formatting" is a process that always produces a string. I don't think that's an accurate portrayal for this spec, which is supporting non-string formatting targets even if it does not explicitly define them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's a jump. It doesn't say anything about string resolution--it just says that the markup can be omitted ("unless otherwise provisioned"). It doesn't say what form any of that takes. We could expand on the examples, I suppose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still referring to markup as "non-formattable", and states that "markup is omitted [unless] otherwise provisioned". Our design doc, on the other hand, includes this:

When formatting to parts (as proposed in <a href="https://github.com/unicode-org/message-format-wg/pull/463">#463</a>),
markup placeholders format to an object including the following properties:
- The `type` of the markup: `"markup-open" | "markup-standalone" | "markup-close"`
- The `name` of the markup, e.g. `"b"` for `{#b}`
- For _markup_,
the `options` with the resolved key-value pairs of the placeholder options

So they are formattable, and are not omitted by default except when formatting to a string.

Also, we did end up includes standalone, for which placeholders like {#img/} or {#button/} are rather likely to end up as concrete parts of the output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but then... how are img or button different from b?

I think what we might be trying to say is:

Suggested change
**_<dfn>Markup</dfn>_** _placeholders_ are _pattern_ parts
that can be used to represent non-language parts of a _message_,
such as inline elements or styling that should apply to a span of parts.
**_<dfn>Markup</dfn>_** is an implementation-specific
(but otherwise unsupported) mechanism intended to provide for
the inclusion of styling, annotation, or various markup syntaxes
in a _message_ in ways that are compatible with translation.
A _markup_ _placeholder_ can only appear in a _pattern_
and might be omitted or remain unprocessed during the formatting process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but then... how are img or button different from b?

They don't have a span of parts within the message that they're wrapping. They explicitly enable the representation of inline elements that should be called out in the explanation of markup.

Markup is an implementation-specific (but otherwise unsupported) mechanism [...]

What does this mean, and why does it need to be called out as the first thing about markup? How is markup "unsupported"?

My understanding of the MF2 position on markup is that we support syntax for it, and define how to represent it in non-string output. I don't really see how it could be more "supported" than that? Defining the output more precisely would've been much easier if we could have included formatted parts in the spec, but alas, that was not to be.

Maybe @mihnita already has some idea how the ICU4J implementation will handle markup, to give us another point of view?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the above hurriedly yesterday, regretting "unsupported" even as I suggested it. I don't think it's necessarily "wrong" though.

The problem with markup is that we don't provide real support for it. It doesn't use functions per-se and we don't provide anything in the registry for it. Open/close don't mean anything, other than as semantic constructs--we don't require them to be in order or paired.

What we're doing is reserving syntax for a partially specified feature. I think there are three ways that implementations will respond to this: (i) they won't support markup; (ii) they'll provide "pass through" support for it using the data model and/or some other means; or (iii) they'll implement either specific markup support (such as for HTML) or some form of pluggable (generic) support.

All implementations can check the syntax of a placeholder and can resolve any variables. Therefore all implementations MUST emit a syntax error for a malformed placeholder and MUST emit a resolution error if they encounter one.

Let's consider an implementation that supports markup. I think we should provide support for implementers who want to put additional requirements on (specific) markup, such as requiring open/close pairing, requiring open/close to be in order, require specific options (or validate option values). Users will appreciate it if a typo in their markup produces an appropriate error (@stasm has cited this in the past). This suggests that we provide an optional MarkupError that implementations MAY emit.

What about pass through? I think this is what we currently describe, albeit we require formatToString to remove the markup (pass through is to the data model or parts. Here we might want to relax the requirement to allow implementations to decide how to pass markup through (callers might need to reparse the output as a result).

What about non-supporting implementations? They might still emit data model parts or they might eat them. I think we might want to require that they not fail on markup but permit them to not emit the markup.

A key consideration is not to allow markup to become a security attack vector. We don't want <scr{#bad}ipt> to become a script tag (or other abusive sequence) in ways that are hard to detect. This can happen if the markup doesn't collapse to nothing until late.

The last two considered together might look like:

  • An implementation SHOULD NOT emit the markup placeholder when formatting to string.
  • An implementation MAY (SHOULD?) emit unrecognized markup placeholders in the data model (i.e. they do something other than formatToString)
  • An implementation SHOULD NOT apply any additional validation (beyond syntax and resolution) to unrecognized or unsupported markup.
  • An implementation MAY emit a MarkupError for unrecognized or unsupported markup. (this has to be allowed if implementations that support markup are to identify typos)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel strongly about this thread, but think we need to merge the overall changes into the spec. I am filing a separate issue to track this conversation.

@@ -12,7 +12,7 @@ their handling during formatting is specified here as well.

Formatting of a _message_ is defined by the following operations:

- **_Expression Resolution_** determines the value of an _expression_,
- **_Expression and Markup Resolution_** determines the value of an _expression_ or _markup_,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make markup resolution a separate bullet in this list. Resolving markup is a very different operations from resolving expressions due to markup's not calling any functions, due to the default ignorable property in formatToString, and due to the fact that it can only happen inside patterns.

In fact, I'm not even sure if we should call it resolution. I guess we still resolve options, but crucially, in its current design, markup is encoded in the data model rather directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make markup resolution a separate bullet in this list. Resolving markup is a very different operations from resolving expressions due to markup's not calling any functions, due to the default ignorable property in formatToString, and due to the fact that it can only happen inside patterns.

I would prefer to keep these together, because the similarities are much greater than the differences.

  • Markup resolution does not call external functions, but its results may be defined in the same way as for expressions. We also have expressions such as literals and variables which do not necessarily call any functions.
  • Markup resolution uses the same option resolution as functions.
  • The behaviour of markup during string formatting is a concern during formatting, but not resolution.
  • Markup resolution only being used for pattern placeholders does not change how it is done.

In fact, I'm not even sure if we should call it resolution. I guess we still resolve options, but crucially, in its current design, markup is encoded in the data model rather directly.

Could you clarify what you mean by "data model" here? The only data model we define in detail is for the message before formatting, from which we need to select, resolve, and format a pattern during formatting. We left out formatted parts, so the output structure is not well defined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first reaction to the draft was like @stasm's: it's okay to be somewhat repetitious in a spec. Keeping a wall between the two similar-but-unlike concepts might be useful.

I didn't make that comment in the end, because there is a lot of commonality and the resolution into different buckets does happen later.

We also have expressions such as literals and variables which do not necessarily call any functions.

Are you sure? If there is a placeholder {|zebra|}, I'm pretty sure that invokes the :string formatting function. If there is a placeholder {$foo}, I'd allow the implementation to introspect foo's type, but I would expect some function to be called in the end. We don't want it to be an error to write messages like Hello {$user}...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a placeholder {|zebra|}, I'm pretty sure that invokes the :string formatting function. If there is a placeholder {$foo}, I'd allow the implementation to introspect foo's type, but I would expect some function to be called in the end.

I agree that this is a reasonable thing for an implementation to do, but I don't think we should expect it to always happen. There are also reasonable further questions, such as: What formatter is called for {|zebra|} if an implementation allows for overriding default formatters, and :string is so overridden? Is it the core :string formatter, or the user-provided :string formatter that handles the zebra?

I think that's a question we ought to not resolve in this spec, leaving it instead to implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a question we ought to not resolve in this spec, leaving it instead to implementations.

I agree. Hmm... what concrete change are we proposing with this thread?

literal = quoted / unquoted
variable = "$" name
function = (":" / "+" / "-") identifier
function = ":" identifier *(s option)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move *(s option) back to the annotation production. It's helpful that variable and function define names. In fact, we should consider renaming them to veriable-name and function-name, respectively.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, isn't it also useful for a function to have options? Previously, they were only a part of the annotation, which could effectively have any shape as private-use and reserved annotations.

@@ -3,7 +3,7 @@ message = simple-message / complex-message
simple-message = [simple-start pattern]
simple-start = simple-start-char / text-escape / placeholder
pattern = *(text-char / text-escape / placeholder)
placeholder = expression
placeholder = expression / markup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is cosmetic (or academic?), but I think I was assuming we'd instead do:

pattern = *(text-char / text-escape / placeholder / markup)
placeholder = expression

That is, that markup won't end up being placeholders, but rather a new type of pattern elements, similar to how you implemented it in the data model. I presume that ABNF will have the most impact on shaping MF2's vocabulary, which is why I think we should be careful here.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the ABNF was explicitly included in the design doc, and so I'd really rather not change that here. I also think that having a catch-all name for "stuff in a pattern that has curly braces" is useful, and that "placeholder" works for that.

I would strongly prefer discussing this change as a separate follow-up to this PR.

eemeli and others added 2 commits December 22, 2023 09:22
Co-authored-by: Addison Phillips <addison@unicode.org>
Co-authored-by: Stanisław Małolepszy <sta@malolepszy.org>
@macchiati
Copy link
Member

macchiati commented Dec 23, 2023 via email

}
},
"required": ["name", "value"]
"options": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but aesthetically I'd prefer keeping the shallower approach and just duplicating "options": { "type": "array", "items": { "$ref": "#/$defs/option" } } (it's nice to have an easily-referenced "option" type).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what use case you have in mind for an externally-referrable JSON schema option type?

In my experience (and according to the reference docs), these $defs are meant only for internal use within this schema, and that only the top-level entity here should be considered as an external API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, it's just aesthetic. Referencing a singular is both shallower and a bit easier to keep in my human memory when contemplating the schema.

- For _markup-open_ and _markup_standalone_,
the resolved _options_ values after _option resolution_.

The resolution of _markup_ MUST always succeed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right for a use of RFC 2119 "MUST".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage here is intentional, at least on my part: An implementation must not allow for the resolution of any {#foo} part to ever fail.

Co-authored-by: Richard Gibson <richard.gibson@gmail.com>
Co-authored-by: Addison Phillips <addison@unicode.org>
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments. Getting close.

@@ -214,11 +220,43 @@ that the implementation attaches to that _annotation_.
```ts
interface UnsupportedAnnotation {
type: "unsupported-annotation";
sigil: "!" | "@" | "#" | "%" | "^" | "&" | "*" | "<" | ">" | "/" | "?" | "~";
sigil: "!" | "@" | "%" | "^" | "&" | "*" | "+" | "<" | ">" | "?" | "~";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what order is this list in? Probably we should sort the reserved sigils into code point order?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's currently something close to the layout order on a US keyboard? I've no opinion on the order here.

Comment on lines +262 to +263
Unlike _functions_, the resolution of _markup_ is not customizable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is: it's not "afterwards" at all, it might be now as part of the resolution, if the implementation chooses to make it work that way.

Instead of saying this (which doesn't really help implementations either way), perhaps state what our limits are:

Suggested change
Unlike _functions_, the resolution of _markup_ is not customizable.
This specification does not fully define how _markup_ is resolved.
Implementations can handle _markup_ according to their own needs, so long as the following
requirements are met.

Comment on lines +650 to +653
When formatting to a string, the default representation of all _markup_
MUST be an empty string.
Implementations MAY offer functionality for customizing this,
such as by emitting XML-ish tags for each _markup_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Comment on lines +73 to +74
reserved-annotation-start = "!" / "@" / "%" / "*" / "+"
/ "<" / ">" / "?" / "~"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, we should probably sort these

Co-authored-by: Addison Phillips <addison@unicode.org>
markup = "{" [s] markup-open [s] ["/"] "}"
/ "{" [s] markup-close [s] "}"
markup-open = "#" identifier *(s option)
markup-close = "/" identifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing markup-standalone ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's there as ["/"] in the first line. It's not a separate production because it would require an infinite lookahead.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this important set of changes in and use separate issues to make any necessary modifications.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "shape" is confusing for many readers and may well be misleading: I stumble across a term like "implementation's shape".

It appears that "structure" is what is meant. If so, it should be replaced by that.

If this is merged, I can move this comment to the new issue for cleanup.

@aphillips
Copy link
Member

@macchiati I think your suggested (editorial) change is worth considering. I am filing an issue on your behalf to track it, so that we can merge this work.

@aphillips
Copy link
Member

Per the 2024-11-08 teleconference, merging this work. Please use new issues or PRs to address specific items within it.

@aphillips aphillips merged commit a5d2d89 into main Jan 11, 2024
@aphillips aphillips deleted the add-markup branch January 11, 2024 16:50
eemeli added a commit that referenced this pull request Jan 12, 2024
@eemeli eemeli mentioned this pull request Jan 12, 2024
aphillips pushed a commit that referenced this pull request Jan 13, 2024
* Fix merge of #574 and #585

* Fix JSON Schema options definition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants