[FEEDBACK] Improving bidi layout #661

macchiati · 2024-02-17T05:43:12Z

In MF1.0, we allow RLM and LRM in any position between syntax tokens that is not part of the literal message; those characters are then ignorable in parsing the message. The advantage of this is that many people will be viewing the message in plaintext editors, and the default results for the bidi algorithm is designed for normal text, and can really mess up "code-like" text (as in MF1.0 or MF2.0).

Here is an example of what happens with BIDI text. This is using a tool (mentioned below) that has the convention that UPPERCASE stands for bidi characters so that people can read them, and I used ⁅...⁆ is used in place of {...} because of the tooling.

Example Source:
* ⁅⁅YOU HAVE ⁅$count⁆ NOTIFICATION.⁆⁆

Resulting Display:
⁅⁅.NOITACIFITON ⁅count :number$⁆ EVAH UOY⁆⁆ *

Notice that "YOU ARE" is reversed & NOTIFICATION is reversed (which is the correct order among the pieces), but the $ is jumbled.

but if the line started with 'one' instead of *, you'd get different results.

one ⁅⁅EVAH UOY ⁅$count :number⁆ NOITACIFITON.⁆⁆

If the syntax allows the insertion of an LRM in locations that are not part of the literal message and ignores those LRM, the source can be consistently left-to-right (so that the syntax is in the right order).

* ⁅⁅YOU HAVE ⁅$count :number⁆ NOTIFICATION.⁆⁆

Resulting Display:

* ⁅⁅EVAH UOY ⁅$count :number⁆ NOITACIFITON.⁆⁆

That is still not good for BIDI languages: the ideal would be that the flow of the syntax was LTR, but the flow of the variant message would be consistently RTL, something like:

* ⁅⁅YOU HAVE ⁅$count :number⁆ NOTIFICATION.⁆⁆

Resulting Display:
⁅⁅.NOITACIFITON ⁅$count :number⁆ EVAH UOY⁆⁆ *

A good "MF2" editor or localization tool would use something more sophisticated (we should call that to people's attention in the spec, pointing to https://www.unicode.org/reports/tr55, especially https://www.unicode.org/reports/tr55/#Ordering for applying HL4).

But a lot of people will be using plain-text editors, and allowing the entry of LRM and RLM can make the syntax more readable. (And tooling can insert those characters in the appropriate places automatically, so that plain text looks better.)

The tool is:
https://util.unicode.org/UnicodeJsps/bidi.jsp?a=one+%E2%81%85%E2%81%85YOU+HAVE+%E2%81%85%24count%E2%81%86+NOTIFICATION.%E2%81%86%E2%81%86&p=Auto&hack=on

Notes:

What it does not show is that the brackets ⁅...⁆ are mirrored if displayed R to L, so I did that manually above.
If you insert either x or X in the Source (and ignore them in the Reordered output) you can see what the ordering result would be.

eemeli · 2024-02-17T11:12:58Z

LRM and RLM are valid content-char, so if they show up within a pattern the syntax is completely fine with that. But is this about asking for them to be allowed (and ignored?) outside patterns as well, e.g. among variant keys?

aphillips · 2024-02-17T15:49:29Z

What I think Mark is trying to ask for is that optional whitespace inside expressions be permitted to contain bidi controls (tools should prefer isolating paired controls for this--specifically LRI/PDI).

Pattern exteriors are not a problem because the {{ and }} do mirror, but constructs outside of patterns will have mixed direction display issues. The overall syntax is not very bidi friendly, so this message looks badly formed when shown in an LTR context, but is, in fact, syntactically correct:

The same message presented RTL is surprisingly "backwards correct" (but only because I didn't use any LTR options/values or names):

We can't just add bidi controls to the whitespace production s because the controls are invisible and would interfere with visual parsing. They need to be optional.

Current:

s = 1*( SP / HTAB / CR / LF / %x3000 )

Modified:

s    = *1(%x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069) 
       1*( SP / HTAB / CR / LF / %x3000 ) 
       *1(%x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069)

The other alternative would be to fix statements and expressions. This requires making the isolate controls optional on the brackets in pairs. Something modeled like the following (note that expression here compresses three productions in the real ABNF):

expression = "{" [s] (literal / variable) [s annotation] *(s attribute) [s] "}"
           / "{" [s] (s annotation) *(s attribute) [s] "}"
           / "{%x2066" [s] (literal / variable) [s annotation] *(s attribute) [s] "%x2069}"
           / "{%x2066" [s] (s annotation) *(s attribute) [s] "%x2069}"

(The above does not show how statements need modification.)

@eemeli asked:

LRM and RLM are valid content-char, so if they show up within a pattern the syntax is completely fine with that. But is this about asking for them to be allowed (and ignored?) outside patterns as well, e.g. among variant keys?

Yes, that's what he's asking for. Note that these characters are not valid name chars. This also makes them illegal in unquoted literals.

I'll add that one does not want to insert these characters into the text part of a pattern for purpose of making the expressions look good, because they would be part of the output and interfere with the actual message display in bad ways. That's why I put the isolating controls inside the brackets above.

macchiati · 2024-02-17T19:39:14Z

Yes, that's the ask. I marked this as the Preview-Feedback because I don't think this is urgent enough to require addressing in v45. (BTW, @markusicu was the one who caught this.) We might put a note in to the effect that bidi issues will be reviewed during the tech preview.

An expansion of %x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069 with names is below. (Handy for people who haven't got the codes memorized.)

Code point	Name
U+200E	LEFT-TO-RIGHT MARK
U+200F	RIGHT-TO-LEFT MARK
U+2066	LEFT-TO-RIGHT ISOLATE
U+2067	RIGHT-TO-LEFT ISOLATE
U+2068	FIRST STRONG ISOLATE
U+2069	POP DIRECTIONAL ISOLATE

I think the approach of building into the grammar just enough to produce the right effects is very promising; that has the promise of detecting and preventing mistakes. There are some complications that will need enough time to explore, so we definitely don't want to rush this.

Note: The error messages for malformed bidi controls needs to be clear, because the characters are normally invisible. A simple approach like "error X on line 10 index 15" (aside from being extremely unfriendly in the first place) is especially not recommended for any line that contains RTL characters or isolates.

eggrobin · 2024-02-18T00:47:06Z

@macchiati called my attention to this; I have skimmed the discussion more quickly than is advisable, and I am too unfamiliar with the context and too busy with a host of other things to be of very much use here, but I should like to point people to the bits of standards I edit that at a glance seemed relevant to me.

Mark wrote:

allowing the entry of LRM and RLM can make the syntax more readable.

See https://www.unicode.org/reports/tr31/#Whitespace. Addison also wrote about isolates, which may make sense here, since this seems like a case of https://www.unicode.org/reports/tr55/#Interspersed-Syntax, for which the marks are not enough; these are default ignorable, so adding them to the set of whitespace characters as part of a profile with no other qualifications would put them in the set of ignorable format controls.

In this context, I would tend to think having the isolates as ignorable format controls is sensible.

(And tooling can insert those characters in the appropriate places automatically, so that plain text looks better.)

On this see https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text. If you want this to insert isolates it would have to be beefed up a little.

I would recommend, whatever you do, to include a conformance statement for https://www.unicode.org/reports/tr31/#R3a, so that there well-understood framework for reviewers and implementers to work with.

aphillips · 2024-02-18T17:29:04Z

@eggrobin Thanks for this.

I would recommend, whatever you do, to include a conformance statement for https://www.unicode.org/reports/tr31/#R3a, so that there well-understood framework for reviewers and implementers to work with.

I'm not sure how R3 helps? The problem here is that our syntax (which can be single line) uses neutral characters, including some mirroring characters, and allows non-ASCII (including both weak and strong characters in either direction) in identifiers and values.

If you want this to insert isolates it would have to be beefed up a little.

Actually, beefed up a lot. The pointer to "Conversion to Plain Text" is helpful.

In this release, I think we might add LRM as an optional permitted character before the whitespace production in the syntax. This would allow tools to implement this. We could provide a pointer to this algorithm for users as well.

Current:

; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 )

Proposed:

; Whitespace, optionally prepended with LRM
;    see: https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text
s = [%x200E] 1*( SP / HTAB / CR / LF / %x3000 )

eggrobin · 2024-02-18T18:22:31Z

I'm not sure how R3 helps?

Because it allows LRMs optionally by default, next to whitespace and in similar places; but more broadly because it provides a standardized terminology and baseline for definitions of formal syntaxes to describe what kind of invisible characters they allow optionally next to whitespace (and in related places).

In this release, I think we might add LRM as an optional permitted character before the whitespace production in the syntax. This would allow tools to implement this.

Again, this is starting to sound a lot like item 2 of UAX31-R3a; see also https://www.unicode.org/reports/tr31/#Contexts_for_Ignorable_Format_Controls which it references.
The point of introducing the concept of ignorable format controls was to make it possible to implement that algorithm or ones in a similar vein (the recent changes to UAX31, which include the split of R3 into R3a and R3b and the expansion of Section 4, were made concurrently with the development of UTS55, by the same people, including yours truly; these two documents are designed to work with each other).

duerst · 2024-02-19T08:06:00Z

I think the goal of allowing bidi marks to make it easier to understand the structure of the data is a worthwhile goal.

But we should also make sure that this doesn't open the door to Trojan source attacks. While MF2 maybe has less attack surface for Trojan source attacks than the average programming language, I don't think it's zero.

aphillips · 2024-02-19T16:38:16Z

I think we should add a statement about R3a conformance. We are specifically an R3a-2 implementation, since we don't allow all of the whitespace characters in R3a-1. We would need to add LRM and maybe RLM using the syntax I proposed above. This isn't a question so much of Trojan source as it is one of letting tools make RTL messages legible to humans without interfering with the output.

@duerst I agree that the attack surface is non-zero.

This partially addresses #661 by allowing the LRM character in message whitespace. This is whitespace **_outside_** pattern `text`. Tools can use this to help ensure that messages are formatted visually in a way consistent with LTR presentation of a message.

aphillips · 2024-03-05T14:53:16Z

Although the PR for this exists, the WG decided not to include this in LDML45 because it is late-breaking. See meeting notes here

macchiati added the Preview-Feedback Feedback gathered during the technical preview label Feb 17, 2024

aphillips added Agenda+ Requested for upcoming teleconference LDML45 LDML45 Release (Tech Preview) labels Feb 19, 2024

aphillips mentioned this issue Feb 19, 2024

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Closed

aphillips added LDML46 LDML46 Release (Tech Preview - October 2024) and removed Agenda+ Requested for upcoming teleconference labels Feb 26, 2024

aphillips removed the LDML45 LDML45 Release (Tech Preview) label Mar 5, 2024

mihnita added the blocker-candidate The submitter thinks this might be a block for the next release label Aug 28, 2024

aphillips mentioned this issue Sep 11, 2024

Add bidi support and address UAX31/UTS55 requirements #884

Merged

aphillips closed this as completed in #884 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEEDBACK] Improving bidi layout #661

[FEEDBACK] Improving bidi layout #661

macchiati commented Feb 17, 2024

eemeli commented Feb 17, 2024

aphillips commented Feb 17, 2024

macchiati commented Feb 17, 2024

eggrobin commented Feb 18, 2024 •

edited

Loading

aphillips commented Feb 18, 2024

eggrobin commented Feb 18, 2024 •

edited

Loading

duerst commented Feb 19, 2024

aphillips commented Feb 19, 2024

aphillips commented Mar 5, 2024

[FEEDBACK] Improving bidi layout #661

[FEEDBACK] Improving bidi layout #661

Comments

macchiati commented Feb 17, 2024

eemeli commented Feb 17, 2024

aphillips commented Feb 17, 2024

macchiati commented Feb 17, 2024

eggrobin commented Feb 18, 2024 • edited Loading

aphillips commented Feb 18, 2024

eggrobin commented Feb 18, 2024 • edited Loading

duerst commented Feb 19, 2024

aphillips commented Feb 19, 2024

aphillips commented Mar 5, 2024

eggrobin commented Feb 18, 2024 •

edited

Loading

eggrobin commented Feb 18, 2024 •

edited

Loading