-
-
Notifications
You must be signed in to change notification settings - Fork 36
[FEEDBACK] Improving bidi layout #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
LRM and RLM are valid |
What I think Mark is trying to ask for is that optional whitespace inside expressions be permitted to contain bidi controls (tools should prefer isolating paired controls for this--specifically LRI/PDI). Pattern exteriors are not a problem because the The same message presented RTL is surprisingly "backwards correct" (but only because I didn't use any LTR options/values or names): We can't just add bidi controls to the whitespace production Current: s = 1*( SP / HTAB / CR / LF / %x3000 ) Modified: s = *1(%x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069)
1*( SP / HTAB / CR / LF / %x3000 )
*1(%x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069) The other alternative would be to fix statements and expressions. This requires making the isolate controls optional on the brackets in pairs. Something modeled like the following (note that expression = "{" [s] (literal / variable) [s annotation] *(s attribute) [s] "}"
/ "{" [s] (s annotation) *(s attribute) [s] "}"
/ "{%x2066" [s] (literal / variable) [s annotation] *(s attribute) [s] "%x2069}"
/ "{%x2066" [s] (s annotation) *(s attribute) [s] "%x2069}" (The above does not show how statements need modification.) @eemeli asked:
Yes, that's what he's asking for. Note that these characters are not valid name chars. This also makes them illegal in unquoted literals. I'll add that one does not want to insert these characters into the |
Yes, that's the ask. I marked this as the Preview-Feedback because I don't think this is urgent enough to require addressing in v45. (BTW, @markusicu was the one who caught this.) We might put a note in to the effect that bidi issues will be reviewed during the tech preview. An expansion of %x200E / %x200F / %x2066 / %x2067 / %x2068 / %x2069 with names is below. (Handy for people who haven't got the codes memorized.)
I think the approach of building into the grammar just enough to produce the right effects is very promising; that has the promise of detecting and preventing mistakes. There are some complications that will need enough time to explore, so we definitely don't want to rush this. Note: The error messages for malformed bidi controls needs to be clear, because the characters are normally invisible. A simple approach like "error X on line 10 index 15" (aside from being extremely unfriendly in the first place) is especially not recommended for any line that contains RTL characters or isolates. |
@macchiati called my attention to this; I have skimmed the discussion more quickly than is advisable, and I am too unfamiliar with the context and too busy with a host of other things to be of very much use here, but I should like to point people to the bits of standards I edit that at a glance seemed relevant to me. Mark wrote:
See https://www.unicode.org/reports/tr31/#Whitespace. Addison also wrote about isolates, which may make sense here, since this seems like a case of https://www.unicode.org/reports/tr55/#Interspersed-Syntax, for which the marks are not enough; these are default ignorable, so adding them to the set of whitespace characters as part of a profile with no other qualifications would put them in the set of ignorable format controls. In this context, I would tend to think having the isolates as ignorable format controls is sensible.
On this see https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text. If you want this to insert isolates it would have to be beefed up a little. I would recommend, whatever you do, to include a conformance statement for https://www.unicode.org/reports/tr31/#R3a, so that there well-understood framework for reviewers and implementers to work with. |
@eggrobin Thanks for this.
I'm not sure how R3 helps? The problem here is that our syntax (which can be single line) uses neutral characters, including some mirroring characters, and allows non-ASCII (including both weak and strong characters in either direction) in identifiers and values.
Actually, beefed up a lot. The pointer to "Conversion to Plain Text" is helpful. In this release, I think we might add LRM as an optional permitted character before the whitespace production in the syntax. This would allow tools to implement this. We could provide a pointer to this algorithm for users as well. Current: ; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 ) Proposed: ; Whitespace, optionally prepended with LRM
; see: https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text
s = [%x200E] 1*( SP / HTAB / CR / LF / %x3000 ) |
Because it allows LRMs optionally by default, next to whitespace and in similar places; but more broadly because it provides a standardized terminology and baseline for definitions of formal syntaxes to describe what kind of invisible characters they allow optionally next to whitespace (and in related places).
Again, this is starting to sound a lot like item 2 of UAX31-R3a; see also https://www.unicode.org/reports/tr31/#Contexts_for_Ignorable_Format_Controls which it references. |
I think the goal of allowing bidi marks to make it easier to understand the structure of the data is a worthwhile goal. But we should also make sure that this doesn't open the door to Trojan source attacks. While MF2 maybe has less attack surface for Trojan source attacks than the average programming language, I don't think it's zero. |
I think we should add a statement about R3a conformance. We are specifically an R3a-2 implementation, since we don't allow all of the whitespace characters in R3a-1. We would need to add LRM and maybe RLM using the syntax I proposed above. This isn't a question so much of Trojan source as it is one of letting tools make RTL messages legible to humans without interfering with the output. @duerst I agree that the attack surface is non-zero. |
This partially addresses #661 by allowing the LRM character in message whitespace. This is whitespace **_outside_** pattern `text`. Tools can use this to help ensure that messages are formatted visually in a way consistent with LTR presentation of a message.
Although the PR for this exists, the WG decided not to include this in LDML45 because it is late-breaking. See meeting notes here |
In MF1.0, we allow RLM and LRM in any position between syntax tokens that is not part of the literal message; those characters are then ignorable in parsing the message. The advantage of this is that many people will be viewing the message in plaintext editors, and the default results for the bidi algorithm is designed for normal text, and can really mess up "code-like" text (as in MF1.0 or MF2.0).
Here is an example of what happens with BIDI text. This is using a tool (mentioned below) that has the convention that UPPERCASE stands for bidi characters so that people can read them, and I used ⁅...⁆ is used in place of {...} because of the tooling.
Example Source:
* ⁅⁅YOU HAVE ⁅$count⁆ NOTIFICATION.⁆⁆
Resulting Display:
⁅⁅.NOITACIFITON ⁅count :number$⁆ EVAH UOY⁆⁆ *
Notice that "YOU ARE" is reversed & NOTIFICATION is reversed (which is the correct order among the pieces), but the $ is jumbled.
but if the line started with 'one' instead of *, you'd get different results.
one ⁅⁅EVAH UOY ⁅$count :number⁆ NOITACIFITON.⁆⁆
If the syntax allows the insertion of an LRM in locations that are not part of the literal message and ignores those LRM, the source can be consistently left-to-right (so that the syntax is in the right order).
* ⁅⁅YOU HAVE ⁅$count :number⁆ NOTIFICATION.⁆⁆
Resulting Display:
* ⁅⁅EVAH UOY ⁅$count :number⁆ NOITACIFITON.⁆⁆
That is still not good for BIDI languages: the ideal would be that the flow of the syntax was LTR, but the flow of the variant message would be consistently RTL, something like:
* ⁅⁅YOU HAVE ⁅$count :number⁆ NOTIFICATION.⁆⁆
Resulting Display:
⁅⁅.NOITACIFITON ⁅$count :number⁆ EVAH UOY⁆⁆ *
A good "MF2" editor or localization tool would use something more sophisticated (we should call that to people's attention in the spec, pointing to https://www.unicode.org/reports/tr55, especially https://www.unicode.org/reports/tr55/#Ordering for applying HL4).
But a lot of people will be using plain-text editors, and allowing the entry of LRM and RLM can make the syntax more readable. (And tooling can insert those characters in the appropriate places automatically, so that plain text looks better.)
The tool is:
https://util.unicode.org/UnicodeJsps/bidi.jsp?a=one+%E2%81%85%E2%81%85YOU+HAVE+%E2%81%85%24count%E2%81%86+NOTIFICATION.%E2%81%86%E2%81%86&p=Auto&hack=on
Notes:
The text was updated successfully, but these errors were encountered: