Skip to content

Add Bidirectional Isolation section to formatting #315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 19, 2023

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Nov 11, 2022

Closes #28

Proposing here effectively the same solution as discussed in the issue.

CC @aharon-lanin, @rxaviers, @macchiati who've commented on the issue but I'm not able to add as a reviewer.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start.


## Bidirectional Isolation

Where appropriate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Where appropriate" ... "MUST" seems difficult to implement/test.

I think we should say when it is appropriate. I believe this might be:

When a message part has a direction different from the message pattern string or when the message part's direction is not known...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "Where appropriate ... " is far too vague to be a conformance MUST.

However, condition is too strong.

  1. There are many instances where isolation is not required, such as a single Arabic word in a placeholder surrounded by Latin. It would be better to supply a functional description. Off the top of my head: w
    • Where the text of the placeholder or surrounding text would rearrange due to the insertion.
  2. There is a significant number of products in use that do not yet support isolates. [Aside: I really wish I'd made the embeddings be isolating in the first place, but water far under the bridge.] For use of MF2.0 with those products, we have to have a fallback position (ugly as that is).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the text following @aphillips's suggestions.

@macchiati Is there an existing algorithm for determining when isolation would not be required? I presume that this would require scanning both the placeholder's stringified value as well as that of its surroundings, and ensuring that neither have any weakly typed characters immediately adjacent to the other.

I would prefer an algorithm that would not require such introspection, even if it does come at a cost of needing to include isolating characters in some situations where they have no effect.

To support products and situations which do not support isolates, would it be sufficient to mandate the default behaviour, and allow for a formatter to provide alternatives? For example, in the JS implementation, I could imagine including an option like:

type MesageFormatOptions = {
  isolate = 'isolate' | 'embed' | 'none' = 'isolate',
  ...
}

This would allow for defaulting to isolation (explicitly spec-defined), while also being able to provide implementation-defined fallback solutions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are thinking along similar lines.

For any hard conformance requirement, I think we should be precise as to the minimum required. That is relatively simple to apply, as long as an implementation has access to an api for the bidi algorithm. And someone could do an analysis to try to come up with a simpler algorithm for this specific task (but such an algorithm would take a lot of testing — there are many edge cases).

However, what we can do make it clear that an implementation can "over-isolate"; add codes that are not needed — the most extreme case would be to isolate every placeholder at every level. That would be an approach that anyone could implement. And there could be intermediate approaches.

## Bidirectional Isolation

Where appropriate,
the formatted string representation of a message MUST isolate message parts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the formatted string representation" doesn't say what happens with formatToParts. Do message segments have lang/dir metadata or do they include isolating controls? I'd prefer metadata (so that one could implement markup instead of controls)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for documenting the parts behavior.
I also prefer metadata, or special parts.
Whatever the "format to parts" will produce, but not raw Unicode bidi controls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 'formatToString' functionality, I think we have no choice but to deploy the bidi controls. For the full data model, I agree that metadata is better. (One can argue that the formatToString is just semantic sugar, but I think the end result of processing for formatToString() had better agree with that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left out the formatted parts representation from this PR because I think that's a separate concern, and a topic on which I find us to have less agreement. In fact, I would be (pleasantly!) surprised if we were able to define a common formatted-part definition that could be shared by all implementations, given how much the design of such an implementation is tied to its surroundings.

Rather than discussing this further here, we should continue in #41 or #272.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was agreement that the format to parts would return the bidi info as some kind of part / meta on parts, rather than bidi control characters.
If we have agreement on that, can we say it in the spec?
Or did I misunderstand it and there is no agreement?

@mihnita
Copy link
Collaborator

mihnita commented Nov 13, 2022

I also think that a big part of the discussion on #28 was about who is responsible for adding those control characters / special-bidi-control parts (when we format to parts)

This change is saying what should happen, but not who does it, the "engine" or the function (the placeholder formatter).


## Bidirectional Isolation

Where appropriate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "Where appropriate ... " is far too vague to be a conformance MUST.

However, condition is too strong.

  1. There are many instances where isolation is not required, such as a single Arabic word in a placeholder surrounded by Latin. It would be better to supply a functional description. Off the top of my head: w
    • Where the text of the placeholder or surrounding text would rearrange due to the insertion.
  2. There is a significant number of products in use that do not yet support isolates. [Aside: I really wish I'd made the embeddings be isolating in the first place, but water far under the bridge.] For use of MF2.0 with those products, we have to have a fallback position (ugly as that is).

## Bidirectional Isolation

Where appropriate,
the formatted string representation of a message MUST isolate message parts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 'formatToString' functionality, I think we have no choice but to deploy the bidi controls. For the full data model, I agree that metadata is better. (One can argue that the formatToString is just semantic sugar, but I think the end result of processing for formatToString() had better agree with that.)

@duerst
Copy link

duerst commented Nov 14, 2022 via email

@eemeli eemeli requested a review from aphillips November 14, 2022 08:12
@aphillips
Copy link
Member

@duerst

Is there, or can we imagine, a characterization that may not be as precise (as it may contain some false positives) but may be cheaper to check (and of course shouldn't contain any false negatives)?

We only need to consider the first and last characters of the placeable. If they are both strong and both in the same direction as the string, the inside of the string is effectively embedded. However, we also have to consider the characters just before/after the placeable. You can see this in the famous "purple pizza" example here. The placeable ("purple pizza" in Hebrew) is unidirectional, but it interacts with the neutrals and weakly directional numbers just adjacent in the enclosing LTR string.

image

(Interestingly, this example actually has two placeables: presumably the number 4 is runtime formatted. When the number is isolated with LRI/PDI like this the string works 😉--including the reordering of the neutral - in an RTL context. However, this is only true for this string)

@macchiati

Where the text of the placeholder or surrounding text would rearrange due to the insertion.

That would require full bidi analysis of the final text (which includes resolving all of the placeables, as with the - 4 reviews above!) My proposed algorithm is a stab at making the analysis as simple as possible.

There is a significant number of products in use that do not yet support isolates.

Regrettably, yes. The shim is to embed and then restore direction with host string paragraph base direction RLM/LRM at the end of the placeable. That's... ugly (especially when combined with isolates!!) and probably should be a compatibility mode? Worst case the string with isolates performs no worse than it did before (if your platform doesn't support isolates, you may need to put bidi controls and such into your translations)?

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (since it's what I proposed). The remainder of the discussion of course still has to be agreed.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't take into account the recent conversations.

@mihnita
Copy link
Collaborator

mihnita commented Nov 14, 2022

full bidi analysis of the final text (which includes resolving all of the placeables)

I think that since the discussion was around formatToString and formatToParts, then the placeholders are indeed already resolved.

I don't know about full bidi analysis, or just the optimized one you suggested (first/last chars of the resolved placeholder + adjacent characters before / after the placeholder). We can leave implementation dependent (with the risk that some implementations will wrap things even if not needed?)
But we can assume the placeholders to be resolved.

@aphillips
Copy link
Member

@mihnita

I don't know about full bidi analysis, or just the optimized one you suggested (first/last chars of the resolved placeholder + adjacent characters before / after the placeholder). We can leave implementation dependent (with the risk that some implementations will wrap things even if not needed?)

Actually, my "optimized" version doesn't work. Embedding levels don't break runs in the necessary way. In the purple pizza example, putting RLMs or RLE/PDF characters around the outside doesn't fix the problem. The neutral spaces, hyphen, and weak number "stick" to the trailing end of inserted value (the left visual end) of the RTL run.

I guess my main disagreement with @macchiati is that I don't think it is necessary to dive inside what MF considers a placeable. The "part" in question might contain runs with isolating controls or other characters to get the right presentation, but the isolating wrapper is all MF cares about. I agree with @eemeli that having an option like none makes sense and I agree that you'd need formatToParts to solve this with markup.

Am I right in thinking that MF itself doesn't have "nested" formats? I think we can ignore selectors (such as plural), because these were deliberately set up to choose complete-thought pattern strings. If so and MF only has pattern level placeables then MF only needs to govern isolation of its own placeables and leaves subformat internal bidi considerations to the subformatter (which is fine because its output is isolated from the host pattern), I think we can describe and implement that reliably.

I think bidi analysis to see if the placeable results in reordering is super hard to get right. You want "logical end" runs to switch sides under bidi in lots of cases. I think that the only strings where you can ignore isolation are made entirely of strong and neutral characters--where all of the strongs are the same direction. This could be fast fail, but it's still looking at potentially many character properties...

@mihnita
Copy link
Collaborator

mihnita commented Nov 14, 2022

Actually, my "optimized" version doesn't work.

OK. Then the "where necessary" is problematic, so we can maybe leave it out...

For the "dive inside": I think it is needed.
It is what I've meant by "the functions" should add bidi control info, not the "engine"

MF2 does not have nested constructs the way MF1 had with plural / selectors.
But the result of the placeholders might have a "tree-like" structure that might need bidi info.

For example "We are closed {when :daterange skeleton=yMMMMd}..."
The placeholder is a range. Which has a start / end. And each of the two parts might have a year, month, day.
Possibly with different formatting (so the parts should be able to address those children), and bidi.
Or a list formatter, which might be "... {people :list} ...", and list wold result in each person name potentially wrapped in bidi characters.


About "none", that looks like a value pre-formatting.
The translator (or developer) would specify "none" for the placeholder, and the result (formatToString / formatToParts) would output no bidi info.

But then the values listed there ('isolate' | 'embed' | 'none') are probably not enough.
We have to really be specific about the values pre-format (coming from the dev / translator) and post-format, created by the function / engine, taking into consideration the pre-format bidi options, and the values we format.

@duerst
Copy link

duerst commented Nov 15, 2022

Actually, my "optimized" version doesn't work.

OK. Then the "where necessary" is problematic, so we can maybe leave it out...

But wouldn't that mean that even completely LTR cases (e.g. all English or so) would need BIDI formatting (markup or formatting characters or whatever) when following the spec? I agree with Addison that finding an easy criterion that does exactly the right thing may be difficult or impossible to find. But there should for sure be some criteria that are easy and just "on the right side".

@eemeli
Copy link
Collaborator Author

eemeli commented Nov 15, 2022

@mihnita:
OK. Then the "where necessary" is problematic, so we can maybe leave it out...

@duerst:
But wouldn't that mean that even completely LTR cases (e.g. all English or so) would need BIDI formatting (markup or formatting characters or whatever) when following the spec? I agree with Addison that finding an easy criterion that does exactly the right thing may be difficult or impossible to find. But there should for sure be some criteria that are easy and just "on the right side".

This is why the current proposal starts with this qualifier, which is intended to allow for e.g. completely LTR content to be output without any isolates:

When a message part has a direction different from the message as a whole
or when the part's direction is not known,
the formatted string representation of the message MUST isolate the part in question
with an explicit isolate character: [...]

This is intentionally worded to describe the intended result, and not constrain an implementation's choice of how this is achieved. For the purposes of this language, it would be equally valid for an implementation to do e.g. one of the following, or something else:

  1. When stringifying each individual literal and placeholder, pass in information about the directionality of the whole message, and expect the result to include isolate characters, if necessary. The stringifcation of the whole message would then be the direct concatenation of these strings.
  2. Stringify each literal and placeholder separately and without information about the message's directionality, but expect the result to include metadata about the directionality of the part. Then, when stringifying the whole message, add isolate characters between parts if and as necessary.

We do not need to agree on how an implementation formats its parts in order to agree on default bidi isolation in formatted string output.

Comment on lines 24 to 25
When a message part has a direction different from the message as a whole
or when the part's direction is not known,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a message part has a direction different from the message as a whole
or when the part's direction is not known,
When a message part has a direction different from the message as a whole,

@macchiati With reference to your comment #315 (comment), would this be the sort of change you're looking for? Or have I perhaps understood your "minimum" incorrectly?

The intent here would be to reduce the complexity required of implementations, such that they only need to add isolates if they explicitly know that a message and a part have different directionality. For instance, if an instance doesn't have the capability to determine the directionality of a variable $foo, it would not be required to isolate it.

To be clear, my own preference would be to require isolation by default in that case, but to allow for an implementation to be configurable to act differently, isolating either less, more, or somehow differently than defined in the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such that they only need to add isolates if they explicitly know that a message and a part have different directionality. For instance, if an instance doesn't have the capability to determine the directionality of a variable $foo, it would not be required to isolate it.

This is fundamentally the problem. In order to know that a given part has a different direction and/or requires isolation, you have to run the bidi algorithm over the assembled string and see if it generates multiple "runs" and that run boundaries fall on placeholder boundaries.

To be clear, my own preference would be to require isolation by default in that case, but to allow for an implementation to be configurable to act differently, isolating either less, more, or somehow differently than defined in the spec.

+1

The spec could be written to isolate by default, but allow implementations to optimize their output (that is, omit isolation if they determine it is safe to do so)

I suspect this means that the MUST on line 26 should either be amended with the out I just mentioned or become SHOULD... unless...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to state the requirement in L24 and beyond is something like the following.

The formatted string representation MUST ensure that when the BIDI algorithm is applied to the message, the part must behave as if it were surrounded by the appropriate isolate characters. The prefixed isolate character is:

  • U+2066 LEFT-TO-RIGHT ISOLATE if the part is known to have LTR directionality,
  • U+2067 RIGHT-TO-LEFT ISOLATE if the part is known to have RTL directionality, or
  • U+2068 FIRST STRONG ISOLATE if the part's directionality is not certain.

The postfix isolate character is U+2069 POP DIRECTIONAL ISOLATE.

NOTE: This requirement does not imply that BIDI isolates must be present in the resulting text:

  • In the vast majority of cases, they are not necessary and can be omitted. Implementations can have more or less sophisticated algorithms to determine this, but they need to default to using the isolates (or equivalent mechanisms) wherever there is any question.
  • Moreover, many implementations support the older BIDI embedding characters and BIDI marks, so an implementation may need to use a combination of those instead.
    The important requirement is the behavior, not the mechanism used to achieve it.

@aphillips
Copy link
Member

Apologies for a long note.

@duerst said:

But wouldn't that mean that even completely LTR cases (e.g. all English or so) would need BIDI formatting (markup or formatting characters or whatever) when following the spec? I agree with Addison that finding an easy criterion that does exactly the right thing may be difficult or impossible to find. But there should for sure be some criteria that are easy and just "on the right side".

Having mulled is over a bit and having gone back and written some code to play with it...

I think the logic could be: if the placeable has a single directional run and the characters up to and including the nearest non-weak characters before/after the placeable would form a single run, then isolation is not needed. Note that there are 20 different directionality classes. My quick-and-dirty code looks something like this in Java and probably a number of other directionality classes are needed in the switch:

    private static int findPreviousNonNeutral(String input) {
        if (input == null || input.length() == 0) return 0;
        int index = input.indexOf('\ufffc');
        while (--index >= 0) {
            char ch = input.charAt(index);
            switch (Character.getDirectionality(ch)) {
               default: return index;
               case Character.DIRECTIONALITY_OTHER_NEUTRALS:
               case Character.DIRECTIONALITY_UNDEFINED:
               case Character.DIRECTIONALITY_WHITESPACE:
               case Character.DIRECTIONALITY_ARABIC_NUMBER:
               case Character.DIRECTIONALITY_EUROPEAN_NUMBER:
               case Character.DIRECTIONALITY_SEGMENT_SEPARATOR:
               case Character.DIRECTIONALITY_EUROPEAN_NUMBER_SEPARATOR:
                   continue;
            }
        }
        return 0;
    }

Is there a better estimate?

@eemeli noted:

The intent here would be to reduce the complexity required of implementations, such that they only need to add isolates if they explicitly know that a message and a part have different directionality. For instance, if an instance doesn't have the capability to determine the directionality of a variable $foo, it would not be required to isolate it.

Matching the base paragraph direction of the message and the part isn't sufficient to prevent spillover. The part's direction is needed to decide which isolate start control to use. Plenty of inserted strings have a base paragraph direction that matches that of the message but produce spillover effects unless the part is isolated. Also, don't forget that strings have both front and trailing ends that can reorder.

So I think the guidance should be "isolate unless you know you do not have to". The guidance on which isolate controls to use depends on the part's base direction, but the use/non-use of isolates is a separate concern. Perhaps:

When isolation is enabled, implementations MUST insert isolating controls around message parts except when the inserted message part extends the bidirectional run in the message at the point of insertion and does not change existing run boundaries vs. insertion of a blank part.

Note that some strings present correctly in one direction but spillover in the other.


My code to test this looks something like:

    private static void testMyBidiAssertion() {
        for (String val : VALUES_TO_INSERT) {
            ImmutableMap<String,Object> args = ImmutableMap.of("str", val);
            // if this is true then it is possible that no isolates are needed
            System.out.println(shouldBeSingleRun(val));
            for (String pattern : PATTERNS_TO_GO_INTO) {
                com.ibm.icu.text.MessageFormat mf = new com.ibm.icu.text.MessageFormat(pattern);
                
                // use a well-known character to find the placeable's location
                // and thus find the start and end of the substring to test
                String tmp = mf.format(ImmutableMap.of("str", "\ufffc"));
                int start = findPreviousNonNeutral(tmp);
                int end = findNextNonNeutral(tmp) + val.length();
                String tmp2 = mf.format(args).substring(start, end);
                System.out.println(tmp2 + " " + shouldBeSingleRun(tmp2));
                
                // if my assertion holds true, this should produce no spillover
                if (shouldBeSingleRun(tmp2)) {
                    System.out.println("> " + mf.format(args));
                }
            }
        }
    }

    private static boolean shouldBeSingleRun(String input) {
        // The Bidi class is from ICU4J and I am running the latest version
        if (input == null || input.length() == 0) return true;
        Bidi bidi = new Bidi(input, Bidi.DIRECTION_DEFAULT_LEFT_TO_RIGHT);
        Bidi bidiRTL = new Bidi(input, Bidi.DIRECTION_DEFAULT_RIGHT_TO_LEFT);
        // probably something more optimal is possible here
        return (bidi.countRuns() == 1 && bidiRTL.countRuns() == 1);
    }

And some example test values:

    static final String[] VALUES_TO_INSERT = {
        "\u05e4\u05d9\u05e6\u05d4 \u05e1\u05d2\u05d5\u05dc\u05d4", // purple pizza, all RTL
        "purple pizza", // doh
        "\u05e4\u05d9\u05e6\u05d4 \u05e1\u05d2\u05d5\u05dc\u05d4 3", // purple pizza 3
        "104.96.78.63", // numbers
        "1,234.56 AED", // numbers and letters
        "---",          // can go anywhere?
    };
    
    static final String[] PATTERNS_TO_GO_INTO = {
         "abc {str} xzy",         // ltr
         "\u0648 {str} \u0648",   // rtl
         "3 - {str} - 5",         // punctuation separated
         "3-{str}-5",             // non-separated
         "moo ({str}) moo",       // enclosing ltr
         "\u0648 ({str}) \u0648", // enclosing rtl
    };

@eemeli eemeli mentioned this pull request Jan 2, 2023
@eemeli
Copy link
Collaborator Author

eemeli commented Jan 24, 2023

I've refactored the proposal to account for some of the issues discussed above. Specifically:

  • The proposed strategy is now only required to be one of the isolation strategies supported by an implementation.
  • The method of determining a part's directionality is now explicitly left to each implementation.
  • The language/teminology used in the text is more explicitly defined.

@aphillips @macchiati @mihnita One usage scenario that's not supported here is something like a "compatibility mode" which could help guarantee that different implementations provide the same sequence of code points for the same inputs. Should one be added? It would probably work like this:

For every placeholder, prefix its formatted string output with a FIRST STRONG ISOLATE and postfix it with a POP DIRECTIONAL ISOLATE.

While isolating all placeholders would be useless in most cases, the output would at least be predictable.

@eemeli eemeli requested review from aphillips and macchiati January 24, 2023 14:43
@aphillips
Copy link
Member

@eemeli I'll have a look at the revision. On this:

One usage scenario that's not supported here is something like a "compatibility mode" which could help guarantee that different implementations provide the same sequence of code points for the same inputs.

That would be tricky from the point of view that the values being placed could vary by CLDR version or other implementation detail. It would also be better if a function could control the introducing isolate to have the appropriate base direction (RLI or LRI when the direction is known) vs. FSI (when it is not). If I were to introduce a "compatibility mode" it would be to force isolation for every placeholder or to force no isolation for every one such.

@eemeli
Copy link
Collaborator Author

eemeli commented Jan 24, 2023

@aphillips:
It would also be better if a function could control the introducing isolate to have the appropriate base direction (RLI or LRI when the direction is known) vs. FSI (when it is not).

This is what the PR's current strategy does, effectively. The compat mode would be an optional alternative to it, something a bit like the localeMatcher option used by JS Intl constructors.

If I were to introduce a "compatibility mode" it would be to force isolation for every placeholder or to force no isolation for every one such.

Yes, the idea would be to enforce FSI/PDI isolation for every single placeholder in compatibility mode. When enabled, we could be relatively certain that the formatted output would be presented correctly, with no need to know anything about the directionality of the message or the placeholders. The cost of this would be having lots of unnecessary FSI/PDI code points included in the output, but OTOH they'd always be there no matter what the underlying implementation is.

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're making progress. Thanks for the update!

@@ -18,3 +18,38 @@ the local variable takes precedence.

It is an error for a local variable definition to
refer to a local variable that's defined after it in the message.

## Bidirectional Isolation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably call this something more generic, such as "Handling Bidirectional Text"

Comment on lines 31 to 33
Implementations may provide one or more strategies for isolation,
but one such strategy must behave as follows,
when formatting the message as a string:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reverse the normative statements for clarity to implementers:

Implementations must provide at least one isolation mechanism from the list below. They may provide additional mechanisms in order to allow better overall control from users.

Copy link
Collaborator Author

@eemeli eemeli Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the PR only presents one mechanism/strategy. If we add the compatibility mode I mention in an earlier comment, I would prefer either requiring both or at least the compatibility mode of all implementations, rather than "at least one".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that there's something not right in how this is current phrased. MAY provide one or more... and one MUST behave as follows... seem to be at odds with each other? Can an implementation provide zero strategies?

I liked the wording suggested by @macchiati in #315 (comment). @aphillips's suggestion above also sounds good to me.

Comment on lines 35 to 39
Considering a "formatted part" to be the formatted string representation of a Placeholder,
when the formatted part has a direction different from the message as a whole
or when its directionality is not known,
the formatted string representation of the message must
prefix the formatted part with one of the following isolate characters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves a gap for same-base-paragraph-direction placeholders with various edge effects (generally starting or ending with weak runs or runs with neighboring character issues). I admit that we've wrapped ourselves up trying to define when we are required to isolate without having to run the bidi algorithm. I don't have a solution yet, so perhaps keeps this for now, but add a note about the issue??

I would split the term definition of "formatted part" from the requirement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I see the hole you're describing. Could you give an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern string is:

السعر {$price} + {$shipping_amt} الشحن

The currency code for both is USD and the symbol for that is USD in the ar-AE locale. The resulting string is:

السعر 1,234.56 AED + 12.99 USD الشحن

It should be:

السعر ⁧1,234.56 USD⁩ + ⁧12.99 USD⁩ الشحن

Here are the code point sequences with and without RLI/PDI around the prices:

without: \u0627\u0644\u0633\u0639\u0631 1,234.56 USD + 12.99 USD \u0627\u0644\u0634\u062d\u0646
with: \u0627\u0644\u0633\u0639\u0631 \u20671,234.56 USD\u2069 + \u206712.99 USD\u2069 \u0627\u0644\u0634\u062d\u0646

In both cases the base paragraph direction of the pattern string and also each of the placeables is RTL, but the appearance of weak (numbers) and strong (letters) LTR runs confuses bidi unless isolated.

Dates often have this sort of thing too or any string that starts or ends with numeric sequences or enclosing punctuation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see it now, how the formatted value might be nominally RTL but only composed of neutral and LTR characters. It's interesting to play around with this in different browsers, as I see this:

const usd = new Intl.NumberFormat('ar-AE', { style: 'currency', currency: 'USD'}).format(12.99)

// in Chrome & Node.js
usd === "US$ 12.99"
usd.split('').map(c=>c.charCodeAt(0))
// [ 85, 83, 36, 160, 49, 50, 46, 57,  57 ]

// in Firefox
usd === "‏12.99 US$"
usd.split('').map(c=>c.charCodeAt(0))
// [ 8207, 49, 50, 46, 57, 57, 160, 85, 83, 36 ]

Specifically, Firefox prefixes the string with an RTL mark for currencies. It doesn't really help with the isolation, though.

So I agree that this situation is left outside the "must" statement, I think it's potentially ok? As in, at least we don't have a "must not" so that an implementation is allowed to introspect the formatted part and determine that it ought to be isolated.

Comment on lines 48 to 51
An implementation must define its own method for
determining the directionality of a formatted part,
relying either on metadata attached to the formatted part,
or on introspection of its contents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add one more callout here. Many of the formatted parts are produced by a formatter with a given locale. APIs exist in at least a few platforms for determining the default base direction based on locale. So perhaps:

An implementation must determine the base paragraph direction for each formatted part. These might include a metadata field/method, introspection of the formatted part using UBA [UAX9], or from the locale of the placeholder's formatter (for example, Intl.Locale's textInfo in JavaScript). Formatters are responsible for using whatever mechanism is provided by the implementation to provide appropriate directional information for each formatted part.

Comment on lines 53 to 55
For example, an implementation could provide a `:number` formatting function
which would always produce output matching the message's locale,
allowing for its formatted string representation to never need isolation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're assuming that the message formatter's locale matches the language of the pattern, which is a bad assumption. Many (most!) resource systems fall back or allow sparse population of translated resources. As a result, the locale of the message formatter (and its subformatters) may not match that of the pattern string. This is a feature. It is often the case that developers will wish to test code in a given locale before the translations are available or to use pseudo.

I also note that this paragraph is not an example of the preceding requirement. Perhaps insert this just ahead of this paragraph:

Implementations should not isolate formatted parts when the base paragraph direction of the pattern and the formatted parts match and there are no opposite direction runs of text in the formatted part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give an example to verify if I've understood your concern right:

An app using English as its fallback locale is partially localized to Arabic, with some English messages filling in holes. If such a LTR message is formatted using an RTL formatter and it contains a placeholder with a variable reference to an RTL value, we should isolate the placeholder's value in the output, even though the formatter and the value are both RTL.

If this is the case, then I think what we need is a more explicit definition somewhere of the "message locale", which I would consider to be English in the above example, even if the formatter is nominally in Arabic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more or less effectively what I'm saying. The challenge is that (let's use Java as an example), I have code that looks like this:

ResourceBundle myPatterns = ResourceBundle.getBundle("com.example.Patterns", myRuntimeLocale);
MessageFormat2 formatter = new MessageFormat2(myPatterns.getString("somePattern"), myRuntimeLocale);

If you query myPatterns for its locale, you get myRuntimeLocale, even though the bundle is really a stack of bundles. If myRuntimeLocale is ar-AE, the actual translations probably live in the ar bundle. If the developer has added a new string, though, somePattern may be coming from the ROOT locale bundle (and be in, e.g. English). Both formatter and its subformat patterns are trying to serve the ar-AE locale, though and finding out which specific locale bundle coughed up the pattern is both inaccessible to it (MF2 doesn't know that the pattern is coming from a bundle--it's just a string, as far as it is concerned) and doesn't add any value to instrument in most cases.

The actual language (as opposed to the asserted language/locale of it) is not really a concern for MF2, as long as the pattern is valid. If we wanted to do language-sensitive processing (capitalization, for example), we'd still use a locale for that.

When it comes to bidi, it also doesn't matter as much as the base paragraph direction does, save that the language might serve as a hint for what the direction should be (lacking other information).

@eemeli eemeli added the Agenda+ Requested for upcoming teleconference label Feb 1, 2023
Comment on lines 31 to 33
Implementations may provide one or more strategies for isolation,
but one such strategy must behave as follows,
when formatting the message as a string:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that there's something not right in how this is current phrased. MAY provide one or more... and one MUST behave as follows... seem to be at odds with each other? Can an implementation provide zero strategies?

I liked the wording suggested by @macchiati in #315 (comment). @aphillips's suggestion above also sounds good to me.

but one such strategy must behave as follows,
when formatting the message as a string:

Considering a "formatted part" to be the formatted string representation of a Placeholder,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we not use part to mean the stringified result of formatting. We've sort of established formatToParts() as the name of the method which returns structured formatting results, which implies that a part is one such structured result of formatting. Perhaps formatted elements or formatted pattern elements would work better?

@aphillips
Copy link
Member

@eemeli could you review my and @stasm's comments on the current diff? If those changes were integrated, I'd support merging (as we discussed in the previous teleconference). Thanks!

@eemeli
Copy link
Collaborator Author

eemeli commented Jun 18, 2023

@aphillips @stasm @macchiati I've rebased and rewritten the section based on your comments. Please take another look?

@eemeli eemeli requested review from aphillips, stasm and macchiati and removed request for macchiati June 18, 2023 13:47
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the one change below.

I have to go out for some hours today, but have in mind additional comments. As noted before, I would be very happy if we merged this in tomorrow's call and then worked on that as a separate PR.

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
@eemeli eemeli requested a review from aphillips June 18, 2023 20:56
@aphillips aphillips merged commit 0a09bc2 into unicode-org:main Jun 19, 2023
@eemeli eemeli deleted the isolate-bidi branch June 19, 2023 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for BiDi in placeables
6 participants