-
-
Notifications
You must be signed in to change notification settings - Fork 36
proposal: replace first-match with best-match #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As far as I can tell, the problem we are trying to solve here has a specific expression in the treatment of plural matching, i.e. whether to select Are we aware of any other selectors which might have different or further needs than plurals, or can we be satisfied that any solution which is determined to be sufficient for plurals is sufficient for all selectors? |
Good callout. I suspect time-based selectors will be similar to plurals, particularly when messages need to decide between floating (aka "local") and incremental (aka "instant" or possibly "zoned" values) or other places where precision varies. Similarly some of the Relative formatters break behavior above or below certain values. Person name formatting might also match patterns based on the fields in a name plus locale factors. And maybe measurements more generally (such as breaking from mg => g => kg or from ounces => pounds)? |
I think the three possible approaches are (in brief):
As I said when this came up, I think that a company like mine could work around the fagility of
BTW, some of your examples really are suitable for selectors, but instead need to be internal to a formatter. For people names, there are many different input parameters, and optionally multiple patterns per input parameter. Or take measurement units. They also need to be one level down, because what needs to be supplied is a
Then the appropriate unit and amount of that unit are formatted. That process isn't really appropriate or feasible for the selector mechanism. (I can flesh this out more if there are any questions.) |
+100 The only improvement (?) I would make is to use the sum of squares if the scores. This matches the intuitive idea of distance in the real world. The distance from origin to a point:
In our case there is no need to extract the root, as we compare the scores. I think the main benefit is that it matches the natural distance, so it is not that artificial, and is probably intuitive. I posted a bit more detailed comment on the commit (#358) https://github.com/unicode-org/message-format-wg/pull/358/files#r1125011916 With a bit more details and pseudocode. |
I think I've had a bit of a revelation which I'd like to share before I go to sleep today. (We'll see whether I still think of it as a revelation tomorrow morning...) With first-match, it's possible to arrange variants such that the meaning of the
With best-match, If I imagine a message with the following variant:
Such variant may win over Thus, #322 is closely related. |
Argh, I edited this example a few times before posting and ended up getting it wrong. I meant to say: Such variant may win over |
@stasm I guess it is kind of like that. The thing is: the value The "value" of a Using your example:
For English, However, this is a perverse message because the matching is clearly incomplete. There are many cases where you can't get a reasonable value out of it. First match doesn't work either with only that set of messages! I think the note I have about non-plural complex match cases is helpful here. I don't know what the gender determination means, so I can't write a "perfected matrix", but what I do know is that there is some (possibly locale-affected) set of values that the selector can produce. In fact, in your example the two I'd be careful about using speculative examples (I've mostly stuck to plural in examples) because we can get wrapped up in hard-to-understand hypotheticals. If we instead describe what a selector can do, we want to make a complete test case to describe what happens and where the failures are. To evaluate if Finally, note that a list of complex selectors produces a huge matrix (98 entries in Polish for my example--more in Arabic or Russian!). I think Elango's call out in the call is a good one. Even if for survival purposes tools and developers always write the matrix in first-match order (to help them ensure that they don't miss a case!), the chances of making a mistake grow with complexity. Having the runtime fix the matrix for you is better than the penalty of it erroring on a mal-ordered matrix. The penalty for getting the order wrong anywhere in a long matrix of entries in first-match might be severe and my intuition says that, for any matrix (complete or incomplete) having the runtime fix the order of the matrix does not produce a different result than having had the optimal order in the first place. I think the onus on first-match would be to show a case where a non-optimally ordered matrix has value in excess of the burden on everyone else. Are there any use cases for non-optimally order matrices? |
You're right, the example is incomplete. In a real-life scenario it's unlikely that this message would omit other, better variants. But I think my point still stands -- even in best-match, we cannot analyze variants completely independently, because the presence of other variants defines what
I think it would be easier to find more realistic examples in other languages, but for the sake of the discussion we stick to English, whose grammar rules don't require a lot of complexity that we're designing for.
Correct, the issue of the unclear meaning of
Ack, this is one of the benefits of best-match, and I agree with it. I'm trying to make best-match even more robust by fixing The point about being able to consider individual variants independently of each other has been one of the most compelling arguments in favor of best-match for me. My mental model is that it would make it possible to just look at a single variant at a time and know how to translate it:
This would be great for l10n workflows which only send a subset of translation units for localization, great for tooling and QA because the translator only needs to consider However, am I right to say that
If we look at each variant separately, we'll end up with:
...which isn't enough of a "spec" for translators to know how to translate it. Only by looking at the other variant in the message can they know what This discussion makes me consider two other topics:
|
@stasm asked:
I don't think so. I do think that the match statement is needed to understand what the But then, there are lots of message format patterns (or just plain strings) that lack sufficient context for a good translation by themselves. Does it help if you mentally replace the term
I think this would be scary. The source locale (i.e. the developer) might lump together items that need to be separate in another locale. There is another reason why a |
I think so? In English this is easiest for me at least to see by starting with a degenerate case like
where initially the
and that's obviously wrong, because introducing the
As I mention, this is a bit of a degenerate case, but effectively something like this happens every time you introduce a new variant, independently of how the selection happens. If we're deeply aware of how selection happens, then we might be able to look at the variants and deduce which ones are potentially affected by an update. For instance, when introducing a
we might see that in many locales this narrows the meaning of |
@eemeli I disagree with your logic somewhat? Introducing a The existence of a Here's a different illustration. Let's consider what I recommend for developers to write for a basic plural in the
Now let's compile our application with that message--and no localizations--and run it in the For the value Now we run in the For the value We do write different keys when we localize to different locales to handle these cases. And we do introduce special cases like Using a different example from my not-plural examples. I can write a case like:
... where What I'm trying to say is: the IOW, @stasm is correct that |
Isn't this self-contradictory? If the |
(This might be a topic for a different discussion.) While I understand what you're saying here, @aphillips, I also question that in this example we should consider calling Polish formatters in English messages. If there's no translation and the UI falls back to the source language, messages should use the source language's formatters and matchers. |
I agree with @eemeli here. If
|
(And then, because we should be able to detect that |
I know and understand that this has happened and can happen again. Are we saying that graceful handling of such changes is a hard requirement for MF2? |
(I apologize for the volley of comments this morning.) Let me take a step back and try to rephrase the problem. When
However, when
For Of course, in this particular example, the issue is mitigated by the presence of the I'm concerned, however, that in multi-selector messages this problem will manifest more often, and will lead to incorrect translations. That's because How do we do it? This problem also applies to first-match and incomplete messages, but it can be mitigated by reordering variants. Reordering effectively changes the meaning of |
No, not really. "Graceful" is in the eye of the beholder. What I'm saying is more like: "if you have a
This isn't how I18N APIs work! 😀 Localizations often fall back (particularly during development, before the translations are available), but that doesn't mean that you want to lose the formatter or selector behavior. The solution for having English is to provide the localization, but you want the functionality to match the locale expected. Otherwise, for example, you can't test using pseudo or use features that depend on the locale. I often write I18N demos with the equivalent of:
And then code like: // the loop over the locales is usually replaced by a list box
for (Locale locale : Locale.getAvailableLocales()) {
MessageFormat fmt = new MessageFormat(abovePattern, locale);
for (int count = 0; count < 30; count++) {
Map args = ImmutableMap.of("count", count);
System.out.println(String.format("%d: %s", x, fmt.format(args));
}
} When one sees my zero/one/two example above, the temptation is to think that At Amazon I had the equivalent of the above localized. You could run it on our demo portal with the portal in any language doing selection in another locale (it required a bit of work to get the right outcomes with the translators). Maybe @zbraniecki can take a screenshot of the demo if it is still there 😺.
Yes, that's correct. At some point this is unavoidable. Somewhere in this thread there's an example of trying to make the following generic in English:
This string is impossible to translate into many languages for the same reason we tell developers to use a plural format in the first place. You can't fix (e.g.) Polish orthography to be generic in the same way you can English. Similarly, this is a bad source string:
... because, while that is correct English, that is making the assumption that the locale that selected this message was an English locale. These are the same problem. The solution to the Anyway, @stasm goes on to note:
Yes, absolutely the problem will manifest more often with a matrix selector. As noted, some of these matrices will be hairy (98 entries in Polish for the three plural selector example with only one special case). I don't agree that We also cannot prevent developer from writing the bad "You have one item" message. We can only educate them. |
Closing per 2023-06-19 telecon discussion |
Is your feature request related to a problem? Please describe.
I would like to reconsider the choice of "first-match" in pattern selection by proposing that we switch to "best-match". I have prepared an explainer.
Describe the solution you'd like
See explainer.
Describe why your solution should shape the standard
This is a core feature.
Additional context or examples
See explainer.
The text was updated successfully, but these errors were encountered: