Skip to content

[Intl][Emoji] Arrows instead of emoji (CLDR hierarchical ↑↑↑) #53116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smnandre opened this issue Dec 18, 2023 · 3 comments · Fixed by #53203
Closed

[Intl][Emoji] Arrows instead of emoji (CLDR hierarchical ↑↑↑) #53116

smnandre opened this issue Dec 18, 2023 · 3 comments · Fixed by #53203

Comments

@smnandre
Copy link
Member

Symfony version(s) affected

6.4,7.0,7.1

Description

Hierarchical metadata (this character: ↑↑↑) are copied into the resource files, leading to arrows beeing inserted in the texts.

I took a list of "top 100 emojis" to see how what proportion was concerned and went through all the xx_YY files.

Locale Emojis Fleches % error
en_CA 0 100 🔴 🔴 🔴 🔴 🔴 🔴 🔴 🔴 🔴 🔴
fr_CA 26 74 🟢 🟢 🟢 🔴 🔴 🔴 🔴 🔴 🔴 🔴
es_MX 34 66 🟢 🟢 🟢 🔴 🔴 🔴 🔴 🔴 🔴 🔴
en_IN 0 100 🔴 🔴 🔴 🔴 🔴 🔴 🔴 🔴 🔴 🔴

How to reproduce

$message = 'Hello 🐨 ❤️ 🤣';

echo (EmojiTransliterator::create('fr'))->transliterate($message).PHP_EOL;
echo (EmojiTransliterator::create('fr_CA'))->transliterate($message).PHP_EOL;

// --> Hello koala cœur rouge️ se rouler par terre de rire
// --> Hello ↑↑↑ ↑↑↑️ rire à se rouler par terre

Obviously it impacts the Slugger with Emoji

$message = '🐨❤️';

echo (new AsciiSlugger('fr'))->withEmoji()->slug($message).PHP_EOL;
echo (new AsciiSlugger('fr_CA'))->withEmoji()->slug($message).PHP_EOL;

// --> koala-coeur-rouge
// --> 

Possible Solution

Something to adapt in the files generation.

Additional Context

No response

@smnandre
Copy link
Member Author

Top 100 emojis according to the first click-bait site i found

😂 ❤️ 🤣 👍 😭 🙏 😘 🥰 😍 😊
🎉 😁 💕 🥺 😅 🔥 ☺️ 🤦 ♥️ 🤷
🙄 😆 🤗 😉 🎂 🤔 👏 🙂 😳 🥳
😎 👌 💜 😔 💪 ✨ 💖 👀 😋 😏
😢 👉 💗 😩 💯 🌹 💞 🎈 💙 😃
😡 💐 😜 🙈 🤞 😄 🤤 🙌 🤪 ❣️
😀 💋 💀 👇 💔 😌 💓 🤩 🙃 😬
😱 😴 🤭 😐 🌞 😒 😇 🌸 😈 🎶
✌️ 🎊 🥵 😞 💚 ☀️ 🖤 💰 😚 👑
🎁 💥 🙋 ☹️ 😑 🥴 👈 💩 ✅ 👋

@smnandre smnandre changed the title [Emoji] Arrows instead of emoji (CLDR hierarchical ↑↑↑) [Intl][Emoji] Arrows instead of emoji (CLDR hierarchical ↑↑↑) Dec 18, 2023
@xabbuh xabbuh added the Intl label Dec 18, 2023
@nicolas-grekas
Copy link
Member

/cc @lyrixx ;)

@smnandre
Copy link
Member Author

I started to look at it, and i have some news :)

Almost all concerns the Resources/emoji/build.php script

--

1/ the merge algorithm with the parents locales does not work because there is an intermediary level in the $mapsByLocale (by mb_strlen), so the merge beeing not recursive we lost data there.

foreach ($results as $result) {
    // ...
    $codePointsCount = mb_strlen($emoji);
    $mapsByLocale[$locale][$codePointsCount][$emoji] = $name;
}

// ...

$maps += $mapsByLocale[$parentLocale] ?? [];

--

2/ the arrows can simply be ignored 😃

So something like that should be enough

// 263A FE0F    ; fully-qualified     # ☺️ E0.6 smiling face
preg_match('{^(?<codePoints>[\w ]+) +; [\w-]+ +# (?<emoji>.+) E\d+\.\d+ ?(?<name>.+)$}Uu', $line, $matches);
if (!$matches) {
    throw new \DomainException("Could not parse line: \"$line\".");
}
+if (str_contains($matches['name'], '↑')) {     
+    continue;
+}

--

3/ There was some codepoints "manually" added that messed with the build

// We also add a version without the "Zero Width Joiner"
$codePoints = str_replace('200d ', '', $codePoints);
$emojisCodePoints[$codePoints] = $matches['emoji'];

That created 2 codepoints for the same emoji .. or we need to build a 1:1 map in the end, and that made all the warnings during the build due to false-negative tests.

--

4/ Good news: JSON repository

I played a bit with the script and i discovered there is a CLDR-JSON repository, much easier to manipulate during the build. One locale file could be generated like this (once the data loaded).

foreach (self::getCldrAnnotations() as $locale => $data) {
    $rules = array_map(fn (array $data) => $data['tts'][0] ?? null, $data);
    $localeRules[$locale] = [...$localeRules[$locale] ?? [], ...array_filter($rules)];
}

And seeing how it's complete and simple to parse/reuse, i think we should use it to build the Intl data too

--

5/ I also looked a bit more deeply into the derived annotations (75% of the emoji data weight) ... and if we accept just a "second pass" when we build the transliterator map, we could almost make it seemless.

Something like :

// BEFORE
    '👨🏻‍❤‍💋‍👨🏼' => 'bisou : homme, homme, peau claire et peau moyennement claire',
    '👨🏻‍❤‍💋‍👨🏽' => 'bisou : homme, homme, peau claire et peau légèrement mate',
    '👨🏻‍❤‍💋‍👨🏾' => 'bisou : homme, homme, peau claire et peau mate',
    '👨🏻‍❤‍💋‍👨🏿' => 'bisou : homme, homme, peau claire et peau foncée',
    '👨🏼‍❤‍💋‍👨🏻' => 'bisou : homme, homme, peau moyennement claire et peau claire',
    '👨🏼‍❤‍💋‍👨🏼' => 'bisou : homme, homme et peau moyennement claire',
    '👨🏼‍❤‍💋‍👨🏽' => 'bisou : homme, homme, peau moyennement claire et peau légèrement mate',
    '👨🏼‍❤‍💋‍👨🏾' => 'bisou : homme, homme, peau moyennement claire et peau mate',
    '👨🏼‍❤‍💋‍👨🏿' => 'bisou : homme, homme, peau moyennement claire et peau foncée',

// AFTER
    '👨🏻‍❤‍💋‍👨🏼' => '💋: 👨🏻, 👨🏻, 🏼 et 🏼',
    '👨🏻‍❤‍💋‍👨🏽' => '💋: 👨🏻, 👨🏻, 🏼 et 🏽',
    '👨🏻‍❤‍💋‍👨🏾' => '💋: 👨🏻, 👨🏻, 🏼 et 🏾',
    '👨🏻‍❤‍💋‍👨🏿' => '💋: 👨🏻, 👨🏻, 🏼 et 🏿',
    '👨🏼‍❤‍💋‍👨🏻' => '💋: 👨🏻, 👨🏻, 🏼 et 🏼',
    '👨🏼‍❤‍💋‍👨🏼' => '💋: 👨🏻, 👨🏻 et 🏼',
    '👨🏼‍❤‍💋‍👨🏽' => '💋: 👨🏻, 👨🏻, 🏼 et 🏽',
    '👨🏼‍❤‍💋‍👨🏾' => '💋: 👨🏻, 👨🏻, 🏼 et 🏾',
    '👨🏼‍❤‍💋‍👨🏿' => '💋: 👨🏻, 👨🏻, 🏼 et 🏿',
    '👨🏽‍❤‍💋‍👨🏻' => '💋: 👨🏻, 👨🏻, 🏽 et 🏼',
    '👨🏽‍❤‍💋‍👨🏼' => '💋: 👨🏻, 👨🏻, 🏽 et 🏼',

As they are "modifiers" we may even not need them at all and build that map following the specs !

--

I'll work on it tomorrow (at least the first part, to fix the bug currently impacting ... around half the world population 😅 )

But if someone wants to do all (or some of that) today, feel really, really free to start without me :)

nicolas-grekas added a commit that referenced this issue Dec 26, 2023
…add missing data) (smnandre)

This PR was squashed before being merged into the 6.4 branch.

Discussion
----------

[Intl] [Emoji] Fix emoji files (remove wrong characters / add missing data)

| Q             | A
| ------------- | ---
| Branch?       | 6.4
| Bug fix?      | yes
| New feature?  | no
| Deprecations? | no
| Issues        | Fix #53116
| License       | MIT

Fix two things
* unwanted characters (`↑↑↑`) instead of expected translations
* merging between child and parent locales

Also adapted the code to maintain a shared / reproductible order in the generated files.

Commits
-------

e9fcff5 [Intl] [Emoji] Fix emoji files (remove wrong characters / add missing data)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants