Skip to content

[Intl] Allow compressing emoji and data maps #49970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nicolas-grekas
Copy link
Member

@nicolas-grekas nicolas-grekas commented Apr 7, 2023

Q A
Branch? 6.3
Bug fix? no
New feature? no
Deprecations? no
Tickets #49943
License MIT
Doc PR -

This PR proposes to compress data and emoji maps into .php.gz files that are uncompressed on the fly.
Compatibility with opcache is preserved so that this will have no impact on performance.

I see this PR as an alternative to #49947.

du -sh src/Symfony/Component/Intl/:

  • before: 45M
  • after: 11M

du -sh src/Symfony/Component/Intl/Resources/data/transliterator/:

  • before: 27MB
  • after: 3,9M

The dependency to ext-zlib is made optional by not-compressing data for the "en" and the "meta" locales. The "en" locale is the default one in our recipes so this should help first timers. A LogicException is throw when using any other locales or when dealing with emojis and ext-zlib is missing.

For the record, this is how I compressed all files:

FILES=$(find Resources/data/ -name '*.php' | grep -v -F '/en.php' | grep -v -F '/meta.php')
echo $FILES | xargs -P0 -n1 php -r 'file_put_contents("compress.zlib://".$argv[1].".gz", file_get_contents($argv[1]));'
echo $FILES | xargs -P0 -n1 php -r 'unlink($argv[1].(filesize($argv[1].".gz") >= filesize($argv[1]) ? ".gz" : ""));'

@carsonbot carsonbot added this to the 6.3 milestone Apr 7, 2023
@nicolas-grekas nicolas-grekas force-pushed the intl-zip-emoji branch 4 times, most recently from 16bb293 to a4679ac Compare April 7, 2023 13:39
@stof
Copy link
Member

stof commented Apr 7, 2023

Given that dependents of EmojiTransliterator and of the classes exposing ICU data are different, I think it still make sense to split the component even though we can reduce the size of the data for the emoji data.
Requiring someone wanting to manage emojis in their slug to install all the ICU data still wastes space for them (12M is still huge compared to typical composer packages)

@stof
Copy link
Member

stof commented Apr 7, 2023

Given that zlib is now required to load any data in symfony/intl, it must be declared as a dependency in composer.json (there is no feature that works without it, and existing features now require it to work)

@nicolas-grekas
Copy link
Member Author

You might have missed my update @stof: The dependency to ext-zlib is made optional by not-compressing data for the "en" and the "meta" locales, which match our defaults.

@stof
Copy link
Member

stof commented Apr 7, 2023

It still does not preserve BC for existing projects that work without zlib and would break when upgrading to 6.3 without notice.

If we do that, we should probably have the dependency on zlib in 6.3 and 6.4 (so that projects don't upgrade in the semver range while breaking support for their transliteration or ICU data usage) and make it actually optional only in 7.0

@nicolas-grekas
Copy link
Member Author

nicolas-grekas commented Apr 7, 2023

I think I'm fine either ways:

  • requiring ext-zlib shouldn't be a too strong requirement as it's already a recommended extension for composer. Doing that would also kinda "break BC" by some definition of "BC break", since we're going to ask ppl to add the ext if they don't have it already. This approach has the benefit of breaking at composer install time.
  • not-requiring ext-zlib: that's another "BC break" by some other definition of "BC break", since loading non-"en" locales will fail for ppl that don't have "ext-zlib". This approach has the benefit of making it easier for first timers to install a Symfony project.

Our BC policy doesn't cover this topic so we can decide either way IMHO. I have a slight preference for approach 2. 🤘

/cc @symfony/mergers in case you want to chime in here.

@fabpot
Copy link
Member

fabpot commented Apr 7, 2023

I prefer this PR instead of #49947 as having a "full" component for "just" the emoji transliterator looks weird to me.

@fabpot
Copy link
Member

fabpot commented Apr 7, 2023

I like option 2 (about zlib req).

@stof
Copy link
Member

stof commented Apr 7, 2023

Option 1 would not break BC as composer update would not select the new version if the extension is missing. On the other hand, option 2 will fail at runtime the first time a non-en user connects.

Another argument for keeping the component split in addition of reducing the size of their data: they don't even have the same platform dependency. EmojiTransliterator depends on ext-intl while other parts of symfony/intl has nothing to do with that extension

@fabpot
Copy link
Member

fabpot commented Apr 8, 2023

Option 1 would not break BC as composer update would not select the new version if the extension is missing. On the other hand, option 2 will fail at runtime the first time a non-en user connects.

Another argument for keeping the component split in addition of reducing the size of their data: they don't even have the same platform dependency. EmojiTransliterator depends on ext-intl while other parts of symfony/intl has nothing to do with that extension

Why not create a new component, but having one just for the emoji tranliterator seems weird to me (too small, too specific). We would need to come up with a more generic component for which the first part would be the emoji transliterator.

@stof
Copy link
Member

stof commented Apr 8, 2023

Well, to me, the reason to keep that component focused on the emoji transliterator is precisely the fact that it needs a bunch of data to operate. Mixing other features in the same component would force people to download a big package if they want to use those other features and not the transliteration. Even with the zlib compression, the data is still 4MB. And as reported in #49943, there are environments where the package size is an issue.

@OskarStark
Copy link
Contributor

I think, but I could be wrong, Fabien meant to create a dedicated component then symfony/transliterator, right?

@yguedidi
Copy link
Contributor

yguedidi commented Apr 8, 2023

Quick idea: isn't this a new "data" concept that could live next to bridges, bundles, components and contracts?
So have src/Symfony/Data/Emojis or something like that.
Composer package could be symfony/data-emojis maybe

@stof
Copy link
Member

stof commented Apr 8, 2023

@yguedidi separating the data from the code needing it would create more packages, not less. And it would be more complex to locate the data or would require adding autoloadable PHP classes in those data packages that can return the data (at which point we would have half of the existing symfony/intl code moved there)

@stof
Copy link
Member

stof commented Apr 8, 2023

@OskarStark but what else would go there ? The Transliterator class itself is already part of PHP itself. And if we have other special transliterators needing their data, the question is whether projects using one will also use others or no. If the answer is no, putting their data together still leads to wasted space (which is what the issue is about).

@stof
Copy link
Member

stof commented Apr 8, 2023

@yguedidi thus, if we have packages meant to expose this data, it means that the structure of this data is now the public API of those packages and must be covered by BC. This is a big no-go to me. Today, the data files in symfony/intl are purely internal (and we changed their format in the past, and the current PR changing it again is something that would be a no-go if they were not internal)

@nicolas-grekas
Copy link
Member Author

nicolas-grekas commented Apr 8, 2023

On my side, I see the theory, but the pragmatic aspect of keeping everything here still wins: splitting 4M out of 11MB doesn't look worth the costs of the split to me.

@kevinpapst
Copy link

The dependency to ext-zlib is made optional by not-compressing data for the "en" and the "meta" locales.

Most projects I know have a user base with more than 90% of them using a different locale, which makes it mandatory and therefor my vote goes to option 1: adding zlib as a composer dependency. It would also allow to remove that special treatment by handling all locales the same way and compressing english as well.

@wouterj
Copy link
Member

wouterj commented Apr 10, 2023

What about:

  1. Create a sub package (e.g. symfony/intl-emoji-data) with all emoji data and a dummy class EmojiData (similar to the dummy FullStack class)
  2. In the EmojiTransliterator, check if the EmojiData class exists and otherwise throw a message saying you need to install the symfony/intl-emoji-data package to use the EmojiTransliterator.

@stof
Copy link
Member

stof commented Apr 11, 2023

@kevinpapst dealing with multiple locales does not imply that you need the translated language names and country names.

@stof
Copy link
Member

stof commented Apr 11, 2023

@wouterj I fail to see the benefit of this symfony/intl-emoji-data package that would contain a EmojiData class compared to putting the EmojiTransliterator class in the package directly (which is what #49947 does)

@dkarlovi
Copy link
Contributor

@stof the value of intl-*-data could be that you'd be able to further fragment down the line. Having a standalone Emoji transliterator component is weird, feels like the old symfony/inflector which was a single use component, being abandoned in favor of symfony/string.

@wouterj
Copy link
Member

wouterj commented Apr 13, 2023

@stof it was an alternative suggestion to somewhat find a middle ground between what you're proposing and Fabien's opinion. If an EmojiTranslator is thought to be too little value for a new component, but we want to have a separate package to avoid lots of unused disk space, let's create a package that adds zero value standalone.

@nicolas-grekas nicolas-grekas force-pushed the intl-zip-emoji branch 2 times, most recently from f4b81dd to d339300 Compare April 14, 2023 17:19
@fabpot
Copy link
Member

fabpot commented Apr 17, 2023

Let's sum up the data sizes in the last few versions:

  • 6.1: 18M
  • 6.2: 44M (because of the addition of the emoji transliterator)

This PR would reduce the data size to 12M, way less than 6.1 (and the previous versions).
And the emoji data represent 4M out of 12M.
So, I don't see why we should split "just" the emoji data in another component.

To me, this PR is enough to fix the size issue of this component (as I still think that an emoji transliterator is not a big enough feature to warrant a full component).

@nicolas-grekas
Copy link
Member Author

nicolas-grekas commented Apr 18, 2023

Maybe wild idea: what about keeping all the code infrastructure in this PR, but NOT compressing any files?
Instead, we would tell people that care about the size that they CAN compress to .php.gz if they need to when preparing their images.
WDYT?

@dkarlovi
Copy link
Contributor

Maybe wild idea: what about keeping all the code infrastructure in this PR, but NOT compressing any files? Instead, we would tell people that care about the size that they CAN compress to .php.gz if they need to when preparing their images. WDYT?

What is the reason to not compress?

@nicolas-grekas
Copy link
Member Author

nicolas-grekas commented Apr 18, 2023

What is the reason to not compress?

Convenience mostly, on two aspects:

  • not requiring ppl to install ext-gzip. Yes, it's an almost always needed extension, but still, that can only improve the DX.
  • improving the compressibility of the git history of this very repo.

@dkarlovi
Copy link
Contributor

dkarlovi commented Apr 18, 2023

👍 Do you think it would make sense to do it when installing if ext-gzip is detected or at least provide a script to do it in the package itself? Using the script would allow future upgrades.

compressibility of the git history of this very repo.

Sounds like it would be a good candidate to use LFS here if the data files are expected to change often, it helps significantly with the Git index size.

@stof
Copy link
Member

stof commented Apr 18, 2023

Compressing after downloading would not reduce the bandwidth usage of the download (though I'm not sure how big the zip itself is on both cases) and would require a composer plugin to perform the compression and delete the original files (btw, this would create a mess for installations from source as composer would detect the files as modified)

@stof
Copy link
Member

stof commented Apr 18, 2023

Sounds like it would be a good candidate to use LFS here if the data files are expected to change often

The data files will change once a year, when Unicode releases a new version.

@dkarlovi
Copy link
Contributor

Compressing after downloading would not reduce the bandwidth usage of the download

Personally I don't care about that at all, I'm just looking forward to being able to minimize Docker images size.

@nicolas-grekas
Copy link
Member Author

About the bandwidth, this is a mistake in the original report: the data is always compressed on the network. Either via ZIP when dist, or git compression when source.

@nicolas-grekas
Copy link
Member Author

I updated the PR to ship only the compression+decompression logic, but not the pre-compressed files.

@nicolas-grekas nicolas-grekas changed the title [Intl] Compress emoji and data maps [Intl] Allow compressing emoji and data maps Apr 18, 2023
}

foreach (glob(dirname(__DIR__).'/data/*/*.php') as $file) {
if (in_array(basename($file), ['en.php', 'meta.php'], true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this script requires zlib anyway, is there any benefit to keeping those uncompressed ?

Builder::saveRules(Builder::buildGitHubRules($emojisCodePoints));
Builder::saveRules(Builder::buildSlackRules($emojisCodePoints));
Builder::saveRules(Builder::buildStripRules($emojisCodePoints));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing the order ?

@@ -192,7 +192,7 @@ public static function saveRules(iterable $rulesByLocale): void
{
$firstChars = [];
foreach ($rulesByLocale as $filename => $rules) {
file_put_contents(self::TARGET_DIR."/$filename.php", "<?php\n\nreturn ".VarExporter::export($rules).";\n");
file_put_contents('compress.zlib://'.self::TARGET_DIR."/$filename.php.gz", "<?php\n\nreturn ".VarExporter::export($rules).";\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it expected that you still build them as compressed here ?

@stof
Copy link
Member

stof commented Apr 18, 2023

@nicolas-grekas can you add tests covering the loading of the compressed data ? Now that the repo does not pre-compress them anymore, this is not tested on CI anymore.

@stof stof closed this Apr 18, 2023
@stof
Copy link
Member

stof commented Apr 18, 2023

oups, closed it by mistake. And apparently, github does not want me to re-open it

nicolas-grekas added a commit that referenced this pull request Apr 18, 2023
…grekas)

This PR was merged into the 6.3 branch.

Discussion
----------

[Intl] Allow compressing emoji and data maps

| Q             | A
| ------------- | ---
| Branch?       | 6.3
| Bug fix?      | no
| New feature?  | yes
| Deprecations? | no
| Tickets       | Fix #49943
| License       | MIT
| Doc PR        | -

Resuming #49970, which was closed by mistake :)

Commits
-------

de6a74c [Intl] Allow compressing emoji and data maps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants