Skip to content

MNT Replace PDF build by ZIP of the HTML #17564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 12, 2021

Conversation

alfaro96
Copy link
Member

Reference Issues/PRs

Closes #17051.

What does this implement/fix? Explain your changes.

This PR replaces the build of PDF documentation by a ZIP of the HTML for the offline documentation.

@alfaro96 alfaro96 changed the title MNT Replace PDF build by ZIP of the HTML [WIP] MNT Replace PDF build by ZIP of the HTML Jun 11, 2020
@alfaro96 alfaro96 marked this pull request as draft June 11, 2020 09:18
@alfaro96
Copy link
Member Author

Hey @rth @thomasjpfan!

Do you mind to review this PR (the log confirms that the ZIP of the HTML is properly uploaded and this link provides the downloadable file?

@alfaro96 alfaro96 marked this pull request as ready for review June 11, 2020 10:23
@alfaro96 alfaro96 changed the title [WIP] MNT Replace PDF build by ZIP of the HTML MNT Replace PDF build by ZIP of the HTML Jun 11, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much running optipng reduces the size of the zip file...

@adrinjalali
Copy link
Member

Nice, thanks @alfaro96 . Since we're not gonna have a ton of these files laying around, I don't think we'd need to worry about the size too much, since it requires cpu cycles to compress them.

@rth
Copy link
Member

rth commented Jun 11, 2020

Since we're not gonna have a ton of these files laying around, I don't think we'd need to worry about the size too much, since it requires cpu cycles to compress them.

Generally we have an issue with our deployment docs repo size right now: I tried to clone it and at 6% it was at 3.7GB. Because we store the all files for each commit. Even with a clone at --depth 1 the size of the repo (uncompressed) is 2.0GB. I'm kind of surprised we have been allowed by Github to do it, as it's way more than they ToS https://help.github.com/en/github/working-with-github-pages/about-github-pages#usage-limits

So I think anything we can do to reduce the size of files we put there could be worth doing, even at the cost of some CPU time. Though maybe compressing files in previous releases could be a start. The PDF was 48MB in size the ZIP of the PDF is 75MB, so not that different but still a bit more.

@alfaro96
Copy link
Member Author

alfaro96 commented Jun 11, 2020

Since we're not gonna have a ton of these files laying around, I don't think we'd need to worry about the size too much, since it requires cpu cycles to compress them.

Generally we have an issue with our deployment docs repo size right now: I tried to clone it and at 6% it was at 3.7GB. Because we store the all files for each commit. Even with a clone at --depth 1 the size of the repo (uncompressed) is 2.0GB. I'm kind of surprised we have been allowed by Github to do it, as it's way more than they ToS https://help.github.com/en/github/working-with-github-pages/about-github-pages#usage-limits

So I think anything we can do to reduce the size of files we put there could be worth doing, even at the cost of some CPU time. Though maybe compressing files in previous releases could be a start. The PDF was 48MB in size the ZIP of the PDF is 75MB, so not that different but still a bit more.

The actual size of the ZIP file is 75MB (1.56 times the size of the PDF file). Running optipng under the _images directory with *.png reduces the size of the ZIP file to 59MB (1.23 times the size of the PDF file).

If repository size is a problem, I think that compressing the images with optipng could be interesting, at the cost of slightly increasing the CPU time.

@alfaro96
Copy link
Member Author

I wonder how much running optipng reduces the size of the zip file...

According to my calculations, running optipng reduces the size of the ZIP file by a 27.11%.

@rth
Copy link
Member

rth commented Jun 11, 2020

According to my calculations, running optipng reduces the size of the ZIP file by a 27.11%.

Great. How long does it take?

We should probably optimize png and create this Zip in the deploy job not in the job that is building docs (as it runs only for commits on master and only once per commit, unlike the build docs job).

Edit: opened #17568 for additional things we could try.

@alfaro96
Copy link
Member Author

alfaro96 commented Jun 11, 2020

According to my calculations, running optipng reduces the size of the ZIP file by a 27.11%.

Great. How long does it take?

Compressing the images takes 6:02 (using a docker container with the same hardware resources than CircleCI). I think that is quite affordable.

We should probably optimize png and create this Zip in the deploy job not in the job that is building docs (as it runs only for commits on master and only once per commit, unlike the build docs job).

IIUC, I think that we can compress the images in the dist rule of the Makefile since the following lines

if [[ "$CIRCLE_BRANCH" =~ ^master$|^[0-9]+\.[0-9]+\.X$ && -z "$CI_PULL_REQUEST" ]]
then
# PDF linked into HTML
make_args="dist LATEXMKOPTS=-halt-on-error"

are executed under the same condition than the deploy job

- deploy:
command: |
if [[ "${CIRCLE_BRANCH}" =~ ^master$|^[0-9]+\.[0-9]+\.X$ ]]; then
bash build_tools/circle/push_doc.sh doc/_build/html/stable
fi

Edit: opened #17568 for additional things we could try.

I have committed these changes to optimize the images (_images/*.png).

Thanks @rth!

@jnothman
Copy link
Member

Generally we have an issue with our deployment docs repo size right now: I tried to clone it and at 6% it was at 3.7GB. Because we store the all files for each commit. Even with a clone at --depth 1 the size of the repo (uncompressed) is 2.0GB. I'm kind of surprised we have been allowed by Github to do it, as it's way more than they ToS help.github.com/en/github/working-with-github-pages/about-github-pages#usage-limits

Would it be appropriate to use git lfs to at least keep the git out of the server?

Or should we look into less append-only usage of the repo?

@rth
Copy link
Member

rth commented Aug 29, 2020

Would it be appropriate to use git lfs to at least keep the git out of the server?

git lfs doesn't seem to work with github pages git-lfs/git-lfs#791

@alfaro96
Copy link
Member Author

alfaro96 commented Oct 8, 2020

Should we move on with this PR? (#18151 solves the case insensitive filesystems)

@thomasjpfan
Copy link
Member

Should we move on with this PR? (#18151 solves the case insensitive filesystems)

Okay, lets try to move this forward.

The zipped file ends up to have a nested folder structure: html/stable/* where the website is two levels deep. I would expect index.html to be in top directory without any nesting.

For the future, we can work around the github pages being huge as follows:

  1. Create another append only repo called scikit-learn-website-backup that is not hosted on github pages.
  2. When we deploy here, we push to scikit-learn-website-backup (we can optionally setup git-lfs here).
  3. We have a github action in scikit-learn-website-backup to do a force push to scikit-learn.github.io and we keep scikit-learn.github.io to one commit.

This way we have a backup scikit-learn-website-backup and have a github pages that is not huge.

I suspect this is what matplotlib does with its devdocs

@cmarmo cmarmo removed the Needs Decision Requires decision label Oct 13, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just wanted to know if @thomasjpfan has any other comments before merging this PR?

@alfaro96
Copy link
Member Author

alfaro96 commented Jan 5, 2021

I have merged with master to check for possible errors.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there aren't any HTML versions for the older versions, the links for the older PDF versions are not generated by build_tools/circle/list_versions.py. I think we should continue to link to the PDF versions for <=0.24 and then have the HTML versions linked for >=1.0.0. (Unfortunately, this means having two code paths in list_versions.py depending on the version.

@alfaro96
Copy link
Member Author

Since there aren't any HTML versions for the older versions, the links for the older PDF versions are not generated by build_tools/circle/list_versions.py. I think we should continue to link to the PDF versions for <=0.24 and then have the HTML versions linked for >=1.0.0. (Unfortunately, this means having two code paths in list_versions.py depending on the version.

You are absolutely right. I think that the current solution fixes the problem.

@alfaro96
Copy link
Member Author

The generated doc/versions.rst file looks good:

:orphan:

Available documentation for Scikit-learn
========================================

Web-based documentation is available for versions listed below:

* `Scikit-learn 1.0.dev0 (dev) documentation <https://scikit-learn.org/dev/>`_
* `Scikit-learn 0.24.0 (stable) documentation <https://scikit-learn.org/stable/>`_ (`PDF 53.7 MB <https://scikit-learn.org/stable//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.23.2 documentation <https://scikit-learn.org/0.23/>`_ (`PDF 51.0 MB <https://scikit-learn.org/0.23//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.22.2 documentation <https://scikit-learn.org/0.22/>`_ (`PDF 48.5 MB <https://scikit-learn.org/0.22//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.21.3 documentation <https://scikit-learn.org/0.21/>`_ (`PDF 46.7 MB <https://scikit-learn.org/0.21//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.20.4 documentation <https://scikit-learn.org/0.20/>`_ (`PDF 45.2 MB <https://scikit-learn.org/0.20//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.19.2 documentation <https://scikit-learn.org/0.19/>`_ (`PDF 42.2 MB <https://scikit-learn.org/0.19//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.18.2 documentation <https://scikit-learn.org/0.18/>`_ (`PDF 46.5 MB <https://scikit-learn.org/0.18//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.17.1 documentation <https://scikit-learn.org/0.17/>`_ (`PDF 46.0 MB <https://scikit-learn.org/0.17//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.16.1 documentation <https://scikit-learn.org/0.16/>`_ (`PDF 56.8 MB <https://scikit-learn.org/0.16//_downloads/scikit-learn-docs.pdf>`_)
* `Scikit-learn 0.15-git documentation <https://scikit-learn.org/0.15/>`_

Note that the ZIP file is not yet available in the website.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@thomasjpfan thomasjpfan merged commit be4c1d1 into scikit-learn:master Jan 12, 2021
@alfaro96 alfaro96 deleted the ship_zip_html branch January 12, 2021 14:57
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop generating pdf docs?
7 participants