Skip to content

MNT Parallel build of sphinx_gallery #29614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Aug 2, 2024

Reference Issues/PRs

#29570

What does this implement/fix? Explain your changes.

This is an attempt to configure the doc/conf.py so that the sphinx_gallery build runs on two cores (see issue). The goal of this PR is to trigger the CI and see how it deals with it.

Locally, running make html and OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 make html (I haven't yet been able to measure differences in speed, but they) seem equally stable, except for one tricky thing:

examples/feature_selection/plot_rfe_with_cross_validation.py triggers the following errors (either one, I've ran it multiple times) within resp. right after the parallelized joblib part in RFECV.fit().

AttributeError
../examples/feature_selection/plot_rfe_with_cross_validation.py unexpectedly failed to execute correctly:

Traceback (most recent call last):
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/examples/feature_selection/plot_rfe_with_cross_validation.py", line 60, in <module>
    rfecv.fit(X, y)
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/feature_selection/_rfe.py", line 774, in fit
    scores_features = parallel(
                      ^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/parallel.py", line 77, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 745, in get_result
    return self._return_or_raise()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 763, in _return_or_raise
    raise self._result
AttributeError: 'LogisticRegression' object has no attribute 'coef_'
ValueError
../examples/feature_selection/plot_rfe_with_cross_validation.py unexpectedly failed to execute correctly:

Traceback (most recent call last):
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/examples/feature_selection/plot_rfe_with_cross_validation.py", line 60, in <module>
    rfecv.fit(X, y)
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/feature_selection/_rfe.py", line 781, in fit
    scores = np.array(scores)
             ^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

Can this be due to a race condition? If it appears on the CI also, we need to deal with it somehow.

Copy link

github-actions bot commented Aug 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 1ef8457. Link to the linter CI: here

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Aug 2, 2024

For the CI failure in ci/circleci: doc-min-dependencies:

This new feature in sphinx_gallery is only available from version 0.17, and we use "sphinx-gallery": ("0.16.0", "docs"), for this test. But I'm not sure what we would usually do in such a case.

For the CI failure in ci/circleci: doc:

This is one of the two errors that can occur that I also had locally.

@lesteve
Copy link
Member

lesteve commented Aug 3, 2024

I think the first question to try to answer is "what could we gain with this"

  • either to speed up the CI build. Right now it seems, this saves maybe 10 minutes out of 45 minutes so I am wondering whether this is worth pursuing 🤔. Maybe push a few commits with [doc build] because this can vary from one run to the next. The doc build in this PR took 35 minutes. I have in mind that on main it is more ~45 minutes which matches recents CircleCI builds on main
    image

  • or when running locally although I guess running the full doc build does not happen that often and if this involves remembering to set environment variables this is probably not that great. Having said that setting the environment variables could be done in doc/conf.py

A few more comments before I forget:

  • one downside could be that some examples are showing the impact on parallelism like changing n_jobs and they could become misleading since we only have one
  • switching to a bigger CircleCI worker could be an option if the main aim is to make the CircleCI doc build faster Try running examples in parallel during doc build #29570 (comment)
  • some example are using n_jobs=2 or even n_jobs=3 IIRC, which seems like a bad idea if we use the sphinx-gallery parallel feature, since right now we should be having only a single core per example

About the errors, I am a bit suprised that running the examples in parallel can cause this kind of errors but maybe I am missing something.

About the doc-min-dependencies, this can be left for later when we get a better sense of "is this really worth it?"

@StefanieSenger
Copy link
Contributor Author

Thank you @lesteve, I understand the priorities now. I will run some experiments measuring speedup and sum it all up (while disregarding the errors for now). We can talk in September about the "if it is worth it". :)

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Aug 6, 2024

I have fixed the inconsistent failure (setting number jobs in the example to 1 worked) and ran all of the builds three times each.
Here are the results:

  • CI build: 34, 32 and 32 minutes
  • local build with make html: 22, 22 and 21 minutes
  • local build with OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 make html: 23, 20 and 20 minutes
  • compared to 28 minutes building locally on main

So, there seems to not be any meaningful difference in the local builds with or without those variables set, but an improvement compared to building without changing the sphinx-gallery configuration.

@adrinjalali
Copy link
Member

I think 10min difference in CI is worth pursuing this. We can bump the min version to fix the min dependency issue.

As for the local stuff, I'd like to have a spin thingy which disables sets the threads (blas, ...) to 1, and uses all CPU cores available. That should speed up tings locally substantially.

@StefanieSenger
Copy link
Contributor Author

Do you also think it's worth it, @lesteve?

I was thinking about the examples that use n_jobs=more_than_one. If I understand correctly, then they never really ran on several cores, but rather ran as several threads in the past. I think it's legit, since the examples are there to illustrate use cases, and the users will most probably have more cores available.
Building the examples in parallel would probably not change that, would it? Each example will still have one core available and might then use several threads. Do you think there is any need for action?

I don't know so much to say about switching to a bigger CircleCI worker. Is this is separate action that we could take in addition?

@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

Do you also think it's worth it, @lesteve?

I don't know 🙄 I think we can always try and revert if we realize that this causes too many annoyances.

The fact that there were weird Python errors originally was unexpected and that makes me think this is a bit brittle but this may actually be an issue in RFECV (full disclosure I have seen very similar errors in RFECV when testing CPython 3.13 free-threaded but I thought it was only happening because I was setting the joblib backend to threading) ...

@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

FWIW I can reproduce the RFECV issue when running inside a joblib.Parallel #29783 so this seems like a real bug that was uncovered by trying to run the example in parallel.

@StefanieSenger
Copy link
Contributor Author

@scikit-learn-bot update lock-files --select-build doc

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Sep 5, 2024

So, it seems that there is a problem when building sphinx-gallery in parallel for the files where the __init__.py file of model_selection has kind of a split-off part in enable_halving_search_cv.py. @adrinjalali helped me see that this might be a sphinx issue. I am trying to make a reproducible; it will take me a few days.

@StefanieSenger
Copy link
Contributor Author

StefanieSenger commented Sep 6, 2024

I have observed that the failure (ImportError, see current CI) is inconsistent. We have two examples that use enable_halving_search_cv.py and we can see that sometimes, the ImportError is on HalvingGridSearchCV from file A and sometimes it's on HalvingRandomSearchCV from file B, and sometimes it is not appearing at all.

I was trying to reproduce it in an example_project, without success though, because either
a) things there run too fast for parallelism to really to take effect, or
b) the minimal project runs on more cores than the scikit-learn example build, and thus there is less thread-switching. (Edit: I have now run sphinx-build with -j 1 (on one job) and that doesn't make any difference.)
:scratching_head:

Edit: I was experimenting with putting the enablings of the modules into the "builder-inited" handler and into an __init__.py file on top of the doc folder, finally, I tried to add it to the Makefile, and all of that did not work. I will still push this with comments as a reminder of what doesn't work.

@StefanieSenger
Copy link
Contributor Author

It seems that the most recent release of sphinx-gallery 0.17.1 (which main uses now) has resolved our problems. I think it was this fix for adding nested class support(?): sphinx-gallery/sphinx-gallery#1364.

@StefanieSenger
Copy link
Contributor Author

Running examples/model_selection/plot_cv_indices.py in the CircleCI failed twice with:

Too long with no output (exceeded 10m0s): context deadline exceeded

Locally, this example runs quite fast in a matter of 1-2 seconds. I'm not quite sure what could be the reason.

@lesteve
Copy link
Member

lesteve commented Sep 18, 2024

No idea, I would say try to push an empty commit and see whether it goes away ...

@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2024

Maybe you can also try to add to the beginning of the example the following debugging code:

import faulthandler

faulthandler.dump_traceback_later(60, exit=True)

and:

cancel_dump_traceback_later()

at the end.

This will interrupt the Python execution after 60 seconds and print tracebacks to identify what is causing the slow/frozen execution. I am not sure if passing exit=True is a good idea or not to help debugging.

https://docs.python.org/3/library/faulthandler.html

The point is to collect extra info in the CI logs when the execution of this example happens on the CI prior to reaching sphinx-gallery own time out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants