-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT Parallel build of sphinx_gallery #29614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
For the CI failure in This new feature in sphinx_gallery is only available from version 0.17, and we use For the CI failure in This is one of the two errors that can occur that I also had locally. |
I think the first question to try to answer is "what could we gain with this"
A few more comments before I forget:
About the errors, I am a bit suprised that running the examples in parallel can cause this kind of errors but maybe I am missing something. About the doc-min-dependencies, this can be left for later when we get a better sense of "is this really worth it?" |
Thank you @lesteve, I understand the priorities now. I will run some experiments measuring speedup and sum it all up (while disregarding the errors for now). We can talk in September about the "if it is worth it". :) |
I have fixed the inconsistent failure (setting number jobs in the example to 1 worked) and ran all of the builds three times each.
So, there seems to not be any meaningful difference in the local builds with or without those variables set, but an improvement compared to building without changing the sphinx-gallery configuration. |
I think 10min difference in CI is worth pursuing this. We can bump the min version to fix the min dependency issue. As for the local stuff, I'd like to have a |
Do you also think it's worth it, @lesteve? I was thinking about the examples that use I don't know so much to say about switching to a bigger CircleCI worker. Is this is separate action that we could take in addition? |
I don't know 🙄 I think we can always try and revert if we realize that this causes too many annoyances. The fact that there were weird Python errors originally was unexpected and that makes me think this is a bit brittle but this may actually be an issue in |
FWIW I can reproduce the RFECV issue when running inside a |
@scikit-learn-bot update lock-files --select-build doc |
So, it seems that there is a problem when building sphinx-gallery in parallel for the files where the |
I have observed that the failure ( I was trying to reproduce it in an example_project, without success though, because either Edit: I was experimenting with putting the enablings of the modules into the "builder-inited" handler and into an |
It seems that the most recent release of sphinx-gallery 0.17.1 (which main uses now) has resolved our problems. I think it was this fix for adding nested class support(?): sphinx-gallery/sphinx-gallery#1364. |
Running
Locally, this example runs quite fast in a matter of 1-2 seconds. I'm not quite sure what could be the reason. |
No idea, I would say try to push an empty commit and see whether it goes away ... |
Maybe you can also try to add to the beginning of the example the following debugging code: import faulthandler
faulthandler.dump_traceback_later(60, exit=True) and: cancel_dump_traceback_later() at the end. This will interrupt the Python execution after 60 seconds and print tracebacks to identify what is causing the slow/frozen execution. I am not sure if passing https://docs.python.org/3/library/faulthandler.html The point is to collect extra info in the CI logs when the execution of this example happens on the CI prior to reaching sphinx-gallery own time out. |
Reference Issues/PRs
#29570
What does this implement/fix? Explain your changes.
This is an attempt to configure the
doc/conf.py
so that the sphinx_gallery build runs on two cores (see issue). The goal of this PR is to trigger the CI and see how it deals with it.Locally, running
make html
andOMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 make html
(I haven't yet been able to measure differences in speed, but they) seem equally stable, except for one tricky thing:examples/feature_selection/plot_rfe_with_cross_validation.py
triggers the following errors (either one, I've ran it multiple times) within resp. right after the parallelized joblib part inRFECV.fit()
.AttributeError
ValueError
Can this be due to a race condition? If it appears on the CI also, we need to deal with it somehow.