MNT Parallel build of sphinx_gallery #29614

StefanieSenger · 2024-08-02T12:31:51Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This is an attempt to configure the doc/conf.py so that the sphinx_gallery build runs on two cores (see issue). The goal of this PR is to trigger the CI and see how it deals with it.

Locally, running make html and OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 make html (I haven't yet been able to measure differences in speed, but they) seem equally stable, except for one tricky thing:

examples/feature_selection/plot_rfe_with_cross_validation.py triggers the following errors (either one, I've ran it multiple times) within resp. right after the parallelized joblib part in RFECV.fit().

AttributeError

../examples/feature_selection/plot_rfe_with_cross_validation.py unexpectedly failed to execute correctly:

Traceback (most recent call last):
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/examples/feature_selection/plot_rfe_with_cross_validation.py", line 60, in <module>
    rfecv.fit(X, y)
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/feature_selection/_rfe.py", line 774, in fit
    scores_features = parallel(
                      ^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/parallel.py", line 77, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 745, in get_result
    return self._return_or_raise()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/joblib/parallel.py", line 763, in _return_or_raise
    raise self._result
AttributeError: 'LogisticRegression' object has no attribute 'coef_'

ValueError

../examples/feature_selection/plot_rfe_with_cross_validation.py unexpectedly failed to execute correctly:

Traceback (most recent call last):
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/examples/feature_selection/plot_rfe_with_cross_validation.py", line 60, in <module>
    rfecv.fit(X, y)
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/feature_selection/_rfe.py", line 781, in fit
    scores = np.array(scores)
             ^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

Can this be due to a race condition? If it appears on the CI also, we need to deal with it somehow.

github-actions · 2024-08-02T12:33:15Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 1ef8457. Link to the linter CI: here}

StefanieSenger · 2024-08-02T12:46:53Z

For the CI failure in ci/circleci: doc-min-dependencies:

This new feature in sphinx_gallery is only available from version 0.17, and we use "sphinx-gallery": ("0.16.0", "docs"), for this test. But I'm not sure what we would usually do in such a case.

For the CI failure in ci/circleci: doc:

This is one of the two errors that can occur that I also had locally.

lesteve · 2024-08-03T06:20:11Z

I think the first question to try to answer is "what could we gain with this"

either to speed up the CI build. Right now it seems, this saves maybe 10 minutes out of 45 minutes so I am wondering whether this is worth pursuing 🤔. Maybe push a few commits with [doc build] because this can vary from one run to the next. The doc build in this PR took 35 minutes. I have in mind that on main it is more ~45 minutes which matches recents CircleCI builds on main
or when running locally although I guess running the full doc build does not happen that often and if this involves remembering to set environment variables this is probably not that great. Having said that setting the environment variables could be done in doc/conf.py

A few more comments before I forget:

one downside could be that some examples are showing the impact on parallelism like changing n_jobs and they could become misleading since we only have one
switching to a bigger CircleCI worker could be an option if the main aim is to make the CircleCI doc build faster Try running examples in parallel during doc build #29570 (comment)
some example are using n_jobs=2 or even n_jobs=3 IIRC, which seems like a bad idea if we use the sphinx-gallery parallel feature, since right now we should be having only a single core per example

About the errors, I am a bit suprised that running the examples in parallel can cause this kind of errors but maybe I am missing something.

About the doc-min-dependencies, this can be left for later when we get a better sense of "is this really worth it?"

StefanieSenger · 2024-08-03T06:42:04Z

Thank you @lesteve, I understand the priorities now. I will run some experiments measuring speedup and sum it all up (while disregarding the errors for now). We can talk in September about the "if it is worth it". :)

StefanieSenger · 2024-08-06T13:39:20Z

I have fixed the inconsistent failure (setting number jobs in the example to 1 worked) and ran all of the builds three times each.
Here are the results:

CI build: 34, 32 and 32 minutes
local build with make html: 22, 22 and 21 minutes
local build with OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 make html: 23, 20 and 20 minutes
compared to 28 minutes building locally on main

So, there seems to not be any meaningful difference in the local builds with or without those variables set, but an improvement compared to building without changing the sphinx-gallery configuration.

adrinjalali · 2024-08-13T13:30:15Z

I think 10min difference in CI is worth pursuing this. We can bump the min version to fix the min dependency issue.

As for the local stuff, I'd like to have a spin thingy which disables sets the threads (blas, ...) to 1, and uses all CPU cores available. That should speed up tings locally substantially.

StefanieSenger · 2024-09-02T12:47:45Z

Do you also think it's worth it, @lesteve?

I was thinking about the examples that use n_jobs=more_than_one. If I understand correctly, then they never really ran on several cores, but rather ran as several threads in the past. I think it's legit, since the examples are there to illustrate use cases, and the users will most probably have more cores available.
Building the examples in parallel would probably not change that, would it? Each example will still have one core available and might then use several threads. Do you think there is any need for action?

I don't know so much to say about switching to a bigger CircleCI worker. Is this is separate action that we could take in addition?

lesteve · 2024-09-04T09:21:58Z

Do you also think it's worth it, @lesteve?

I don't know 🙄 I think we can always try and revert if we realize that this causes too many annoyances.

The fact that there were weird Python errors originally was unexpected and that makes me think this is a bit brittle but this may actually be an issue in RFECV (full disclosure I have seen very similar errors in RFECV when testing CPython 3.13 free-threaded but I thought it was only happening because I was setting the joblib backend to threading) ...

lesteve · 2024-09-04T15:40:44Z

FWIW I can reproduce the RFECV issue when running inside a joblib.Parallel #29783 so this seems like a real bug that was uncovered by trying to run the example in parallel.

StefanieSenger · 2024-09-05T10:46:38Z

@scikit-learn-bot update lock-files --select-build doc

StefanieSenger · 2024-09-05T17:51:13Z

So, it seems that there is a problem when building sphinx-gallery in parallel for the files where the __init__.py file of model_selection has kind of a split-off part in enable_halving_search_cv.py. @adrinjalali helped me see that this might be a sphinx issue. I am trying to make a reproducible; it will take me a few days.

StefanieSenger · 2024-09-06T13:39:48Z

I have observed that the failure (ImportError, see current CI) is inconsistent. We have two examples that use enable_halving_search_cv.py and we can see that sometimes, the ImportError is on HalvingGridSearchCV from file A and sometimes it's on HalvingRandomSearchCV from file B, and sometimes it is not appearing at all.

I was trying to reproduce it in an example_project, without success though, because either
a) things there run too fast for parallelism to really to take effect, or
b) the minimal project runs on more cores than the scikit-learn example build, and thus there is less thread-switching. (Edit: I have now run sphinx-build with -j 1 (on one job) and that doesn't make any difference.)
:scratching_head:

Edit: I was experimenting with putting the enablings of the modules into the "builder-inited" handler and into an __init__.py file on top of the doc folder, finally, I tried to add it to the Makefile, and all of that did not work. I will still push this with comments as a reminder of what doesn't work.

…llel_build

StefanieSenger · 2024-09-18T08:14:37Z

It seems that the most recent release of sphinx-gallery 0.17.1 (which main uses now) has resolved our problems. I think it was this fix for adding nested class support(?): sphinx-gallery/sphinx-gallery#1364.

StefanieSenger · 2024-09-18T10:20:24Z

Running examples/model_selection/plot_cv_indices.py in the CircleCI failed twice with:

Too long with no output (exceeded 10m0s): context deadline exceeded

Locally, this example runs quite fast in a matter of 1-2 seconds. I'm not quite sure what could be the reason.

lesteve · 2024-09-18T12:21:08Z

No idea, I would say try to push an empty commit and see whether it goes away ...

ogrisel · 2024-09-23T13:19:12Z

Maybe you can also try to add to the beginning of the example the following debugging code:

import faulthandler

faulthandler.dump_traceback_later(60, exit=True)

and:

cancel_dump_traceback_later()

at the end.

This will interrupt the Python execution after 60 seconds and print tracebacks to identify what is causing the slow/frozen execution. I am not sure if passing exit=True is a good idea or not to help debugging.

https://docs.python.org/3/library/faulthandler.html

The point is to collect extra info in the CI logs when the execution of this example happens on the CI prior to reaching sphinx-gallery own time out.

StefanieSenger added 2 commits August 2, 2024 14:02

add configuration for parallel sphinx_gallery build

1d9d860

[doc build]

9b0dff9

[doc build]

7509695

StefanieSenger added 2 commits August 5, 2024 11:55

[doc build]

86766cc

only one job in example rfe_with_cross_validation

1190134

StefanieSenger and others added 2 commits September 2, 2024 14:53

change min version of gallery and [doc build]

a345b15

Merge branch 'main' into parallel_build

d04f3e7

lesteve mentioned this pull request Sep 4, 2024

Running RFECV.fit inside joblib.Parallel causes ValueError or AttributeError #29783

Closed

change min version in pyproject.toml [doc build]

42fcb54

web-flow and others added 2 commits September 5, 2024 10:48

Update lock files

6dc229b

clarify documentation on re-building lock files

5847515

more precise documentation of how to update lock files

bb1e8a4

StefanieSenger added 6 commits September 6, 2024 17:24

unsuccessful attempts to run enabling files before parallel build

bd64c24

typo

2001585

improve comment on setattr()

792a981

Merge branch 'main' of github.com:scikit-learn/scikit-learn into para…

0ccdb2e

…llel_build

re-build lock files

14287c0

Merge branch 'main' of github.com:scikit-learn/scikit-learn into para…

3c571f3

…llel_build

rebuild lock-files after merge with main [doc build]

862429a

StefanieSenger added 2 commits September 18, 2024 10:16

delete unnecessary and unsuccesful fixing attempts [doc build]

19978b0

[doc build]

d90da62

[doc build]

1ef8457

lesteve mentioned this pull request Oct 30, 2024

FIX Make RFECV thread-safe when used with joblib threading backend #30176

Merged

Uh oh!

MNT Parallel build of sphinx_gallery #29614

Are you sure you want to change the base?

MNT Parallel build of sphinx_gallery #29614

Uh oh!

Conversation

StefanieSenger commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

StefanieSenger commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lesteve commented Aug 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Aug 3, 2024

Uh oh!

StefanieSenger commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Aug 13, 2024

Uh oh!

StefanieSenger commented Sep 2, 2024

Uh oh!

lesteve commented Sep 4, 2024

Uh oh!

lesteve commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Sep 5, 2024

Uh oh!

StefanieSenger commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger commented Sep 18, 2024

Uh oh!

StefanieSenger commented Sep 18, 2024

Uh oh!

lesteve commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Sep 23, 2024

Uh oh!

Uh oh!

StefanieSenger commented Aug 2, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024 •

edited

Loading

StefanieSenger commented Aug 2, 2024 •

edited

Loading

lesteve commented Aug 3, 2024 •

edited

Loading

StefanieSenger commented Aug 6, 2024 •

edited

Loading

lesteve commented Sep 4, 2024 •

edited

Loading

StefanieSenger commented Sep 5, 2024 •

edited

Loading

StefanieSenger commented Sep 6, 2024 •

edited

Loading

lesteve commented Sep 18, 2024 •

edited

Loading