Parallel computing with nested cross-validation #10232

mattvan83 · 2017-11-30T16:07:58Z

Dear sklearn's experts,

Standard use of nested cross-validation within sklearn doesn't allow multi-core computing. As in the example below, njobs has to be set to 1 for inner/outer loops:

gs = GridSearchCV(pipe_svc, param_grid, scoring=score_type, cv, n_jobs=1)
scores = cross_val_score(gs, X, y, scoring, cv, n_jobs=1)

Would there be any no too difficult way to parallelize jobs in nested cross-validation, which would allow to highly reduce time-consuming computing ?

Thanks in advance !

Best,
Matthieu

The text was updated successfully, but these errors were encountered:

agramfort · 2017-11-30T21:06:23Z

why do think it must be set to 1?

glemaitre · 2017-11-30T21:10:39Z

You can probably refer to #3754 which should be highly related to your question.

why do think it must be set to 1?

Good catch :)

glemaitre · 2017-12-06T13:20:45Z

I am closing since this is more related to joblib

amueller · 2017-12-08T18:54:52Z

Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that.

glemaitre · 2017-12-08T19:03:00Z

Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that.

it seems reasonable.

akshkr · 2017-12-13T05:27:53Z

Can I take this?

jnothman · 2017-12-13T06:10:30Z

Go for it

…

On 13 December 2017 at 16:27, Akash ***@***.***> wrote: Can I take this? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10232 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61s0fL0grsjuyklG9qXq1-IWT3uhks5s_2BcgaJpZM4Qwy9x> .

akshkr · 2017-12-13T06:16:28Z

Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that.

Will have to solve this one. Right?

jnothman · 2017-12-13T06:18:13Z

If you feel you have the experience, or can experiment/research, in order to understand what to right, then do it.

…

On 13 December 2017 at 17:16, Akash ***@***.***> wrote: Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that. Will have to solve this one. Right? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10232 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68kx6Trqil8JZpxrOBT3G6Jwv01Sks5s_2u9gaJpZM4Qwy9x> .

jnothman · 2017-12-13T06:18:23Z

sorry: what to *write.

…

On 13 December 2017 at 17:18, Joel Nothman ***@***.***> wrote: If you feel you have the experience, or can experiment/research, in order to understand what to right, then do it. On 13 December 2017 at 17:16, Akash ***@***.***> wrote: > Should this be in the FAQ? I get asked this regularly (and on a webcast > two days ago). It's also non-obvious to users how to decide which n_jobs to > set and we could give some guidance on that. > > Will have to solve this one. Right? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#10232 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz68kx6Trqil8JZpxrOBT3G6Jwv01Sks5s_2u9gaJpZM4Qwy9x> > . >

akshkr · 2017-12-13T06:19:59Z

Will figure out

H4dr1en · 2019-05-02T12:03:23Z

Has this been done now? Do we still have to specify n_jobs=1 for the inner parallelism (estimator) ?

jnothman · 2019-05-02T12:14:53Z

I don't think we have a solution to this yet

ManishAradwad · 2020-04-14T14:50:14Z

There already seems to be 2 questions in FAQ(First and Second) which address n_jobs parameter's use cases. But, they don't explain how it should be used.

Am I supposed to add a new question explaining how to use n_jobs parameter? Also this section of Glossary seems to be addressing that part well enough.

jinamshah · 2020-09-25T04:15:55Z

Hi,
Is this still an issue? If so, then can someone help me understand better?

tnwei · 2020-10-13T10:05:34Z

From the last PR linked above, nested parallelism is now enabled by default when you use parallel_backend("dask"), see docs. This solves the original problem that led to this issue being opened, thus I believe clarification in docs shouldn't be required anymore.

Unless we want to document this more extensively in a tutorial, I suggest we close this issue.

NicolasHug · 2020-10-14T12:03:08Z

I'm going to close this as the original post was addressed: no need to set n_jobs to 1.

As to documenting what n_jobs should be in nested CV procedures, this is neither easy or a good first issue IMHO.

We have related docs in https://scikit-learn.org/stable/modules/computing.html#parallelism and there is also #14228 which I think will tackle most of it.

robna · 2022-11-03T17:11:59Z

Although, this is closed I think it would be good for an update, for users who land here now:

What would be the current best practise for parallelism in nested cross validation with sklearn today?
Nov 2022: sklearn stable v. 1.1.3, or 1.2.dev0 would be the relevant recent versions
@mattvan83 @NicolasHug

Running inside a jupyter notebook, I am trying to use parallel computation on a server (120 cpu cores) like so:

with parallel_backend('loky', n_jobs=-1):
    innerCV = GridSearchCV(
        pipe,
        params,
        scoring= scoring,
        refit= refit_scorer,
        cv=10,
        verbose=1,
        )

    outerCV = cross_validate(
        innerCV,
        model_X,
        model_y,
        scoring=scoring,
        cv=10,
        return_estimator=True,
        verbose=1,
        )

The pipe is my estimator object, which itself is a sklearn.pipeline wrapping some transformers and various options for estimators specified in my params grid.

It runs without errors, however, I am not sure if it is completely optimised. Some time during the fit I see load on all CPUs but most of the time just 10 of them get to work. I assume this is due to the cv=10 I am using here (though, is it the inner or outer the gets parallelised?).

The times when all CPUs are in use might be when an estimator is tested which has some internal (numpy) parallelisation, I assume?

So, is this a tangible way today to approach nested CV parallelisation in sklearn today? ...Or would it be better to:

specify n_jobs in inner and outer CV individually instead of using a context manager?
- and if so, should they both get n_jobs=-1?
be using dask backend be beneficial to loki (when on a single machine)
go about it in a completely different way?

Any guidance welcome!
Thanks, robna.

glemaitre closed this as completed Dec 6, 2017

glemaitre reopened this Dec 8, 2017

glemaitre added Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Dec 8, 2017

mattvan83 mentioned this issue Mar 28, 2018

Optimizing nested cross-validation: speedups even in single-machine cases dask/dask#3291

Closed

NicolasHug closed this as completed Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel computing with nested cross-validation #10232

Parallel computing with nested cross-validation #10232

mattvan83 commented Nov 30, 2017

agramfort commented Nov 30, 2017 via email

glemaitre commented Nov 30, 2017

glemaitre commented Dec 6, 2017

amueller commented Dec 8, 2017

glemaitre commented Dec 8, 2017

akshkr commented Dec 13, 2017

jnothman commented Dec 13, 2017 via email

akshkr commented Dec 13, 2017

jnothman commented Dec 13, 2017 via email

jnothman commented Dec 13, 2017 via email

akshkr commented Dec 13, 2017

H4dr1en commented May 2, 2019

jnothman commented May 2, 2019 via email

ManishAradwad commented Apr 14, 2020 •

edited

Loading

jinamshah commented Sep 25, 2020

tnwei commented Oct 13, 2020

NicolasHug commented Oct 14, 2020

robna commented Nov 3, 2022

Parallel computing with nested cross-validation #10232

Parallel computing with nested cross-validation #10232

Comments

mattvan83 commented Nov 30, 2017

agramfort commented Nov 30, 2017 via email

glemaitre commented Nov 30, 2017

glemaitre commented Dec 6, 2017

amueller commented Dec 8, 2017

glemaitre commented Dec 8, 2017

akshkr commented Dec 13, 2017

jnothman commented Dec 13, 2017 via email

akshkr commented Dec 13, 2017

jnothman commented Dec 13, 2017 via email

jnothman commented Dec 13, 2017 via email

akshkr commented Dec 13, 2017

H4dr1en commented May 2, 2019

jnothman commented May 2, 2019 via email

ManishAradwad commented Apr 14, 2020 • edited Loading

jinamshah commented Sep 25, 2020

tnwei commented Oct 13, 2020

NicolasHug commented Oct 14, 2020

robna commented Nov 3, 2022

ManishAradwad commented Apr 14, 2020 •

edited

Loading