Skip to content

Parallel computing with nested cross-validation #10232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattvan83 opened this issue Nov 30, 2017 · 18 comments
Closed

Parallel computing with nested cross-validation #10232

mattvan83 opened this issue Nov 30, 2017 · 18 comments
Labels
Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted

Comments

@mattvan83
Copy link

Dear sklearn's experts,

Standard use of nested cross-validation within sklearn doesn't allow multi-core computing. As in the example below, njobs has to be set to 1 for inner/outer loops:

gs = GridSearchCV(pipe_svc, param_grid, scoring=score_type, cv, n_jobs=1)
scores = cross_val_score(gs, X, y, scoring, cv, n_jobs=1)

Would there be any no too difficult way to parallelize jobs in nested cross-validation, which would allow to highly reduce time-consuming computing ?

Thanks in advance !

Best,
Matthieu

@agramfort
Copy link
Member

agramfort commented Nov 30, 2017 via email

@glemaitre
Copy link
Member

You can probably refer to #3754 which should be highly related to your question.

why do think it must be set to 1?

Good catch :)

@glemaitre
Copy link
Member

I am closing since this is more related to joblib

@amueller
Copy link
Member

amueller commented Dec 8, 2017

Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that.

@glemaitre
Copy link
Member

Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that.

it seems reasonable.

@glemaitre glemaitre reopened this Dec 8, 2017
@glemaitre glemaitre added Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Dec 8, 2017
@akshkr
Copy link

akshkr commented Dec 13, 2017

Can I take this?

@jnothman
Copy link
Member

jnothman commented Dec 13, 2017 via email

@akshkr
Copy link

akshkr commented Dec 13, 2017

Should this be in the FAQ? I get asked this regularly (and on a webcast two days ago). It's also non-obvious to users how to decide which n_jobs to set and we could give some guidance on that.

Will have to solve this one. Right?

@jnothman
Copy link
Member

jnothman commented Dec 13, 2017 via email

@jnothman
Copy link
Member

jnothman commented Dec 13, 2017 via email

@akshkr
Copy link

akshkr commented Dec 13, 2017

Will figure out

@H4dr1en
Copy link
Contributor

H4dr1en commented May 2, 2019

Has this been done now? Do we still have to specify n_jobs=1 for the inner parallelism (estimator) ?

@jnothman
Copy link
Member

jnothman commented May 2, 2019 via email

@ManishAradwad
Copy link
Contributor

ManishAradwad commented Apr 14, 2020

There already seems to be 2 questions in FAQ(First and Second) which address n_jobs parameter's use cases. But, they don't explain how it should be used.

Am I supposed to add a new question explaining how to use n_jobs parameter? Also this section of Glossary seems to be addressing that part well enough.

@jinamshah
Copy link

Hi,
Is this still an issue? If so, then can someone help me understand better?

@tnwei
Copy link
Contributor

tnwei commented Oct 13, 2020

From the last PR linked above, nested parallelism is now enabled by default when you use parallel_backend("dask"), see docs. This solves the original problem that led to this issue being opened, thus I believe clarification in docs shouldn't be required anymore.

Unless we want to document this more extensively in a tutorial, I suggest we close this issue.

@NicolasHug
Copy link
Member

I'm going to close this as the original post was addressed: no need to set n_jobs to 1.

As to documenting what n_jobs should be in nested CV procedures, this is neither easy or a good first issue IMHO.

We have related docs in https://scikit-learn.org/stable/modules/computing.html#parallelism and there is also #14228 which I think will tackle most of it.

@robna
Copy link
Contributor

robna commented Nov 3, 2022

Although, this is closed I think it would be good for an update, for users who land here now:

What would be the current best practise for parallelism in nested cross validation with sklearn today?
Nov 2022: sklearn stable v. 1.1.3, or 1.2.dev0 would be the relevant recent versions
@mattvan83 @NicolasHug

Running inside a jupyter notebook, I am trying to use parallel computation on a server (120 cpu cores) like so:

with parallel_backend('loky', n_jobs=-1):
    innerCV = GridSearchCV(
        pipe,
        params,
        scoring= scoring,
        refit= refit_scorer,
        cv=10,
        verbose=1,
        )

    outerCV = cross_validate(
        innerCV,
        model_X,
        model_y,
        scoring=scoring,
        cv=10,
        return_estimator=True,
        verbose=1,
        )

The pipe is my estimator object, which itself is a sklearn.pipeline wrapping some transformers and various options for estimators specified in my params grid.

It runs without errors, however, I am not sure if it is completely optimised. Some time during the fit I see load on all CPUs but most of the time just 10 of them get to work. I assume this is due to the cv=10 I am using here (though, is it the inner or outer the gets parallelised?).

The times when all CPUs are in use might be when an estimator is tested which has some internal (numpy) parallelisation, I assume?

So, is this a tangible way today to approach nested CV parallelisation in sklearn today? ...Or would it be better to:

  • specify n_jobs in inner and outer CV individually instead of using a context manager?
    • and if so, should they both get n_jobs=-1?
  • be using dask backend be beneficial to loki (when on a single machine)
  • go about it in a completely different way?

Any guidance welcome!
Thanks, robna.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted
Projects
None yet
Development

No branches or pull requests