Skip to content

Add ability to override joblib backend for scikit-learn estimators #8804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcrist opened this issue Apr 27, 2017 · 19 comments
Closed

Add ability to override joblib backend for scikit-learn estimators #8804

jcrist opened this issue Apr 27, 2017 · 19 comments
Labels
Needs Decision - Close Requires decision for closing Performance

Comments

@jcrist
Copy link

jcrist commented Apr 27, 2017

Some calls to Parallel in scikit-learn hardcode what joblib backend to use. It would be nice if these could be overridden to use a different backend using the parallel_backend contextmanager. This would allow optionally using the dask.distributed backend in more places, which may provide speedups (see comment here).

One way to do this would be to check if there's a globally set backend (as set by the context manager) and use that, otherwise use the specified fallback. This might look like:

from sklearn.externals.joblib.parallel import _backend

def active_backend_or(default):
    """If there is an active joblib backend use that, otherwise use the default"""
    return getattr(_backend, 'backend_and_jobs', (default, None))[0]

# Use the active backend if set, otherwise use "threading"
Parallel(backend=active_backend_or("threading"))(...)

I'm not sure if this is a fix that should be implemented in scikit-learn or in joblib. Opening this for discussion.

@lesteve
Copy link
Member

lesteve commented Apr 28, 2017

Now I am wondering why using a context manager only overrides the default backend (when not specified explicitly) and not all of them.

In other words:

with parallel_backend(outer_backend):
    Parallel(backend=inner_backend)(...)

Should outer_backend have priority over inner_backend?

Ping @ogrisel, I seem to remember we chatted about related things but I don't remember the outcome of the discussion.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 28, 2017 via email

@lesteve
Copy link
Member

lesteve commented Apr 28, 2017

One problem that I see is that when the backing is threading, the code can safely assume that there is shared memory, and use it. This is no longer the case with other backend.

That's a good point actually and this is going to be a problem since #8672, which was merged around a month ago relies on the threading backend. In https://github.com/scikit-learn/scikit-learn/pull/8672/files#diff-65dc1c804f310acfd90f5ea83286065cR387 the function that was passed into Parallel has side-effects and this only works because backend='threading'.

@jcrist
Copy link
Author

jcrist commented Apr 28, 2017

In your benchmark: did you check that the results were the same in the different backends, or did you only compare the run-time?

I'm not sure exactly what to compare for trees, but they have the same score, and the following passes (for RandomForestClassifier):

for t1, t2 in zip(distributed_res.estimators_, threading_res.estimators_):
     np.testing.assert_array_equal(t1.tree_.value, t2.tree_.value)

That is a good point though, and would need to be considered when determining what can have overridable backends.

Should outer_backend have priority over inner_backend?

I don't think so, as that wouldn't allow internal code to ensure using threading. For the case of RandomForest we could allow fit to use any backend, but predict would require threading (my benchmark found that distributing predict was slower anyway, which makes sense because of the low overhead of predict).

@jnothman
Copy link
Member

jnothman commented Apr 29, 2017 via email

@jcrist
Copy link
Author

jcrist commented May 1, 2017

allow_override=False?

If we were to do that, my preference would be that Parallel backends are not overridable by default, and we set allow_override=True for calls that are known to work fine under non-shared memory backends.

For moving forward on this, is this something that should be an issue/PR to joblib instead of here?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 1, 2017 via email

@jcrist
Copy link
Author

jcrist commented May 1, 2017

Hmmm, that's an interesting idea. I will say that as a novice joblib user, I find the concept of a default backend and whether that backend can/cannot be swapped out to be simpler than a set of semantic requirements (probably simpler to implement too). It's not immediately clear to me what nogil means here (could mean either that my function releases the gil or that I want to compute in processes to avoid the gil?). I will cede to your judgement here though.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 1, 2017 via email

@jmschrei
Copy link
Member

jmschrei commented May 5, 2017 via email

@jcrist
Copy link
Author

jcrist commented May 22, 2017

Is there anything I can do to help expedite this issue? If I was to make an example PR showing what I think should be done, should I submit it here or joblib? Would that be a good next step?

@jnothman
Copy link
Member

jnothman commented May 22, 2017 via email

@jcrist
Copy link
Author

jcrist commented May 23, 2017

See joblib/joblib#524.

@jmschrei
Copy link
Member

This is a really neat addition, thanks for the link. I'll be watching it.

TomAugspurger added a commit to TomAugspurger/scikit-learn that referenced this issue Jun 22, 2018
Uses the new prefer / require keywords from joblib/joblib#602.

This allows users to control how jobs are parallelized in more situations.
For example, training a RandomForest on a cluster of machines with the dask backend.

Closes scikit-learn#8804
tomMoral pushed a commit to ogrisel/scikit-learn that referenced this issue Jun 22, 2018
Uses the new prefer / require keywords from joblib/joblib#602.

This allows users to control how jobs are parallelized in more situations.
For example, training a RandomForest on a cluster of machines with the dask backend.

Closes scikit-learn#8804
@jakirkham
Copy link
Contributor

Should this be closed? PR ( #11166 ) allows optional use of a different joblib and PR ( joblib/joblib#602 ) solved the shared memory constraint. Were there other outstanding issues?

@GaelVaroquaux
Copy link
Member

I am not convinced: right now specifying the backend with system joblib will work only if scikit-learn is "unvendored".

@jnothman
Copy link
Member

jnothman commented Feb 5, 2019 via email

@cmarmo cmarmo added Needs Decision - Close Requires decision for closing Performance labels Dec 16, 2021
@cmarmo
Copy link
Contributor

cmarmo commented Dec 16, 2021

@GaelVaroquaux, @ogrisel, joblib is no longer distributed with scikit-learn, right? Shall this is issue be closed now? Thanks!

@thomasjpfan
Copy link
Member

I think this issue can be closed now. As noted in the dask-ml docs, parallel_backend can be used to switch the parallel backend to "dask".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Close Requires decision for closing Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants