Add ability to override joblib backend for scikit-learn estimators #8804

jcrist · 2017-04-27T18:36:50Z

Some calls to Parallel in scikit-learn hardcode what joblib backend to use. It would be nice if these could be overridden to use a different backend using the parallel_backend contextmanager. This would allow optionally using the dask.distributed backend in more places, which may provide speedups (see comment here).

One way to do this would be to check if there's a globally set backend (as set by the context manager) and use that, otherwise use the specified fallback. This might look like:

from sklearn.externals.joblib.parallel import _backend

def active_backend_or(default):
    """If there is an active joblib backend use that, otherwise use the default"""
    return getattr(_backend, 'backend_and_jobs', (default, None))[0]

# Use the active backend if set, otherwise use "threading"
Parallel(backend=active_backend_or("threading"))(...)

I'm not sure if this is a fix that should be implemented in scikit-learn or in joblib. Opening this for discussion.

The text was updated successfully, but these errors were encountered:

lesteve · 2017-04-28T13:49:45Z

Now I am wondering why using a context manager only overrides the default backend (when not specified explicitly) and not all of them.

In other words:

with parallel_backend(outer_backend):
    Parallel(backend=inner_backend)(...)

Should outer_backend have priority over inner_backend?

Ping @ogrisel, I seem to remember we chatted about related things but I don't remember the outcome of the discussion.

GaelVaroquaux · 2017-04-28T13:55:48Z

One problem that I see is that when the backing is threading, the code can safely assume that there is shared memory, and use it. This is no longer the case with other backend. In your benchmark: did you check that the results were the same in the different backends, or did you only compare the run-time? I don't remember if random forest use a shared memory or not.

lesteve · 2017-04-28T14:11:23Z

One problem that I see is that when the backing is threading, the code can safely assume that there is shared memory, and use it. This is no longer the case with other backend.

That's a good point actually and this is going to be a problem since #8672, which was merged around a month ago relies on the threading backend. In https://github.com/scikit-learn/scikit-learn/pull/8672/files#diff-65dc1c804f310acfd90f5ea83286065cR387 the function that was passed into Parallel has side-effects and this only works because backend='threading'.

jcrist · 2017-04-28T15:56:16Z

In your benchmark: did you check that the results were the same in the different backends, or did you only compare the run-time?

I'm not sure exactly what to compare for trees, but they have the same score, and the following passes (for RandomForestClassifier):

for t1, t2 in zip(distributed_res.estimators_, threading_res.estimators_):
     np.testing.assert_array_equal(t1.tree_.value, t2.tree_.value)

That is a good point though, and would need to be considered when determining what can have overridable backends.

Should outer_backend have priority over inner_backend?

I don't think so, as that wouldn't allow internal code to ensure using threading. For the case of RandomForest we could allow fit to use any backend, but predict would require threading (my benchmark found that distributing predict was slower anyway, which makes sense because of the low overhead of predict).

jnothman · 2017-04-29T09:30:32Z

allow_override=False?

…

On 29 Apr 2017 1:56 am, "Jim Crist" ***@***.***> wrote: In your benchmark: did you check that the results were the same in the different backends, or did you only compare the run-time? I'm not sure exactly what to compare for trees, but they have the same score, and the following passes (for RandomForestClassifier): for t1, t2 in zip(distributed_res.estimators_, threading_res.estimators_): np.testing.assert_array_equal(t1.tree_.value, t2.tree_.value) That is a good point though, and would need to be considered when determining what can have overridable backends. Should outer_backend have priority over inner_backend? I don't think so, as that wouldn't allow internal code to ensure using threading. For the case of RandomForest we could allow fit to use any backend, but predict would require threading (my benchmark found that distributing predict was slower anyway, which makes sense because of the low overhead of predict). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8804 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-pytxxwJWKCUWev4jtC4gdq-icxks5r0gwhgaJpZM4NKmzc> .

jcrist · 2017-05-01T17:18:11Z

allow_override=False?

If we were to do that, my preference would be that Parallel backends are not overridable by default, and we set allow_override=True for calls that are known to work fine under non-shared memory backends.

For moving forward on this, is this something that should be an issue/PR to joblib instead of here?

GaelVaroquaux · 2017-05-01T17:24:38Z

I think that the notion of overrides is going to be challenging to do in a non fragile way. Maybe we should have a semantics in the way we specify the backends. To give an example, what the threading backend provides is shared memory, what it doesn't provide is parallel computing for gil. A semantic specification might look like "backend='nogil,shared_mem'", which would say that the code needs shared mem, but doesn't need gil-level parallelism. Another dimension to look at would be whether the operations are expected to be very short or not.

jcrist · 2017-05-01T17:37:04Z

Hmmm, that's an interesting idea. I will say that as a novice joblib user, I find the concept of a default backend and whether that backend can/cannot be swapped out to be simpler than a set of semantic requirements (probably simpler to implement too). It's not immediately clear to me what nogil means here (could mean either that my function releases the gil or that I want to compute in processes to avoid the gil?). I will cede to your judgement here though.

GaelVaroquaux · 2017-05-01T17:40:50Z

It's not immediately clear to me what nogil means here (could mean either that my function releases the gil or that I want to compute in processes to avoid the gil?).

I agree that the semantics should be made clear in documentation. I also agree that better wording must be explored. The question is rather: are these the good axis to explore? Ping @ogrisel, @aabadie, @lesteve

jmschrei · 2017-05-05T18:10:07Z

I think it would be good to allow users to directly specify what the input to joblib is if they want to override it. That way they can call upon documentation from both sklearn, joblib, and other packages built on joblib to try to understand what is happening better. In sklearn documentation we would want to specify why we chose what we did, for example that 'threading' is useful for trees, but not for estimators that don't release the GIL.

…

On Mon, May 1, 2017 at 10:41 AM, Gael Varoquaux ***@***.***> wrote: > It's not immediately clear to me what nogil means here (could mean > either that my function releases the gil or that I want to compute in > processes to avoid the gil?). I agree that the semantics should be made clear in documentation. I also agree that better wording must be explored. The question is rather: are these the good axis to explore? Ping @ogrisel, @aabadie, @lesteve — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8804 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADvEEH0NtnXeWp-1459QW4jZkBs-3mH_ks5r1hktgaJpZM4NKmzc> .

jcrist · 2017-05-22T21:25:12Z

Is there anything I can do to help expedite this issue? If I was to make an example PR showing what I think should be done, should I submit it here or joblib? Would that be a good next step?

jnothman · 2017-05-22T22:29:31Z

if it requires changes to joblib, best to raise it there

…

On 23 May 2017 7:25 am, "Jim Crist" ***@***.***> wrote: Is there anything I can do to help expedite this issue? If I was to make an example PR showing what I think should be done, should I submit it here or joblib? Would that be a good next step? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8804 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz621WBJreYMQ35lXfNC5Rnr5GcIL1ks5r8f06gaJpZM4NKmzc> .

jcrist · 2017-05-23T17:25:50Z

See joblib/joblib#524.

jmschrei · 2017-05-24T07:08:36Z

This is a really neat addition, thanks for the link. I'll be watching it.

Uses the new prefer / require keywords from joblib/joblib#602. This allows users to control how jobs are parallelized in more situations. For example, training a RandomForest on a cluster of machines with the dask backend. Closes scikit-learn#8804

jakirkham · 2018-07-02T15:08:29Z

Should this be closed? PR ( #11166 ) allows optional use of a different joblib and PR ( joblib/joblib#602 ) solved the shared memory constraint. Were there other outstanding issues?

GaelVaroquaux · 2018-07-02T15:36:38Z

I am not convinced: right now specifying the backend with system joblib will work only if scikit-learn is "unvendored".

jnothman · 2019-02-05T20:39:14Z

We provide sklearn.utils.parallel_backend for this reason. It's not perfect.

cmarmo · 2021-12-16T23:05:19Z

@GaelVaroquaux, @ogrisel, joblib is no longer distributed with scikit-learn, right? Shall this is issue be closed now? Thanks!

thomasjpfan · 2022-06-30T19:37:25Z

I think this issue can be closed now. As noted in the dask-ml docs, parallel_backend can be used to switch the parallel backend to "dask".

jcrist mentioned this issue May 17, 2017

Joblib scatter dask/distributed#1022

Merged

jcrist mentioned this issue May 23, 2017

Add allow_override kwarg to Parallel joblib/joblib#524

Closed

TomAugspurger mentioned this issue Jun 22, 2018

[WIP] Use Joblib backend hints rather than hardcoding #11345

Closed

cmarmo added Needs Decision - Close Requires decision for closing Performance labels Dec 16, 2021

thomasjpfan closed this as completed Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to override joblib backend for scikit-learn estimators #8804

Add ability to override joblib backend for scikit-learn estimators #8804

jcrist commented Apr 27, 2017 •

edited

Loading

lesteve commented Apr 28, 2017 •

edited

Loading

GaelVaroquaux commented Apr 28, 2017 via email

lesteve commented Apr 28, 2017 •

edited

Loading

jcrist commented Apr 28, 2017

jnothman commented Apr 29, 2017 via email

jcrist commented May 1, 2017

GaelVaroquaux commented May 1, 2017 via email

jcrist commented May 1, 2017

GaelVaroquaux commented May 1, 2017 via email

jmschrei commented May 5, 2017 via email

jcrist commented May 22, 2017

jnothman commented May 22, 2017 via email

jcrist commented May 23, 2017

jmschrei commented May 24, 2017

jakirkham commented Jul 2, 2018

GaelVaroquaux commented Jul 2, 2018

jnothman commented Feb 5, 2019 via email

cmarmo commented Dec 16, 2021

thomasjpfan commented Jun 30, 2022

Add ability to override joblib backend for scikit-learn estimators #8804

Add ability to override joblib backend for scikit-learn estimators #8804

Comments

jcrist commented Apr 27, 2017 • edited Loading

lesteve commented Apr 28, 2017 • edited Loading

GaelVaroquaux commented Apr 28, 2017 via email

lesteve commented Apr 28, 2017 • edited Loading

jcrist commented Apr 28, 2017

jnothman commented Apr 29, 2017 via email

jcrist commented May 1, 2017

GaelVaroquaux commented May 1, 2017 via email

jcrist commented May 1, 2017

GaelVaroquaux commented May 1, 2017 via email

jmschrei commented May 5, 2017 via email

jcrist commented May 22, 2017

jnothman commented May 22, 2017 via email

jcrist commented May 23, 2017

jmschrei commented May 24, 2017

jakirkham commented Jul 2, 2018

GaelVaroquaux commented Jul 2, 2018

jnothman commented Feb 5, 2019 via email

cmarmo commented Dec 16, 2021

thomasjpfan commented Jun 30, 2022

jcrist commented Apr 27, 2017 •

edited

Loading

lesteve commented Apr 28, 2017 •

edited

Loading

lesteve commented Apr 28, 2017 •

edited

Loading