-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add ability to override joblib backend for scikit-learn estimators #8804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Now I am wondering why using a context manager only overrides the default backend (when not specified explicitly) and not all of them. In other words: with parallel_backend(outer_backend):
Parallel(backend=inner_backend)(...) Should Ping @ogrisel, I seem to remember we chatted about related things but I don't remember the outcome of the discussion. |
One problem that I see is that when the backing is threading, the code
can safely assume that there is shared memory, and use it. This is no
longer the case with other backend.
In your benchmark: did you check that the results were the same in the
different backends, or did you only compare the run-time? I don't
remember if random forest use a shared memory or not.
|
That's a good point actually and this is going to be a problem since #8672, which was merged around a month ago relies on the threading backend. In https://github.com/scikit-learn/scikit-learn/pull/8672/files#diff-65dc1c804f310acfd90f5ea83286065cR387 the function that was passed into |
I'm not sure exactly what to compare for trees, but they have the same score, and the following passes (for for t1, t2 in zip(distributed_res.estimators_, threading_res.estimators_):
np.testing.assert_array_equal(t1.tree_.value, t2.tree_.value) That is a good point though, and would need to be considered when determining what can have overridable backends.
I don't think so, as that wouldn't allow internal code to ensure using threading. For the case of |
allow_override=False?
…On 29 Apr 2017 1:56 am, "Jim Crist" ***@***.***> wrote:
In your benchmark: did you check that the results were the same in the
different backends, or did you only compare the run-time?
I'm not sure exactly what to compare for trees, but they have the same
score, and the following passes (for RandomForestClassifier):
for t1, t2 in zip(distributed_res.estimators_, threading_res.estimators_):
np.testing.assert_array_equal(t1.tree_.value, t2.tree_.value)
That is a good point though, and would need to be considered when
determining what can have overridable backends.
Should outer_backend have priority over inner_backend?
I don't think so, as that wouldn't allow internal code to ensure using
threading. For the case of RandomForest we could allow fit to use any
backend, but predict would require threading (my benchmark found that
distributing predict was slower anyway, which makes sense because of the
low overhead of predict).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8804 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-pytxxwJWKCUWev4jtC4gdq-icxks5r0gwhgaJpZM4NKmzc>
.
|
If we were to do that, my preference would be that For moving forward on this, is this something that should be an issue/PR to joblib instead of here? |
I think that the notion of overrides is going to be challenging to do in
a non fragile way.
Maybe we should have a semantics in the way we specify the backends. To
give an example, what the threading backend provides is shared memory,
what it doesn't provide is parallel computing for gil.
A semantic specification might look like "backend='nogil,shared_mem'",
which would say that the code needs shared mem, but doesn't need
gil-level parallelism.
Another dimension to look at would be whether the operations are expected
to be very short or not.
|
Hmmm, that's an interesting idea. I will say that as a novice joblib user, I find the concept of a default backend and whether that backend can/cannot be swapped out to be simpler than a set of semantic requirements (probably simpler to implement too). It's not immediately clear to me what |
It's not immediately clear to me what nogil means here (could mean
either that my function releases the gil or that I want to compute in
processes to avoid the gil?).
|
I think it would be good to allow users to directly specify what the input
to joblib is if they want to override it. That way they can call upon
documentation from both sklearn, joblib, and other packages built on joblib
to try to understand what is happening better. In sklearn documentation we
would want to specify why we chose what we did, for example that
'threading' is useful for trees, but not for estimators that don't release
the GIL.
…On Mon, May 1, 2017 at 10:41 AM, Gael Varoquaux ***@***.***> wrote:
> It's not immediately clear to me what nogil means here (could mean
> either that my function releases the gil or that I want to compute in
> processes to avoid the gil?).
I agree that the semantics should be made clear in documentation. I also
agree that better wording must be explored.
The question is rather: are these the good axis to explore?
Ping @ogrisel, @aabadie, @lesteve
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8804 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADvEEH0NtnXeWp-1459QW4jZkBs-3mH_ks5r1hktgaJpZM4NKmzc>
.
|
Is there anything I can do to help expedite this issue? If I was to make an example PR showing what I think should be done, should I submit it here or joblib? Would that be a good next step? |
if it requires changes to joblib, best to raise it there
…On 23 May 2017 7:25 am, "Jim Crist" ***@***.***> wrote:
Is there anything I can do to help expedite this issue? If I was to make
an example PR showing what I think should be done, should I submit it here
or joblib? Would that be a good next step?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8804 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz621WBJreYMQ35lXfNC5Rnr5GcIL1ks5r8f06gaJpZM4NKmzc>
.
|
See joblib/joblib#524. |
This is a really neat addition, thanks for the link. I'll be watching it. |
Uses the new prefer / require keywords from joblib/joblib#602. This allows users to control how jobs are parallelized in more situations. For example, training a RandomForest on a cluster of machines with the dask backend. Closes scikit-learn#8804
Uses the new prefer / require keywords from joblib/joblib#602. This allows users to control how jobs are parallelized in more situations. For example, training a RandomForest on a cluster of machines with the dask backend. Closes scikit-learn#8804
Should this be closed? PR ( #11166 ) allows optional use of a different |
I am not convinced: right now specifying the backend with system joblib will work only if scikit-learn is "unvendored". |
We provide sklearn.utils.parallel_backend for this reason. It's not perfect.
|
@GaelVaroquaux, @ogrisel, joblib is no longer distributed with scikit-learn, right? Shall this is issue be closed now? Thanks! |
I think this issue can be closed now. As noted in the dask-ml docs, |
Some calls to
Parallel
in scikit-learn hardcode what joblib backend to use. It would be nice if these could be overridden to use a different backend using theparallel_backend
contextmanager. This would allow optionally using thedask.distributed
backend in more places, which may provide speedups (see comment here).One way to do this would be to check if there's a globally set backend (as set by the context manager) and use that, otherwise use the specified fallback. This might look like:
I'm not sure if this is a fix that should be implemented in scikit-learn or in joblib. Opening this for discussion.
The text was updated successfully, but these errors were encountered: