Skip to content

FIX remove all mention to externals.joblib in the codebase #12345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Nov 19, 2018

Conversation

tomMoral
Copy link
Contributor

This PR intends to fix the failure seen in the CRON job using master branch of joblib. (See here)

It removes all mention to externals.joblib in the codebase and replace it with sklearn.utils.

Note that for testing purposes, it is necessary to add the module joblib in utils to be able to access more advanced functionality of joblib without importing directly the module. I am not sure this is the best level to expose this, so let me know if this should only be access from sklearn.utils._joblib.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, apart for the comment below and there is a flake8 failure.

Thanks for improving this!

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Should this be backported to 0.20.1 ?

@TomDLT TomDLT added this to the 0.20.1 milestone Oct 16, 2018
@rth
Copy link
Member

rth commented Oct 16, 2018

Should this be backported to 0.20.1 ?

+1

rth
rth previously requested changes Oct 16, 2018
@rth
Copy link
Member

rth commented Oct 16, 2018

So my comments above are not exactly on the right lines :) but the general point is that I think we should not expose public attributes sklearn.utils.{joblib,joblib_version}. What we do in sklearn.utils._joblib and import it from there directly internally is fine..

@tomMoral
Copy link
Contributor Author

Ok I tried to remove all direct import of joblib, which is now only available in _joblib.

Note that I also exposed dump and load in sklearn.utils to avoid putting private API import in the doc. Is that okay or should they be renamed?

@rth
Copy link
Member

rth commented Oct 16, 2018

Hmm, right joblib.{dump, load} is also used in examples in which case joblib.dump is still better than dump.

I don't know anymore, other opinions on this would be welcome @jnothman @ogrisel @amueller

It depends on how you think the interaction between scikit-learn and joblib will evolve in the future.

In 0.20.0 we have exposed sklearn.utils.{cpu_count, Parallel, Memory, delayed, hash, parallel_backend, register_parallel_backend}. Now it turns out that this is not enough and some at least dump/load are necessary. The sklearn.utils is namespace is progressively getting invaded by joblib and I'm not convinced it's a good thing. For internal code, I guess importing from sklearn.utils import _joblib could have been enough, but then the question is what to do for examples. One thing I find important is that it should be clear for users that the parallel capability is coming from a separate project called joblib, and in that sense in examples,

from sklearn.externals.joblib import Parallel
# or
from sklearn.utils._joblib import Parallel

is better IMO than

from sklearn.utils import Parallel

so that bugs are reported to the right issue tracker and joblib gets the recognition it deserves.

Maybe we could have exposed just sklearn.utils.joblib instead of the half dozen of current methods, but I guess we can't do much now that 0.20.0 is out.

What happens if say in v0.15 joblib deprecates or renames joblib.hash (or some other method). Will we have to do a deprecation cycle on sklearn.utils.hash, or do we consider those not part of the public API (in which case we might as well do something about it now)?

@rth rth dismissed their stale review October 16, 2018 14:55

No longer sure about the optimal solution.

@amueller
Copy link
Member

the public/private state of utils is still a bit unclear.
I didn't actually realize we exposed sklearn.utils.Parallel otherwise I would have vetoed that :(
I think we should unvendor as soon as possible, and having these names defined is very weird in that case. If we deprecated sklearn.externals.joblib and told people to use sklearn.utils.X then we shouldn't be removing those. But we can backpaddle and deprecate them in 0.20.1

Either way I feel like we need to resolve this for 0.20.1.
Can someone point me to the original discussion for this? What was the reason for moving to utils? (I know the reason for defining _joblib was to optionally unvendor).

@amueller
Copy link
Member

(FYI I also had vetoed vendoring loky inside vendoring joblib very explicitly and that was ignored...)

@tomMoral
Copy link
Contributor Author

For reference:

@amueller
Copy link
Member

hm... I'd still vote deprecate everything and unvendor. That will probably break all pickles in two versions but will remove some major headaches going forward.
It "only" raises the questions which versions of joblib each sklearn should support.

@jnothman
Copy link
Member

The advantage of vendoring (or pinning) is that, unless the user does something very explicit, we know which version of joblib they're using when handling issues around a particular version. I assume @GaelVaroquaux would have an opinion here.

I don't think we need to provide load and dump in utils. If we do, we can prefix them by joblib_. Yes, I agree there are open questions about API stability for public vendored components... but we had already been expecting users to write code that depended on this vendoring of joblib, really.

@rth
Copy link
Member

rth commented Oct 17, 2018

Whether we vendor joblib, and how much of its API we are exposing as a public API of scikit-learn are two somewhat different/unrelated concerns. Let's move this discussion to a separate issue so it doesn't get lost in this PR, I'll try to open one later today with a summary of different points of view.

@amueller
Copy link
Member

@rth I don't think they are entirely separate, unless you would consider providing an interface without vendoring. Which we could do - but it would seem pretty strange to me.
And we as long as we vendor, we need to provide a public API, right?

@amueller
Copy link
Member

I don't think we need to provide load and dump in utils.

Can you elaborate? How else would you fix the problem this PR addresses?

@jnothman
Copy link
Member

Well we never use sklearn.externals.joblib.load in our codebase... so we don't strictly need to support it. But we can.

@tomMoral
Copy link
Contributor Author

It is used in the datasets. I added sklearn.utils.dump and sklearn.utils.load because they are presented in the datasets and imported for instance here.

Also, it is mentioned in the documentation for instance here

@jnothman
Copy link
Member

jnothman commented Oct 18, 2018 via email

@rth
Copy link
Member

rth commented Oct 23, 2018

Tried to write some summary on the joblib integration in #12447

@amueller
Copy link
Member

I guess in the interest of getting 0.20.1 out we should probably table the larger discussion and finish this up in some way for now?

@tomMoral
Copy link
Contributor Author

I am not sure I got the general concensus here.

I can rename dump and load to joblib_* and finish this PR like that, with joblib objects being available in sklearn.utils.

Or I add an extra layer for all joblib-related import such that everything is imported under sklearn.utils.joblib. I think this solution is better as it makes it clear that the tools used are defined in the joblib library and should not be considered as part of the sklearn API.

I can wrap this up tomorrow once given the green-light on the chosen solution.

@amueller
Copy link
Member

I don't have a strong preference between the two. Having a more clear namespace for joblib seems desirable, but that would also require moving all the other functions and objects again, right?

@tomMoral
Copy link
Contributor Author

tomMoral commented Oct 29, 2018 via email

@tomMoral
Copy link
Contributor Author

tomMoral commented Nov 6, 2018

I will move all the exposed functions in sklearn.utils.joblib if nobody has a strong opposition against it.

@ogrisel?

@tomMoral
Copy link
Contributor Author

Done.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks for your patience on this one @tomMoral !

LGTM. Will wait for a day or so before merging, in case there are other comments.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation is not reflected in doc/modules/classes.rst

@tomMoral
Copy link
Contributor Author

@jnothman I updated module/classes.rst, let me know if this is the right way to do it.
I also added a more detailed entry in what's new.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 18, 2018 via email

@jnothman
Copy link
Member

The changes to classes.rst look fine

@amueller
Copy link
Member

@jnothman are you ok with doing it this way? You seemed a bit concerned about the change earlier. It seems the easiest way forward to me.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

- |API| Removed all mentions of ``sklearn.externals.joblib``, and deprecated
joblib methods exposed in ``sklearn.utils``, except for
:func:`utils.parallel_backend` and :func:`utils.register_parallel_backend`,
which impact the behavior of `sklearn`. Other functionalities are part of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"which allow users to configure parallel computation in scikit-learn".

@jnothman jnothman merged commit d25da1b into scikit-learn:master Nov 19, 2018
@jnothman
Copy link
Member

Thanks @tomMoral

@amueller
Copy link
Member

yay!

@jolespin
Copy link

jolespin commented Dec 1, 2018

Recently I changed all my code to use sklearn.externals.joblib instead of joblib alone. What should I use that will be compatible in the future? Would I be able still use the joblib from sklearn or will this be deprecated in future versions? Thanks.

@jnothman
Copy link
Member

jnothman commented Dec 3, 2018 via email

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants