Skip to content

[MRG] MAINT: option to unvendor joblib #11166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 29, 2018

Conversation

GaelVaroquaux
Copy link
Member

Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.

This is going to be very useful to test the progress in joblib: the joblib team is hard at work on enabling better parallelism and scaling, but real-world test cases (by the developers themselves or advanced users) are crucial.

Setting the SKLEARN_SITE_JOBLIB to a non null value will now force
scikit-learn to use the site joblib rather than the vendored one.
@jnothman
Copy link
Member

Test failures still

GaelVaroquaux added a commit to GaelVaroquaux/distributed that referenced this pull request May 31, 2018
Follows scikit-learn/scikit-learn#11166 to deal
with unvendoring joblib in scikit-learn
@@ -1,6 +1,9 @@
This directory contains bundled external dependencies that are updated
every once in a while.

Note to developers and advanced users: setting the SKLEARN_SITE_JOBLIB to
a non null value will force scikit-learn to use the site joblib.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that joblib.Memory cache and joblib pickles may be invalidated when switching from one to the other.

@amueller
Copy link
Member

This sounds good to me. It needs mention in the dev docs, though, right?

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented May 31, 2018 via email

@amueller
Copy link
Member

amueller commented Jun 4, 2018

I was 100% sure I replied to this... I think it should be somewhere with set_config which is currently in doc/modules/computational_performance.rst. Maybe we should have a section on configuration in the user guide? There's no explicit section on set_config and get_config right now :-/
I would also add it to the joblib entry in the glossary maybe?

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jun 4, 2018 via email

@amueller
Copy link
Member

amueller commented Jun 4, 2018

I think we can merge 6 and 7 but I don't understand your new TOC.
Would it make sense to distinguish RAM saving strategies from speeding up strategies from parallelization strategies?

I'm not sure I understand the Model Reshaping section. Is that feature selection?

There seems to be a bunch of model specific stuff in 7. like "sparsify" which might be better housed in the linear models section maybe?

@amueller
Copy link
Member

amueller commented Jun 4, 2018

Also 6 and 7 are a bit hard to find imho :-/

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jun 4, 2018 via email

@amueller
Copy link
Member

amueller commented Jun 4, 2018

Sure, go for it.

@jnothman
Copy link
Member

jnothman commented Jun 5, 2018 via email

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jun 5, 2018 via email

@GaelVaroquaux
Copy link
Member Author

I've reorganized the docs and updated the glossary. This is ready for review and merge.

@GaelVaroquaux GaelVaroquaux changed the title MAINT: option to unvendor joblib [MRG] MAINT: option to unvendor joblib Jun 11, 2018

When this environment variable is set to a non zero value,
scikit-learn uses the site joblib rather than its vendored version.
Consequently, joblib must be installed for scikit-learn to run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think it's worth mentioning that Memory and dumped content may be invalidated when switching?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please mention in what's new

@glemaitre glemaitre added this to the 0.20 milestone Jun 28, 2018
@jnothman
Copy link
Member

jnothman commented Jun 28, 2018 via email

@glemaitre
Copy link
Member

Sounds reasonable, but can also be changed (on or off) after release if we want it or don't.

OK let's do that later.

@glemaitre glemaitre merged commit 106bb9e into scikit-learn:master Jun 29, 2018
@glemaitre
Copy link
Member

Thanks

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging d5de52c into f97b515 - view on LGTM.com

new alerts:

  • 2 for Unused import

Comment posted by LGTM.com

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jun 29, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018

This has broken CircleCI on master, which is trying, if I understand correctly, to load pickles which reference sklearn.externals.joblib.numpy_pickle, resulting in an ImportError.

Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where).

Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle?

@lesteve
Copy link
Member

lesteve commented Jul 2, 2018

Weird, here is the first CircleCI failed build in master. It only seems to be a problem in Python 2.

Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where).

I can probably find the way to clear the CircleCI cache let me look for this.

Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle?

Breaking pickles is not something we should do lightly. We should definitely double check that and see whether this can be fixed.

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018

Well it's easily fixed if we move sklearn.externals._joblib back to sklearn.externals.joblib, and import internally from ._joblib or something confusing like that.

One hack is to use sys.modules['sklearn.externals.joblib'] = joblib and same for all its sub-modules, when SKLEARN_SITE_JOBLIB is set ...

A more "correct" approach might be to modify sys.path_hooks.

@lesteve
Copy link
Member

lesteve commented Jul 2, 2018

I can probably find the way to clear the CircleCI cache let me look for this.

So it looks like the only way to clear the cache is to change the cache key in circle/config.yml, see this.

I think clearing the cache would just hide the problem though. At the moment on master we can not import as we would do from joblib:

On master (5916d57):

import sklearn.externals.joblib.numpy_pickle

Python 3:

ModuleNotFoundError: No module named 'sklearn.externals.joblib.numpy_pickle'; 'sklearn.externals.joblib' is not a package

Python 2:

ImportError: No module named numpy_pickle

On 526aede (i.e. before this PR was merged) the same snippet runs fine.

@GaelVaroquaux
Copy link
Member Author

I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix.

@lesteve
Copy link
Member

lesteve commented Jul 2, 2018

I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix.

Thanks for this! I can reproduce the backward compatibility problem locally (on Python 2.7) and I don't think #11403 fixes it. When loading the pickle sklearn.externals.joblib.numpy_pickle is imported and this import fails even with your fix:

Here is what the pickle contains for completeness (a joblib pickle of np.array([1., 2., 3.]) using 526aede):

In [3]: open('/tmp/test.joblib', 'rb').read()
Out[3]: '\x80\x02csklearn.externals.joblib.numpy_pickle\nNumpyArrayWrapper\nq\x00)\x81q\x01}q\x02(U\x05dtypeq\x03cnumpy\ndtype\nq\x04U\x02f8q\x05K\x00K\x01\x87q\x06Rq\x07(K\x03U\x01<q\x08NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\tbU\x05shapeq\nK\x03\x85q\x0bU\x05orderq\x0cU\x01Cq\rU\x08subclassq\x0ecnumpy\nndarray\nq\x0fU\nallow_mmapq\x10\x88ub\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@.'

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jul 2, 2018 via email

@lesteve
Copy link
Member

lesteve commented Jul 2, 2018

Maybe a bit hacky, but what about having only joblib/__init__.py that does this:

import sys
import os

if os.environ.get('SKLEARN_SITE_JOBLIB', False):
    import joblib
    module = sys.modules['joblib']
else:
    import sklearn.externals._joblib
    module = sys.modules['sklearn.externals._joblib']

sys.modules[__name__] = module

I did not test this extensively besides checking that import sklearn.externals.joblib.numpy_pickle works in both cases and looking that sklearn.externals.joblib.numpy_pickle.__file__ makes sense.

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jul 2, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jul 2, 2018 via email

@lesteve
Copy link
Member

lesteve commented Jul 6, 2018

I tried a bit more and the sys.modules hack does not seem to solve all the pickling problems ... in particular there seems to be some problems with the compressed pickles that I haven't fully understood.

While I am at it, a less hacky way of doing it may be to use sys.meta_path. It does not seem that easy, but maybe good examples to look at would be python-future and/or six.

@rth
Copy link
Member

rth commented Jul 6, 2018

Yeah, but how likely is it that it would still work with some other packages that use sys.module / sys.meta_path hacks (for instance pyinstaller/pyinstaller#2246 )?

It seems that our only option is to invert _joblib and joblib. It seems
the wrong API, but I don't see another way of doing it without breaking
backward compat.

If the documentations repeats often enough that everything under sklearn.externals is private and should not be used in user code maybe that would be enough?

Another way of looking at this, could be to raise a warning if the bundled joblib (e.g. sklearn.externals.joblib) is imported from outside of sklearn. That would allow keeping just one joblib module and remove the issue with the public/private naming convention, but I'm not not sure if it's technically feasible (might also involve some sys hacks) and even if it is, desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants