[MRG] MAINT: option to unvendor joblib #11166

GaelVaroquaux · 2018-05-30T18:02:19Z

Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.

This is going to be very useful to test the progress in joblib: the joblib team is hard at work on enabling better parallelism and scaling, but real-world test cases (by the developers themselves or advanced users) are crucial.

Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.

jnothman · 2018-05-30T23:47:06Z

Test failures still

Follows scikit-learn/scikit-learn#11166 to deal with unvendoring joblib in scikit-learn

jnothman · 2018-05-31T01:04:26Z

sklearn/externals/README

@@ -1,6 +1,9 @@
 This directory contains bundled external dependencies that are updated
 every once in a while.

+Note to developers and advanced users: setting the SKLEARN_SITE_JOBLIB to
+a non null value will force scikit-learn to use the site joblib.


Maybe note that joblib.Memory cache and joblib pickles may be invalidated when switching from one to the other.

amueller · 2018-05-31T16:58:31Z

This sounds good to me. It needs mention in the dev docs, though, right?

GaelVaroquaux · 2018-05-31T18:30:27Z

It needs mention in the dev docs, though, right?

Sure. I was indeed wondering what to do about the documentation. Where would you mention it?

amueller · 2018-06-04T17:51:10Z

I was 100% sure I replied to this... I think it should be somewhere with set_config which is currently in doc/modules/computational_performance.rst. Maybe we should have a section on configuration in the user guide? There's no explicit section on set_config and get_config right now :-/
I would also add it to the joblib entry in the glossary maybe?

GaelVaroquaux · 2018-06-04T21:25:52Z

set_config which is currently in doc/modules/computational_performance.rst. Maybe we should have a section on configuration in the user guide?

How about I merge section 6 and 7 into a section "Computing with scikit-learn": and rewrite it as following. 6. Computing with scikit-learn 6.1 Strategies to scale computationally: bigger data - Scaling with instances using out-of-core learning 6.2 Computational Performance - Prediction Latency - Prediction Throughput - Tips and Tricks 6.3 Parallelism, resources, and configuration - Configuration switches - Parallel and distributed computing (when the new joblib is released) I would like two +1s on this plan before I do it, as there will be a bit of work. @amueller, @jnothman, @agramfort, @ogrisel?

I would also add it to the joblib entry in the glossary maybe?

+1

amueller · 2018-06-04T21:41:06Z

I think we can merge 6 and 7 but I don't understand your new TOC.
Would it make sense to distinguish RAM saving strategies from speeding up strategies from parallelization strategies?

I'm not sure I understand the Model Reshaping section. Is that feature selection?

There seems to be a bunch of model specific stuff in 7. like "sparsify" which might be better housed in the linear models section maybe?

amueller · 2018-06-04T21:51:11Z

Also 6 and 7 are a bit hard to find imho :-/

GaelVaroquaux · 2018-06-04T21:55:38Z

Agreed with all these comments, but they are on things that are already in the docs. I won't undertake a full rewrite for now (I should be reviewing grants instead of coding :$). Would you consider the changes that I suggest as an improvement on the existing?

amueller · 2018-06-04T21:56:36Z

Sure, go for it.

jnothman · 2018-06-05T08:43:28Z

I've not looked at the grand plan, but putting it in the config functions will only work before importing modules that import from sklearn.externals.joblib, so I don't see how that's feasbile. The config can still be managed through get_config, but setting seems inappropriate. set_config and get_config have API reference which I've thought sufficient as well as disparate user guide references in appropriate usage contexts.

GaelVaroquaux · 2018-06-05T09:12:16Z

I've not looked at the grand plan, but putting it in the config functions will only work before importing modules that import from sklearn.externals.joblib, so I don't see how that's feasbile.

That's why it's coded why an environment variable.

GaelVaroquaux · 2018-06-11T08:43:20Z

I've reorganized the docs and updated the glossary. This is ready for review and merge.

jnothman · 2018-06-11T10:20:09Z

doc/modules/computing.rst

+
+    When this environment variable is set to a non zero value,
+    scikit-learn uses the site joblib rather than its vendored version.
+    Consequently, joblib must be installed for scikit-learn to run


You don't think it's worth mentioning that Memory and dumped content may be invalidated when switching?

jnothman

Also, please mention in what's new

jnothman · 2018-06-28T21:45:58Z

Shall we add an install of joblib-master in the CRON job

Sounds reasonable, but can also be changed (on or off) after release if we want it or don't.

glemaitre · 2018-06-29T14:34:39Z

Sounds reasonable, but can also be changed (on or off) after release if we want it or don't.

OK let's do that later.

glemaitre · 2018-06-29T14:36:03Z

Thanks

sklearn-lgtm · 2018-06-29T14:59:56Z

This pull request introduces 2 alerts when merging d5de52c into f97b515 - view on LGTM.com

new alerts:

2 for Unused import

Comment posted by LGTM.com

GaelVaroquaux · 2018-06-29T16:12:16Z

Thank you Guillaume!

jnothman · 2018-07-02T11:42:42Z

This has broken CircleCI on master, which is trying, if I understand correctly, to load pickles which reference sklearn.externals.joblib.numpy_pickle, resulting in an ImportError.

Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where).

Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle?

lesteve · 2018-07-02T11:57:24Z

Weird, here is the first CircleCI failed build in master. It only seems to be a problem in Python 2.

Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where).

I can probably find the way to clear the CircleCI cache let me look for this.

Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle?

Breaking pickles is not something we should do lightly. We should definitely double check that and see whether this can be fixed.

jnothman · 2018-07-02T12:18:29Z

Well it's easily fixed if we move sklearn.externals._joblib back to sklearn.externals.joblib, and import internally from ._joblib or something confusing like that.

One hack is to use sys.modules['sklearn.externals.joblib'] = joblib and same for all its sub-modules, when SKLEARN_SITE_JOBLIB is set ...

A more "correct" approach might be to modify sys.path_hooks.

lesteve · 2018-07-02T13:03:21Z

I can probably find the way to clear the CircleCI cache let me look for this.

So it looks like the only way to clear the cache is to change the cache key in circle/config.yml, see this.

I think clearing the cache would just hide the problem though. At the moment on master we can not import as we would do from joblib:

On master (5916d57):

import sklearn.externals.joblib.numpy_pickle

Python 3:

ModuleNotFoundError: No module named 'sklearn.externals.joblib.numpy_pickle'; 'sklearn.externals.joblib' is not a package

Python 2:

ImportError: No module named numpy_pickle

On 526aede (i.e. before this PR was merged) the same snippet runs fine.

GaelVaroquaux · 2018-07-02T13:36:54Z

I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix.

lesteve · 2018-07-02T14:14:02Z

I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix.

Thanks for this! I can reproduce the backward compatibility problem locally (on Python 2.7) and I don't think #11403 fixes it. When loading the pickle sklearn.externals.joblib.numpy_pickle is imported and this import fails even with your fix:

Here is what the pickle contains for completeness (a joblib pickle of np.array([1., 2., 3.]) using 526aede):

In [3]: open('/tmp/test.joblib', 'rb').read()
Out[3]: '\x80\x02csklearn.externals.joblib.numpy_pickle\nNumpyArrayWrapper\nq\x00)\x81q\x01}q\x02(U\x05dtypeq\x03cnumpy\ndtype\nq\x04U\x02f8q\x05K\x00K\x01\x87q\x06Rq\x07(K\x03U\x01<q\x08NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\tbU\x05shapeq\nK\x03\x85q\x0bU\x05orderq\x0cU\x01Cq\rU\x08subclassq\x0ecnumpy\nndarray\nq\x0fU\nallow_mmapq\x10\x88ub\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@.'

GaelVaroquaux · 2018-07-02T14:48:07Z

When loading the pickle sklearn.externals.joblib.numpy_pickle is imported and this import fails even with your fix:

Indeed, that's correct. "from ... import numpy_picke" works, but not "import ....numpy_pickle". The latter is needed. :(. It seems that our only option is to invert _joblib and joblib. It seems the wrong API, but I don't see another way of doing it without breaking backward compat.

lesteve · 2018-07-02T16:59:35Z

Maybe a bit hacky, but what about having only joblib/__init__.py that does this:

import sys
import os

if os.environ.get('SKLEARN_SITE_JOBLIB', False):
    import joblib
    module = sys.modules['joblib']
else:
    import sklearn.externals._joblib
    module = sys.modules['sklearn.externals._joblib']

sys.modules[__name__] = module

I did not test this extensively besides checking that import sklearn.externals.joblib.numpy_pickle works in both cases and looking that sklearn.externals.joblib.numpy_pickle.__file__ makes sense.

GaelVaroquaux · 2018-07-02T17:06:50Z

I really frown on hacking sys.modules. It is a recipe to break other things, in other packages.

jnothman · 2018-07-02T19:54:24Z

@lesteve, does that work?

lesteve · 2018-07-06T14:17:13Z

I tried a bit more and the sys.modules hack does not seem to solve all the pickling problems ... in particular there seems to be some problems with the compressed pickles that I haven't fully understood.

While I am at it, a less hacky way of doing it may be to use sys.meta_path. It does not seem that easy, but maybe good examples to look at would be python-future and/or six.

rth · 2018-07-06T14:39:07Z

Yeah, but how likely is it that it would still work with some other packages that use sys.module / sys.meta_path hacks (for instance pyinstaller/pyinstaller#2246 )?

It seems that our only option is to invert _joblib and joblib. It seems
the wrong API, but I don't see another way of doing it without breaking
backward compat.

If the documentations repeats often enough that everything under sklearn.externals is private and should not be used in user code maybe that would be enough?

Another way of looking at this, could be to raise a warning if the bundled joblib (e.g. sklearn.externals.joblib) is imported from outside of sklearn. That would allow keeping just one joblib module and remove the issue with the public/private naming convention, but I'm not not sure if it's technically feasible (might also involve some sys hacks) and even if it is, desirable.

MAINT: option to unvendor joblib

b87fafa

Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.

GaelVaroquaux added the Enhancement label May 30, 2018

FIX: fix setup.py

beb06a5

GaelVaroquaux added 2 commits May 31, 2018 02:04

FIX: fix unvendoring joblib

5163b41

BUG: fix vendoring joblib

4b00e8b

GaelVaroquaux added a commit to GaelVaroquaux/distributed that referenced this pull request May 31, 2018

Unvendoring joblib

57a7c95

Follows scikit-learn/scikit-learn#11166 to deal with unvendoring joblib in scikit-learn

GaelVaroquaux mentioned this pull request May 31, 2018

Unvendoring joblib dask/distributed#2019

Closed

jnothman reviewed May 31, 2018

View reviewed changes

GaelVaroquaux mentioned this pull request May 31, 2018

ENH: avoid oversubscription with nested for loops joblib/joblib#690

Open

This was referenced Jun 1, 2018

Make externals private? #11182

Closed

Test too slow: test_mean_shift.py::test_parallel #11146

Closed

GaelVaroquaux added 2 commits June 11, 2018 10:30

DOC: unite docs on configuration and computation

8575972

DOC: update glossary for joblib entry

c4c4b73

GaelVaroquaux changed the title ~~MAINT: option to unvendor joblib~~ [MRG] MAINT: option to unvendor joblib Jun 11, 2018

jnothman approved these changes Jun 11, 2018

View reviewed changes

jnothman reviewed Jun 11, 2018

View reviewed changes

DOC: warn about incompatible version and CHANGELOG

551157c

glemaitre added this to the 0.20 milestone Jun 28, 2018

glemaitre added 2 commits June 29, 2018 16:33

Update computing.rst

975ddf4

Update v0.20.rst

d5de52c

glemaitre approved these changes Jun 29, 2018

View reviewed changes

glemaitre merged commit 106bb9e into scikit-learn:master Jun 29, 2018

GaelVaroquaux mentioned this pull request Jul 2, 2018

FIX: attempt to fix pickling backward compat #11403

Closed

jakirkham mentioned this pull request Jul 2, 2018

Add ability to override joblib backend for scikit-learn estimators #8804

Closed

jnothman mentioned this pull request Jul 3, 2018

Regression for unpickling due to #11166 #11408

Closed

gglanzani mentioned this pull request Jul 3, 2018

Protect against new sklearn API dask/distributed#2091

Closed

rth mentioned this pull request Jul 16, 2018

Add sklearn.show_versions() similar to pandas.show_versions (with numpy blas binding info) #11522

Closed

tomMoral mentioned this pull request Oct 16, 2018

FIX remove all mention to externals.joblib in the codebase #12345

Merged

rth mentioned this pull request Oct 23, 2018

[RFC] Define future dependence on joblib #12447

Closed

trevorstephens mentioned this pull request Dec 6, 2018

sklearn seems to be devendoring joblib trevorstephens/gplearn#114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] MAINT: option to unvendor joblib #11166

[MRG] MAINT: option to unvendor joblib #11166

GaelVaroquaux commented May 30, 2018

jnothman commented May 30, 2018

jnothman May 31, 2018

amueller commented May 31, 2018

GaelVaroquaux commented May 31, 2018 via email

amueller commented Jun 4, 2018

GaelVaroquaux commented Jun 4, 2018 via email

amueller commented Jun 4, 2018

amueller commented Jun 4, 2018

GaelVaroquaux commented Jun 4, 2018 via email

amueller commented Jun 4, 2018

jnothman commented Jun 5, 2018 via email

GaelVaroquaux commented Jun 5, 2018 via email

GaelVaroquaux commented Jun 11, 2018

jnothman Jun 11, 2018

jnothman left a comment

jnothman commented Jun 28, 2018 via email

glemaitre commented Jun 29, 2018

glemaitre commented Jun 29, 2018

sklearn-lgtm commented Jun 29, 2018

GaelVaroquaux commented Jun 29, 2018 via email

jnothman commented Jul 2, 2018

lesteve commented Jul 2, 2018

jnothman commented Jul 2, 2018

lesteve commented Jul 2, 2018 •

edited

Loading

GaelVaroquaux commented Jul 2, 2018

lesteve commented Jul 2, 2018 •

edited

Loading

GaelVaroquaux commented Jul 2, 2018 via email

lesteve commented Jul 2, 2018

GaelVaroquaux commented Jul 2, 2018 via email

jnothman commented Jul 2, 2018 via email

lesteve commented Jul 6, 2018 •

edited

Loading

rth commented Jul 6, 2018 •

edited

Loading

[MRG] MAINT: option to unvendor joblib #11166

[MRG] MAINT: option to unvendor joblib #11166

Conversation

GaelVaroquaux commented May 30, 2018

jnothman commented May 30, 2018

jnothman May 31, 2018

Choose a reason for hiding this comment

amueller commented May 31, 2018

GaelVaroquaux commented May 31, 2018 via email

amueller commented Jun 4, 2018

GaelVaroquaux commented Jun 4, 2018 via email

amueller commented Jun 4, 2018

amueller commented Jun 4, 2018

GaelVaroquaux commented Jun 4, 2018 via email

amueller commented Jun 4, 2018

jnothman commented Jun 5, 2018 via email

GaelVaroquaux commented Jun 5, 2018 via email

GaelVaroquaux commented Jun 11, 2018

jnothman Jun 11, 2018

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

jnothman commented Jun 28, 2018 via email

glemaitre commented Jun 29, 2018

glemaitre commented Jun 29, 2018

sklearn-lgtm commented Jun 29, 2018

GaelVaroquaux commented Jun 29, 2018 via email

jnothman commented Jul 2, 2018

lesteve commented Jul 2, 2018

jnothman commented Jul 2, 2018

lesteve commented Jul 2, 2018 • edited Loading

GaelVaroquaux commented Jul 2, 2018

lesteve commented Jul 2, 2018 • edited Loading

GaelVaroquaux commented Jul 2, 2018 via email

lesteve commented Jul 2, 2018

GaelVaroquaux commented Jul 2, 2018 via email

jnothman commented Jul 2, 2018 via email

lesteve commented Jul 6, 2018 • edited Loading

rth commented Jul 6, 2018 • edited Loading

lesteve commented Jul 2, 2018 •

edited

Loading

lesteve commented Jul 2, 2018 •

edited

Loading

lesteve commented Jul 6, 2018 •

edited

Loading

rth commented Jul 6, 2018 •

edited

Loading