-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] MAINT: option to unvendor joblib #11166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.
Test failures still |
Follows scikit-learn/scikit-learn#11166 to deal with unvendoring joblib in scikit-learn
@@ -1,6 +1,9 @@ | |||
This directory contains bundled external dependencies that are updated | |||
every once in a while. | |||
|
|||
Note to developers and advanced users: setting the SKLEARN_SITE_JOBLIB to | |||
a non null value will force scikit-learn to use the site joblib. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note that joblib.Memory cache and joblib pickles may be invalidated when switching from one to the other.
This sounds good to me. It needs mention in the dev docs, though, right? |
It needs mention in the dev docs, though, right?
Sure. I was indeed wondering what to do about the documentation. Where
would you mention it?
|
I was 100% sure I replied to this... I think it should be somewhere with |
set_config which is currently in doc/modules/computational_performance.rst.
Maybe we should have a section on configuration in the user guide?
How about I merge section 6 and 7 into a section "Computing with
scikit-learn": and rewrite it as following.
6. Computing with scikit-learn
6.1 Strategies to scale computationally: bigger data
- Scaling with instances using out-of-core learning
6.2 Computational Performance
- Prediction Latency
- Prediction Throughput
- Tips and Tricks
6.3 Parallelism, resources, and configuration
- Configuration switches
- Parallel and distributed computing (when the new joblib is released)
I would like two +1s on this plan before I do it, as there will be a bit
of work. @amueller, @jnothman, @agramfort, @ogrisel?
I would also add it to the joblib entry in the glossary maybe?
+1
|
I think we can merge 6 and 7 but I don't understand your new TOC. I'm not sure I understand the Model Reshaping section. Is that feature selection? There seems to be a bunch of model specific stuff in 7. like "sparsify" which might be better housed in the linear models section maybe? |
Also 6 and 7 are a bit hard to find imho :-/ |
Agreed with all these comments, but they are on things that are already
in the docs. I won't undertake a full rewrite for now (I should be
reviewing grants instead of coding :$).
Would you consider the changes that I suggest as an improvement on the
existing?
|
Sure, go for it. |
I've not looked at the grand plan, but putting it in the config functions
will only work before importing modules that import from
sklearn.externals.joblib, so I don't see how that's feasbile. The config
can still be managed through get_config, but setting seems inappropriate.
set_config and get_config have API reference which I've thought sufficient
as well as disparate user guide references in appropriate usage contexts.
|
I've not looked at the grand plan, but putting it in the config functions
will only work before importing modules that import from
sklearn.externals.joblib, so I don't see how that's feasbile.
That's why it's coded why an environment variable.
|
I've reorganized the docs and updated the glossary. This is ready for review and merge. |
doc/modules/computing.rst
Outdated
|
||
When this environment variable is set to a non zero value, | ||
scikit-learn uses the site joblib rather than its vendored version. | ||
Consequently, joblib must be installed for scikit-learn to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't think it's worth mentioning that Memory and dumped content may be invalidated when switching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please mention in what's new
Shall we add an install of joblib-master in the CRON job
Sounds reasonable, but can also be changed (on or off) after release if we
want it or don't.
|
OK let's do that later. |
Thanks |
This pull request introduces 2 alerts when merging d5de52c into f97b515 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Thank you Guillaume!
|
This has broken CircleCI on master, which is trying, if I understand correctly, to load pickles which reference Firstly, I presume to fix Circle, we just need to clear a cache somewhere (but I've not worked out where). Secondly, are we unnecessarily breaking pickles from previous versions? Should we see if we can find a way to make this function importable when loading a joblib pickle? |
Weird, here is the first CircleCI failed build in master. It only seems to be a problem in Python 2.
I can probably find the way to clear the CircleCI cache let me look for this.
Breaking pickles is not something we should do lightly. We should definitely double check that and see whether this can be fixed. |
Well it's easily fixed if we move sklearn.externals._joblib back to sklearn.externals.joblib, and import internally from ._joblib or something confusing like that. One hack is to use A more "correct" approach might be to modify |
So it looks like the only way to clear the cache is to change the cache key in circle/config.yml, see this. I think clearing the cache would just hide the problem though. At the moment on master we can not import as we would do from joblib: On master (5916d57): import sklearn.externals.joblib.numpy_pickle Python 3:
Python 2:
On 526aede (i.e. before this PR was merged) the same snippet runs fine. |
I am attempting to fix the problem in #11403. I don't know if the approach will work, but if it does, it's a simple fix. |
Thanks for this! I can reproduce the backward compatibility problem locally (on Python 2.7) and I don't think #11403 fixes it. When loading the pickle Here is what the pickle contains for completeness (a joblib pickle of In [3]: open('/tmp/test.joblib', 'rb').read()
Out[3]: '\x80\x02csklearn.externals.joblib.numpy_pickle\nNumpyArrayWrapper\nq\x00)\x81q\x01}q\x02(U\x05dtypeq\x03cnumpy\ndtype\nq\x04U\x02f8q\x05K\x00K\x01\x87q\x06Rq\x07(K\x03U\x01<q\x08NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\tbU\x05shapeq\nK\x03\x85q\x0bU\x05orderq\x0cU\x01Cq\rU\x08subclassq\x0ecnumpy\nndarray\nq\x0fU\nallow_mmapq\x10\x88ub\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@.' |
When loading the pickle sklearn.externals.joblib.numpy_pickle is
imported and this import fails even with your fix:
Indeed, that's correct. "from ... import numpy_picke" works, but not
"import ....numpy_pickle". The latter is needed.
:(.
It seems that our only option is to invert _joblib and joblib. It seems
the wrong API, but I don't see another way of doing it without breaking
backward compat.
|
Maybe a bit hacky, but what about having only import sys
import os
if os.environ.get('SKLEARN_SITE_JOBLIB', False):
import joblib
module = sys.modules['joblib']
else:
import sklearn.externals._joblib
module = sys.modules['sklearn.externals._joblib']
sys.modules[__name__] = module I did not test this extensively besides checking that |
I really frown on hacking sys.modules. It is a recipe to break other
things, in other packages.
|
@lesteve, does that work?
|
I tried a bit more and the While I am at it, a less hacky way of doing it may be to use |
Yeah, but how likely is it that it would still work with some other packages that use
If the documentations repeats often enough that everything under Another way of looking at this, could be to raise a warning if the bundled joblib (e.g. |
Setting the SKLEARN_SITE_JOBLIB to a non null value will now force scikit-learn to use the site joblib rather than the vendored one.
This is going to be very useful to test the progress in joblib: the joblib team is hard at work on enabling better parallelism and scaling, but real-world test cases (by the developers themselves or advanced users) are crucial.