-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Unvendor joblib #13531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unvendor joblib #13531
Conversation
obj = _unpickle(fobj, filename, mmap_mode) | ||
|
||
return obj | ||
from joblib.numpy_pickle import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added to allow loading pickles generated with older versions of scikit-learn, used e.g. in fetch_20newsgroups_vectorized
.
#13527 is merged and we think it has fixed the circleci issues. |
'register_parallel_backend', 'parallel_backend', | ||
'register_store_backend', 'register_compressor', | ||
'wrap_non_picklable_objects'] | ||
from joblib import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added so that unpickling of existing models doesn't raise errors about non existing paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you've tested unpickling something with a joblib numpy array in this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked that we can upickle a numpy array and the PCA estimator serialized with joblib in scikit-learn 0.20.2.
Also CircleCI caches pickled datasets used in examples, and I think we could have seen failures in examples if it didn't work.
Without this change I saw failures locally in tests that used the cached data for fetch_20newsgroups_vectorized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You won't see circle ci examples fail unless you force a full build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, forgot about that. Re-triggered the full CircleCI build -- no errors still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep a small sklearn estimator serialized with joblib in scikit-learn 0.20.2 in a test directory, and include a test that unpickles it to make sure it doesn't error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan There is a test for unpickling numpy arrays in https://github.com/scikit-learn/scikit-learn/pull/13531/files#r269922098 I'm not convinced that unpickling an actual scikit-learn estimator as part of the test suite is worth it: if it's simple we would be basically testing the python cPickle module. If it's more complex and interacts with lots of parts of scikit-learn it is bound to break as our estimators evolve (and we won't be able to unpickle it).
I have tested that unpickling works outside of the test suite as well as mentioned above.
This is looking pretty good. I hope it doesn't upset installations too much. |
from joblib import effective_n_jobs | ||
from joblib import hash | ||
from joblib import cpu_count, Parallel, Memory, delayed | ||
from joblib import parallel_backend, register_parallel_backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still keep the sklearn.utils._joblib
module I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @GaelVaroquaux @ogrisel @tomMoral can weight in: is joblib >= 0.11 still raising DeprecationWarnings on Python 3?
If not we could get rid of this file too.
Also @rth we can remove the comment at the top of the file mentionning local / site joblib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug I agree we should probably get rid of sklearn.utils._joblib
but that would involve updating a large number of imports
sklearn/multioutput.py
28:from .utils._joblib import Parallel, delayed
sklearn/pipeline.py
19:from .utils._joblib import Parallel, delayed
sklearn/multiclass.py
56:from .utils._joblib import Parallel
57:from .utils._joblib import delayed
sklearn/decomposition/dict_learning.py
16:from ..utils._joblib import Parallel, delayed, effective_n_jobs
sklearn/decomposition/online_lda.py
23:from ..utils._joblib import Parallel, delayed, effective_n_jobs
sklearn/datasets/rcv1.py
24:from ..utils import _joblib
sklearn/datasets/covtype.py
29:from ..utils import _joblib
sklearn/datasets/species_distributions.py
53:from sklearn.utils import _joblib
sklearn/datasets/twenty_newsgroups.py
47:from ..utils import _joblib
sklearn/datasets/olivetti_faces.py
26:from ..utils import _joblib
sklearn/datasets/kddcup99.py
24:from ..utils import _joblib
sklearn/datasets/california_housing.py
36:from ..utils import _joblib
sklearn/datasets/lfw.py
22:from ..utils._joblib import Memory
23:from ..utils import _joblib
sklearn/ensemble/partial_dependence.py
13:from ..utils._joblib import Parallel, delayed
sklearn/ensemble/voting_classifier.py
20:from ..utils._joblib import Parallel, delayed
sklearn/ensemble/iforest.py
18:from ..utils.fixes import _joblib_parallel_args
sklearn/ensemble/forest.py
52:from ..utils._joblib import Parallel, delayed
61:from ..utils.fixes import parallel_helper, _joblib_parallel_args
sklearn/ensemble/base.py
15:from ..utils._joblib import effective_n_jobs
sklearn/ensemble/bagging.py
15:from ..utils._joblib import Parallel, delayed
sklearn/linear_model/least_angle.py
24:from ..utils._joblib import Parallel, delayed
sklearn/linear_model/theil_sen.py
23:from ..utils._joblib import Parallel, delayed, effective_n_jobs
sklearn/linear_model/stochastic_gradient.py
12:from ..utils._joblib import Parallel, delayed
35:from ..utils.fixes import _joblib_parallel_args
sklearn/linear_model/coordinate_descent.py
21:from ..utils._joblib import Parallel, delayed, effective_n_jobs
23:from ..utils.fixes import _joblib_parallel_args
sklearn/linear_model/base.py
26:from ..utils._joblib import Parallel, delayed
sklearn/linear_model/omp.py
19:from ..utils._joblib import Parallel, delayed
sklearn/linear_model/logistic.py
36:from ..utils._joblib import Parallel, delayed, effective_n_jobs
37:from ..utils.fixes import _joblib_parallel_args
sklearn/neighbors/base.py
27:from ..utils._joblib import Parallel, delayed, effective_n_jobs
28:from ..utils._joblib import __version__ as joblib_version
sklearn/model_selection/_validation.py
25:from ..utils._joblib import Parallel, delayed
26:from ..utils._joblib import logger
sklearn/model_selection/_search.py
31:from ..utils._joblib import Parallel, delayed
sklearn/feature_selection/rfe.py
18:from ..utils._joblib import Parallel, delayed, effective_n_jobs
sklearn/metrics/pairwise.py
28:from ..utils._joblib import Parallel
29:from ..utils._joblib import delayed
30:from ..utils._joblib import effective_n_jobs
sklearn/utils/validation.py
26:from ._joblib import Memory
27:from ._joblib import __version__ as joblib_version
sklearn/utils/fixes.py
221: from . import _joblib
sklearn/utils/__init__.py
15:from . import _joblib
sklearn/utils/testing.py
52:from sklearn.utils._joblib import joblib
sklearn/utils/estimator_checks.py
15:from sklearn.utils import _joblib
sklearn/tests/test_multioutput.py
20:from sklearn.utils._joblib import cpu_count
sklearn/tests/test_site_joblib.py
3:from sklearn.utils._joblib import Parallel, delayed, Memory, parallel_backend
sklearn/covariance/graph_lasso_.py
26:from ..utils._joblib import Parallel, delayed
sklearn/cluster/mean_shift_.py
26:from ..utils._joblib import Parallel
27:from ..utils._joblib import delayed
sklearn/tests/test_pipeline.py
34:from sklearn.utils._joblib import Memory
35:from sklearn.utils._joblib import __version__ as joblib_version
sklearn/cluster/k_means_.py
31:from ..utils._joblib import Parallel
32:from ..utils._joblib import delayed
33:from ..utils._joblib import effective_n_jobs
sklearn/compose/_column_transformer.py
17:from ..utils._joblib import Parallel, delayed
sklearn/manifold/mds.py
15:from ..utils._joblib import Parallel
16:from ..utils._joblib import delayed
17:from ..utils._joblib import effective_n_jobs
sklearn/ensemble/tests/test_bagging.py
36:from sklearn.utils import _joblib
sklearn/ensemble/tests/test_forest.py
25:from sklearn.utils._joblib import joblib
26:from sklearn.utils._joblib import parallel_backend
27:from sklearn.utils._joblib import register_parallel_backend
28:from sklearn.utils._joblib import __version__ as __joblib_version__
sklearn/linear_model/tests/test_sgd.py
30:from sklearn.utils import _joblib
31:from sklearn.utils._joblib import parallel_backend
sklearn/neighbors/tests/test_neighbors.py
29:from sklearn.utils._joblib import joblib
30:from sklearn.utils._joblib import parallel_backend
sklearn/neighbors/tests/test_kde.py
13:from sklearn.utils import _joblib
sklearn/metrics/tests/test_score_objects.py
40:from sklearn.utils import _joblib
sklearn/utils/tests/test_utils.py
306: from sklearn.utils._joblib import Parallel, Memory, delayed
307: from sklearn.utils._joblib import cpu_count, hash, effective_n_jobs
308: from sklearn.utils._joblib import parallel_backend
309: from sklearn.utils._joblib import register_parallel_backend
319: from sklearn.utils._joblib import joblib
sklearn/utils/tests/test_estimator_checks.py
11:from sklearn.utils import _joblib
and I would rather do it in a separate PR, as this diff is large enough. It's a bit orthogonal to unvendoring anyway and rather how we alias or don't alias joblib imports. The points of this PR is to remove sklearn.externals.joblib
with minimal changes elsewhere.
Removed the no longer relevant warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the import should be changed in a separate PR to keep it readable.
@@ -26,24 +16,4 @@ def test_old_pickle(tmpdir): | |||
b'\x0fU\nallow_mmapq\x10\x88ub\x01\x00\x00\x00\x00\x00\x00\x00.', | |||
mode='wb') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially this becomes a test of the joblib dependency. It's still useful, but somewhat non standard.
CI wise this went rather smoothly indeed. For new scikit-learn installs (or when upgrading from versions with vendored joblib to to non-vendored ones) this should be transparent to the user: latest joblib will be installed as a dependency automatically with pip or conda. Checked that for pip in a clean environment. For subsequent updates, we need to makes sure users have incentive (and are aware of the necessity) to upgrade joblib at the same time as they are upgrading scikit-learn. Conda will handle this automatically but pip won't with just This is ready to be reviewed. cc @ogrisel @lesteve @tomMoral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments forgot to be submitted and are now probably outdated.
'register_parallel_backend', 'parallel_backend', | ||
'register_store_backend', 'register_compressor', | ||
'wrap_non_picklable_objects'] | ||
from joblib import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you've tested unpickling something with a joblib numpy array in this branch?
sklearn/externals/joblib/__init__.py
Outdated
@@ -1,133 +1,3 @@ | |||
"""Joblib is a set of tools to provide **lightweight pipelining in | |||
Python**. In particular: | |||
# Import necessary to preserve backward compatibliity of pickles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a short-term or long-term solution? at what point should we deprecate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, this is necessary to allow loading pickles created with the vendored joblib. Currently there is some number of those, but after it is merged and a few releases their relative amount should decrease. So I would say maybe we can deprecate it in 0.22 or 0.23.
Though another concern is to make sure people are not using this module as an alias for joblib. So maybe we should raise a warning here.
Will add a comment or warning once we reach an agreement about how to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing the bigger picture, but I'd be in favor of raising a deprecation warning here since this is our only way to discourage users from using sklearn.externals.joblib
.
Do we want to consider this a blocker for 0.21 (which should be released, like, yesterday)? |
I also think it would be good to have this in v0.21 |
I suppose this also closes #12263 and perhaps other loky-specific issues? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments / questions
from joblib import effective_n_jobs | ||
from joblib import hash | ||
from joblib import cpu_count, Parallel, Memory, delayed | ||
from joblib import parallel_backend, register_parallel_backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @GaelVaroquaux @ogrisel @tomMoral can weight in: is joblib >= 0.11 still raising DeprecationWarnings on Python 3?
If not we could get rid of this file too.
Also @rth we can remove the comment at the top of the file mentionning local / site joblib
@@ -5,9 +5,5 @@ def configuration(parent_package='', top_path=None): | |||
from numpy.distutils.misc_util import Configuration | |||
config = Configuration('externals', parent_package, top_path) | |||
config.add_subpackage('joblib') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @NicolasHug !
Tried to remove it, but then unpickling of old pickles that we have in one test started to fail. So I suppose we still need it. Reverted that change.
sklearn/externals/joblib/__init__.py
Outdated
@@ -1,133 +1,3 @@ | |||
"""Joblib is a set of tools to provide **lightweight pipelining in | |||
Python**. In particular: | |||
# Import necessary to preserve backward compatibliity of pickles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing the bigger picture, but I'd be in favor of raising a deprecation warning here since this is our only way to discourage users from using sklearn.externals.joblib
.
Co-Authored-By: rth <rth.yurchak@pm.me>
…into unvendor-joblib
'register_parallel_backend', 'parallel_backend', | ||
'register_store_backend', 'register_compressor', | ||
'wrap_non_picklable_objects'] | ||
if not hasattr(sys, "_is_pytest_session"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean the tests still use extenals.joblib
somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we test that old pickles can be restored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also more generally pytest will import all python files during test collection, and this is necessary so that we don't fail at collection time due to our error on DeprecationWarning
policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The current state seems to be doing the unvendoring with the minimal possible changes so it makes sense.
from joblib import effective_n_jobs | ||
from joblib import hash | ||
from joblib import cpu_count, Parallel, Memory, delayed | ||
from joblib import parallel_backend, register_parallel_backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the import should be changed in a separate PR to keep it readable.
Thanks @tomMoral ! |
Merging, let's give it a try, thanks Roman! |
This reverts commit 785bd53.
This reverts commit 785bd53.
`scikit-learn` (<0.21) does not explicitly declare joblib as an external dependency (scikit-learn/scikit-learn#13531).
Closes #12447
This unvendors joblib. A major concern is backward compatibility of pickles.
Todo:
pip install scikit-learn
in a clean environment automatically installsjoblib