Skip to content

Unvendor joblib #13531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 17, 2019
Merged

Unvendor joblib #13531

merged 19 commits into from
Apr 17, 2019

Conversation

rth
Copy link
Member

@rth rth commented Mar 27, 2019

Closes #12447

This unvendors joblib. A major concern is backward compatibility of pickles.

Todo:

  • update documentation
  • check that pip install scikit-learn in a clean environment automatically installs joblib

obj = _unpickle(fobj, filename, mmap_mode)

return obj
from joblib.numpy_pickle import *
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to allow loading pickles generated with older versions of scikit-learn, used e.g. in fetch_20newsgroups_vectorized.

@rth
Copy link
Member Author

rth commented Mar 27, 2019

So it looks like at least Azure CI is fine. Circle CI fails due to #13525 so that would need to be merged first.

I'm not sure what else I should check to make sure we are keeping backward compatibility of pickles. cc @tomMoral

@adrinjalali
Copy link
Member

#13527 is merged and we think it has fixed the circleci issues.

'register_parallel_backend', 'parallel_backend',
'register_store_backend', 'register_compressor',
'wrap_non_picklable_objects']
from joblib import *
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added so that unpickling of existing models doesn't raise errors about non existing paths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you've tested unpickling something with a joblib numpy array in this branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked that we can upickle a numpy array and the PCA estimator serialized with joblib in scikit-learn 0.20.2.

Also CircleCI caches pickled datasets used in examples, and I think we could have seen failures in examples if it didn't work.

Without this change I saw failures locally in tests that used the cached data for fetch_20newsgroups_vectorized

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't see circle ci examples fail unless you force a full build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, forgot about that. Re-triggered the full CircleCI build -- no errors still.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep a small sklearn estimator serialized with joblib in scikit-learn 0.20.2 in a test directory, and include a test that unpickles it to make sure it doesn't error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan There is a test for unpickling numpy arrays in https://github.com/scikit-learn/scikit-learn/pull/13531/files#r269922098 I'm not convinced that unpickling an actual scikit-learn estimator as part of the test suite is worth it: if it's simple we would be basically testing the python cPickle module. If it's more complex and interacts with lots of parts of scikit-learn it is bound to break as our estimators evolve (and we won't be able to unpickle it).

I have tested that unpickling works outside of the test suite as well as mentioned above.

@jnothman
Copy link
Member

This is looking pretty good. I hope it doesn't upset installations too much.

@rth rth changed the title WIP: Unvendor joblib Unvendor joblib Mar 28, 2019
from joblib import effective_n_jobs
from joblib import hash
from joblib import cpu_count, Parallel, Memory, delayed
from joblib import parallel_backend, register_parallel_backend
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still keep the sklearn.utils._joblib module I suppose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @GaelVaroquaux @ogrisel @tomMoral can weight in: is joblib >= 0.11 still raising DeprecationWarnings on Python 3?

If not we could get rid of this file too.

Also @rth we can remove the comment at the top of the file mentionning local / site joblib

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug I agree we should probably get rid of sklearn.utils._joblib but that would involve updating a large number of imports

sklearn/multioutput.py
28:from .utils._joblib import Parallel, delayed

sklearn/pipeline.py
19:from .utils._joblib import Parallel, delayed

sklearn/multiclass.py
56:from .utils._joblib import Parallel
57:from .utils._joblib import delayed

sklearn/decomposition/dict_learning.py
16:from ..utils._joblib import Parallel, delayed, effective_n_jobs

sklearn/decomposition/online_lda.py
23:from ..utils._joblib import Parallel, delayed, effective_n_jobs

sklearn/datasets/rcv1.py
24:from ..utils import _joblib

sklearn/datasets/covtype.py
29:from ..utils import _joblib

sklearn/datasets/species_distributions.py
53:from sklearn.utils import _joblib

sklearn/datasets/twenty_newsgroups.py
47:from ..utils import _joblib

sklearn/datasets/olivetti_faces.py
26:from ..utils import _joblib

sklearn/datasets/kddcup99.py
24:from ..utils import _joblib

sklearn/datasets/california_housing.py
36:from ..utils import _joblib

sklearn/datasets/lfw.py
22:from ..utils._joblib import Memory
23:from ..utils import _joblib

sklearn/ensemble/partial_dependence.py
13:from ..utils._joblib import Parallel, delayed

sklearn/ensemble/voting_classifier.py
20:from ..utils._joblib import Parallel, delayed

sklearn/ensemble/iforest.py
18:from ..utils.fixes import _joblib_parallel_args

sklearn/ensemble/forest.py
52:from ..utils._joblib import Parallel, delayed
61:from ..utils.fixes import parallel_helper, _joblib_parallel_args

sklearn/ensemble/base.py
15:from ..utils._joblib import effective_n_jobs

sklearn/ensemble/bagging.py
15:from ..utils._joblib import Parallel, delayed

sklearn/linear_model/least_angle.py
24:from ..utils._joblib import Parallel, delayed

sklearn/linear_model/theil_sen.py
23:from ..utils._joblib import Parallel, delayed, effective_n_jobs

sklearn/linear_model/stochastic_gradient.py
12:from ..utils._joblib import Parallel, delayed
35:from ..utils.fixes import _joblib_parallel_args

sklearn/linear_model/coordinate_descent.py
21:from ..utils._joblib import Parallel, delayed, effective_n_jobs
23:from ..utils.fixes import _joblib_parallel_args

sklearn/linear_model/base.py
26:from ..utils._joblib import Parallel, delayed

sklearn/linear_model/omp.py
19:from ..utils._joblib import Parallel, delayed

sklearn/linear_model/logistic.py
36:from ..utils._joblib import Parallel, delayed, effective_n_jobs
37:from ..utils.fixes import _joblib_parallel_args

sklearn/neighbors/base.py
27:from ..utils._joblib import Parallel, delayed, effective_n_jobs
28:from ..utils._joblib import __version__ as joblib_version

sklearn/model_selection/_validation.py
25:from ..utils._joblib import Parallel, delayed
26:from ..utils._joblib import logger

sklearn/model_selection/_search.py
31:from ..utils._joblib import Parallel, delayed

sklearn/feature_selection/rfe.py
18:from ..utils._joblib import Parallel, delayed, effective_n_jobs

sklearn/metrics/pairwise.py
28:from ..utils._joblib import Parallel
29:from ..utils._joblib import delayed
30:from ..utils._joblib import effective_n_jobs

sklearn/utils/validation.py
26:from ._joblib import Memory
27:from ._joblib import __version__ as joblib_version

sklearn/utils/fixes.py
221:    from . import _joblib

sklearn/utils/__init__.py
15:from . import _joblib

sklearn/utils/testing.py
52:from sklearn.utils._joblib import joblib

sklearn/utils/estimator_checks.py
15:from sklearn.utils import _joblib

sklearn/tests/test_multioutput.py
20:from sklearn.utils._joblib import cpu_count

sklearn/tests/test_site_joblib.py
3:from sklearn.utils._joblib import Parallel, delayed, Memory, parallel_backend

sklearn/covariance/graph_lasso_.py
26:from ..utils._joblib import Parallel, delayed

sklearn/cluster/mean_shift_.py
26:from ..utils._joblib import Parallel
27:from ..utils._joblib import delayed

sklearn/tests/test_pipeline.py
34:from sklearn.utils._joblib import Memory
35:from sklearn.utils._joblib import __version__ as joblib_version

sklearn/cluster/k_means_.py
31:from ..utils._joblib import Parallel
32:from ..utils._joblib import delayed
33:from ..utils._joblib import effective_n_jobs

sklearn/compose/_column_transformer.py
17:from ..utils._joblib import Parallel, delayed

sklearn/manifold/mds.py
15:from ..utils._joblib import Parallel
16:from ..utils._joblib import delayed
17:from ..utils._joblib import effective_n_jobs

sklearn/ensemble/tests/test_bagging.py
36:from sklearn.utils import _joblib

sklearn/ensemble/tests/test_forest.py
25:from sklearn.utils._joblib import joblib
26:from sklearn.utils._joblib import parallel_backend
27:from sklearn.utils._joblib import register_parallel_backend
28:from sklearn.utils._joblib import __version__ as __joblib_version__

sklearn/linear_model/tests/test_sgd.py
30:from sklearn.utils import _joblib
31:from sklearn.utils._joblib import parallel_backend

sklearn/neighbors/tests/test_neighbors.py
29:from sklearn.utils._joblib import joblib
30:from sklearn.utils._joblib import parallel_backend

sklearn/neighbors/tests/test_kde.py
13:from sklearn.utils import _joblib

sklearn/metrics/tests/test_score_objects.py
40:from sklearn.utils import _joblib

sklearn/utils/tests/test_utils.py
306:    from sklearn.utils._joblib import Parallel, Memory, delayed
307:    from sklearn.utils._joblib import cpu_count, hash, effective_n_jobs
308:    from sklearn.utils._joblib import parallel_backend
309:    from sklearn.utils._joblib import register_parallel_backend
319:    from sklearn.utils._joblib import joblib

sklearn/utils/tests/test_estimator_checks.py
11:from sklearn.utils import _joblib

and I would rather do it in a separate PR, as this diff is large enough. It's a bit orthogonal to unvendoring anyway and rather how we alias or don't alias joblib imports. The points of this PR is to remove sklearn.externals.joblib with minimal changes elsewhere.

Removed the no longer relevant warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the import should be changed in a separate PR to keep it readable.

@@ -26,24 +16,4 @@ def test_old_pickle(tmpdir):
b'\x0fU\nallow_mmapq\x10\x88ub\x01\x00\x00\x00\x00\x00\x00\x00.',
mode='wb')
Copy link
Member Author

@rth rth Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this becomes a test of the joblib dependency. It's still useful, but somewhat non standard.

@rth
Copy link
Member Author

rth commented Mar 28, 2019

This is looking pretty good. I hope it doesn't upset installations too much.

CI wise this went rather smoothly indeed.

For new scikit-learn installs (or when upgrading from versions with vendored joblib to to non-vendored ones) this should be transparent to the user: latest joblib will be installed as a dependency automatically with pip or conda. Checked that for pip in a clean environment.

For subsequent updates, we need to makes sure users have incentive (and are aware of the necessity) to upgrade joblib at the same time as they are upgrading scikit-learn. Conda will handle this automatically but pip won't with just pip install -U scikit-learn, given that --upgrade-strategy only-if-needed by default (instead of eager). Though in the end it's not that different from the question of how to ensure latest scipy/numpy/pandas is used when possible with scikit-learn.

This is ready to be reviewed. cc @ogrisel @lesteve @tomMoral

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments forgot to be submitted and are now probably outdated.

'register_parallel_backend', 'parallel_backend',
'register_store_backend', 'register_compressor',
'wrap_non_picklable_objects']
from joblib import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you've tested unpickling something with a joblib numpy array in this branch?

@@ -1,133 +1,3 @@
"""Joblib is a set of tools to provide **lightweight pipelining in
Python**. In particular:
# Import necessary to preserve backward compatibliity of pickles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a short-term or long-term solution? at what point should we deprecate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, this is necessary to allow loading pickles created with the vendored joblib. Currently there is some number of those, but after it is merged and a few releases their relative amount should decrease. So I would say maybe we can deprecate it in 0.22 or 0.23.

Though another concern is to make sure people are not using this module as an alias for joblib. So maybe we should raise a warning here.

Will add a comment or warning once we reach an agreement about how to handle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing the bigger picture, but I'd be in favor of raising a deprecation warning here since this is our only way to discourage users from using sklearn.externals.joblib.

@jnothman
Copy link
Member

jnothman commented Apr 6, 2019

Do we want to consider this a blocker for 0.21 (which should be released, like, yesterday)?

@rth
Copy link
Member Author

rth commented Apr 6, 2019

I also think it would be good to have this in v0.21

@jnothman
Copy link
Member

jnothman commented Apr 9, 2019

I suppose this also closes #12263 and perhaps other loky-specific issues?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments / questions

from joblib import effective_n_jobs
from joblib import hash
from joblib import cpu_count, Parallel, Memory, delayed
from joblib import parallel_backend, register_parallel_backend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @GaelVaroquaux @ogrisel @tomMoral can weight in: is joblib >= 0.11 still raising DeprecationWarnings on Python 3?

If not we could get rid of this file too.

Also @rth we can remove the comment at the top of the file mentionning local / site joblib

@@ -5,9 +5,5 @@ def configuration(parent_package='', top_path=None):
from numpy.distutils.misc_util import Configuration
config = Configuration('externals', parent_package, top_path)
config.add_subpackage('joblib')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @NicolasHug !

Tried to remove it, but then unpickling of old pickles that we have in one test started to fail. So I suppose we still need it. Reverted that change.

@@ -1,133 +1,3 @@
"""Joblib is a set of tools to provide **lightweight pipelining in
Python**. In particular:
# Import necessary to preserve backward compatibliity of pickles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing the bigger picture, but I'd be in favor of raising a deprecation warning here since this is our only way to discourage users from using sklearn.externals.joblib.

'register_parallel_backend', 'parallel_backend',
'register_store_backend', 'register_compressor',
'wrap_non_picklable_objects']
if not hasattr(sys, "_is_pytest_session"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean the tests still use extenals.joblib somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we test that old pickles can be restored

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also more generally pytest will import all python files during test collection, and this is necessary so that we don't fail at collection time due to our error on DeprecationWarning policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, thanks

Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The current state seems to be doing the unvendoring with the minimal possible changes so it makes sense.

from joblib import effective_n_jobs
from joblib import hash
from joblib import cpu_count, Parallel, Memory, delayed
from joblib import parallel_backend, register_parallel_backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the import should be changed in a separate PR to keep it readable.

@rth
Copy link
Member Author

rth commented Apr 17, 2019

Thanks @tomMoral !

@NicolasHug NicolasHug merged commit fc33d30 into scikit-learn:master Apr 17, 2019
@NicolasHug
Copy link
Member

Merging, let's give it a try, thanks Roman!

@rth rth deleted the unvendor-joblib branch April 17, 2019 13:45
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
takebayashi added a commit to takebayashi/pynndescent that referenced this pull request May 21, 2019
`scikit-learn` (<0.21) does not explicitly declare joblib as an external dependency (scikit-learn/scikit-learn#13531).
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Define future dependence on joblib
6 participants