Unvendor joblib #13531

rth · 2019-03-27T15:52:03Z

This unvendors joblib. A major concern is backward compatibility of pickles.

Todo:

update documentation
check that pip install scikit-learn in a clean environment automatically installs joblib

rth · 2019-03-27T16:52:28Z

sklearn/externals/joblib/numpy_pickle.py

-                obj = _unpickle(fobj, filename, mmap_mode)
-
-    return obj
+from joblib.numpy_pickle import *


This was added to allow loading pickles generated with older versions of scikit-learn, used e.g. in fetch_20newsgroups_vectorized.

rth · 2019-03-27T16:55:58Z

So it looks like at least Azure CI is fine. Circle CI fails due to #13525 so that would need to be merged first.

I'm not sure what else I should check to make sure we are keeping backward compatibility of pickles. cc @tomMoral

adrinjalali · 2019-03-27T16:59:16Z

#13527 is merged and we think it has fixed the circleci issues.

rth · 2019-03-27T17:00:04Z

sklearn/externals/joblib/__init__.py

-           'register_parallel_backend', 'parallel_backend',
-           'register_store_backend', 'register_compressor',
-           'wrap_non_picklable_objects']
+from joblib import *


Also added so that unpickling of existing models doesn't raise errors about non existing paths.

I presume you've tested unpickling something with a joblib numpy array in this branch?

Yes, I checked that we can upickle a numpy array and the PCA estimator serialized with joblib in scikit-learn 0.20.2.

Also CircleCI caches pickled datasets used in examples, and I think we could have seen failures in examples if it didn't work.

Without this change I saw failures locally in tests that used the cached data for fetch_20newsgroups_vectorized

You won't see circle ci examples fail unless you force a full build

Right, forgot about that. Re-triggered the full CircleCI build -- no errors still.

We can keep a small sklearn estimator serialized with joblib in scikit-learn 0.20.2 in a test directory, and include a test that unpickles it to make sure it doesn't error.

@thomasjpfan There is a test for unpickling numpy arrays in https://github.com/scikit-learn/scikit-learn/pull/13531/files#r269922098 I'm not convinced that unpickling an actual scikit-learn estimator as part of the test suite is worth it: if it's simple we would be basically testing the python cPickle module. If it's more complex and interacts with lots of parts of scikit-learn it is bound to break as our estimators evolve (and we won't be able to unpickle it).

I have tested that unpickling works outside of the test suite as well as mentioned above.

jnothman · 2019-03-27T22:53:28Z

This is looking pretty good. I hope it doesn't upset installations too much.

rth · 2019-03-28T09:43:53Z

sklearn/utils/_joblib.py

+    from joblib import effective_n_jobs
+    from joblib import hash
+    from joblib import cpu_count, Parallel, Memory, delayed
+    from joblib import parallel_backend, register_parallel_backend


We still keep the sklearn.utils._joblib module I suppose.

Maybe @GaelVaroquaux @ogrisel @tomMoral can weight in: is joblib >= 0.11 still raising DeprecationWarnings on Python 3?

If not we could get rid of this file too.

Also @rth we can remove the comment at the top of the file mentionning local / site joblib

@NicolasHug I agree we should probably get rid of sklearn.utils._joblib but that would involve updating a large number of imports

sklearn/multioutput.py 28:from .utils._joblib import Parallel, delayed sklearn/pipeline.py 19:from .utils._joblib import Parallel, delayed sklearn/multiclass.py 56:from .utils._joblib import Parallel 57:from .utils._joblib import delayed sklearn/decomposition/dict_learning.py 16:from ..utils._joblib import Parallel, delayed, effective_n_jobs sklearn/decomposition/online_lda.py 23:from ..utils._joblib import Parallel, delayed, effective_n_jobs sklearn/datasets/rcv1.py 24:from ..utils import _joblib sklearn/datasets/covtype.py 29:from ..utils import _joblib sklearn/datasets/species_distributions.py 53:from sklearn.utils import _joblib sklearn/datasets/twenty_newsgroups.py 47:from ..utils import _joblib sklearn/datasets/olivetti_faces.py 26:from ..utils import _joblib sklearn/datasets/kddcup99.py 24:from ..utils import _joblib sklearn/datasets/california_housing.py 36:from ..utils import _joblib sklearn/datasets/lfw.py 22:from ..utils._joblib import Memory 23:from ..utils import _joblib sklearn/ensemble/partial_dependence.py 13:from ..utils._joblib import Parallel, delayed sklearn/ensemble/voting_classifier.py 20:from ..utils._joblib import Parallel, delayed sklearn/ensemble/iforest.py 18:from ..utils.fixes import _joblib_parallel_args sklearn/ensemble/forest.py 52:from ..utils._joblib import Parallel, delayed 61:from ..utils.fixes import parallel_helper, _joblib_parallel_args sklearn/ensemble/base.py 15:from ..utils._joblib import effective_n_jobs sklearn/ensemble/bagging.py 15:from ..utils._joblib import Parallel, delayed sklearn/linear_model/least_angle.py 24:from ..utils._joblib import Parallel, delayed sklearn/linear_model/theil_sen.py 23:from ..utils._joblib import Parallel, delayed, effective_n_jobs sklearn/linear_model/stochastic_gradient.py 12:from ..utils._joblib import Parallel, delayed 35:from ..utils.fixes import _joblib_parallel_args sklearn/linear_model/coordinate_descent.py 21:from ..utils._joblib import Parallel, delayed, effective_n_jobs 23:from ..utils.fixes import _joblib_parallel_args sklearn/linear_model/base.py 26:from ..utils._joblib import Parallel, delayed sklearn/linear_model/omp.py 19:from ..utils._joblib import Parallel, delayed sklearn/linear_model/logistic.py 36:from ..utils._joblib import Parallel, delayed, effective_n_jobs 37:from ..utils.fixes import _joblib_parallel_args sklearn/neighbors/base.py 27:from ..utils._joblib import Parallel, delayed, effective_n_jobs 28:from ..utils._joblib import __version__ as joblib_version sklearn/model_selection/_validation.py 25:from ..utils._joblib import Parallel, delayed 26:from ..utils._joblib import logger sklearn/model_selection/_search.py 31:from ..utils._joblib import Parallel, delayed sklearn/feature_selection/rfe.py 18:from ..utils._joblib import Parallel, delayed, effective_n_jobs sklearn/metrics/pairwise.py 28:from ..utils._joblib import Parallel 29:from ..utils._joblib import delayed 30:from ..utils._joblib import effective_n_jobs sklearn/utils/validation.py 26:from ._joblib import Memory 27:from ._joblib import __version__ as joblib_version sklearn/utils/fixes.py 221: from . import _joblib sklearn/utils/__init__.py 15:from . import _joblib sklearn/utils/testing.py 52:from sklearn.utils._joblib import joblib sklearn/utils/estimator_checks.py 15:from sklearn.utils import _joblib sklearn/tests/test_multioutput.py 20:from sklearn.utils._joblib import cpu_count sklearn/tests/test_site_joblib.py 3:from sklearn.utils._joblib import Parallel, delayed, Memory, parallel_backend sklearn/covariance/graph_lasso_.py 26:from ..utils._joblib import Parallel, delayed sklearn/cluster/mean_shift_.py 26:from ..utils._joblib import Parallel 27:from ..utils._joblib import delayed sklearn/tests/test_pipeline.py 34:from sklearn.utils._joblib import Memory 35:from sklearn.utils._joblib import __version__ as joblib_version sklearn/cluster/k_means_.py 31:from ..utils._joblib import Parallel 32:from ..utils._joblib import delayed 33:from ..utils._joblib import effective_n_jobs sklearn/compose/_column_transformer.py 17:from ..utils._joblib import Parallel, delayed sklearn/manifold/mds.py 15:from ..utils._joblib import Parallel 16:from ..utils._joblib import delayed 17:from ..utils._joblib import effective_n_jobs sklearn/ensemble/tests/test_bagging.py 36:from sklearn.utils import _joblib sklearn/ensemble/tests/test_forest.py 25:from sklearn.utils._joblib import joblib 26:from sklearn.utils._joblib import parallel_backend 27:from sklearn.utils._joblib import register_parallel_backend 28:from sklearn.utils._joblib import __version__ as __joblib_version__ sklearn/linear_model/tests/test_sgd.py 30:from sklearn.utils import _joblib 31:from sklearn.utils._joblib import parallel_backend sklearn/neighbors/tests/test_neighbors.py 29:from sklearn.utils._joblib import joblib 30:from sklearn.utils._joblib import parallel_backend sklearn/neighbors/tests/test_kde.py 13:from sklearn.utils import _joblib sklearn/metrics/tests/test_score_objects.py 40:from sklearn.utils import _joblib sklearn/utils/tests/test_utils.py 306: from sklearn.utils._joblib import Parallel, Memory, delayed 307: from sklearn.utils._joblib import cpu_count, hash, effective_n_jobs 308: from sklearn.utils._joblib import parallel_backend 309: from sklearn.utils._joblib import register_parallel_backend 319: from sklearn.utils._joblib import joblib sklearn/utils/tests/test_estimator_checks.py 11:from sklearn.utils import _joblib

and I would rather do it in a separate PR, as this diff is large enough. It's a bit orthogonal to unvendoring anyway and rather how we alias or don't alias joblib imports. The points of this PR is to remove sklearn.externals.joblib with minimal changes elsewhere.

Removed the no longer relevant warning.

I agree that the import should be changed in a separate PR to keep it readable.

rth · 2019-03-28T09:45:01Z

sklearn/tests/test_site_joblib.py

@@ -26,24 +16,4 @@ def test_old_pickle(tmpdir):
            b'\x0fU\nallow_mmapq\x10\x88ub\x01\x00\x00\x00\x00\x00\x00\x00.',
            mode='wb')


Essentially this becomes a test of the joblib dependency. It's still useful, but somewhat non standard.

rth · 2019-03-28T10:22:37Z

This is looking pretty good. I hope it doesn't upset installations too much.

CI wise this went rather smoothly indeed.

For new scikit-learn installs (or when upgrading from versions with vendored joblib to to non-vendored ones) this should be transparent to the user: latest joblib will be installed as a dependency automatically with pip or conda. Checked that for pip in a clean environment.

For subsequent updates, we need to makes sure users have incentive (and are aware of the necessity) to upgrade joblib at the same time as they are upgrading scikit-learn. Conda will handle this automatically but pip won't with just pip install -U scikit-learn, given that --upgrade-strategy only-if-needed by default (instead of eager). Though in the end it's not that different from the question of how to ensure latest scipy/numpy/pandas is used when possible with scikit-learn.

This is ready to be reviewed. cc @ogrisel @lesteve @tomMoral

jnothman

My comments forgot to be submitted and are now probably outdated.

jnothman · 2019-03-27T22:50:42Z

sklearn/externals/joblib/__init__.py

-           'register_parallel_backend', 'parallel_backend',
-           'register_store_backend', 'register_compressor',
-           'wrap_non_picklable_objects']
+from joblib import *


I presume you've tested unpickling something with a joblib numpy array in this branch?

jnothman · 2019-03-27T22:51:52Z

sklearn/externals/joblib/__init__.py

@@ -1,133 +1,3 @@
-"""Joblib is a set of tools to provide **lightweight pipelining in
-Python**. In particular:
+# Import necessary to preserve backward compatibliity of pickles


is this a short-term or long-term solution? at what point should we deprecate?

As far as I understand, this is necessary to allow loading pickles created with the vendored joblib. Currently there is some number of those, but after it is merged and a few releases their relative amount should decrease. So I would say maybe we can deprecate it in 0.22 or 0.23.

Though another concern is to make sure people are not using this module as an alias for joblib. So maybe we should raise a warning here.

Will add a comment or warning once we reach an agreement about how to handle this.

I may be missing the bigger picture, but I'd be in favor of raising a deprecation warning here since this is our only way to discourage users from using sklearn.externals.joblib.

doc/whats_new/v0.21.rst

jnothman · 2019-04-06T12:28:10Z

Do we want to consider this a blocker for 0.21 (which should be released, like, yesterday)?

rth · 2019-04-06T16:15:31Z

I also think it would be good to have this in v0.21

jnothman · 2019-04-09T11:28:57Z

I suppose this also closes #12263 and perhaps other loky-specific issues?

NicolasHug

Some comments / questions

NicolasHug · 2019-04-15T19:15:53Z

sklearn/utils/_joblib.py

+    from joblib import effective_n_jobs
+    from joblib import hash
+    from joblib import cpu_count, Parallel, Memory, delayed
+    from joblib import parallel_backend, register_parallel_backend


Maybe @GaelVaroquaux @ogrisel @tomMoral can weight in: is joblib >= 0.11 still raising DeprecationWarnings on Python 3?

If not we could get rid of this file too.

Also @rth we can remove the comment at the top of the file mentionning local / site joblib

NicolasHug · 2019-04-15T19:19:07Z

sklearn/externals/setup.py

@@ -5,9 +5,5 @@ def configuration(parent_package='', top_path=None):
    from numpy.distutils.misc_util import Configuration
    config = Configuration('externals', parent_package, top_path)
    config.add_subpackage('joblib')


is this still needed?

Thanks for the review @NicolasHug !

Tried to remove it, but then unpickling of old pickles that we have in one test started to fail. So I suppose we still need it. Reverted that change.

NicolasHug · 2019-04-15T19:28:21Z

sklearn/externals/joblib/__init__.py

@@ -1,133 +1,3 @@
-"""Joblib is a set of tools to provide **lightweight pipelining in
-Python**. In particular:
+# Import necessary to preserve backward compatibliity of pickles


I may be missing the bigger picture, but I'd be in favor of raising a deprecation warning here since this is our only way to discourage users from using sklearn.externals.joblib.

sklearn/externals/joblib/__init__.py

Co-Authored-By: rth <rth.yurchak@pm.me>

…into unvendor-joblib

NicolasHug · 2019-04-17T11:24:02Z

sklearn/externals/joblib/__init__.py

-           'register_parallel_backend', 'parallel_backend',
-           'register_store_backend', 'register_compressor',
-           'wrap_non_picklable_objects']
+if not hasattr(sys, "_is_pytest_session"):


does that mean the tests still use extenals.joblib somehow?

Yes, we test that old pickles can be restored

Also more generally pytest will import all python files during test collection, and this is necessary so that we don't fail at collection time due to our error on DeprecationWarning policy.

oh right, thanks

tomMoral

LGTM! The current state seems to be doing the unvendoring with the minimal possible changes so it makes sense.

tomMoral · 2019-04-17T13:16:48Z

sklearn/utils/_joblib.py

+    from joblib import effective_n_jobs
+    from joblib import hash
+    from joblib import cpu_count, Parallel, Memory, delayed
+    from joblib import parallel_backend, register_parallel_backend


I agree that the import should be changed in a separate PR to keep it readable.

rth · 2019-04-17T13:20:01Z

Thanks @tomMoral !

NicolasHug · 2019-04-17T13:22:26Z

Merging, let's give it a try, thanks Roman!

This reverts commit 785bd53.

`scikit-learn` (<0.21) does not explicitly declare joblib as an external dependency (scikit-learn/scikit-learn#13531).

rth added 6 commits March 27, 2019 15:32

Unvendor joblib

50d03f1

Lint

ae5807d

Fix sklearn/externals/setup.py

465dfc2

Install joblib on azure

62a09f9

Fix typo

3efb960

Fix test

179e10c

rth commented Mar 27, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into unvendor-joblib

db626eb

rth commented Mar 27, 2019

View reviewed changes

rth added 4 commits March 28, 2019 10:22

Deprecated SKLEARN_SITE_JOBLIB

c95e810

Improve azure config

9ced0f7

Add whats new

ad3c0ac

Lint

20e115b

rth changed the title ~~WIP: Unvendor joblib~~ Unvendor joblib Mar 28, 2019

rth commented Mar 28, 2019

View reviewed changes

Fix conda joblib version compat

abcd031

jnothman approved these changes Mar 28, 2019

View reviewed changes

[doc build] Trigger Circle CI

539e38d

rth mentioned this pull request Apr 6, 2019

Consider unvendoring joblib conda-forge/scikit-learn-feedstock#75

Closed

jnothman added this to the 0.21 milestone Apr 6, 2019

NicolasHug reviewed Apr 15, 2019

View reviewed changes

NicolasHug mentioned this pull request Apr 16, 2019

[MRG] Replace absolute imports with relative ones #13653

Merged

NicolasHug and others added 5 commits April 17, 2019 00:27

Update sklearn/externals/joblib/__init__.py

991c1fe

Co-Authored-By: rth <rth.yurchak@pm.me>

Remove no longer relevant warning in sklearn/utils/_joblib.py

faa1de4

Merge branch 'unvendor-joblib' of https://github.com/rth/scikit-learn …

08245da

…into unvendor-joblib

Merge branch 'master' into unvendor-joblib

6d215da

Address review comments

7474f35

NicolasHug reviewed Apr 17, 2019

View reviewed changes

Revert sklearn/externals/setup.py removal

1e88e03

NicolasHug approved these changes Apr 17, 2019

View reviewed changes

tomMoral approved these changes Apr 17, 2019

View reviewed changes

NicolasHug merged commit fc33d30 into scikit-learn:master Apr 17, 2019

rth deleted the unvendor-joblib branch April 17, 2019 13:45

rth mentioned this pull request Apr 19, 2019

MAINT Remove imports from sklearn.utils._joblib #13676

Merged

2 tasks

This was referenced Apr 22, 2019

sklearn seems to be devendoring joblib trevorstephens/gplearn#114

Closed

joblib dependency nicodv/kmodes#123

Merged

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019

MAINT Unvendor joblib (scikit-learn#13531)

66fa659

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

MAINT Unvendor joblib (scikit-learn#13531)

785bd53

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "MAINT Unvendor joblib (scikit-learn#13531)"

b92f3b2

This reverts commit 785bd53.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "MAINT Unvendor joblib (scikit-learn#13531)"

b389842

This reverts commit 785bd53.

rtbs-dev mentioned this pull request May 3, 2019

sklearn/joblib spawns multiple copies when frozen on macOS pyinstaller/pyinstaller#4110

Closed

letalvoj mentioned this pull request May 17, 2019

Pyinstaller exe keeps opening itself pyinstaller/pyinstaller#4067

Closed

takebayashi added a commit to takebayashi/pynndescent that referenced this pull request May 21, 2019

Add joblib to install_requires

8d14ced

`scikit-learn` (<0.21) does not explicitly declare joblib as an external dependency (scikit-learn/scikit-learn#13531).

takebayashi mentioned this pull request May 21, 2019

Add joblib to install_requires lmcinnes/pynndescent#57

Merged

vuillaut mentioned this pull request Jun 17, 2019

Future updates in scikit-learn could brake behaviour of energy reconstruction cta-observatory/ctapipe#1088

Closed

thomasfrederikhoeck mentioned this pull request Jun 20, 2019

BUG: Making joblib a dependency andersbogsnes/ml_tooling#229

Closed

2 tasks

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

MAINT Unvendor joblib (scikit-learn#13531)

4e2034d

edrogers mentioned this pull request Jul 17, 2019

Stale scikit-learn datasets can throw pickling errors #14328

Closed

		@@ -26,24 +16,4 @@ def test_old_pickle(tmpdir):
		b'\x0fU\nallow_mmapq\x10\x88ub\x01\x00\x00\x00\x00\x00\x00\x00.',
		mode='wb')

Unvendor joblib #13531

Unvendor joblib #13531

Conversation

rth commented Mar 27, 2019 • edited Loading

Choose a reason for hiding this comment

rth commented Mar 27, 2019

adrinjalali commented Mar 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Mar 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth Mar 28, 2019 • edited Loading

Choose a reason for hiding this comment

rth commented Mar 28, 2019 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 6, 2019

rth commented Apr 6, 2019

jnothman commented Apr 9, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomMoral left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Apr 17, 2019

NicolasHug commented Apr 17, 2019

rth commented Mar 27, 2019 •

edited

Loading

rth Mar 28, 2019 •

edited

Loading

rth commented Mar 28, 2019 •

edited

Loading