-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Unvendor joblib #13531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unvendor joblib #13531
Changes from all commits
50d03f1
ae5807d
465dfc2
62a09f9
3efb960
179e10c
db626eb
c95e810
9ced0f7
ad3c0ac
20e115b
abcd031
539e38d
991c1fe
faa1de4
08245da
6d215da
7474f35
1e88e03
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,133 +1,15 @@ | ||
"""Joblib is a set of tools to provide **lightweight pipelining in | ||
Python**. In particular: | ||
# Import necessary to preserve backward compatibility of pickles | ||
import sys | ||
import warnings | ||
|
||
1. transparent disk-caching of functions and lazy re-evaluation | ||
(memoize pattern) | ||
from joblib import * | ||
|
||
2. easy simple parallel computing | ||
|
||
Joblib is optimized to be **fast** and **robust** in particular on large | ||
data and has specific optimizations for `numpy` arrays. It is | ||
**BSD-licensed**. | ||
msg = ("sklearn.externals.joblib is deprecated in 0.21 and will be removed " | ||
"in 0.23. Please import this functionality directly from joblib, " | ||
"which can be installed with: pip install joblib. If this warning is " | ||
"raised when loading pickled models, you may need to re-serialize " | ||
"those models with scikit-learn 0.21+.") | ||
|
||
|
||
==================== =============================================== | ||
**Documentation:** https://joblib.readthedocs.io | ||
|
||
**Download:** http://pypi.python.org/pypi/joblib#downloads | ||
|
||
**Source code:** http://github.com/joblib/joblib | ||
|
||
**Report issues:** http://github.com/joblib/joblib/issues | ||
==================== =============================================== | ||
|
||
|
||
Vision | ||
-------- | ||
|
||
The vision is to provide tools to easily achieve better performance and | ||
reproducibility when working with long running jobs. | ||
|
||
* **Avoid computing twice the same thing**: code is rerun over an | ||
over, for instance when prototyping computational-heavy jobs (as in | ||
scientific development), but hand-crafted solution to alleviate this | ||
issue is error-prone and often leads to unreproducible results | ||
|
||
* **Persist to disk transparently**: persisting in an efficient way | ||
arbitrary objects containing large data is hard. Using | ||
joblib's caching mechanism avoids hand-written persistence and | ||
implicitly links the file on disk to the execution context of | ||
the original Python object. As a result, joblib's persistence is | ||
good for resuming an application status or computational job, eg | ||
after a crash. | ||
|
||
Joblib addresses these problems while **leaving your code and your flow | ||
control as unmodified as possible** (no framework, no new paradigms). | ||
|
||
Main features | ||
------------------ | ||
|
||
1) **Transparent and fast disk-caching of output value:** a memoize or | ||
make-like functionality for Python functions that works well for | ||
arbitrary Python objects, including very large numpy arrays. Separate | ||
persistence and flow-execution logic from domain logic or algorithmic | ||
code by writing the operations as a set of steps with well-defined | ||
inputs and outputs: Python functions. Joblib can save their | ||
computation to disk and rerun it only if necessary:: | ||
|
||
>>> from sklearn.externals.joblib import Memory | ||
>>> cachedir = 'your_cache_dir_goes_here' | ||
>>> mem = Memory(cachedir) | ||
>>> import numpy as np | ||
>>> a = np.vander(np.arange(3)).astype(np.float) | ||
>>> square = mem.cache(np.square) | ||
>>> b = square(a) # doctest: +ELLIPSIS | ||
________________________________________________________________________________ | ||
[Memory] Calling square... | ||
square(array([[0., 0., 1.], | ||
[1., 1., 1.], | ||
[4., 2., 1.]])) | ||
___________________________________________________________square - 0...s, 0.0min | ||
|
||
>>> c = square(a) | ||
>>> # The above call did not trigger an evaluation | ||
|
||
2) **Embarrassingly parallel helper:** to make it easy to write readable | ||
parallel code and debug it quickly:: | ||
|
||
>>> from sklearn.externals.joblib import Parallel, delayed | ||
>>> from math import sqrt | ||
>>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) | ||
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] | ||
|
||
|
||
3) **Fast compressed Persistence**: a replacement for pickle to work | ||
efficiently on Python objects containing large data ( | ||
*joblib.dump* & *joblib.load* ). | ||
|
||
.. | ||
>>> import shutil ; shutil.rmtree(cachedir) | ||
|
||
""" | ||
|
||
# PEP0440 compatible formatted version, see: | ||
# https://www.python.org/dev/peps/pep-0440/ | ||
# | ||
# Generic release markers: | ||
# X.Y | ||
# X.Y.Z # For bugfix releases | ||
# | ||
# Admissible pre-release markers: | ||
# X.YaN # Alpha release | ||
# X.YbN # Beta release | ||
# X.YrcN # Release Candidate | ||
# X.Y # Final release | ||
# | ||
# Dev branch marker is: 'X.Y.dev' or 'X.Y.devN' where N is an integer. | ||
# 'X.Y.dev0' is the canonical version of 'X.Y.dev' | ||
# | ||
__version__ = '0.13.0' | ||
|
||
|
||
from .memory import Memory, MemorizedResult, register_store_backend | ||
from .logger import PrintTime | ||
from .logger import Logger | ||
from .hashing import hash | ||
from .numpy_pickle import dump | ||
from .numpy_pickle import load | ||
from .compressor import register_compressor | ||
from .parallel import Parallel | ||
from .parallel import delayed | ||
from .parallel import cpu_count | ||
from .parallel import register_parallel_backend | ||
from .parallel import parallel_backend | ||
from .parallel import effective_n_jobs | ||
|
||
from .externals.loky import wrap_non_picklable_objects | ||
|
||
|
||
__all__ = ['Memory', 'MemorizedResult', 'PrintTime', 'Logger', 'hash', 'dump', | ||
'load', 'Parallel', 'delayed', 'cpu_count', 'effective_n_jobs', | ||
'register_parallel_backend', 'parallel_backend', | ||
'register_store_backend', 'register_compressor', | ||
'wrap_non_picklable_objects'] | ||
if not hasattr(sys, "_is_pytest_session"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does that mean the tests still use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we test that old pickles can be restored There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also more generally pytest will import all python files during test collection, and this is necessary so that we don't fail at collection time due to our error on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh right, thanks |
||
warnings.warn(msg, category=DeprecationWarning) |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added so that unpickling of existing models doesn't raise errors about non existing paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you've tested unpickling something with a joblib numpy array in this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked that we can upickle a numpy array and the PCA estimator serialized with joblib in scikit-learn 0.20.2.
Also CircleCI caches pickled datasets used in examples, and I think we could have seen failures in examples if it didn't work.
Without this change I saw failures locally in tests that used the cached data for
fetch_20newsgroups_vectorized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You won't see circle ci examples fail unless you force a full build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, forgot about that. Re-triggered the full CircleCI build -- no errors still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep a small sklearn estimator serialized with joblib in scikit-learn 0.20.2 in a test directory, and include a test that unpickles it to make sure it doesn't error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan There is a test for unpickling numpy arrays in https://github.com/scikit-learn/scikit-learn/pull/13531/files#r269922098 I'm not convinced that unpickling an actual scikit-learn estimator as part of the test suite is worth it: if it's simple we would be basically testing the python cPickle module. If it's more complex and interacts with lots of parts of scikit-learn it is bound to break as our estimators evolve (and we won't be able to unpickle it).
I have tested that unpickling works outside of the test suite as well as mentioned above.