Skip to content

Stale scikit-learn datasets can throw pickling errors #14328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deniederhut opened this issue Jul 13, 2019 · 11 comments
Closed

Stale scikit-learn datasets can throw pickling errors #14328

deniederhut opened this issue Jul 13, 2019 · 11 comments

Comments

@deniederhut
Copy link
Contributor

Description

Loading a dataset that has been downloaded to ~/scikit_learn_data some time in the past can throw unpickling errors after updating scikit-learn. Possible solutions might be:

  • clearing the dataset cache after updating
  • including the scikit-learn version used to serialize the dataset objects
  • serializing with something besides pickle

Deleting the data folder resolves this issue.

Steps/Code to Reproduce

python examples/impute/plot_iterative_imputer_variants_comparison.py

Expected Results

No error is thrown.

Actual Results

An unpickling error is thrown when trying to find the _joblib module in sklearn.externals.

Traceback (most recent call last):
  File "examples/impute/plot_iterative_imputer_variants_comparison.py", line 61, in <module>
    X_full, y_full = fetch_california_housing(return_X_y=True)
  File "/Users/dillon/githubpackages/scikit-learn/sklearn/datasets/california_housing.py", line 133, in fetch_california_housing
    cal_housing = _refresh_cache([filepath], 6)
  File "/Users/dillon/githubpackages/scikit-learn/sklearn/datasets/base.py", line 930, in _refresh_cache
    data = tuple([joblib.load(f) for f in files])
  File "/Users/dillon/githubpackages/scikit-learn/sklearn/datasets/base.py", line 930, in <listcomp>
    data = tuple([joblib.load(f) for f in files])
  File "/Users/dillon/anaconda/envs/sklearn-dev/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 598, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/Users/dillon/anaconda/envs/sklearn-dev/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 526, in _unpickle
    obj = unpickler.load()
  File "/Users/dillon/anaconda/envs/sklearn-dev/lib/python3.7/pickle.py", line 1085, in load
    dispatch[key[0]](self)
  File "/Users/dillon/anaconda/envs/sklearn-dev/lib/python3.7/pickle.py", line 1373, in load_global
    klass = self.find_class(module, name)
  File "/Users/dillon/anaconda/envs/sklearn-dev/lib/python3.7/pickle.py", line 1423, in find_class
    __import__(module, level=0)
ModuleNotFoundError: No module named 'sklearn.externals._joblib'

Versions

System:
    python: 3.7.3 (default, Mar 27 2019, 16:54:48)  [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /Users/dillon/anaconda/envs/sklearn-dev/bin/python
   machine: Darwin-17.7.0-x86_64-i386-64bit

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /Users/dillon/anaconda/envs/sklearn-dev/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.22.dev0
     numpy: 1.16.4
     scipy: 1.2.1
    Cython: 0.29.11
    pandas: 0.24.2
matplotlib: 3.1.0
    joblib: 0.13.2
@amueller amueller added the Bug label Jul 14, 2019
@edrogers
Copy link
Contributor

I'm having trouble reproducing this issue. Running the Iterative Imputer example with sklearn 0.21.0 (the earliest release to include it), then updating to 0.22.dev0 an re-running the example throws no errors for me. Could you clarify how you originally got stale data? Was there a python version change, or some change among your sub-dependencies?

Failed attempt to reproduce

Starting with sklearn 0.21.0:

$ python -c "import sklearn; print(sklearn.__version__)"
0.21.0
$ pip freeze | grep -v scikit-learn
Click==7.0
cycler==0.10.0
Cython==0.29.11
joblib==0.13.2
kiwisolver==1.1.0
matplotlib==3.1.0
numpy==1.16.4
pandas==0.24.2
pip-tools==3.8.0
pyparsing==2.4.0
python-dateutil==2.8.0
pytz==2019.1
scipy==1.2.1
six==1.12.0
$ python examples/impute/plot_iterative_imputer_variants_comparison.py
# ...
# Lots of output and ConvergenceWarnings
# ...
$ sha1sum ~/scikit_learn_data/cal_housing_py3.pkz
41bac263fe97ef4d681395c42af14e1fa2c72ce5  ~/scikit_learn_data/cal_housing_py3.pkz

This cached version then gets loaded fine by sklearn 0.22.dev0. It also has the same SHA1 as a copy generated directly by 0.22.dev0.

$ python -c "import sklearn; print(sklearn.__version__)"
0.22.dev0
$ python examples/impute/plot_iterative_imputer_variants_comparison.py
# ...
# Lots of output and ConvergenceWarnings ... no ModuleNotFoundError
# ...

@deniederhut
Copy link
Contributor Author

Hm... I can get close. If you clear out your data cache

rm -r ~/scikit_learn_data

then install an older version of sklearn

pip install "scikit-learn==0.19.0"

download california housing

python -c "from sklearn.datasets import fetch_california_housing;fetch_california_housing()"

then install a more recent version of sklearn

pip install "scikit-learn==0.21.2

and access california housing again, you see:

/Users/dillon/anaconda/envs/scikit-test/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)

If you do this in the reverse direction, it fails with

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/dillon/anaconda/envs/py35/lib/python3.5/site-packages/sklearn/datasets/california_housing.py", line 106, in fetch_california_housing
    cal_housing = joblib.load(filepath)
  File "/Users/dillon/anaconda/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 575, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/Users/dillon/anaconda/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 507, in _unpickle
    obj = unpickler.load()
  File "/Users/dillon/anaconda/envs/py35/lib/python3.5/pickle.py", line 1043, in load
    dispatch[key[0]](self)
  File "/Users/dillon/anaconda/envs/py35/lib/python3.5/pickle.py", line 1342, in load_global
    klass = self.find_class(module, name)
  File "/Users/dillon/anaconda/envs/py35/lib/python3.5/pickle.py", line 1392, in find_class
    __import__(module, level=0)
ImportError: No module named 'joblib'

Presumably, I had originally downloaded the california data some release before 0.19.0, when sklearn.externals.joblib was called sklearn.externals._joblib.

@edrogers
Copy link
Contributor

Okay, great. I can reproduce your DeprecationWarning.

Digging through the github history here, it appears this situation wasn't altogether unexpected. @rth took on the task of unvendoring joblib and right at the top of the PR mentions:

A major concern is backward compatibility of pickles.

Hence, the inclusion of the DeprecationWarning. The behavior is unattractive, but I'm not sure what action can be taken. You've pointed out that clearing the dataset cache after updating does work. Is there a smarter approach than the status quo?

Interested in hearing input.

@deniederhut
Copy link
Contributor Author

So, one option would be to tag the downloads with the version of scikit-learn that pickled them, and check that tag before loading the data. If the version doesn't match, download a new thing.

@adrinjalali
Copy link
Member

I don't think we'd want to support the case where people would downgrade the package. And also for the warning, it should automatically fix the issue with #14197

I'd say this is a "won't fix" issue. Closing, happy to have it opened if the other core devs disagree.

@rth
Copy link
Member

rth commented Jul 18, 2019

And also for the warning, it should automatically fix the issue with #14197

Doesn't that PR actually solve this issue by refreshing local cache for legacy pickles? Not just the warning?

@adrinjalali
Copy link
Member

That PR only refreshes the cache if the warning is raised, The issue here is that going back from a future version of sklearn raises an error, and I think it should.

@rth
Copy link
Member

rth commented Jul 18, 2019

The issue here is that going back from a future version of sklearn raises an error, and I think it should.

I think that was only done to reproduce the original issue. The original issue, unless I misunderstood something, is that pickles done with older scikit-learn version fail to load with newer scikit-learn versions. Maybe I shouldn't have removed sklearn.externals._joblib during joblib unvendoring.

In any case, since the only complaint about it we got from datasets loading, it should be fairly straightforward to modify the _refresh_cache function to also refresh cache if pickles failed to load with ModuleNotFoundError: No module named 'sklearn.externals._joblib', I think? I'll look into it.

@rth
Copy link
Member

rth commented Jul 18, 2019

it should be fairly straightforward to modify the _refresh_cache function to also refresh cache if pickles failed to load

OK that statement was too optimistic. _refresh_cache is oblivious to the download URL, so another mechanism would be needed something like,

filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

cal_housing = None
if exists(filepath):
    try:
        cal_housing = _refresh_cache([filepath], 6)
    except (ImportError,):
        # whitelist exceptions that are expected when loading
        # outdated pickles, otherwise raise exception

if cal_housing is None:
    # download it

which would require modifying all the fetchers. Not sure if we really want to do that, or if adding a message in _refresh_cache to manually remove the cache folder (or even remove the problematic pickle and ask the user to re-run the example again) would be better.

@rth rth reopened this Jul 18, 2019
@deniederhut
Copy link
Contributor Author

I think that was only done to reproduce the original issue.

Correct.

As for fixes, printing a warning about stale cache data sounds reasonable to me.

@lesteve
Copy link
Member

lesteve commented Jan 21, 2025

I would say "not worth fixing for now", let's close this one.

This was an issue a while ago when joblib was unvendored, but I would guess that a similar issue is unlikely to happen. In case I am wrong, another issue will be created and we can have a closer look ...

@lesteve lesteve closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants