-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Stale scikit-learn datasets can throw pickling errors #14328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm having trouble reproducing this issue. Running the Iterative Imputer example with sklearn 0.21.0 (the earliest release to include it), then updating to 0.22.dev0 an re-running the example throws no errors for me. Could you clarify how you originally got stale data? Was there a python version change, or some change among your sub-dependencies? Failed attempt to reproduceStarting with sklearn 0.21.0:
This cached version then gets loaded fine by sklearn 0.22.dev0. It also has the same SHA1 as a copy generated directly by 0.22.dev0.
|
Hm... I can get close. If you clear out your data cache
then install an older version of sklearn
download california housing
then install a more recent version of sklearn
and access california housing again, you see:
If you do this in the reverse direction, it fails with
Presumably, I had originally downloaded the california data some release before 0.19.0, when |
Okay, great. I can reproduce your DeprecationWarning. Digging through the github history here, it appears this situation wasn't altogether unexpected. @rth took on the task of unvendoring joblib and right at the top of the PR mentions:
Hence, the inclusion of the DeprecationWarning. The behavior is unattractive, but I'm not sure what action can be taken. You've pointed out that clearing the dataset cache after updating does work. Is there a smarter approach than the status quo? Interested in hearing input. |
So, one option would be to tag the downloads with the version of scikit-learn that pickled them, and check that tag before loading the data. If the version doesn't match, download a new thing. |
I don't think we'd want to support the case where people would downgrade the package. And also for the warning, it should automatically fix the issue with #14197 I'd say this is a "won't fix" issue. Closing, happy to have it opened if the other core devs disagree. |
Doesn't that PR actually solve this issue by refreshing local cache for legacy pickles? Not just the warning? |
That PR only refreshes the cache if the warning is raised, The issue here is that going back from a future version of sklearn raises an error, and I think it should. |
I think that was only done to reproduce the original issue. The original issue, unless I misunderstood something, is that pickles done with older scikit-learn version fail to load with newer scikit-learn versions. Maybe I shouldn't have removed In any case, since the only complaint about it we got from datasets loading, it should be fairly straightforward to modify the |
OK that statement was too optimistic. filepath = _pkl_filepath(data_home, 'cal_housing.pkz')
cal_housing = None
if exists(filepath):
try:
cal_housing = _refresh_cache([filepath], 6)
except (ImportError,):
# whitelist exceptions that are expected when loading
# outdated pickles, otherwise raise exception
if cal_housing is None:
# download it which would require modifying all the fetchers. Not sure if we really want to do that, or if adding a message in |
Correct. As for fixes, printing a warning about stale cache data sounds reasonable to me. |
I would say "not worth fixing for now", let's close this one. This was an issue a while ago when joblib was unvendored, but I would guess that a similar issue is unlikely to happen. In case I am wrong, another issue will be created and we can have a closer look ... |
Description
Loading a dataset that has been downloaded to ~/scikit_learn_data some time in the past can throw unpickling errors after updating scikit-learn. Possible solutions might be:
pickle
Deleting the data folder resolves this issue.
Steps/Code to Reproduce
Expected Results
No error is thrown.
Actual Results
An unpickling error is thrown when trying to find the
_joblib
module insklearn.externals
.Versions
The text was updated successfully, but these errors were encountered: