-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX introduce a refresh_cache param to fetch_...
functions.
#14197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX introduce a refresh_cache param to fetch_...
functions.
#14197
Conversation
also, with this change, I think we can remove the |
Is it possible to only introduce a function (i.e., refresh_cache)? If we introduce a parameter, we'll need a deprecation cycle to remove it. |
Yes, as I said: I've changed only one of the dataset files, will fix the others once we agree on a solution.
Yes, as Joel suggested:
This also allows the user to set the parameter to |
What's Joel's suggestion? I saw something like "I would be okay to detect this case and re-cache over a deprecation period". Does that mean that we need to download/construct the dataset everytime we use the dataset? |
I had mostly just intended this to happen automatically without a parameter to control it, and with that functionality disappearing after a couple of versions. |
the issue is when users don't have reliable, or easy connection from those systems, and the functionality would delete the files, while not being able to easily download them. Although with the default value here the same thing happens. I can remove the parameter, and have the deprecation cycle completely done silent, if that's what you prefer. I find the parameter useful anyway, even w/o |
Why would it necessarily delete the files?
|
thinking more about it, seems that it's difficult to solve the issue if we only introduce a function to refresh the cache, because it's difficult to determine which files to remove. So I'm OK with this solution (i.e., add a parameter). If we don't like it later, we can deprecate it. |
If I understand correctly, users can still use the datasets they pickled previously (e.g., before sklearn 0.21), right? So it seems not good to download / construct the datasets again by default? |
I don't see why we should download again here at all... Shouldn't we just
unpickle (with sklearn.externals.joblib) and re-pickle (with joblib)?
|
Didn't think of that, will do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's more what I was thinking ;)
I'll vote +1 for this solution. |
sklearn/datasets/base.py
Outdated
@@ -919,3 +920,25 @@ def _fetch_remote(remote, dirname=None): | |||
"file may be corrupted.".format(file_path, checksum, | |||
remote.checksum)) | |||
return file_path | |||
|
|||
|
|||
def _refresh_cache(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you put this function in base.py but actually it's designed only for fetch_covtype
?
It still suppresses the warning if the folder/file is read-only. Trying to fix that. |
you're replying my comment? I mean things like
won't work for other functions (e.g., fetch_20newsgroups_vectorized) |
I'll see what I need to do once I check the other instances which need to be fixed. For now working still on this one. It was not a response to your comment, it was a response to @jnothman 's comment regarding handling the case where I/O fails. |
This now fixes the issue for all the instances of the issue, except for The codecov fails. I've tested all of the examples with having the old persisted data converted to the new one, and also for the case when the files are read-only. But I don't think we can easily test them (unless we do some monkey patching, not sure if it's worth it). How does this look @jnothman ? |
codecov doesn't seem to be working properly on this patch anyway. I suppose it's kinda ready now. |
sklearn/datasets/base.py
Outdated
|
||
other_warns = [w for w in warns if not str(w.message).startswith(msg)] | ||
for w in other_warns: | ||
warnings.warn(message=w.message, category=w.category) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally you should also be using the original module and line number. Is there an alternative function for best issuing the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's the warn_explicit
, but it didn't actually end up showing the warning and I couldn't fix it in 5 minutes, so I switched back to warn
.
Add a test refreshing a pretend pickle? |
Not really testing the pickle itself, but testing different routes through the warnings now. |
sklearn/datasets/tests/test_base.py
Outdated
return 0 | ||
|
||
def _load_warn_unrelated(*args, **kwargs): | ||
warnings.warn("unrelated warning", UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a tighter test, this could be a DeprecationWarning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an entry in the what's new. We will need it to recall to make the changes (it was quite useful when I wanted to remove deprecated code). I don't know if we should have a specific section like maintenance to group things like deprecation or these type of changes.
sklearn/datasets/base.py
Outdated
|
||
|
||
def _refresh_cache(files, compress): | ||
# REMOVE in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO marker can be easier to catch
# REMOVE in v0.23 | |
# TODO: REMOVE in v0.23 |
sklearn/datasets/base.py
Outdated
warnings.warn(message=message, category=DeprecationWarning) | ||
|
||
if len(data) == 1: | ||
return data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one liner:
return data[0] if len(data) == 1 else data
@@ -129,7 +130,9 @@ def fetch_california_housing(data_home=None, download_if_missing=True, | |||
remove(archive_path) | |||
|
|||
else: | |||
cal_housing = joblib.load(filepath) | |||
cal_housing = _refresh_cache([filepath], 6) | |||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
sklearn/datasets/covtype.py
Outdated
X = joblib.load(samples_path) | ||
y = joblib.load(targets_path) | ||
X, y = _refresh_cache([samples_path, targets_path], 9) | ||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
sklearn/datasets/kddcup99.py
Outdated
X = joblib.load(samples_path) | ||
y = joblib.load(targets_path) | ||
X, y = _refresh_cache([samples_path, targets_path], 0) | ||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
sklearn/datasets/olivetti_faces.py
Outdated
@@ -107,7 +108,9 @@ def fetch_olivetti_faces(data_home=None, shuffle=False, random_state=0, | |||
joblib.dump(faces, filepath, compress=6) | |||
del mfile | |||
else: | |||
faces = joblib.load(filepath) | |||
faces = _refresh_cache([filepath], 6) | |||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
sklearn/datasets/rcv1.py
Outdated
X = joblib.load(samples_path) | ||
sample_id = joblib.load(sample_id_path) | ||
X, sample_id = _refresh_cache([samples_path, sample_id_path], 9) | ||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
sklearn/datasets/rcv1.py
Outdated
y = joblib.load(sample_topics_path) | ||
categories = joblib.load(topics_path) | ||
y, categories = _refresh_cache([sample_topics_path, topics_path], 9) | ||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
@@ -259,6 +260,8 @@ def fetch_species_distributions(data_home=None, | |||
**extra_params) | |||
joblib.dump(bunch, archive_path, compress=9) | |||
else: | |||
bunch = joblib.load(archive_path) | |||
bunch = _refresh_cache([archive_path], 9) | |||
# Revert to the following two lines in v0.23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Revert to the following two lines in v0.23 | |
# TODO: Revert to the following two lines in v0.23 |
Otherwise LGTM |
Codecov doesn't seem to be properly analyzing this PR. Otherwise would you please check the what's new entry and let me know if that's what you though @glemaitre ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You approve, @glemaitre?
doc/whats_new/v0.21.rst
Outdated
:mod:`sklearn.datasets` | ||
....................... | ||
|
||
- |Fix| :func:`fetch_california_housing`, :func:`fetch_covtype`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
links are missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may just work without the func. Otherwise you can use the ~ prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they're resolved now.
Is this good to merge? |
Thanks @adrinjalali |
This kinda fixes #14177.
This does not change the warning message. But it introduces a
refresh_cache
parameter to thefetch_...
functions that download and persist the data usingjoblib
.The proposal is to add this parameter, with some variation after reviews:
I've changed only one of the dataset files, will fix the others once we agree on a solution.
@glemaitre @jnothman