-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] ENH: dataset-fetching with use figshare, partial download and checksum #7429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: dataset-fetching with use figshare, partial download and checksum #7429
Conversation
sorry for not following up on this. Will review shortly. |
@amueller no problem, this sort of thing is pretty tedious to check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a code perspective this otherwise LGTM
sklearn/datasets/covtype.py
Outdated
@@ -33,8 +33,7 @@ | |||
from ..utils import check_random_state | |||
|
|||
|
|||
URL = ('http://archive.ics.uci.edu/ml/' | |||
'machine-learning-databases/covtype/covtype.data.gz') | |||
URL = ('https://ndownloader.figshare.com/files/5976039') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to keep parentheses. They were there for line-wrapped strings only.
sklearn/datasets/lfw.py
Outdated
@@ -44,14 +44,15 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
BASE_URL = "http://vis-www.cs.umass.edu/lfw/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Can we assume no backwards compatibility issues with these names? I'm fine with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I believe so. I specifically made an attempt to maintain backwards compatibility by changing the associated variable names
sklearn/datasets/kddcup99.py
Outdated
@@ -29,11 +29,9 @@ | |||
from ..utils import shuffle as shuffle_method | |||
|
|||
|
|||
URL10 = ('http://archive.ics.uci.edu/ml/' | |||
'machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz') | |||
URL10 = ('https://ndownloader.figshare.com/files/5976042') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parens
sklearn/datasets/kddcup99.py
Outdated
|
||
URL = ('http://archive.ics.uci.edu/ml/' | ||
'machine-learning-databases/kddcup99-mld/kddcup.data.gz') | ||
URL = ('https://ndownloader.figshare.com/files/5976045') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parens
sklearn/datasets/rcv1.py
Outdated
'https://ndownloader.figshare.com/files/5976060', | ||
'https://ndownloader.figshare.com/files/5976057' | ||
] | ||
URL_topics = ('https://ndownloader.figshare.com/files/5976048') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parens
@@ -64,8 +64,7 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
URL = ("http://people.csail.mit.edu/jrennie/" | |||
"20Newsgroups/20news-bydate.tar.gz") | |||
URL = ("https://ndownloader.figshare.com/files/5975967") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parens
I've then compared the MD5 sums produced and found them identical except for generated images. However, I've realised that those MD5 sum checks should be part of the fetch. Please add this:
I'm also concerned about what we lose by throwing away the original URLs. Should we use a mapping rather than replacing them? |
@jnothman good idea, but I'm not too sure where you want me to add the MD5 checks to. Could you elaborate a bit more
Hmm...perhaps I'm being slow, but how would such a mapping work? |
I would create a helper function that fetches a file from a particular URL and stores it at a path relative to data home, with an MD5 check. So it would be passed the URL and the expected MD5 among other args. Not sure if that's sufficient. |
Not sure if mapping is necessary |
@jnothman what i ended up doing instead was running md5 on the downloaded archives rather than the raw datafiles. I basically just wrote a general |
That's okay, except that it doesn't catch the case where files later get
corrupted.
…On 21 December 2016 at 19:07, Nelson Liu ***@***.***> wrote:
@jnothman <https://github.com/jnothman> what i ended up doing instead was
running md5 on the downloaded archives rather than the raw datafiles. I
basically just wrote a general fetch_dataset function that given a URL,
path, and md5, downloads the URL to a temp location, checks the archive /
raw data with the given MD5, then moves it to the proper location. Do you
think this would be suitable, or is it greatly preferred to run md5 on the
extracted files?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7429 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68U4T8loSOcrupef-rLFvo-4s0Iaks5rKN49gaJpZM4J9Y_O>
.
|
@jnothman i initially had written a function to just check the archives, but I decided to check the hashes of the generated files as well. However, it seems like i get different md5 hashes than you do. For example, |
d23e1d2
to
7186af8
Compare
I agree. We should aim to merge this soon and certainly ensure it's available for the next release. Are you able to wrap it up @nelson-liu? Fix merge conflicts and I'll give it a final look. |
When were you planning for the next release? In a bit of a crunch right now, but I should be able to get to this next weekend (unfortunately not the coming one) |
Current vague release plan is "before June". I hope we can actually make it
happen. So this is urgent only from the perspective of users seeking
solutions in master to current broken functionality.
We can get someone else to put in any finishing touches if you'd rather.
…On 5 April 2017 at 09:11, Nelson Liu ***@***.***> wrote:
When we're you planning for the next release? In a bit of a crunch right
now, but I should be able to get to this next weekend (unfortunately not
the coming one)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7429 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63N-9sahTIQMWM7p7jTNkdTWXYBMks5rss4ngaJpZM4J9Y_O>
.
|
Bump? |
Oops, this fell of my radar; I'll do it today. Thanks for the reminder @jnothman |
@jnothman ok, I think it should be good now (pending CI)...I manually reran some of the |
CircleCI error seems genuine:
|
@lesteve thanks! I think it'd be best to remove the validation on the actual processed file, since I also realized that the hash of the generated file would change depending on python version (2 vs 3)... |
appveyor failure looks spurious...does someone want to rerun that test or should i just push a spurious commit? |
I'm sure we'll find something to nitpick about to make your next commit not spurious. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate if there were a bit more consistency across fetch functions as to how the checksum, URL and local path are related in code. We could of course just store this in a table globally, but I'm not sure that's so elegant either.
if expected_checksum != _md5(path): | ||
# remove the corrupted file | ||
remove(path) | ||
raise ValueError("{} has an MD5 hash differing " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOError?
"{}-").format(existing_size)): | ||
raise IOError("Server does not support the HTTP Range " | ||
"header, cannot resume download.") | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except Exception. Otherwise it blocks KeyboardInterrupt
# get the amount of path_temp we've downloaded | ||
existing_size = getsize(path_temp) | ||
print("Resuming download from previous temp file, " | ||
"already have {} bytes".format(existing_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer these on stderr
except: | ||
# delete the temp file and retry download of whole file | ||
remove(path_temp) | ||
print("Attempting to re-download file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stderr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention "after {!r}".format(exc)?
temp_file = open(path_temp, "ab") | ||
# get the amount of path_temp we've downloaded | ||
existing_size = getsize(path_temp) | ||
print("Resuming download from previous temp file, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention URL?
from ..externals import joblib | ||
|
||
|
||
DATA_URL = "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it is appropriate to leave the original URL in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or indeed to have it available as a mirror??)
@@ -95,20 +88,21 @@ def fetch_california_housing(data_home=None, download_if_missing=True): | |||
raise IOError("Data not found and `download_if_missing` is False") | |||
|
|||
print('downloading Cal. housing from %s to %s' % (DATA_URL, data_home)) | |||
archive_fileobj = BytesIO(urlopen(DATA_URL).read()) | |||
archive_path = join(data_home, "cal_housing.tgz") | |||
expected_checksum = "130d0eececf165046ec4dc621d121d80" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems appropriate to me for this to appear as a global with the URL.
|
||
URL = ('http://archive.ics.uci.edu/ml/' | ||
'machine-learning-databases/kddcup99-mld/kddcup.data.gz') | ||
URL10 = 'https://ndownloader.figshare.com/files/5976042' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the blank line
|
||
URL = ('http://archive.ics.uci.edu/ml/' | ||
'machine-learning-databases/kddcup99-mld/kddcup.data.gz') | ||
URL10 = 'https://ndownloader.figshare.com/files/5976042' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is now less clear what 10 means. Rename the variable to URL_10_PERCENT
files = [] | ||
for file_url in file_urls: | ||
for file_name, file_url in zip(FILE_NAMES, FILE_URLS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you already have zip, why not zip the checksums in too, rather than using a different data structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, a needless inconsistency with lfw.
Good work, @nelson-liu |
Would you be able to fix this up to merge for the upcoming release, @nelson-liu? Or should we get someone else to do the final touch-ups? |
sorry for being so flaky --- it'd be good to get someone else to finish the last bits. I'll see if I can pitch in, though... |
Closing in favor of #9240. |
Reference Issue
addresses #7425, #8089
What does this implement/fix? Explain your changes.
fetch_...
) to figshare and changed their URLs accordingly.Any other comments?
I'm pretty sure CI doesn't test datasets that aren't already downloaded to my machine, is there a way to test this PR besides going into all of the tests and setting
download_if_missing=True
?@amueller I emailed the figshare details to your gmail (t3kcit)