Skip to content

[MRG] ENH: dataset-fetching with use figshare, partial download and checksum #7429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Sep 15, 2016

Reference Issue

addresses #7425, #8089

What does this implement/fix? Explain your changes.

  • I added all of the datasets (fetch_...) to figshare and changed their URLs accordingly.
  • Also added a custom download function that enables users to resume interrupted downloads
  • Lastly, this PR implements md5 checking of both the raw downloaded files and the produced datasets

Any other comments?

I'm pretty sure CI doesn't test datasets that aren't already downloaded to my machine, is there a way to test this PR besides going into all of the tests and setting download_if_missing=True?

@amueller I emailed the figshare details to your gmail (t3kcit)

@nelson-liu nelson-liu changed the title Use figshare in datasets [MRG] Uploaded datasets we fetch to figshare Sep 15, 2016
@amueller
Copy link
Member

sorry for not following up on this. Will review shortly.

@amueller amueller added this to the 0.19 milestone Oct 11, 2016
@nelson-liu
Copy link
Contributor Author

@amueller no problem, this sort of thing is pretty tedious to check.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code perspective this otherwise LGTM

@@ -33,8 +33,7 @@
from ..utils import check_random_state


URL = ('http://archive.ics.uci.edu/ml/'
'machine-learning-databases/covtype/covtype.data.gz')
URL = ('https://ndownloader.figshare.com/files/5976039')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to keep parentheses. They were there for line-wrapped strings only.

@@ -44,14 +44,15 @@
logger = logging.getLogger(__name__)


BASE_URL = "http://vis-www.cs.umass.edu/lfw/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Can we assume no backwards compatibility issues with these names? I'm fine with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I believe so. I specifically made an attempt to maintain backwards compatibility by changing the associated variable names

@@ -29,11 +29,9 @@
from ..utils import shuffle as shuffle_method


URL10 = ('http://archive.ics.uci.edu/ml/'
'machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz')
URL10 = ('https://ndownloader.figshare.com/files/5976042')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens


URL = ('http://archive.ics.uci.edu/ml/'
'machine-learning-databases/kddcup99-mld/kddcup.data.gz')
URL = ('https://ndownloader.figshare.com/files/5976045')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens

'https://ndownloader.figshare.com/files/5976060',
'https://ndownloader.figshare.com/files/5976057'
]
URL_topics = ('https://ndownloader.figshare.com/files/5976048')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens

@@ -64,8 +64,7 @@
logger = logging.getLogger(__name__)


URL = ("http://people.csail.mit.edu/jrennie/"
"20Newsgroups/20news-bydate.tar.gz")
URL = ("https://ndownloader.figshare.com/files/5975967")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parens

@jnothman
Copy link
Member

echo '
from sklearn.datasets import *
for name in locals().keys():
     if name.startswith("fetch_") and name != "fetch_mldata":
         print(name)
         locals()[name]()
' > /tmp/foo.py &&
mkdir /tmp/master_data /tmp/pr_data &&
git checkout master &&
SCIKIT_LEARN_DATA=/tmp/master_data python /tmp/foo.py &&
git checkout upstream/pr/7429 &&
SCIKIT_LEARN_DATA=/tmp/pr_data python /tmp/foo.py

I've then compared the MD5 sums produced and found them identical except for generated images.

However, I've realised that those MD5 sum checks should be part of the fetch.

Please add this:

05297c95ca4efd1017820d17e358a11e  /tmp/master_data/kddcup99_10/samples
0888a9e9b09a04af8c5d542973ed1950  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_pairs/4dacd9a65737d5fda3ff6c02e3dd167b/output.pkl
0a761020de3d898d00cdb69e37ada346  /tmp/master_data/lfw_home/lfw_funneled/pairs_05.txt
0e896d78ea4c95b72b2914089e9503a6  /tmp/master_data/RCV1/sample_topics.pkl
0f97d450ebcc31d9489bc40ea7db1de1  /tmp/master_data/RCV1/sample_id.pkl
10ff5148d371ea69a620692665553612  /tmp/master_data/lfw_home/lfw_funneled/pairs_06.txt
25455304fc74c07fd031297c8fffd4e8  /tmp/master_data/kddcup99_10/targets
2a5f838e615f5423947945a28a3ef5bd  /tmp/master_data/olivetti.pkz
321690c7b33b99fdfbcf20caabc9fd32  /tmp/master_data/lfw_home/lfw_funneled/pairs_01.txt
3a17d79fe25a12233872ae721603020c  /tmp/master_data/lfw_home/lfw_funneled/pairs_08.txt
3c19f7377094e966cc5ee26f1ed40534  /tmp/master_data/lfw_home/lfw_funneled/pairs_02.txt
41e98d056b196a9ffe00b09c2565e162  /tmp/master_data/covertype/samples_01.npy.z
4f27cbf15b2da4a85c1907eb4181ad21  /tmp/master_data/lfw_home/pairsDevTrain.txt
5132f7440eb68cf58910c8a45a2ac10b  /tmp/master_data/lfw_home/pairsDevTest.txt
564261c7565ce0b57ed44f1f77ff0ea0  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/040ffd485691ec83d5af84299b39ea70/output.pkl_01.npy.z
5739836ebdbe6adaf6a4e499ee6fd6d1  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_pairs/func_code.py
5a05ba128212f5b4169f7ee2f460e81a  /tmp/master_data/RCV1/samples.pkl_01.npy.z
64ab790f52560900b85e4869eaf61c15  /tmp/master_data/lfw_home/lfw_funneled/pairs_09.txt
64bbe259a6e08b050332b74809e8f0ee  /tmp/master_data/species_coverage.pkz
78fbf6d80131f3c66e5cc7ac0ddc03c2  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/func_code.py
88012b0d731694b416dec9b85ca3c800  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/040ffd485691ec83d5af84299b39ea70/metadata.json
98b8efef98c2c3602517ef6fb2990539  /tmp/master_data/RCV1/topics_names.pkl
9d15045fb2719e6b19dd158972982b13  /tmp/master_data/lfw_home/lfw_funneled/pairs_04.txt
9f1ba174e4e1c508ff7cdf10ac338a7d  /tmp/master_data/lfw_home/lfw_funneled/pairs.txt
9f1ba174e4e1c508ff7cdf10ac338a7d  /tmp/master_data/lfw_home/pairs.txt
a0d5d66ebb457948fd502f71d7ffdb6e  /tmp/master_data/lfw_home/lfw_funneled/pairs_03.txt
a5420a1cae0b2270d33fec1075341d38  /tmp/master_data/covertype/samples
a9d6eb2c2b080e238b89cc996eb328af  /tmp/master_data/RCV1/samples.pkl
b23cf9a382bc3a9d8352ae3fdc98b878  /tmp/master_data/covertype/targets
b6911731dd9f956d1f4df612fd127538  /tmp/master_data/RCV1/samples.pkl_02.npy.z
be6dacd276a598cd17755331054ca7b1  /tmp/master_data/lfw_home/lfw_funneled/pairs_10.txt
d7e080432f2e9ba4f1d3d6171f6113c8  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_pairs/4dacd9a65737d5fda3ff6c02e3dd167b/metadata.json
d8aff5fdc363ff82950145dabd2343bd  /tmp/master_data/20newsgroup_vectorized.pkl
e08bb31595a445c56a991c0da7d85e83  /tmp/master_data/cal_housing.pkz
eeab2af266f3bfb0767b7ea2d7fafe3c  /tmp/master_data/lfw_home/lfw_funneled/pairs_07.txt
f445b493da79987ef5cb2b26d09e2f88  /tmp/master_data/20news-bydate.pkz
f6c3e4ff0acc44604bc7efb457559775  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/040ffd485691ec83d5af84299b39ea70/output.pkl

I'm also concerned about what we lose by throwing away the original URLs. Should we use a mapping rather than replacing them?

@nelson-liu
Copy link
Contributor Author

@jnothman good idea, but I'm not too sure where you want me to add the MD5 checks to. Could you elaborate a bit more

I'm also concerned about what we lose by throwing away the original URLs. Should we use a mapping rather than replacing them?

Hmm...perhaps I'm being slow, but how would such a mapping work?

@jnothman
Copy link
Member

I would create a helper function that fetches a file from a particular URL and stores it at a path relative to data home, with an MD5 check. So it would be passed the URL and the expected MD5 among other args. Not sure if that's sufficient.

@jnothman
Copy link
Member

Not sure if mapping is necessary

@nelson-liu
Copy link
Contributor Author

@jnothman what i ended up doing instead was running md5 on the downloaded archives rather than the raw datafiles. I basically just wrote a general fetch_dataset function that given a URL, path, and md5, downloads the URL to a temp location, checks the archive / raw data with the given MD5, then moves it to the proper location. Do you think this would be suitable, or is it greatly preferred to run md5 on the extracted files?

@jnothman
Copy link
Member

jnothman commented Dec 21, 2016 via email

@nelson-liu
Copy link
Contributor Author

@jnothman i initially had written a function to just check the archives, but I decided to check the hashes of the generated files as well. However, it seems like i get different md5 hashes than you do. For example, cal_housing.pkz has an md5 hash of 39c2dc70c4aad72e44b741c37163e6cc on my machine as calculated by md5sum...

@nelson-liu nelson-liu force-pushed the use_figshare_in_datasets branch from d23e1d2 to 7186af8 Compare December 24, 2016 07:31
@nelson-liu nelson-liu changed the title [MRG] Uploaded datasets we fetch to figshare [MRG] RFC/ENH: Update dataset-fetching utilities Dec 24, 2016
@jnothman
Copy link
Member

jnothman commented Apr 4, 2017

I agree. We should aim to merge this soon and certainly ensure it's available for the next release. Are you able to wrap it up @nelson-liu? Fix merge conflicts and I'll give it a final look.

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Apr 4, 2017

When were you planning for the next release? In a bit of a crunch right now, but I should be able to get to this next weekend (unfortunately not the coming one)

@jnothman
Copy link
Member

jnothman commented Apr 4, 2017 via email

@jnothman
Copy link
Member

Bump?

@nelson-liu
Copy link
Contributor Author

Oops, this fell of my radar; I'll do it today. Thanks for the reminder @jnothman

@nelson-liu
Copy link
Contributor Author

@jnothman ok, I think it should be good now (pending CI)...I manually reran some of the fetch_* functions, and was thinking that perhaps it'd be good to have some more logging messages on them (since i spent a lot of time waiting, and wasn't quite sure what it was doing). That's probably out of scope of this PR though...

@lesteve
Copy link
Member

lesteve commented Apr 28, 2017

CircleCI error seems genuine:

Unexpected failing examples:
../examples/applications/plot_model_complexity_influence.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/gen_rst.py", line 450, in execute_code_block
    exec(code_block, example_globals)
  File "<string>", line 2, in <module>
  File "<string>", line 13, in generate_data
  File "/home/ubuntu/scikit-learn/sklearn/datasets/twenty_newsgroups.py", line 328, in fetch_20newsgroups_vectorized
    remove=remove)
  File "/home/ubuntu/scikit-learn/sklearn/datasets/twenty_newsgroups.py", line 215, in fetch_20newsgroups
    cache_path=cache_path)
  File "/home/ubuntu/scikit-learn/sklearn/datasets/twenty_newsgroups.py", line 94, in download_20newsgroups
    _validate_file_md5(expected_checksum, cache_path)
  File "/home/ubuntu/scikit-learn/sklearn/datasets/base.py", line 881, in _validate_file_md5
    "corrupted.".format(path))
ValueError: /home/ubuntu/scikit_learn_data/20news-bydate.pkz has an MD5 hash differing from expected, file may be corrupted.

@nelson-liu
Copy link
Contributor Author

@lesteve thanks! I think it'd be best to remove the validation on the actual processed file, since I also realized that the hash of the generated file would change depending on python version (2 vs 3)...

@nelson-liu
Copy link
Contributor Author

appveyor failure looks spurious...does someone want to rerun that test or should i just push a spurious commit?

@jnothman
Copy link
Member

I'm sure we'll find something to nitpick about to make your next commit not spurious.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate if there were a bit more consistency across fetch functions as to how the checksum, URL and local path are related in code. We could of course just store this in a table globally, but I'm not sure that's so elegant either.

if expected_checksum != _md5(path):
# remove the corrupted file
remove(path)
raise ValueError("{} has an MD5 hash differing "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOError?

"{}-").format(existing_size)):
raise IOError("Server does not support the HTTP Range "
"header, cannot resume download.")
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception. Otherwise it blocks KeyboardInterrupt

# get the amount of path_temp we've downloaded
existing_size = getsize(path_temp)
print("Resuming download from previous temp file, "
"already have {} bytes".format(existing_size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer these on stderr

except:
# delete the temp file and retry download of whole file
remove(path_temp)
print("Attempting to re-download file.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stderr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention "after {!r}".format(exc)?

temp_file = open(path_temp, "ab")
# get the amount of path_temp we've downloaded
existing_size = getsize(path_temp)
print("Resuming download from previous temp file, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention URL?

from ..externals import joblib


DATA_URL = "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is appropriate to leave the original URL in a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or indeed to have it available as a mirror??)

@@ -95,20 +88,21 @@ def fetch_california_housing(data_home=None, download_if_missing=True):
raise IOError("Data not found and `download_if_missing` is False")

print('downloading Cal. housing from %s to %s' % (DATA_URL, data_home))
archive_fileobj = BytesIO(urlopen(DATA_URL).read())
archive_path = join(data_home, "cal_housing.tgz")
expected_checksum = "130d0eececf165046ec4dc621d121d80"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems appropriate to me for this to appear as a global with the URL.


URL = ('http://archive.ics.uci.edu/ml/'
'machine-learning-databases/kddcup99-mld/kddcup.data.gz')
URL10 = 'https://ndownloader.figshare.com/files/5976042'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the blank line


URL = ('http://archive.ics.uci.edu/ml/'
'machine-learning-databases/kddcup99-mld/kddcup.data.gz')
URL10 = 'https://ndownloader.figshare.com/files/5976042'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is now less clear what 10 means. Rename the variable to URL_10_PERCENT

files = []
for file_url in file_urls:
for file_name, file_url in zip(FILE_NAMES, FILE_URLS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already have zip, why not zip the checksums in too, rather than using a different data structure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, a needless inconsistency with lfw.

@jnothman jnothman changed the title [MRG] RFC/ENH: Update dataset-fetching utilities [MRG] ENH: dataset-fetching with use figshare, partial download and checksum Apr 30, 2017
@jnothman
Copy link
Member

Good work, @nelson-liu

@jnothman
Copy link
Member

Would you be able to fix this up to merge for the upcoming release, @nelson-liu? Or should we get someone else to do the final touch-ups?

@nelson-liu
Copy link
Contributor Author

sorry for being so flaky --- it'd be good to get someone else to finish the last bits. I'll see if I can pitch in, though...

@ogrisel
Copy link
Member

ogrisel commented Jul 7, 2017

Closing in favor of #9240.

@ogrisel ogrisel closed this Jul 7, 2017
@nelson-liu nelson-liu deleted the use_figshare_in_datasets branch July 7, 2017 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants