[MRG] ENH: dataset-fetching with use figshare, partial download and checksum #7429

nelson-liu · 2016-09-15T00:39:12Z

Reference Issue

What does this implement/fix? Explain your changes.

I added all of the datasets (fetch_...) to figshare and changed their URLs accordingly.
Also added a custom download function that enables users to resume interrupted downloads
Lastly, this PR implements md5 checking of both the raw downloaded files and the produced datasets

Any other comments?

I'm pretty sure CI doesn't test datasets that aren't already downloaded to my machine, is there a way to test this PR besides going into all of the tests and setting download_if_missing=True?

@amueller I emailed the figshare details to your gmail (t3kcit)

amueller · 2016-10-11T21:23:53Z

sorry for not following up on this. Will review shortly.

nelson-liu · 2016-10-14T05:53:23Z

@amueller no problem, this sort of thing is pretty tedious to check.

jnothman

From a code perspective this otherwise LGTM

jnothman · 2016-10-27T21:04:09Z

sklearn/datasets/covtype.py

@@ -33,8 +33,7 @@
 from ..utils import check_random_state


-URL = ('http://archive.ics.uci.edu/ml/'
-       'machine-learning-databases/covtype/covtype.data.gz')
+URL = ('https://ndownloader.figshare.com/files/5976039')


no need to keep parentheses. They were there for line-wrapped strings only.

jnothman · 2016-10-27T21:05:40Z

sklearn/datasets/lfw.py

@@ -44,14 +44,15 @@
 logger = logging.getLogger(__name__)


-BASE_URL = "http://vis-www.cs.umass.edu/lfw/"


Hmm. Can we assume no backwards compatibility issues with these names? I'm fine with this.

hmm, I believe so. I specifically made an attempt to maintain backwards compatibility by changing the associated variable names

jnothman · 2016-10-27T21:06:07Z

sklearn/datasets/kddcup99.py

@@ -29,11 +29,9 @@
 from ..utils import shuffle as shuffle_method


-URL10 = ('http://archive.ics.uci.edu/ml/'
-         'machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz')
+URL10 = ('https://ndownloader.figshare.com/files/5976042')


jnothman · 2016-10-27T21:06:11Z

sklearn/datasets/kddcup99.py


-URL = ('http://archive.ics.uci.edu/ml/'
-       'machine-learning-databases/kddcup99-mld/kddcup.data.gz')
+URL = ('https://ndownloader.figshare.com/files/5976045')


jnothman · 2016-10-27T21:06:21Z

sklearn/datasets/rcv1.py

+    'https://ndownloader.figshare.com/files/5976060',
+    'https://ndownloader.figshare.com/files/5976057'
+]
+URL_topics = ('https://ndownloader.figshare.com/files/5976048')


jnothman · 2016-10-27T21:06:26Z

sklearn/datasets/twenty_newsgroups.py

@@ -64,8 +64,7 @@
 logger = logging.getLogger(__name__)


-URL = ("http://people.csail.mit.edu/jrennie/"
-       "20Newsgroups/20news-bydate.tar.gz")
+URL = ("https://ndownloader.figshare.com/files/5975967")


jnothman · 2016-10-31T20:38:01Z

echo '
from sklearn.datasets import *
for name in locals().keys():
     if name.startswith("fetch_") and name != "fetch_mldata":
         print(name)
         locals()[name]()
' > /tmp/foo.py &&
mkdir /tmp/master_data /tmp/pr_data &&
git checkout master &&
SCIKIT_LEARN_DATA=/tmp/master_data python /tmp/foo.py &&
git checkout upstream/pr/7429 &&
SCIKIT_LEARN_DATA=/tmp/pr_data python /tmp/foo.py

I've then compared the MD5 sums produced and found them identical except for generated images.

However, I've realised that those MD5 sum checks should be part of the fetch.

Please add this:

05297c95ca4efd1017820d17e358a11e  /tmp/master_data/kddcup99_10/samples
0888a9e9b09a04af8c5d542973ed1950  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_pairs/4dacd9a65737d5fda3ff6c02e3dd167b/output.pkl
0a761020de3d898d00cdb69e37ada346  /tmp/master_data/lfw_home/lfw_funneled/pairs_05.txt
0e896d78ea4c95b72b2914089e9503a6  /tmp/master_data/RCV1/sample_topics.pkl
0f97d450ebcc31d9489bc40ea7db1de1  /tmp/master_data/RCV1/sample_id.pkl
10ff5148d371ea69a620692665553612  /tmp/master_data/lfw_home/lfw_funneled/pairs_06.txt
25455304fc74c07fd031297c8fffd4e8  /tmp/master_data/kddcup99_10/targets
2a5f838e615f5423947945a28a3ef5bd  /tmp/master_data/olivetti.pkz
321690c7b33b99fdfbcf20caabc9fd32  /tmp/master_data/lfw_home/lfw_funneled/pairs_01.txt
3a17d79fe25a12233872ae721603020c  /tmp/master_data/lfw_home/lfw_funneled/pairs_08.txt
3c19f7377094e966cc5ee26f1ed40534  /tmp/master_data/lfw_home/lfw_funneled/pairs_02.txt
41e98d056b196a9ffe00b09c2565e162  /tmp/master_data/covertype/samples_01.npy.z
4f27cbf15b2da4a85c1907eb4181ad21  /tmp/master_data/lfw_home/pairsDevTrain.txt
5132f7440eb68cf58910c8a45a2ac10b  /tmp/master_data/lfw_home/pairsDevTest.txt
564261c7565ce0b57ed44f1f77ff0ea0  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/040ffd485691ec83d5af84299b39ea70/output.pkl_01.npy.z
5739836ebdbe6adaf6a4e499ee6fd6d1  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_pairs/func_code.py
5a05ba128212f5b4169f7ee2f460e81a  /tmp/master_data/RCV1/samples.pkl_01.npy.z
64ab790f52560900b85e4869eaf61c15  /tmp/master_data/lfw_home/lfw_funneled/pairs_09.txt
64bbe259a6e08b050332b74809e8f0ee  /tmp/master_data/species_coverage.pkz
78fbf6d80131f3c66e5cc7ac0ddc03c2  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/func_code.py
88012b0d731694b416dec9b85ca3c800  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/040ffd485691ec83d5af84299b39ea70/metadata.json
98b8efef98c2c3602517ef6fb2990539  /tmp/master_data/RCV1/topics_names.pkl
9d15045fb2719e6b19dd158972982b13  /tmp/master_data/lfw_home/lfw_funneled/pairs_04.txt
9f1ba174e4e1c508ff7cdf10ac338a7d  /tmp/master_data/lfw_home/lfw_funneled/pairs.txt
9f1ba174e4e1c508ff7cdf10ac338a7d  /tmp/master_data/lfw_home/pairs.txt
a0d5d66ebb457948fd502f71d7ffdb6e  /tmp/master_data/lfw_home/lfw_funneled/pairs_03.txt
a5420a1cae0b2270d33fec1075341d38  /tmp/master_data/covertype/samples
a9d6eb2c2b080e238b89cc996eb328af  /tmp/master_data/RCV1/samples.pkl
b23cf9a382bc3a9d8352ae3fdc98b878  /tmp/master_data/covertype/targets
b6911731dd9f956d1f4df612fd127538  /tmp/master_data/RCV1/samples.pkl_02.npy.z
be6dacd276a598cd17755331054ca7b1  /tmp/master_data/lfw_home/lfw_funneled/pairs_10.txt
d7e080432f2e9ba4f1d3d6171f6113c8  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_pairs/4dacd9a65737d5fda3ff6c02e3dd167b/metadata.json
d8aff5fdc363ff82950145dabd2343bd  /tmp/master_data/20newsgroup_vectorized.pkl
e08bb31595a445c56a991c0da7d85e83  /tmp/master_data/cal_housing.pkz
eeab2af266f3bfb0767b7ea2d7fafe3c  /tmp/master_data/lfw_home/lfw_funneled/pairs_07.txt
f445b493da79987ef5cb2b26d09e2f88  /tmp/master_data/20news-bydate.pkz
f6c3e4ff0acc44604bc7efb457559775  /tmp/master_data/lfw_home/joblib/sklearn/datasets/lfw/_fetch_lfw_people/040ffd485691ec83d5af84299b39ea70/output.pkl

I'm also concerned about what we lose by throwing away the original URLs. Should we use a mapping rather than replacing them?

nelson-liu · 2016-10-31T20:44:27Z

@jnothman good idea, but I'm not too sure where you want me to add the MD5 checks to. Could you elaborate a bit more

I'm also concerned about what we lose by throwing away the original URLs. Should we use a mapping rather than replacing them?

Hmm...perhaps I'm being slow, but how would such a mapping work?

jnothman · 2016-10-31T21:55:44Z

I would create a helper function that fetches a file from a particular URL and stores it at a path relative to data home, with an MD5 check. So it would be passed the URL and the expected MD5 among other args. Not sure if that's sufficient.

jnothman · 2016-10-31T21:56:20Z

Not sure if mapping is necessary

nelson-liu · 2016-12-21T08:07:24Z

@jnothman what i ended up doing instead was running md5 on the downloaded archives rather than the raw datafiles. I basically just wrote a general fetch_dataset function that given a URL, path, and md5, downloads the URL to a temp location, checks the archive / raw data with the given MD5, then moves it to the proper location. Do you think this would be suitable, or is it greatly preferred to run md5 on the extracted files?

jnothman · 2016-12-21T09:38:27Z

That's okay, except that it doesn't catch the case where files later get corrupted.

…

On 21 December 2016 at 19:07, Nelson Liu ***@***.***> wrote: @jnothman <https://github.com/jnothman> what i ended up doing instead was running md5 on the downloaded archives rather than the raw datafiles. I basically just wrote a general fetch_dataset function that given a URL, path, and md5, downloads the URL to a temp location, checks the archive / raw data with the given MD5, then moves it to the proper location. Do you think this would be suitable, or is it greatly preferred to run md5 on the extracted files? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7429 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68U4T8loSOcrupef-rLFvo-4s0Iaks5rKN49gaJpZM4J9Y_O> .

nelson-liu · 2016-12-22T05:44:53Z

@jnothman i initially had written a function to just check the archives, but I decided to check the hashes of the generated files as well. However, it seems like i get different md5 hashes than you do. For example, cal_housing.pkz has an md5 hash of 39c2dc70c4aad72e44b741c37163e6cc on my machine as calculated by md5sum...

jnothman · 2017-04-04T23:06:09Z

I agree. We should aim to merge this soon and certainly ensure it's available for the next release. Are you able to wrap it up @nelson-liu? Fix merge conflicts and I'll give it a final look.

nelson-liu · 2017-04-04T23:11:34Z

When were you planning for the next release? In a bit of a crunch right now, but I should be able to get to this next weekend (unfortunately not the coming one)

jnothman · 2017-04-04T23:19:44Z

Current vague release plan is "before June". I hope we can actually make it happen. So this is urgent only from the perspective of users seeking solutions in master to current broken functionality. We can get someone else to put in any finishing touches if you'd rather.

…

On 5 April 2017 at 09:11, Nelson Liu ***@***.***> wrote: When we're you planning for the next release? In a bit of a crunch right now, but I should be able to get to this next weekend (unfortunately not the coming one) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7429 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63N-9sahTIQMWM7p7jTNkdTWXYBMks5rss4ngaJpZM4J9Y_O> .

jnothman · 2017-04-27T11:15:12Z

Bump?

nelson-liu · 2017-04-27T15:36:09Z

Oops, this fell of my radar; I'll do it today. Thanks for the reminder @jnothman

nelson-liu · 2017-04-27T21:45:32Z

@jnothman ok, I think it should be good now (pending CI)...I manually reran some of the fetch_* functions, and was thinking that perhaps it'd be good to have some more logging messages on them (since i spent a lot of time waiting, and wasn't quite sure what it was doing). That's probably out of scope of this PR though...

lesteve · 2017-04-28T10:13:17Z

CircleCI error seems genuine:

Unexpected failing examples:
../examples/applications/plot_model_complexity_influence.py failed leaving traceback:
Traceback (most recent call last):
  File "/home/ubuntu/scikit-learn/doc/sphinxext/sphinx_gallery/gen_rst.py", line 450, in execute_code_block
    exec(code_block, example_globals)
  File "<string>", line 2, in <module>
  File "<string>", line 13, in generate_data
  File "/home/ubuntu/scikit-learn/sklearn/datasets/twenty_newsgroups.py", line 328, in fetch_20newsgroups_vectorized
    remove=remove)
  File "/home/ubuntu/scikit-learn/sklearn/datasets/twenty_newsgroups.py", line 215, in fetch_20newsgroups
    cache_path=cache_path)
  File "/home/ubuntu/scikit-learn/sklearn/datasets/twenty_newsgroups.py", line 94, in download_20newsgroups
    _validate_file_md5(expected_checksum, cache_path)
  File "/home/ubuntu/scikit-learn/sklearn/datasets/base.py", line 881, in _validate_file_md5
    "corrupted.".format(path))
ValueError: /home/ubuntu/scikit_learn_data/20news-bydate.pkz has an MD5 hash differing from expected, file may be corrupted.

nelson-liu · 2017-04-28T23:00:01Z

@lesteve thanks! I think it'd be best to remove the validation on the actual processed file, since I also realized that the hash of the generated file would change depending on python version (2 vs 3)...

nelson-liu · 2017-04-29T01:13:09Z

appveyor failure looks spurious...does someone want to rerun that test or should i just push a spurious commit?

jnothman · 2017-04-29T11:15:51Z

I'm sure we'll find something to nitpick about to make your next commit not spurious.

jnothman

I would appreciate if there were a bit more consistency across fetch functions as to how the checksum, URL and local path are related in code. We could of course just store this in a table globally, but I'm not sure that's so elegant either.

jnothman · 2017-04-30T01:44:41Z

sklearn/datasets/base.py

+    if expected_checksum != _md5(path):
+        # remove the corrupted file
+        remove(path)
+        raise ValueError("{} has an MD5 hash differing "


jnothman · 2017-04-30T01:46:42Z

sklearn/datasets/base.py

+                                                 "{}-").format(existing_size)):
+                raise IOError("Server does not support the HTTP Range "
+                              "header, cannot resume download.")
+        except:


except Exception. Otherwise it blocks KeyboardInterrupt

jnothman · 2017-04-30T01:47:27Z

sklearn/datasets/base.py

+        # get the amount of path_temp we've downloaded
+        existing_size = getsize(path_temp)
+        print("Resuming download from previous temp file, "
+              "already have {} bytes".format(existing_size))


I'd prefer these on stderr

jnothman · 2017-04-30T01:48:19Z

sklearn/datasets/base.py

+        except:
+            # delete the temp file and retry download of whole file
+            remove(path_temp)
+            print("Attempting to re-download file.")


mention "after {!r}".format(exc)?

jnothman · 2017-04-30T01:48:41Z

sklearn/datasets/base.py

+        temp_file = open(path_temp, "ab")
+        # get the amount of path_temp we've downloaded
+        existing_size = getsize(path_temp)
+        print("Resuming download from previous temp file, "


mention URL?

jnothman · 2017-04-30T01:57:40Z

sklearn/datasets/california_housing.py

 from ..externals import joblib


-DATA_URL = "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"


Perhaps it is appropriate to leave the original URL in a comment.

(Or indeed to have it available as a mirror??)

jnothman · 2017-04-30T01:58:48Z

sklearn/datasets/california_housing.py

@@ -95,20 +88,21 @@ def fetch_california_housing(data_home=None, download_if_missing=True):
            raise IOError("Data not found and `download_if_missing` is False")

        print('downloading Cal. housing from %s to %s' % (DATA_URL, data_home))
-        archive_fileobj = BytesIO(urlopen(DATA_URL).read())
+        archive_path = join(data_home, "cal_housing.tgz")
+        expected_checksum = "130d0eececf165046ec4dc621d121d80"


It seems appropriate to me for this to appear as a global with the URL.

jnothman · 2017-04-30T01:59:44Z

sklearn/datasets/kddcup99.py

-
-URL = ('http://archive.ics.uci.edu/ml/'
-       'machine-learning-databases/kddcup99-mld/kddcup.data.gz')
+URL10 = 'https://ndownloader.figshare.com/files/5976042'



why the blank line

jnothman · 2017-04-30T02:00:13Z

sklearn/datasets/kddcup99.py

-
-URL = ('http://archive.ics.uci.edu/ml/'
-       'machine-learning-databases/kddcup99-mld/kddcup.data.gz')
+URL10 = 'https://ndownloader.figshare.com/files/5976042'


I think it is now less clear what 10 means. Rename the variable to URL_10_PERCENT

jnothman · 2017-04-30T02:02:04Z

sklearn/datasets/rcv1.py

        files = []
-        for file_url in file_urls:
+        for file_name, file_url in zip(FILE_NAMES, FILE_URLS):


If you already have zip, why not zip the checksums in too, rather than using a different data structure?

also, a needless inconsistency with lfw.

jnothman · 2017-04-30T02:06:41Z

Good work, @nelson-liu

jnothman · 2017-06-14T00:43:44Z

Would you be able to fix this up to merge for the upcoming release, @nelson-liu? Or should we get someone else to do the final touch-ups?

nelson-liu · 2017-06-14T00:50:15Z

sorry for being so flaky --- it'd be good to get someone else to finish the last bits. I'll see if I can pitch in, though...

ogrisel · 2017-07-07T14:06:12Z

Closing in favor of #9240.

nelson-liu mentioned this pull request Sep 15, 2016

[MRG] Add species distribution dataset to classes.rst #7430

Merged

nelson-liu changed the title ~~Use figshare in datasets~~ [MRG] Uploaded datasets we fetch to figshare Sep 15, 2016

nelson-liu mentioned this pull request Oct 9, 2016

Documentation Build Fails #7619

Closed

amueller added this to the 0.19 milestone Oct 11, 2016

jnothman requested changes Oct 27, 2016

View reviewed changes

jnothman approved these changes Oct 31, 2016

View reviewed changes

nelson-liu mentioned this pull request Dec 20, 2016

Perform MD5 checks for fetched datasets where possible #8089

Closed

nelson-liu added 11 commits December 23, 2016 21:27

add 20newsgroups dataset to figshare

773f0c5

made link less verbose

a61c20f

add olivetti to figshare

9e64651

add lfw to figshare

b4866e6

add california housing dataset to figshare

7068152

add covtype dataset to figshare

2082655

add kddcup99 dataset to figshare

ff83bd1

add species distribution dataset to figshare

59eae87

add rcv1 dataset

f33a52c

remove extraneous parens from url strings

dfe24f9

check md5 of datasets and add resume functionality to downloads

7186af8

nelson-liu force-pushed the use_figshare_in_datasets branch from d23e1d2 to 7186af8 Compare December 24, 2016 07:31

remove extraneous print statements

4dc8946

nelson-liu changed the title ~~[MRG] Uploaded datasets we fetch to figshare~~ [MRG] RFC/ENH: Update dataset-fetching utilities Dec 24, 2016

jnothman added the Blocker label Apr 4, 2017

nelson-liu and others added 2 commits April 27, 2017 14:21

Merge branch 'master' into use_figshare_in_datasets

269d028

Add missing Bunch import in california housing

773aa48

nelson-liu added 2 commits April 28, 2017 16:00

Remove hash validation of 20news output pkl

11c15db

Remove unused import

f367815

jnothman reviewed Apr 30, 2017

View reviewed changes

jnothman changed the title ~~[MRG] RFC/ENH: Update dataset-fetching utilities~~ [MRG] ENH: dataset-fetching with use figshare, partial download and checksum Apr 30, 2017

lesteve mentioned this pull request May 11, 2017

[MRG] Update URL for species distribution dataset #8864

Merged

jnothman added Need Contributor Waiting for Reviewer labels Jun 14, 2017

massich mentioned this pull request Jun 28, 2017

[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240

Merged

massich pushed a commit to massich/scikit-learn that referenced this pull request Jun 29, 2017

address missing comments in scikit-learn#7429 to start the PR fresh

65764a0

massich pushed a commit to massich/scikit-learn that referenced this pull request Jul 4, 2017

address missing comments in scikit-learn#7429 to start the PR fresh

96cc499

massich pushed a commit to massich/scikit-learn that referenced this pull request Jul 6, 2017

address missing comments in scikit-learn#7429 to start the PR fresh

d11bc7a

ogrisel closed this Jul 7, 2017

nelson-liu deleted the use_figshare_in_datasets branch July 7, 2017 17:23

		@@ -44,14 +44,15 @@
		logger = logging.getLogger(__name__)


		BASE_URL = "http://vis-www.cs.umass.edu/lfw/"

		from ..externals import joblib


		DATA_URL = "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"

[MRG] ENH: dataset-fetching with use figshare, partial download and checksum #7429

[MRG] ENH: dataset-fetching with use figshare, partial download and checksum #7429

Conversation

nelson-liu commented Sep 15, 2016 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

amueller commented Oct 11, 2016

nelson-liu commented Oct 14, 2016

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Oct 31, 2016

nelson-liu commented Oct 31, 2016

jnothman commented Oct 31, 2016

jnothman commented Oct 31, 2016

nelson-liu commented Dec 21, 2016

jnothman commented Dec 21, 2016 via email

nelson-liu commented Dec 22, 2016

jnothman commented Apr 4, 2017

nelson-liu commented Apr 4, 2017 • edited Loading

jnothman commented Apr 4, 2017 via email

jnothman commented Apr 27, 2017

nelson-liu commented Apr 27, 2017

nelson-liu commented Apr 27, 2017

lesteve commented Apr 28, 2017 • edited Loading

nelson-liu commented Apr 28, 2017

nelson-liu commented Apr 29, 2017

jnothman commented Apr 29, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Apr 30, 2017

jnothman commented Jun 14, 2017

nelson-liu commented Jun 14, 2017

ogrisel commented Jul 7, 2017

nelson-liu commented Sep 15, 2016 •

edited

Loading

nelson-liu commented Apr 4, 2017 •

edited

Loading

lesteve commented Apr 28, 2017 •

edited

Loading