[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240

massich · 2017-06-28T15:08:51Z

Reference Issue

Fixes #7425. Fixes #8089

What does this implement/fix? Explain your changes.

adds all of the datasets (fetch_...) to figshare and changed their URLs accordingly.
implements sha256 checking of both the raw downloaded files and the produced datasets

Any other comments?

This PR takes over (Fixes #7429)

cc: @nelson-liu, @jnothman

… not supported

jnothman · 2017-06-28T15:11:57Z

Thanks for this. Much appreciated.

amueller · 2017-06-28T15:13:20Z

do we want this for the release?

massich · 2017-06-28T15:21:59Z

it is labeled as such

…

On Wed, Jun 28, 2017 at 5:13 PM Andreas Mueller ***@***.***> wrote: do we want this for the release? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#9240 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGt-46pAF2h0SdTHX2wImITWh28__u0lks5sIm2lgaJpZM4OIGv_> .

jnothman · 2017-06-28T15:22:29Z

I would like it if it's close enough: it avoids people complaining when less reliable servers break.

lesteve · 2017-08-02T07:24:33Z

Before we merge, one of the problem with logging is that it does not have the same behaviour with python 2.7 and python 3 (I was not aware of that I have to say or maybe I forgot):

import logging

logger = logging.getLogger('logger')
logger.warn('warning')

Python 2.7:

No handlers could be found for logger "logger"

Python 3:

warning

For this PR in particular that means that on Python 2.7 no messages will be printed about the datasets being downloaded.

Here is what I propose (after a quick chat with @ogrisel):

we keep using logging in datasets
we add a logger in sklearn/__init__.py with level logging.INFO. The reason for this is that I would not categorize a dataset being downloaded as a warning. The only reason we use logger.warning is that we want the message to be visible by the user and that the default logging level is logging.WARNING.

Note logging is only used in sklearn/datasets (and a few examples) I'll push a commit in this branch and then people can complain if they feel like it.

ogrisel · 2017-08-02T08:01:47Z

sklearn/__init__.py

+logger = logging.getLogger(__name__)
+if six.PY2:
+    logger.addHandler(logging.StreamHandler())
+logger.level = logging.INFO


Maybe add a small comment to state that this ensures that we get a informative message on stdout by default when downloading a dataset on Python 2 (so as to replicate the default behavior of the logging module in Python 3).

jnothman · 2017-08-02T08:18:00Z

is there a way to make the handler only apply to logging emitted by scikit-learn?

…

On 2 Aug 2017 6:01 pm, "Olivier Grisel" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/__init__.py <#9240 (comment)> : > @@ -17,6 +17,14 @@ import warnings import os from contextlib import contextmanager as _contextmanager +import logging + +import six + +logger = logging.getLogger(__name__) +if six.PY2: + logger.addHandler(logging.StreamHandler()) +logger.level = logging.INFO Maybe add a small comment to state that this ensures that we get a informative message on stdout by default when downloading a dataset on Python 2 (so as to replicate the default behavior of the logging module in Python 3). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9240 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6--gjteY3_qG3NxYN9I8eVSds_nRks5sUCzugaJpZM4OIGv_> .

ogrisel · 2017-08-02T09:31:35Z

is there a way to make the handler only apply to logging emitted by scikit-learn?

This is already the case:

logger = logging.getLogger(__name__)

is only used by sklearn components (__name__ is "sklearn" in the above snippet).

lesteve · 2017-08-02T09:37:22Z

is there a way to make the handler only apply to logging emitted by scikit-learn?

I think this is going to have this behaviour as long as we use logger = logging.getLogger(__name__) consistently. If we configure the handler in sklearn/__init__.py the logger called sklearn will have its handlers configured correctly (a StreamHandler + a logging.INFO level). A logger in a scikit-learn module will be called something like sklearn.subpackage.module. Because logging works in a hierarchical way it will end-up using the sklearn logger.

amueller · 2017-08-02T16:14:13Z

sklearn/__init__.py

@@ -17,6 +17,11 @@
 import warnings
 import os
 from contextlib import contextmanager as _contextmanager
+import logging
+
+logger = logging.getLogger(__name__)


why not "sklearn" instead of __name__?

I guess this is the just the general convention right?

I found this from the Python doc

A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows:

logger = logging.getLogger(__name__)

and this from the Hitchhiker's guide to Python:

Best practice when instantiating loggers in a library is to only create them using the __name__ global variable: the logging module creates a hierarchy of loggers using dot notation, so using __name__ ensures no name collisions.

Move logger.warning to logger.info and logger.info to logger.debug [doc build]

ogrisel

LGTM. +1 for merge and backport to 0.19.X. @amueller any further comment?

lesteve · 2017-08-03T12:30:52Z

Not sure whether it is entirely necessary but 5d040fc may need to be backported in 0.19.X (Add download_if_missing argument to fetch_20newsgroups_vectorized)

amueller · 2017-08-03T16:20:27Z

Happy to merge now and then I'll go through all the necessary backports.

ogrisel · 2017-08-03T16:50:54Z

Merged, will do the backport to 0.19.X.

amueller · 2017-08-03T17:08:18Z

Cool, though I think it'd be good to do a final pass to see if everything is backported that we want.
Do you think we should do another RC or just release? We never released the RC conda-forge package...
I think I'm leaning towards "just release", we can always do 0.19.1. @jnothman ?

Though an issue in scikit-optimize kinda points toward re-thinking the decision of making _y_train_mean in GaussianProcessRegressor private :-/

ogrisel · 2017-08-04T11:30:50Z

+1 for just release but I will have to have a look at the pickling error of scikit-optimize/scikit-optimize#461.

jnothman · 2017-08-06T02:41:34Z

I think one of us should compile a PR with everything from master that is safe to go into 0.19, to be reviewed, merged and released. I'm happy to have a go.

…

On 4 Aug 2017 9:30 pm, "Olivier Grisel" ***@***.***> wrote: +1 for just release but I will have to have a look at the pickling error of scikit-optimize/scikit-optimize#461 <scikit-optimize/scikit-optimize#461>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9240 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-B4KwUOHyr2SZDkvOMsTzk43ekQks5sUwDrgaJpZM4OIGv_> .

nelson-liu and others added 25 commits December 23, 2016 21:27

add 20newsgroups dataset to figshare

773f0c5

made link less verbose

a61c20f

add olivetti to figshare

9e64651

add lfw to figshare

b4866e6

add california housing dataset to figshare

7068152

add covtype dataset to figshare

2082655

add kddcup99 dataset to figshare

ff83bd1

add species distribution dataset to figshare

59eae87

add rcv1 dataset

f33a52c

remove extraneous parens from url strings

dfe24f9

check md5 of datasets and add resume functionality to downloads

7186af8

remove extraneous print statements

4dc8946

fix flake8 violations

7260f73

add docstrings to new dataset fetching functions

f2c44ee

consolidate imports in base and use md5 check function in dl

f6e6ce7

remove accidentally removed import

983544e

attempt to fix docstring conventions / handle case where range header…

03f7f82

… not supported

change functions to used renamed, privatized utilities

9d39dd0

fix flake8 indentation error

5eadb3a

remove checks for joblib dumped files

79a0325

fix error in lfw

29deaa5

Merge branch 'master' into use_figshare_in_datasets

269d028

Add missing Bunch import in california housing

773aa48

Remove hash validation of 20news output pkl

11c15db

Remove unused import

f367815

jnothman added this to the 0.19 milestone Jun 28, 2017

ogrisel reviewed Aug 2, 2017

View reviewed changes

lesteve force-pushed the use_figshare_in_datasets branch 2 times, most recently from 0d0664d to 2037f34 Compare August 2, 2017 08:06

lesteve force-pushed the use_figshare_in_datasets branch from 2037f34 to 91b7c3b Compare August 2, 2017 09:41

amueller reviewed Aug 2, 2017

View reviewed changes

Configure root logger in sklearn/__init__.py

6daa256

Move logger.warning to logger.info and logger.info to logger.debug [doc build]

lesteve force-pushed the use_figshare_in_datasets branch from 91b7c3b to 6daa256 Compare August 2, 2017 18:07

ogrisel approved these changes Aug 3, 2017

View reviewed changes

ogrisel merged commit c1eee27 into scikit-learn:master Aug 3, 2017

ogrisel pushed a commit that referenced this pull request Aug 3, 2017

ENH: dataset-fetching with use figshare and checksum (#9240)

41ae36c

TomDLT mentioned this pull request Aug 4, 2017

HTTP Error 404: Not Found when calling fetch_rcv1() #9490

Closed

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

ENH: dataset-fetching with use figshare and checksum (scikit-learn#9240)

b79ec41

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

ENH: dataset-fetching with use figshare and checksum (scikit-learn#9240)

b26f71d

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

ENH: dataset-fetching with use figshare and checksum (scikit-learn#9240)

58a3902

massich deleted the use_figshare_in_datasets branch August 24, 2017 16:13

AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

ENH: dataset-fetching with use figshare and checksum (scikit-learn#9240)

76b0479

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

ENH: dataset-fetching with use figshare and checksum (scikit-learn#9240)

a5fb260

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

ENH: dataset-fetching with use figshare and checksum (scikit-learn#9240)

f9b358f

mdickinson mentioned this pull request Oct 3, 2019

Please don't configure logging in the top-level __init__.py #15122

Closed

thomasjpfan mentioned this pull request Oct 28, 2019

[MRG] MNT Replaces stream handler with null handler #15386

Closed

Uh oh!

[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240

[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240

Uh oh!

Conversation

massich commented Jun 28, 2017 • edited by lesteve Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Jun 28, 2017

Uh oh!

amueller commented Jun 28, 2017

Uh oh!

massich commented Jun 28, 2017 via email

Uh oh!

jnothman commented Jun 28, 2017

Uh oh!

lesteve commented Aug 2, 2017

Uh oh!

ogrisel Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 2, 2017 via email

Uh oh!

ogrisel commented Aug 2, 2017

Uh oh!

lesteve commented Aug 2, 2017

Uh oh!

amueller Aug 2, 2017

Choose a reason for hiding this comment

Uh oh!

lesteve Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 3, 2017

Uh oh!

ogrisel commented Aug 3, 2017

Uh oh!

amueller commented Aug 3, 2017

Uh oh!

ogrisel commented Aug 4, 2017

Uh oh!

jnothman commented Aug 6, 2017 via email

Uh oh!

Uh oh!

massich commented Jun 28, 2017 •

edited by lesteve

Loading

lesteve Aug 2, 2017 •

edited

Loading

lesteve commented Aug 3, 2017 •

edited

Loading