Skip to content

[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 69 commits into from
Aug 3, 2017

Conversation

massich
Copy link
Contributor

@massich massich commented Jun 28, 2017

Reference Issue

Fixes #7425. Fixes #8089

What does this implement/fix? Explain your changes.

  • adds all of the datasets (fetch_...) to figshare and changed their URLs accordingly.
  • implements sha256 checking of both the raw downloaded files and the produced datasets

Any other comments?

This PR takes over (Fixes #7429)

cc: @nelson-liu, @jnothman

@jnothman
Copy link
Member

Thanks for this. Much appreciated.

@amueller
Copy link
Member

do we want this for the release?

@massich
Copy link
Contributor Author

massich commented Jun 28, 2017 via email

@jnothman jnothman added this to the 0.19 milestone Jun 28, 2017
@jnothman
Copy link
Member

I would like it if it's close enough: it avoids people complaining when less reliable servers break.

@lesteve
Copy link
Member

lesteve commented Aug 2, 2017

Before we merge, one of the problem with logging is that it does not have the same behaviour with python 2.7 and python 3 (I was not aware of that I have to say or maybe I forgot):

import logging

logger = logging.getLogger('logger')
logger.warn('warning')

Python 2.7:

No handlers could be found for logger "logger"

Python 3:

warning

For this PR in particular that means that on Python 2.7 no messages will be printed about the datasets being downloaded.

Here is what I propose (after a quick chat with @ogrisel):

  • we keep using logging in datasets
  • we add a logger in sklearn/__init__.py with level logging.INFO. The reason for this is that I would not categorize a dataset being downloaded as a warning. The only reason we use logger.warning is that we want the message to be visible by the user and that the default logging level is logging.WARNING.

Note logging is only used in sklearn/datasets (and a few examples) I'll push a commit in this branch and then people can complain if they feel like it.

logger = logging.getLogger(__name__)
if six.PY2:
logger.addHandler(logging.StreamHandler())
logger.level = logging.INFO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a small comment to state that this ensures that we get a informative message on stdout by default when downloading a dataset on Python 2 (so as to replicate the default behavior of the logging module in Python 3).

@lesteve lesteve force-pushed the use_figshare_in_datasets branch 2 times, most recently from 0d0664d to 2037f34 Compare August 2, 2017 08:06
@jnothman
Copy link
Member

jnothman commented Aug 2, 2017 via email

@ogrisel
Copy link
Member

ogrisel commented Aug 2, 2017

is there a way to make the handler only apply to logging emitted by scikit-learn?

This is already the case:

logger = logging.getLogger(__name__)

is only used by sklearn components (__name__ is "sklearn" in the above snippet).

@lesteve
Copy link
Member

lesteve commented Aug 2, 2017

is there a way to make the handler only apply to logging emitted by scikit-learn?

I think this is going to have this behaviour as long as we use logger = logging.getLogger(__name__) consistently. If we configure the handler in sklearn/__init__.py the logger called sklearn will have its handlers configured correctly (a StreamHandler + a logging.INFO level). A logger in a scikit-learn module will be called something like sklearn.subpackage.module. Because logging works in a hierarchical way it will end-up using the sklearn logger.

@lesteve lesteve force-pushed the use_figshare_in_datasets branch from 2037f34 to 91b7c3b Compare August 2, 2017 09:41
@@ -17,6 +17,11 @@
import warnings
import os
from contextlib import contextmanager as _contextmanager
import logging

logger = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not "sklearn" instead of __name__?

Copy link
Member

@lesteve lesteve Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the just the general convention right?

I found this from the Python doc

A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows:

logger = logging.getLogger(__name__)

and this from the Hitchhiker's guide to Python:

Best practice when instantiating loggers in a library is to only create them using the __name__ global variable: the logging module creates a hierarchy of loggers using dot notation, so using __name__ ensures no name collisions.

Move logger.warning to logger.info and logger.info to logger.debug

[doc build]
@lesteve lesteve force-pushed the use_figshare_in_datasets branch from 91b7c3b to 6daa256 Compare August 2, 2017 18:07
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 for merge and backport to 0.19.X. @amueller any further comment?

@lesteve
Copy link
Member

lesteve commented Aug 3, 2017

Not sure whether it is entirely necessary but 5d040fc may need to be backported in 0.19.X (Add download_if_missing argument to fetch_20newsgroups_vectorized)

@amueller
Copy link
Member

amueller commented Aug 3, 2017

Happy to merge now and then I'll go through all the necessary backports.

@ogrisel ogrisel merged commit c1eee27 into scikit-learn:master Aug 3, 2017
@ogrisel
Copy link
Member

ogrisel commented Aug 3, 2017

Merged, will do the backport to 0.19.X.

@amueller
Copy link
Member

amueller commented Aug 3, 2017

Cool, though I think it'd be good to do a final pass to see if everything is backported that we want.
Do you think we should do another RC or just release? We never released the RC conda-forge package...
I think I'm leaning towards "just release", we can always do 0.19.1. @jnothman ?

Though an issue in scikit-optimize kinda points toward re-thinking the decision of making _y_train_mean in GaussianProcessRegressor private :-/

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2017

+1 for just release but I will have to have a look at the pickling error of scikit-optimize/scikit-optimize#461.

@jnothman
Copy link
Member

jnothman commented Aug 6, 2017 via email

dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
@massich massich deleted the use_figshare_in_datasets branch August 24, 2017 16:13
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perform MD5 checks for fetched datasets where possible Upload all datasets we have loaders for to figshare?
7 participants