[MRG] Download and test datasets in cron job #16348

VarIr · 2020-01-31T16:05:57Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Download and test datasets in Travis cron job. Otherwise, these tests are never automatically run.

Any other comments?

thomasjpfan · 2020-02-11T18:01:15Z

With this update to master, PRs may have a negative coverage result (when compared to master), because they will not run the network tests. This would be a false positive.

rth · 2020-02-11T22:48:49Z

With this update to master, PRs may have a negative coverage result (when compared to master), because they will not run the network tests. This would be a false positive.

One alternative is to disable uploading of codecov results from nightly builds. It's already confusing since it doesn't run for each commit.

I feel that running untested code is more important than having 100% accurate coverage results (which are not trustworthy already now).

thomasjpfan · 2020-02-13T19:19:29Z

Looking at the travis.yml, it does not look like we upload the coverage after all. (COVERAGE is not set)

sklearn/datasets/tests/test_20news.py

VarIr · 2020-02-14T12:22:33Z

Thanks for the feedback. In addition, I realized that tests might again be skipped, if their order should change at some point. Therefore, added wrappers to fetch the datasets, similar to how test_california_housing.py already does it. These make sure, that all tests download the datasets, if required by the ENV.

EDIT: Don't know about the failing doc-min-dependencies, seems unrelated?

thomasjpfan · 2020-02-14T15:22:29Z

Merging with master should resolve the circleci issue

thomasjpfan

How do you feel about putting the logic in sklearn/datasets/test/conftest.py as a fixture?

# sklearn/datasets/test/conftest.py
from os import environ
import pytest
from sklearn.datasets import fetch_kddcup99 as _fetch_kddcup99

def _wrapped_fetch(f, dataset_name):
	download_if_missing = environ.get('SKLEARN_SKIP_NETWORK_TESTS', '1') == '0'
    def wrapped(*args, **kwargs):
        kwargs['download_if_missing'] = download_if_missing
        try:
            return f(*args, **kwargs)
        except IOError:
            pytest.skip("Download {} to run this test".format(dataset_name))

    return wrapped


@pytest.fixture
def fetch_kddcup99():
    return _wrapped_fetch(fetch_kddcup99, dataset_name='kddcup99')

# define the other fixtures here as well.

And then in the test we can do

def test_percent10(fetch_kddcup99):
	data = fetch_kddcup99()

VarIr · 2020-02-17T13:59:38Z

Gladly, much cleaner this way. Thanks!

Some tests were still skipped, because pandas is available (e.g. test_pandas_dependency_message). I introduced a fixture that raises an ImportError for pandas for a specific test, even if it is available. Now the tests are run. It's a bit out-of-scope, though, since non-download datasets are involved as well. I can move these changes to a new PR, if you prefer.

thomasjpfan · 2020-02-21T20:00:53Z

Some tests were still skipped, because pandas is available (e.g. test_pandas_dependency_message). I introduced a fixture that raises an ImportError for pandas for a specific test, even if it is available. Now the tests are run. It's a bit out-of-scope, though, since non-download datasets are involved as well. I can move these changes to a new PR, if you prefer.

I think it is out of scope. I would be happy to include this feature in another PR. On our CI, we have instances that do not have pandas installed to test checks that requires pandas to not be installed.

VarIr · 2020-02-25T20:03:35Z

I reverted the pandas-related changes, and created a new PR for that.
I think this is now ready for an additional round of reviews.

rth

Thanks @VarIr ! I'm not fully convinced this is the right use case for fixtures. For me fixtures can allow to setup an environment (e.g. make test data) in which the test can be run. Having tests which test fixtures is more confusing as it means that pytest machinery is going to show in the traceback for errors.

Still since it was suggested in the review before, let's go forward with it. LGTM, aside for one comment below.

sklearn/datasets/tests/test_20news.py

VarIr · 2020-02-27T12:44:55Z

Renamed the fixtures. Unfortunately, the PR labeler still fails after merging master.

rth · 2020-02-27T12:50:34Z

Unfortunately, the PR labeler still fails after merging master.

Thanks. Never mind that: cf. #16520

ogrisel

This PR is already quite advanced so please feel free to ignore this comment. Still:

In the long run I would rather like to consolidate everything under the Azure Pipelines CI and stop using travis (and maybe even circle ci). That would make it easier for maintainers not to have to deal with multiple logins on multiple CI systems with different permissions systems.

Azure Pipelines allow for scheduled jobs so we could run daily / weekly checks there.

rth · 2020-03-01T08:33:44Z

In the long run I would rather like to consolidate everything under the Azure Pipelines CI and stop using travis (and maybe even circle ci).

I think we should still merge this PR as is: most of the changes are unrelated to CI provider. Then migrate to Azure Pipelines once #16603 is merged.

rth

Thanks @VarIr !

thomasjpfan

Thank you @VarIr !

VarIr · 2020-03-02T08:16:52Z

Thank you all for reviewing!

VarIr added 5 commits January 31, 2020 16:56

download and test rcv1 in cron job

9f66489

download and test 20news in cron job [scipy-dev]

c9d26fd

fix typo [scipy-dev]

47072fb

california_housing, covtype, kddcup99, olivetti_faces [scipy-dev]

dff956a

fix kddcup99 [scipy-dev]

a1ad4df

VarIr changed the title ~~[WIP] Download and test datasets in cron job~~ [MRG] Download and test datasets in cron job Feb 3, 2020

VarIr requested a review from rth February 3, 2020 10:48

rth mentioned this pull request Feb 5, 2020

[MRG] TST Fixes test for California housing #16389

Merged

rth requested a review from thomasjpfan February 5, 2020 17:17

thomasjpfan reviewed Feb 13, 2020

View reviewed changes

sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved

fetch datasets in wrapper func [scipy-dev]

99c02f2

VarIr force-pushed the datasets branch from 218eaa9 to 99c02f2 Compare February 14, 2020 12:05

fix rcv1 test [scipy-dev]

4d3499c

VarIr added 2 commits February 14, 2020 16:37

do not skip test_pandas_dependency_message [scipy-dev]

ec4db96

Merge remote-tracking branch 'origin/master' into datasets

13f91d7

thomasjpfan reviewed Feb 15, 2020

View reviewed changes

VarIr added 2 commits February 17, 2020 14:41

Merge remote-tracking branch 'upstream/master' into datasets

dbaf5de

introduce fetch fixtures [scipy-dev]

e7081fa

VarIr added 3 commits February 25, 2020 19:37

Remove pandas hiding fixture [scipy-dev]

b1df646

fix flake8 issue [scipy-dev]

25667a5

Merge remote-tracking branch 'upstream/master' into datasets [scipy-dev]

0e293c3

VarIr force-pushed the datasets branch from df9c27a to 0e293c3 Compare February 25, 2020 18:46

VarIr mentioned this pull request Feb 25, 2020

[MRG] Enable california_housing pandas test in cron job #16547

Merged

rth reviewed Feb 26, 2020

View reviewed changes

sklearn/datasets/tests/test_20news.py Outdated Show resolved Hide resolved

VarIr added 2 commits February 27, 2020 13:34

fix name collision [scipy-dev]

cfaedfe

Merge master, hopefully fixing labeler [scipy-dev]

465e6cb

rth mentioned this pull request Feb 27, 2020

MNT Adds autolabler for modules #16520

Merged

ogrisel reviewed Feb 27, 2020

View reviewed changes

rth approved these changes Mar 1, 2020

View reviewed changes

thomasjpfan added 4 commits March 1, 2020 11:31

Merge remote-tracking branch 'upstream/master' into pr/16348

2365070

BLD [scipy-dev]

dce862c

STY Removes prefix from fetch_*

df87928

BLD [scipy-dev]

089890f

thomasjpfan approved these changes Mar 2, 2020

View reviewed changes

thomasjpfan merged commit cd622df into scikit-learn:master Mar 2, 2020

VarIr deleted the datasets branch March 2, 2020 07:48

rth mentioned this pull request Mar 3, 2020

MNT Uses azure pipelines for scipy-dev #16603

Merged

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

MNT Download and test datasets in cron job (scikit-learn#16348)

cad1a50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Download and test datasets in cron job #16348

[MRG] Download and test datasets in cron job #16348

VarIr commented Jan 31, 2020

thomasjpfan commented Feb 11, 2020

rth commented Feb 11, 2020

thomasjpfan commented Feb 13, 2020

VarIr commented Feb 14, 2020 •

edited

Loading

thomasjpfan commented Feb 14, 2020

thomasjpfan left a comment

VarIr commented Feb 17, 2020

thomasjpfan commented Feb 21, 2020

VarIr commented Feb 25, 2020

rth left a comment

VarIr commented Feb 27, 2020

rth commented Feb 27, 2020

ogrisel left a comment •

edited

Loading

rth commented Mar 1, 2020

rth left a comment

thomasjpfan left a comment

VarIr commented Mar 2, 2020

[MRG] Download and test datasets in cron job #16348

[MRG] Download and test datasets in cron job #16348

Conversation

VarIr commented Jan 31, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

thomasjpfan commented Feb 11, 2020

rth commented Feb 11, 2020

thomasjpfan commented Feb 13, 2020

VarIr commented Feb 14, 2020 • edited Loading

thomasjpfan commented Feb 14, 2020

thomasjpfan left a comment

Choose a reason for hiding this comment

VarIr commented Feb 17, 2020

thomasjpfan commented Feb 21, 2020

VarIr commented Feb 25, 2020

rth left a comment

Choose a reason for hiding this comment

VarIr commented Feb 27, 2020

rth commented Feb 27, 2020

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

rth commented Mar 1, 2020

rth left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

VarIr commented Mar 2, 2020

VarIr commented Feb 14, 2020 •

edited

Loading

ogrisel left a comment •

edited

Loading