Skip to content

[MRG] Download and test datasets in cron job #16348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Mar 2, 2020

Conversation

VarIr
Copy link
Contributor

@VarIr VarIr commented Jan 31, 2020

Reference Issues/PRs

Fixes #16340

What does this implement/fix? Explain your changes.

Download and test datasets in Travis cron job. Otherwise, these tests are never automatically run.

Any other comments?

@VarIr VarIr changed the title [WIP] Download and test datasets in cron job [MRG] Download and test datasets in cron job Feb 3, 2020
@VarIr VarIr requested a review from rth February 3, 2020 10:48
@rth rth requested a review from thomasjpfan February 5, 2020 17:17
@thomasjpfan
Copy link
Member

With this update to master, PRs may have a negative coverage result (when compared to master), because they will not run the network tests. This would be a false positive.

@rth
Copy link
Member

rth commented Feb 11, 2020

With this update to master, PRs may have a negative coverage result (when compared to master), because they will not run the network tests. This would be a false positive.

One alternative is to disable uploading of codecov results from nightly builds. It's already confusing since it doesn't run for each commit.

I feel that running untested code is more important than having 100% accurate coverage results (which are not trustworthy already now).

@thomasjpfan
Copy link
Member

Looking at the travis.yml, it does not look like we upload the coverage after all. (COVERAGE is not set)

@VarIr
Copy link
Contributor Author

VarIr commented Feb 14, 2020

Thanks for the feedback. In addition, I realized that tests might again be skipped, if their order should change at some point. Therefore, added wrappers to fetch the datasets, similar to how test_california_housing.py already does it. These make sure, that all tests download the datasets, if required by the ENV.

EDIT: Don't know about the failing doc-min-dependencies, seems unrelated?

@thomasjpfan
Copy link
Member

Merging with master should resolve the circleci issue

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about putting the logic in sklearn/datasets/test/conftest.py as a fixture?

# sklearn/datasets/test/conftest.py
from os import environ
import pytest
from sklearn.datasets import fetch_kddcup99 as _fetch_kddcup99

def _wrapped_fetch(f, dataset_name):
	download_if_missing = environ.get('SKLEARN_SKIP_NETWORK_TESTS', '1') == '0'
    def wrapped(*args, **kwargs):
        kwargs['download_if_missing'] = download_if_missing
        try:
            return f(*args, **kwargs)
        except IOError:
            pytest.skip("Download {} to run this test".format(dataset_name))

    return wrapped


@pytest.fixture
def fetch_kddcup99():
    return _wrapped_fetch(fetch_kddcup99, dataset_name='kddcup99')

# define the other fixtures here as well.

And then in the test we can do

def test_percent10(fetch_kddcup99):
	data = fetch_kddcup99()

@VarIr
Copy link
Contributor Author

VarIr commented Feb 17, 2020

Gladly, much cleaner this way. Thanks!

Some tests were still skipped, because pandas is available (e.g. test_pandas_dependency_message). I introduced a fixture that raises an ImportError for pandas for a specific test, even if it is available. Now the tests are run. It's a bit out-of-scope, though, since non-download datasets are involved as well. I can move these changes to a new PR, if you prefer.

@thomasjpfan
Copy link
Member

Some tests were still skipped, because pandas is available (e.g. test_pandas_dependency_message). I introduced a fixture that raises an ImportError for pandas for a specific test, even if it is available. Now the tests are run. It's a bit out-of-scope, though, since non-download datasets are involved as well. I can move these changes to a new PR, if you prefer.

I think it is out of scope. I would be happy to include this feature in another PR. On our CI, we have instances that do not have pandas installed to test checks that requires pandas to not be installed.

@VarIr
Copy link
Contributor Author

VarIr commented Feb 25, 2020

I reverted the pandas-related changes, and created a new PR for that.
I think this is now ready for an additional round of reviews.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VarIr ! I'm not fully convinced this is the right use case for fixtures. For me fixtures can allow to setup an environment (e.g. make test data) in which the test can be run. Having tests which test fixtures is more confusing as it means that pytest machinery is going to show in the traceback for errors.

Still since it was suggested in the review before, let's go forward with it. LGTM, aside for one comment below.

@VarIr
Copy link
Contributor Author

VarIr commented Feb 27, 2020

Renamed the fixtures. Unfortunately, the PR labeler still fails after merging master.

@rth
Copy link
Member

rth commented Feb 27, 2020

Unfortunately, the PR labeler still fails after merging master.

Thanks. Never mind that: cf. #16520

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is already quite advanced so please feel free to ignore this comment. Still:

In the long run I would rather like to consolidate everything under the Azure Pipelines CI and stop using travis (and maybe even circle ci). That would make it easier for maintainers not to have to deal with multiple logins on multiple CI systems with different permissions systems.

Azure Pipelines allow for scheduled jobs so we could run daily / weekly checks there.

@rth
Copy link
Member

rth commented Mar 1, 2020

In the long run I would rather like to consolidate everything under the Azure Pipelines CI and stop using travis (and maybe even circle ci).

I think we should still merge this PR as is: most of the changes are unrelated to CI provider. Then migrate to Azure Pipelines once #16603 is merged.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VarIr !

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @VarIr !

@thomasjpfan thomasjpfan merged commit cd622df into scikit-learn:master Mar 2, 2020
@VarIr VarIr deleted the datasets branch March 2, 2020 07:48
@VarIr
Copy link
Contributor Author

VarIr commented Mar 2, 2020

Thank you all for reviewing!

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset tests always skipped in CI
4 participants