Skip to content

[MRG] Fix fetch_openml when ignore attributes are numeric #12330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 9, 2018

Conversation

janvanrijn
Copy link
Contributor

@janvanrijn janvanrijn commented Oct 8, 2018

Fixes #12329

I modularized the function that determines per column whether it is a valid data column (opposed to a target or ignore attribute). This functionality was before (in slightly different manifestations) on 2 different places in the code. This simplifies the main function.

I did not add an additional unit test, as that would require packaging more openml files etc. Personally I think that the fact that the code is more modular and the bug is removed adds to the readability and quality of the code. Let me know if you think otherwise, I can of course add a unit test.

@janvanrijn janvanrijn changed the title Fix fetch_openml when ignore attributes are numeric [MRG] Fix fetch_openml when ignore attributes are numeric Oct 8, 2018
@NicolasHug
Copy link
Member

If all goes green LGTM.

Not sure if it's worth adding a dataset to sklearn/datasets/tests/data/openml/ to test this (I checked and this indeed fixes the issue for sklearn.datasets.fetch_openml(data_id=1119)).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to test this, to avoid a regression.

X
Merge remote-tracking branch 'upstream/master' into fix_#12329
@janvanrijn
Copy link
Contributor Author

Btw, should I add a note in the what's new doc? In 0.20 or 0.21? I could append it to the previous entry that was merged by #12246

@jnothman
Copy link
Member

jnothman commented Oct 9, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not checked that the test falls in master. I've also not checked how big the test data is (is it reasonably small?)

Otherwise LGTM

@@ -410,6 +410,25 @@ def test_fetch_openml_australian(monkeypatch, gzip_response):
)


@pytest.mark.parametrize('gzip_response', [True, False])
def test_fetch_openml_adultcensus(monkeypatch, gzip_response):
# Check because of the numeric row attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the issue number.

@@ -555,13 +567,12 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None,
arff = _download_data_arff(data_description['file_id'], return_sparse,
data_home)
arff_data = arff['data']
# nominal attributes is a dict mapping from the attribute name to the
# possible values. Includes also the target column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note however that the target is popped off below

@janvanrijn
Copy link
Contributor Author

I left 10 observations in the dataset, making it to my opinion reasonably small.

@amueller amueller added this to the 0.20.1 milestone Oct 9, 2018
@amueller
Copy link
Member

amueller commented Oct 9, 2018

regression test fails on master, whole data is 24k, looks good.

@amueller amueller merged commit 03c3af5 into scikit-learn:master Oct 9, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 15, 2018
…rn#12330)

* modularized data column functionality

* small bugfix

* removes redundant line breaks

* added some documentation on the added fn

* added additional comment on advice of Nicholas Hug

* added test case

* merged master into branch, and added small comments by Joel

* added doc item
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants