-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Fix fetch_openml when ignore attributes are numeric #12330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If all goes green LGTM. Not sure if it's worth adding a dataset to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to test this, to avoid a regression.
Btw, should I add a note in the what's new doc? In 0.20 or 0.21? I could append it to the previous entry that was merged by #12246 |
It can be a separate note in 0.20.1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not checked that the test falls in master. I've also not checked how big the test data is (is it reasonably small?)
Otherwise LGTM
@@ -410,6 +410,25 @@ def test_fetch_openml_australian(monkeypatch, gzip_response): | |||
) | |||
|
|||
|
|||
@pytest.mark.parametrize('gzip_response', [True, False]) | |||
def test_fetch_openml_adultcensus(monkeypatch, gzip_response): | |||
# Check because of the numeric row attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the issue number.
sklearn/datasets/openml.py
Outdated
@@ -555,13 +567,12 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None, | |||
arff = _download_data_arff(data_description['file_id'], return_sparse, | |||
data_home) | |||
arff_data = arff['data'] | |||
# nominal attributes is a dict mapping from the attribute name to the | |||
# possible values. Includes also the target column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note however that the target is popped off below
I left 10 observations in the dataset, making it to my opinion reasonably small. |
regression test fails on master, whole data is 24k, looks good. |
…rn#12330) * modularized data column functionality * small bugfix * removes redundant line breaks * added some documentation on the added fn * added additional comment on advice of Nicholas Hug * added test case * merged master into branch, and added small comments by Joel * added doc item
Fixes #12329
I modularized the function that determines per column whether it is a valid data column (opposed to a target or ignore attribute). This functionality was before (in slightly different manifestations) on 2 different places in the code. This simplifies the main function.
I did not add an additional unit test, as that would require packaging more openml files etc. Personally I think that the fact that the code is more modular and the bug is removed adds to the readability and quality of the code. Let me know if you think otherwise, I can of course add a unit test.