Skip to content

ENH add as_frame functionality for toy datasets #15980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Feb 2, 2020

Conversation

wconnell
Copy link
Contributor

@wconnell wconnell commented Dec 27, 2019

Reference Issues/PRs

Handles part of #10733
Related to #13902, #15486 and #15950

Work from WiMLDS SF sprint that was waiting for #15486 to close, thanks for the helpful work @gitsteph and @reshamas. Co-authored with @reganconnell.

What does this implement/fix? Explain your changes.

Adds as_frame functionality to all toy datasets including load_wine(), load_iris(), load_breast_cancer(), load_diabetes(), load_linnerud(), load_boston(), load_digits().

Any other comments?

Documentation conventions for fetch_california_housing() from #15950 were followed, although the doc structure describing attributes of the returned toy data Bunch and fetched data Bunch are different so needs review to enhance consistency. In relation, Bunch.target dataframe column name defaults to "target" when unspecified.

Previously there was inconsistency in what Bunch.target_names represents, sometimes returning the classes of a target variable, sometimes the target variable name(s). I specified this attribute only for classification datasets, so it returns the classes of a variable.

Some tests are generalized from those in sklearn/datasets/tests/test_california_housing.py and may be able to supersede those for future use.

wconnell and others added 11 commits December 26, 2019 14:25
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
…es, load_linnerud, load_boston

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
@reshamas
Copy link
Member

Hello @wconnell: good to see this progressing.

This is the linting error I see here:

Running flake8 on the diff in the range 2287d2d22..4a4915566 (12 commit(s)):
--------------------------------------------------------------------------------
sklearn/datasets/tests/test_common.py:13:80: E501 line too long (100 > 79 characters)
        expected_msg = ('{} with as_frame=True requires pandas'.format(fetch_func_partial.__name__))
   

wconnell and others added 3 commits December 27, 2019 11:57
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a first pass. A couple of comments regarding the documentation.
I am a bit surprised that we always return DataFrame for the target while the original API in fetch_openml was to return a Series if we have a single column.

This change was introduced in fetch_california_housing. I will open an issue to discuss what is the desired output. We can wait for a consensus before to change anything on this side.

@cmarmo
Copy link
Contributor

cmarmo commented Jan 13, 2020

@wconnel do you plan to continue this PR?
A consensus has been reached about having DataFrames or Series when the data are 1D (see #16012). Thanks for you patience.

@wconnell
Copy link
Contributor Author

@cmarmo Yes, still planning to continue. Sounds good, thanks for the update!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good but here are some (very redundant :) comments:

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple more nitpicks but otherwise LGTM. Please feel free to merge the current master into this branch to get the coverage report to be up to date.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @wconnell !

@ogrisel
Copy link
Member

ogrisel commented Jan 31, 2020

I do not understand why the codecov reports states that the check_as_frame utility function is never executed. Furthermore the diff does not show the full changes of this PR:

https://codecov.io/gh/scikit-learn/scikit-learn/compare/b24c61cc3cd639291e3a29c766af1c45fca68504...f532e650df892f27d9f1019eb95b975655b8f9d7/diff

Anyway I am still +1 for merge.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually codecov is right: those tests are always skipped because check_pandas_dependency_message skips the test if pandas is installed and check_as_frame skips the test if pandas is not installed.

The tests for the toy datasets that now support as_frame need to be refactored into 2 sets of tests, those who run when pandas is installed and the the others.

@ogrisel
Copy link
Member

ogrisel commented Jan 31, 2020

You can run the test with pytest -v sklearn/datasets/tests/test_base.py to see the details.

$ pytest -v sklearn/datasets/tests/test_base.py 
============================================================================= test session starts ==============================================================================
platform linux -- Python 3.7.6, pytest-5.3.2, py-1.8.1, pluggy-0.13.1 -- /home/ogrisel/miniconda3/envs/pylatest/bin/python
cachedir: .pytest_cache
rootdir: /home/ogrisel/code/scikit-learn, inifile: setup.cfg
plugins: dash-1.7.0, cov-2.8.1, timeout-1.3.4
collected 19 items                                                                                                                                                             

sklearn/datasets/tests/test_base.py::test_data_home PASSED                                                                                                               [  5%]
sklearn/datasets/tests/test_base.py::test_default_empty_load_files PASSED                                                                                                [ 10%]
sklearn/datasets/tests/test_base.py::test_default_load_files PASSED                                                                                                      [ 15%]
sklearn/datasets/tests/test_base.py::test_load_files_w_categories_desc_and_encoding PASSED                                                                               [ 21%]
sklearn/datasets/tests/test_base.py::test_load_files_wo_load_content PASSED                                                                                              [ 26%]
sklearn/datasets/tests/test_base.py::test_load_sample_images PASSED                                                                                                      [ 31%]
sklearn/datasets/tests/test_base.py::test_load_digits SKIPPED                                                                                                            [ 36%]
sklearn/datasets/tests/test_base.py::test_load_digits_n_class_lt_10 PASSED                                                                                               [ 42%]
sklearn/datasets/tests/test_base.py::test_load_sample_image PASSED                                                                                                       [ 47%]
sklearn/datasets/tests/test_base.py::test_load_missing_sample_image_error PASSED                                                                                         [ 52%]
sklearn/datasets/tests/test_base.py::test_load_diabetes SKIPPED                                                                                                          [ 57%]
sklearn/datasets/tests/test_base.py::test_load_linnerud SKIPPED                                                                                                          [ 63%]
sklearn/datasets/tests/test_base.py::test_load_iris SKIPPED                                                                                                              [ 68%]
sklearn/datasets/tests/test_base.py::test_load_wine SKIPPED                                                                                                              [ 73%]
sklearn/datasets/tests/test_base.py::test_load_breast_cancer SKIPPED                                                                                                     [ 78%]
sklearn/datasets/tests/test_base.py::test_load_boston PASSED                                                                                                             [ 84%]
sklearn/datasets/tests/test_base.py::test_loads_dumps_bunch PASSED                                                                                                       [ 89%]
sklearn/datasets/tests/test_base.py::test_bunch_pickle_generated_with_0_16_and_read_with_0_17 PASSED                                                                     [ 94%]
sklearn/datasets/tests/test_base.py::test_bunch_dir PASSED                                                                                                               [100%]

=========================================================================== short test summary info ============================================================================
SKIPPED [6] /home/ogrisel/code/scikit-learn/sklearn/datasets/tests/test_common.py:10: This test requires pandas to be not installed
======================================================================== 13 passed, 6 skipped in 0.65s =========================================================================

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I splitted the as_frame tests into to groups (when pandas is installed vs when pandas is missing). Now all the lines are covered and codecov is happy.

I also added an entry to document the new feature in what's new. LGTM on my end.

@glemaitre any more review / comment?

@wconnell
Copy link
Contributor Author

wconnell commented Feb 2, 2020

Thanks for taking care of that, learned quite a bit from your help. Hope to contribute more in the future!

@thomasjpfan thomasjpfan merged commit ce91b6a into scikit-learn:master Feb 2, 2020
@reshamas
Copy link
Member

reshamas commented Feb 2, 2020

This is great, @wconnell! This also means that all PRs from the SF sprint are now merged or closed. Thanks, everyone.

cc: @cmarmo @amueller

@reshamas
Copy link
Member

reshamas commented Feb 2, 2020

@ogrisel Are you able to remove the "Waiting for reviewer" label? Thank you.

@wconnell wconnell deleted the load_data-as_frame branch February 10, 2020 02:39
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants