ENH add as_frame functionality for toy datasets #15980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

thomasjpfan merged 35 commits into scikit-learn:master from wconnell:load_data-as_frame

Feb 2, 2020

Contributor

wconnell commented Dec 27, 2019 •

edited

Loading

Reference Issues/PRs

Handles part of #10733
Related to #13902, #15486 and #15950

Work from WiMLDS SF sprint that was waiting for #15486 to close, thanks for the helpful work @gitsteph and @reshamas. Co-authored with @reganconnell.

What does this implement/fix? Explain your changes.

Adds as_frame functionality to all toy datasets including load_wine(), load_iris(), load_breast_cancer(), load_diabetes(), load_linnerud(), load_boston(), load_digits().

Any other comments?

Documentation conventions for fetch_california_housing() from #15950 were followed, although the doc structure describing attributes of the returned toy data Bunch and fetched data Bunch are different so needs review to enhance consistency. In relation, Bunch.target dataframe column name defaults to "target" when unspecified.

Previously there was inconsistency in what Bunch.target_names represents, sometimes returning the classes of a target variable, sometimes the target variable name(s). I specified this attribute only for classification datasets, so it returns the classes of a variable.

Some tests are generalized from those in sklearn/datasets/tests/test_california_housing.py and may be able to supersede those for future use.

wconnell and others added 11 commits

December 26, 2019 14:25


          as_frame functionality, test

2c88d93

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          as_frame functionality for load_iris, load_breast_cancer, load_diabet…

29ce0be

…es, load_linnerud, load_boston

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          generalized check_as_frame test

66c9612

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          implemented tests for as_frame

12578ae

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          Merge remote-tracking branch 'upstream/master' into load_data-as_frame

c8c5662


          fix target column headers

7b91b4b

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          fix target column names of load_boston

4b3f421

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          add as_frame to load_digits

8f88370

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          test coverage

6014fdf

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          linting revisions

90947e6

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          target_names specified only for classifcation datasets

eb68ef0

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>

Member

reshamas commented Dec 27, 2019

Hello @wconnell: good to see this progressing.

This is the linting error I see here:

Running flake8 on the diff in the range 2287d2d22..4a4915566 (12 commit(s)):
--------------------------------------------------------------------------------
sklearn/datasets/tests/test_common.py:13:80: E501 line too long (100 > 79 characters)
        expected_msg = ('{} with as_frame=True requires pandas'.format(fetch_func_partial.__name__))

wconnell and others added 3 commits

December 27, 2019 11:57


          remove test for linnerud target names

f2a2791

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          remove partial from pandas_dependency test

822ae80

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>


          linting issue, line length

767c8c2

Co-authored-by: @reganconnell <reganconnell@berkeley.edu>

glemaitre requested changes

View reviewed changes

Member

glemaitre left a comment

I made a first pass. A couple of comments regarding the documentation.
I am a bit surprised that we always return DataFrame for the target while the original API in fetch_openml was to return a Series if we have a single column.

This change was introduced in fetch_california_housing. I will open an issue to discuss what is the desired output. We can wait for a consensus before to change anything on this side.

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_common.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_common.py Outdated Show resolved Hide resolved

Contributor

cmarmo commented Jan 13, 2020

@wconnel do you plan to continue this PR?
A consensus has been reached about having DataFrames or Series when the data are 1D (see #16012). Thanks for you patience.

Contributor Author

wconnell commented Jan 13, 2020

@cmarmo Yes, still planning to continue. Sounds good, thanks for the update!

wconnell added 6 commits

January 20, 2020 14:44


          Merge remote-tracking branch 'upstream/master' into load_data-as_frame

b72e520


          return series when target is 1D

ca67513


          reviewed changes for test

088374e


          documentation reflects descriptive bunch attributes

0689d32


          linting

a6f15ca


          documenation guideline fixes

a63341b

jnothman mentioned this pull request

[WIP] Add parameter as_frame to load_data_xxx to return data frames #10972

Closed

ogrisel added the Waiting for Reviewer label

ogrisel reviewed

View reviewed changes

Member

ogrisel left a comment

This is looking good but here are some (very redundant :) comments:

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_base.py Outdated Show resolved Hide resolved

ogrisel reviewed

View reviewed changes

sklearn/datasets/tests/test_common.py Outdated Show resolved Hide resolved

wconnell added 2 commits

January 29, 2020 15:08


          documentation fixes per review

3eee3d7


          testing revisions

870b4db

wconnell added 4 commits

January 29, 2020 15:52


          rename digits df by pixel row and columns

35c5bf0


          rollback boston dataset as_frame ENH and tests

7a283cb


          Merge remote-tracking branch 'upstream/master' into load_data-as_frame

17f9365


          correct naming of digits pixel columns

45b4faa

ogrisel approved these changes

View reviewed changes

Member

ogrisel left a comment

Couple more nitpicks but otherwise LGTM. Please feel free to merge the current master into this branch to get the coverage report to be up to date.

sklearn/datasets/tests/test_common.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_common.py Outdated Show resolved Hide resolved

wconnell added 2 commits

January 30, 2020 09:23


          check column dtypes for bunch as_frame

4e1888b


          Merge remote-tracking branch 'upstream/master' into load_data-as_frame

b24c61c

thomasjpfan reviewed

View reviewed changes

Member

thomasjpfan left a comment

Thank you for the PR @wconnell !

sklearn/datasets/tests/test_common.py Outdated Show resolved Hide resolved

wconnell added 2 commits

January 30, 2020 10:58


          Merge remote-tracking branch 'upstream/master' into load_data-as_frame

9c9eac0


          check bunch as_frame dtype with paramaterized test

f532e65

Member

ogrisel commented Jan 31, 2020

I do not understand why the codecov reports states that the check_as_frame utility function is never executed. Furthermore the diff does not show the full changes of this PR:

https://codecov.io/gh/scikit-learn/scikit-learn/compare/b24c61cc3cd639291e3a29c766af1c45fca68504...f532e650df892f27d9f1019eb95b975655b8f9d7/diff

Anyway I am still +1 for merge.

ogrisel requested changes

View reviewed changes

Member

ogrisel left a comment

Actually codecov is right: those tests are always skipped because check_pandas_dependency_message skips the test if pandas is installed and check_as_frame skips the test if pandas is not installed.

The tests for the toy datasets that now support as_frame need to be refactored into 2 sets of tests, those who run when pandas is installed and the the others.

Member

ogrisel commented Jan 31, 2020

You can run the test with pytest -v sklearn/datasets/tests/test_base.py to see the details.

$ pytest -v sklearn/datasets/tests/test_base.py 
============================================================================= test session starts ==============================================================================
platform linux -- Python 3.7.6, pytest-5.3.2, py-1.8.1, pluggy-0.13.1 -- /home/ogrisel/miniconda3/envs/pylatest/bin/python
cachedir: .pytest_cache
rootdir: /home/ogrisel/code/scikit-learn, inifile: setup.cfg
plugins: dash-1.7.0, cov-2.8.1, timeout-1.3.4
collected 19 items                                                                                                                                                             

sklearn/datasets/tests/test_base.py::test_data_home PASSED                                                                                                               [  5%]
sklearn/datasets/tests/test_base.py::test_default_empty_load_files PASSED                                                                                                [ 10%]
sklearn/datasets/tests/test_base.py::test_default_load_files PASSED                                                                                                      [ 15%]
sklearn/datasets/tests/test_base.py::test_load_files_w_categories_desc_and_encoding PASSED                                                                               [ 21%]
sklearn/datasets/tests/test_base.py::test_load_files_wo_load_content PASSED                                                                                              [ 26%]
sklearn/datasets/tests/test_base.py::test_load_sample_images PASSED                                                                                                      [ 31%]
sklearn/datasets/tests/test_base.py::test_load_digits SKIPPED                                                                                                            [ 36%]
sklearn/datasets/tests/test_base.py::test_load_digits_n_class_lt_10 PASSED                                                                                               [ 42%]
sklearn/datasets/tests/test_base.py::test_load_sample_image PASSED                                                                                                       [ 47%]
sklearn/datasets/tests/test_base.py::test_load_missing_sample_image_error PASSED                                                                                         [ 52%]
sklearn/datasets/tests/test_base.py::test_load_diabetes SKIPPED                                                                                                          [ 57%]
sklearn/datasets/tests/test_base.py::test_load_linnerud SKIPPED                                                                                                          [ 63%]
sklearn/datasets/tests/test_base.py::test_load_iris SKIPPED                                                                                                              [ 68%]
sklearn/datasets/tests/test_base.py::test_load_wine SKIPPED                                                                                                              [ 73%]
sklearn/datasets/tests/test_base.py::test_load_breast_cancer SKIPPED                                                                                                     [ 78%]
sklearn/datasets/tests/test_base.py::test_load_boston PASSED                                                                                                             [ 84%]
sklearn/datasets/tests/test_base.py::test_loads_dumps_bunch PASSED                                                                                                       [ 89%]
sklearn/datasets/tests/test_base.py::test_bunch_pickle_generated_with_0_16_and_read_with_0_17 PASSED                                                                     [ 94%]
sklearn/datasets/tests/test_base.py::test_bunch_dir PASSED                                                                                                               [100%]

=========================================================================== short test summary info ============================================================================
SKIPPED [6] /home/ogrisel/code/scikit-learn/sklearn/datasets/tests/test_common.py:10: This test requires pandas to be not installed
======================================================================== 13 passed, 6 skipped in 0.65s =========================================================================

ogrisel added 4 commits

February 2, 2020 16:40


          Split as_frame tests: pandas installed vs pandas missing

50f3246


          Merge branch 'master' into load_data-as_frame

37b3a8a


          PEP8

349af1c


          Document new feature in 0.23 changelog

7cf774f

ogrisel approved these changes

View reviewed changes

Member

ogrisel left a comment

I splitted the as_frame tests into to groups (when pandas is installed vs when pandas is missing). Now all the lines are covered and codecov is happy.

I also added an entry to document the new feature in what's new. LGTM on my end.

@glemaitre any more review / comment?


          Forgot load_digits in last commits

037fe04

Contributor Author

wconnell commented Feb 2, 2020

Thanks for taking care of that, learned quite a bit from your help. Hope to contribute more in the future!

thomasjpfan approved these changes

View reviewed changes

thomasjpfan merged commit ce91b6a into scikit-learn:master

Member

reshamas commented Feb 2, 2020

This is great, @wconnell! This also means that all PRs from the SF sprint are now merged or closed. Thanks, everyone.

cc: @cmarmo @amueller

Member

reshamas commented Feb 2, 2020 •

edited

Loading

@ogrisel Are you able to remove the "Waiting for reviewer" label? Thank you.

cmarmo removed the Waiting for Reviewer label

wconnell deleted the load_data-as_frame branch

February 10, 2020 02:39

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request


          ENH add as_frame functionality for toy datasets (scikit-learn#15980)

7b08dd5

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request


          ENH add as_frame functionality for toy datasets (scikit-learn#15980)

48699c6

ogrisel mentioned this pull request

API for returning datasets as DataFrames #10733

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet