-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH add as_frame functionality for toy datasets #15980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
…es, load_linnerud, load_boston Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Hello @wconnell: good to see this progressing. This is the linting error I see here: Running flake8 on the diff in the range 2287d2d22..4a4915566 (12 commit(s)):
--------------------------------------------------------------------------------
sklearn/datasets/tests/test_common.py:13:80: E501 line too long (100 > 79 characters)
expected_msg = ('{} with as_frame=True requires pandas'.format(fetch_func_partial.__name__))
|
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a first pass. A couple of comments regarding the documentation.
I am a bit surprised that we always return DataFrame
for the target while the original API in fetch_openml
was to return a Series
if we have a single column.
This change was introduced in fetch_california_housing
. I will open an issue to discuss what is the desired output. We can wait for a consensus before to change anything on this side.
@wconnel do you plan to continue this PR? |
@cmarmo Yes, still planning to continue. Sounds good, thanks for the update! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good but here are some (very redundant :) comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more nitpicks but otherwise LGTM. Please feel free to merge the current master into this branch to get the coverage report to be up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @wconnell !
I do not understand why the codecov reports states that the check_as_frame utility function is never executed. Furthermore the diff does not show the full changes of this PR: Anyway I am still +1 for merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually codecov is right: those tests are always skipped because check_pandas_dependency_message
skips the test if pandas is installed and check_as_frame
skips the test if pandas is not installed.
The tests for the toy datasets that now support as_frame
need to be refactored into 2 sets of tests, those who run when pandas is installed and the the others.
You can run the test with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I splitted the as_frame tests into to groups (when pandas is installed vs when pandas is missing). Now all the lines are covered and codecov is happy.
I also added an entry to document the new feature in what's new. LGTM on my end.
@glemaitre any more review / comment?
Thanks for taking care of that, learned quite a bit from your help. Hope to contribute more in the future! |
@ogrisel Are you able to remove the "Waiting for reviewer" label? Thank you. |
Reference Issues/PRs
Handles part of #10733
Related to #13902, #15486 and #15950
Work from WiMLDS SF sprint that was waiting for #15486 to close, thanks for the helpful work @gitsteph and @reshamas. Co-authored with @reganconnell.
What does this implement/fix? Explain your changes.
Adds
as_frame
functionality to all toy datasets includingload_wine(), load_iris(), load_breast_cancer(), load_diabetes(), load_linnerud(), load_boston(), load_digits()
.Any other comments?
Documentation conventions for
fetch_california_housing()
from #15950 were followed, although the doc structure describing attributes of the returned toy dataBunch
and fetched dataBunch
are different so needs review to enhance consistency. In relation,Bunch.target
dataframe column name defaults to "target" when unspecified.Previously there was inconsistency in what
Bunch.target_names
represents, sometimes returning the classes of a target variable, sometimes the target variable name(s). I specified this attribute only for classification datasets, so it returns the classes of a variable.Some tests are generalized from those in
sklearn/datasets/tests/test_california_housing.py
and may be able to supersede those for future use.