-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Add parameter as_frame to load_data_xxx to return data frames #10972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do the transformation before return_X_y is handled, and still allow returning a Bunch. Alternatively, we could consider allow return_Xy='pandas'
as how we activate this feature instead of a separate parameter.
I'm not sure whether y should be a Series. Please argue for/against it.
sklearn/datasets/base.py
Outdated
@@ -21,6 +21,7 @@ | |||
from ..utils import check_random_state | |||
|
|||
import numpy as np | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be an optional import. Perhaps it should only be loaded locally within each function as part of the validation of as_frame
.
Thanks for the draft. It also needs tests. |
@jnothman Added a test and implemented partial import as you suggested. I assume that I need to do this for every dataset before it is ready to be merged right? How does it look so far? Thanks a lot for your help. |
sklearn/datasets/base.py
Outdated
target_series = Series(target, name="class") | ||
return data_frame, target_series | ||
except ImportError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If as_frame is true, then we should indeed raise the Importer of, not pass it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess i am confused because i am thinking at the tests passing at the same time.
@@ -202,6 +203,17 @@ def test_load_iris(): | |||
check_return_X_y(res, partial(load_iris)) | |||
|
|||
|
|||
def test_load_iris_as_frame(): | |||
try: | |||
data_frame, target_series = load_iris(as_frame=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why this would raise SkipTest. Why not use importorskip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while looking at other fragments of the code i though this was the way to skip the pandas test. let me check importorskip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest is relatively new to us. But it provides importorskip for this kind of application.
@@ -202,6 +203,17 @@ def test_load_iris(): | |||
check_return_X_y(res, partial(load_iris)) | |||
|
|||
|
|||
def test_load_iris_as_frame(): | |||
try: | |||
data_frame, target_series = load_iris(as_frame=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest is relatively new to us. But it provides importorskip for this kind of application.
data_frame, target_series = load_iris(as_frame=True) | ||
assert_equal(data_frame.shape, (150, 4)) | ||
assert_equal(target_series.shape[0], 150) | ||
except IOError as : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invalid syntax. You need to remove as
@@ -322,7 +322,7 @@ def load_wine(return_X_y=False): | |||
'proline']) | |||
|
|||
|
|||
def load_iris(return_X_y=False): | |||
def load_iris(return_X_y=False, as_frame=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the parameter
@nsorros marking this as stalled for now, we'd be happy if you continue the work on this though. |
@nsorros @adrinjalali @daniel-cortez-stevenson anyone working on this? if it's stalled I'm interested to try working on it. |
go ahead from me. |
@nsorros Thanks. I'll start now. |
#15980 is open and active, so make sure to avoid duplication.
|
@jnothman Thanks for the warning. I will go slowly. |
This has been done in the meantime, so I suggest to close the PR. |
Thank you @bsipocz . Closing PR as this is resolved. |
Reference Issues/PRs
Fixes #10733
What does this implement/fix? Explain your changes.
This PR proposes a way to return datasets as data frames that incorporate information about the feature names and classes potentially. It follows a suggestion made by @jnothman but can go any direction if the community feels this is a valuable feature.