Skip to content

[WIP] Add parameter as_frame to load_data_xxx to return data frames #10972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

nsorros
Copy link

@nsorros nsorros commented Apr 13, 2018

Reference Issues/PRs

Fixes #10733

What does this implement/fix? Explain your changes.

This PR proposes a way to return datasets as data frames that incorporate information about the feature names and classes potentially. It follows a suggestion made by @jnothman but can go any direction if the community feels this is a valuable feature.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the transformation before return_X_y is handled, and still allow returning a Bunch. Alternatively, we could consider allow return_Xy='pandas' as how we activate this feature instead of a separate parameter.

I'm not sure whether y should be a Series. Please argue for/against it.

@@ -21,6 +21,7 @@
from ..utils import check_random_state

import numpy as np
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be an optional import. Perhaps it should only be loaded locally within each function as part of the validation of as_frame.

@jnothman
Copy link
Member

Thanks for the draft. It also needs tests.

@nsorros
Copy link
Author

nsorros commented Apr 17, 2018

@jnothman Added a test and implemented partial import as you suggested. I assume that I need to do this for every dataset before it is ready to be merged right? How does it look so far? Thanks a lot for your help.

target_series = Series(target, name="class")
return data_frame, target_series
except ImportError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If as_frame is true, then we should indeed raise the Importer of, not pass it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i am confused because i am thinking at the tests passing at the same time.

@@ -202,6 +203,17 @@ def test_load_iris():
check_return_X_y(res, partial(load_iris))


def test_load_iris_as_frame():
try:
data_frame, target_series = load_iris(as_frame=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this would raise SkipTest. Why not use importorskip?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while looking at other fragments of the code i though this was the way to skip the pandas test. let me check importorskip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is relatively new to us. But it provides importorskip for this kind of application.

@@ -202,6 +203,17 @@ def test_load_iris():
check_return_X_y(res, partial(load_iris))


def test_load_iris_as_frame():
try:
data_frame, target_series = load_iris(as_frame=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is relatively new to us. But it provides importorskip for this kind of application.

data_frame, target_series = load_iris(as_frame=True)
assert_equal(data_frame.shape, (150, 4))
assert_equal(target_series.shape[0], 150)
except IOError as :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid syntax. You need to remove as

@@ -322,7 +322,7 @@ def load_wine(return_X_y=False):
'proline'])


def load_iris(return_X_y=False):
def load_iris(return_X_y=False, as_frame=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the parameter

@adrinjalali
Copy link
Member

@nsorros marking this as stalled for now, we'd be happy if you continue the work on this though.

@PimwipaV
Copy link

PimwipaV commented Jan 21, 2020

@nsorros @adrinjalali @daniel-cortez-stevenson anyone working on this? if it's stalled I'm interested to try working on it.

@nsorros
Copy link
Author

nsorros commented Jan 21, 2020

@nsorros @adrinjalali @daniel-cortez-stevenson anyone working on this? if it's stalled I'm interested to try working on it.

go ahead from me.

@PimwipaV
Copy link

@nsorros Thanks. I'll start now.

@jnothman
Copy link
Member

jnothman commented Jan 21, 2020 via email

@PimwipaV
Copy link

@jnothman Thanks for the warning. I will go slowly.

@bsipocz
Copy link
Contributor

bsipocz commented Jun 6, 2020

This has been done in the meantime, so I suggest to close the PR.

@thomasjpfan
Copy link
Member

Thank you @bsipocz . Closing PR as this is resolved.

@adrinjalali adrinjalali closed this Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API for returning datasets as DataFrames
7 participants