Skip to content

[MRG] adding as_frame functionality for california housing dataset loader #15486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

gitsteph
Copy link
Contributor

@gitsteph gitsteph commented Nov 2, 2019

Reference Issues/PRs

Handles part of #10733
(Loosely following conventions found in #13902)

What does this implement/fix? Explain your changes.

Adds as_frame functionality for the California Housing dataset loader (fetch_california_housing)

Any other comments?

No

@amueller
Copy link
Member

amueller commented Nov 2, 2019

the dataframe should also contain the target, right?

@gitsteph
Copy link
Contributor Author

gitsteph commented Nov 2, 2019

Ah, fixing a couple things now

@gitsteph gitsteph closed this Nov 2, 2019
@amueller
Copy link
Member

amueller commented Nov 2, 2019

you can make the change in this PR.

@gitsteph
Copy link
Contributor Author

gitsteph commented Nov 2, 2019

Will do; just noticed a couple additional things to fix so wanted to close it until they're handled. It's not ready for viewing yet 😅

@gitsteph gitsteph reopened this Nov 2, 2019
@gitsteph
Copy link
Contributor Author

gitsteph commented Nov 2, 2019

@amueller hope this is better 🙏
thanks for the feedback :)

target_names = ["MedHouseVal", ]
if as_frame:
columns = feature_names + target_names
adjusted_data = np.hstack((data, target[:,np.newaxis]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea in case X and y have different data types. numpy doesn't allow different types, so you're making y be float here (if it was int) or X be object (if y is object).

And I think the approach of wrapping this in a reusable function is good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation with numpy worked for this specific dataset, but I agree that it wouldn't generalize well.
I'll try making this more generalizable -- maybe by constructing separate dataframes (one for X and one for y), and using pandas to concatenate them together instead of numpy. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes something like that would probably work well. Have you tried working with @wconnell as well?

Copy link
Contributor Author

@gitsteph gitsteph Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I just made that proposed change (in the latest commit below) and tested it locally. The _convert_data_dataframe method in this PR hopefully will be able to generalize to some other cases.

I haven't tried working with @wconnell yet but am open to it! Will try to find them on gitter.

frame = None
target_names = ["MedHouseVal", ]
if as_frame:
frame, X, y = _convert_data_dataframe("fetch_california_housing", data, target, feature_names, target_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break at 79 chars please

@amueller
Copy link
Member

amueller commented Nov 2, 2019

please add a test that the attribute exists and has the expected content. Otherwise looks good!

@@ -48,8 +49,17 @@
logger = logging.getLogger(__name__)


def _convert_data_dataframe(caller_name, data, target, feature_names, target_names):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should live in sklearn/datasets/base.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved it after chatting with @wconnell :)
(I left it here temporarily since I wasn't sure it would work for all cases; it's in _base.py in the most recent commit)

@wconnell
Copy link
Contributor

wconnell commented Nov 2, 2019

When we implement this in the various load_* functions we will also have to cover the case when return_X_y=True and as_frame=True.

@amueller
Copy link
Member

amueller commented Nov 2, 2019

Indeed. In this case, the function should (apparently) return X and y separately as a DataFrame and a Series respectively.

@@ -252,6 +253,18 @@ def test_loads_dumps_bunch():
assert bunch_from_pkl['x'] == bunch_from_pkl.x


def test_fetch_asframe():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test should be in test_california_housing.py I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha; thanks! Wasn't sure since it aims to test a function in the _base.py file. I'll move it now.

@@ -181,6 +179,9 @@ def fetch_california_housing(data_home=None, download_if_missing=True,
feature_names,
target_names)

if return_X_y:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to better handle the case where both return_X_y and as_frame are True. If they are both true, this will return X and y as pandas objects.

Copy link
Contributor Author

@gitsteph gitsteph Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY to @wconnell for this comment => #15486 (comment)

@amueller
Copy link
Member

amueller commented Nov 2, 2019

test are failing

@gitsteph
Copy link
Contributor Author

gitsteph commented Nov 3, 2019

👌 I'll fix the remaining linting issues now.

🔍 As for the other failed tests -- it seems that the test environments don't have pandas installed (?). I'm not sure how to test the new functionality in that case... would you have any advice on how to proceed with that? (e.g. finding a way to install pandas into those test environments, or some other workaround?) . The openml test (test_openml.py) example uses pytest.importorskip. I'll... try doing that here too then, but doing so would mean that test would be skipped for the environments its missing pandas in.

@gitsteph
Copy link
Contributor Author

gitsteph commented Nov 3, 2019

For test_fetch_asframe -- I'm failing the coverage test (codecov/patch) because I added @pytest.mark.skipif in the last commit (45940b2), following the pattern of how Olivetti faces dataloading test handled the data not being available at test time.

The original, merged California housing dataloading test (test_fetch) also did not run if the data was not available.

I'm not sure the best way to proceed on this. Any guidance here would be appreciated. 🙏

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

gitsteph and others added 2 commits November 4, 2019 08:50
Removed unneeded parens

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Removed unneeded parens

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@gitsteph gitsteph requested a review from amueller November 7, 2019 21:21
@reshamas
Copy link
Member

reshamas commented Dec 8, 2019

@gitsteph thanks for the email. I added this PR to the LIST

@cmarmo Adding this PR to the SF sprint list.

@thomasjpfan Are you able to assist on this one? Do we know why the check is failing?

@cmarmo
Copy link
Contributor

cmarmo commented Dec 21, 2019

@gitsteph , the two failing builds are failing because of ubuntu issues:

E: Failed to fetch https://packages.microsoft.com/ubuntu/16.04/prod/dists/xenial/main/binary-amd64/Packages.bz2  Hash Sum mismatch

Would you please sync with master and push... hope that the issues have been resolved.
Then, maybe @amueller will be able to finalize his review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants