[MRG] adding as_frame functionality for california housing dataset loader #15486

gitsteph · 2019-11-02T19:35:25Z

Reference Issues/PRs

Handles part of #10733
(Loosely following conventions found in #13902)

What does this implement/fix? Explain your changes.

Adds as_frame functionality for the California Housing dataset loader (fetch_california_housing)

Any other comments?

No

amueller · 2019-11-02T19:41:07Z

the dataframe should also contain the target, right?

gitsteph · 2019-11-02T19:44:36Z

Ah, fixing a couple things now

amueller · 2019-11-02T19:45:32Z

you can make the change in this PR.

gitsteph · 2019-11-02T19:48:57Z

Will do; just noticed a couple additional things to fix so wanted to close it until they're handled. It's not ready for viewing yet 😅

gitsteph · 2019-11-02T21:13:45Z

@amueller hope this is better 🙏
thanks for the feedback :)

amueller · 2019-11-02T21:15:27Z

sklearn/datasets/_california_housing.py

+    target_names = ["MedHouseVal", ]
+    if as_frame:
+        columns = feature_names + target_names
+        adjusted_data = np.hstack((data, target[:,np.newaxis]))


This is not a good idea in case X and y have different data types. numpy doesn't allow different types, so you're making y be float here (if it was int) or X be object (if y is object).

And I think the approach of wrapping this in a reusable function is good.

The current implementation with numpy worked for this specific dataset, but I agree that it wouldn't generalize well.
I'll try making this more generalizable -- maybe by constructing separate dataframes (one for X and one for y), and using pandas to concatenate them together instead of numpy. 🤔

yes something like that would probably work well. Have you tried working with @wconnell as well?

Cool! I just made that proposed change (in the latest commit below) and tested it locally. The _convert_data_dataframe method in this PR hopefully will be able to generalize to some other cases.

I haven't tried working with @wconnell yet but am open to it! Will try to find them on gitter.

…eable

amueller · 2019-11-02T22:05:08Z

sklearn/datasets/_california_housing.py

+    frame = None
+    target_names = ["MedHouseVal", ]
+    if as_frame:
+        frame, X, y = _convert_data_dataframe("fetch_california_housing", data, target, feature_names, target_names)


break at 79 chars please

amueller · 2019-11-02T22:05:50Z

please add a test that the attribute exists and has the expected content. Otherwise looks good!

amueller · 2019-11-02T22:06:09Z

sklearn/datasets/_california_housing.py

@@ -48,8 +49,17 @@
 logger = logging.getLogger(__name__)


+def _convert_data_dataframe(caller_name, data, target, feature_names, target_names):


this probably should live in sklearn/datasets/base.py

Just moved it after chatting with @wconnell :)
(I left it here temporarily since I wasn't sure it would work for all cases; it's in _base.py in the most recent commit)

wconnell · 2019-11-02T22:38:04Z

When we implement this in the various load_* functions we will also have to cover the case when return_X_y=True and as_frame=True.

amueller · 2019-11-02T22:41:16Z

Indeed. In this case, the function should (apparently) return X and y separately as a DataFrame and a Series respectively.

amueller · 2019-11-02T23:00:40Z

sklearn/datasets/tests/test_base.py

@@ -252,6 +253,18 @@ def test_loads_dumps_bunch():
    assert bunch_from_pkl['x'] == bunch_from_pkl.x


+def test_fetch_asframe():


the test should be in test_california_housing.py I think.

Gotcha; thanks! Wasn't sure since it aims to test a function in the _base.py file. I'll move it now.

gitsteph · 2019-11-02T23:12:32Z

sklearn/datasets/_california_housing.py

@@ -181,6 +179,9 @@ def fetch_california_housing(data_home=None, download_if_missing=True,
                                              feature_names,
                                              target_names)

+    if return_X_y:


Moved this to better handle the case where both return_X_y and as_frame are True. If they are both true, this will return X and y as pandas objects.

TY to @wconnell for this comment => #15486 (comment)

amueller · 2019-11-02T23:53:12Z

test are failing

gitsteph · 2019-11-03T16:52:52Z

👌 I'll fix the remaining linting issues now.

🔍 As for the other failed tests -- it seems that the test environments don't have pandas installed (?). I'm not sure how to test the new functionality in that case... would you have any advice on how to proceed with that? (e.g. finding a way to install pandas into those test environments, or some other workaround?) . The openml test (test_openml.py) example uses pytest.importorskip. I'll... try doing that here too then, but doing so would mean that test would be skipped for the environments its missing pandas in.

…installed

gitsteph · 2019-11-03T18:40:33Z

For test_fetch_asframe -- I'm failing the coverage test (codecov/patch) because I added @pytest.mark.skipif in the last commit (45940b2), following the pattern of how Olivetti faces dataloading test handled the data not being available at test time.

The original, merged California housing dataloading test (test_fetch) also did not run if the data was not available.

I'm not sure the best way to proceed on this. Any guidance here would be appreciated. 🙏

jnothman

Otherwise LGTM

sklearn/datasets/tests/test_california_housing.py

Removed unneeded parens Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

reshamas · 2019-12-08T19:51:34Z

@gitsteph thanks for the email. I added this PR to the LIST

@cmarmo Adding this PR to the SF sprint list.

@thomasjpfan Are you able to assist on this one? Do we know why the check is failing?

cmarmo · 2019-12-21T11:26:09Z

@gitsteph , the two failing builds are failing because of ubuntu issues:

E: Failed to fetch https://packages.microsoft.com/ubuntu/16.04/prod/dists/xenial/main/binary-amd64/Packages.bz2  Hash Sum mismatch

Would you please sync with master and push... hope that the issues have been resolved.
Then, maybe @amueller will be able to finalize his review?

adding as_frame functionality for california housing dataset loader

78a196f

gitsteph mentioned this pull request Nov 2, 2019

API for returning datasets as DataFrames #10733

Closed

gitsteph closed this Nov 2, 2019

gitsteph added 2 commits November 2, 2019 13:51

fixes for as_frame to fetch_california_housing

d604f4b

minor change

0a01ecf

gitsteph reopened this Nov 2, 2019

forgot to rename feature_names to columns here

e00f4f4

amueller reviewed Nov 2, 2019

View reviewed changes

changes to use pandas concat and make conversion to df more generaliz…

3bda9b5

…eable

amueller reviewed Nov 2, 2019

View reviewed changes

gitsteph added 2 commits November 2, 2019 15:15

moved _convert_data_dataframe to _base.py

0dba66a

breaking at 79 chars

df9f086

added test_fetch_asframe

5019c9e

amueller reviewed Nov 2, 2019

View reviewed changes

gitsteph added 2 commits November 2, 2019 16:10

moved test to test_california_housing.py

587be05

moved return_X_y handling below as_frame check

f8eeba9

gitsteph commented Nov 2, 2019

View reviewed changes

gitsteph added 2 commits November 3, 2019 09:09

fixed linting issues caught by flake8

ee19ac7

using pytest.importorskip to handle test environments without pandas …

19ed2f3

…installed

gitsteph force-pushed the df_cahousing branch from e4d547f to 19ed2f3 Compare November 3, 2019 17:26

skip test if dataset is not available in test env

45940b2

jnothman approved these changes Nov 4, 2019

View reviewed changes

sklearn/datasets/tests/test_california_housing.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_california_housing.py Outdated Show resolved Hide resolved

gitsteph and others added 2 commits November 4, 2019 08:50

Update sklearn/datasets/tests/test_california_housing.py

a6fcb50

Removed unneeded parens Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

Update sklearn/datasets/tests/test_california_housing.py

e3db481

Removed unneeded parens Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

gitsteph requested a review from amueller November 7, 2019 21:21

thomasjpfan added 2 commits December 10, 2019 20:32

Merge remote-tracking branch 'upstream/master' into pr/15486

4c946e0

Merge remote-tracking branch 'upstream/master' into pr/15486

20daa13

reshamas mentioned this pull request Dec 22, 2019

ENH adding as_frame functionality for CA housing dataset loader #15950

Merged

rth closed this in #15950 Dec 26, 2019

wconnell mentioned this pull request Dec 27, 2019

ENH add as_frame functionality for toy datasets #15980

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] adding as_frame functionality for california housing dataset loader #15486

[MRG] adding as_frame functionality for california housing dataset loader #15486

gitsteph commented Nov 2, 2019 •

edited

Loading

amueller commented Nov 2, 2019

gitsteph commented Nov 2, 2019 •

edited

Loading

amueller commented Nov 2, 2019

gitsteph commented Nov 2, 2019

gitsteph commented Nov 2, 2019

amueller Nov 2, 2019

gitsteph Nov 2, 2019

amueller Nov 2, 2019

gitsteph Nov 2, 2019 •

edited

Loading

amueller Nov 2, 2019

amueller commented Nov 2, 2019

amueller Nov 2, 2019

gitsteph Nov 2, 2019

wconnell commented Nov 2, 2019

amueller commented Nov 2, 2019

amueller Nov 2, 2019

gitsteph Nov 2, 2019

gitsteph Nov 2, 2019

gitsteph Nov 2, 2019 •

edited

Loading

amueller commented Nov 2, 2019

gitsteph commented Nov 3, 2019 •

edited

Loading

gitsteph commented Nov 3, 2019 •

edited

Loading

jnothman left a comment

reshamas commented Dec 8, 2019

cmarmo commented Dec 21, 2019

		@@ -48,8 +49,17 @@
		logger = logging.getLogger(__name__)


		def _convert_data_dataframe(caller_name, data, target, feature_names, target_names):

		@@ -252,6 +253,18 @@ def test_loads_dumps_bunch():
		assert bunch_from_pkl['x'] == bunch_from_pkl.x


		def test_fetch_asframe():

[MRG] adding as_frame functionality for california housing dataset loader #15486

[MRG] adding as_frame functionality for california housing dataset loader #15486

Conversation

gitsteph commented Nov 2, 2019 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

amueller commented Nov 2, 2019

gitsteph commented Nov 2, 2019 • edited Loading

amueller commented Nov 2, 2019

gitsteph commented Nov 2, 2019

gitsteph commented Nov 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gitsteph Nov 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Nov 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconnell commented Nov 2, 2019

amueller commented Nov 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gitsteph Nov 2, 2019 • edited Loading

Choose a reason for hiding this comment

amueller commented Nov 2, 2019

gitsteph commented Nov 3, 2019 • edited Loading

gitsteph commented Nov 3, 2019 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

reshamas commented Dec 8, 2019

cmarmo commented Dec 21, 2019

gitsteph commented Nov 2, 2019 •

edited

Loading

gitsteph commented Nov 2, 2019 •

edited

Loading

gitsteph Nov 2, 2019 •

edited

Loading

gitsteph Nov 2, 2019 •

edited

Loading

gitsteph commented Nov 3, 2019 •

edited

Loading

gitsteph commented Nov 3, 2019 •

edited

Loading