-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] adding as_frame functionality for california housing dataset loader #15486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the dataframe should also contain the target, right? |
Ah, fixing a couple things now |
you can make the change in this PR. |
Will do; just noticed a couple additional things to fix so wanted to close it until they're handled. It's not ready for viewing yet 😅 |
@amueller hope this is better 🙏 |
target_names = ["MedHouseVal", ] | ||
if as_frame: | ||
columns = feature_names + target_names | ||
adjusted_data = np.hstack((data, target[:,np.newaxis])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good idea in case X and y have different data types. numpy doesn't allow different types, so you're making y
be float here (if it was int) or X
be object (if y is object).
And I think the approach of wrapping this in a reusable function is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation with numpy worked for this specific dataset, but I agree that it wouldn't generalize well.
I'll try making this more generalizable -- maybe by constructing separate dataframes (one for X
and one for y
), and using pandas to concatenate them together instead of numpy. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes something like that would probably work well. Have you tried working with @wconnell as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I just made that proposed change (in the latest commit below) and tested it locally. The _convert_data_dataframe
method in this PR hopefully will be able to generalize to some other cases.
I haven't tried working with @wconnell yet but am open to it! Will try to find them on gitter.
frame = None | ||
target_names = ["MedHouseVal", ] | ||
if as_frame: | ||
frame, X, y = _convert_data_dataframe("fetch_california_housing", data, target, feature_names, target_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break at 79 chars please
please add a test that the attribute exists and has the expected content. Otherwise looks good! |
@@ -48,8 +49,17 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def _convert_data_dataframe(caller_name, data, target, feature_names, target_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably should live in sklearn/datasets/base.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved it after chatting with @wconnell :)
(I left it here temporarily since I wasn't sure it would work for all cases; it's in _base.py
in the most recent commit)
When we implement this in the various |
Indeed. In this case, the function should (apparently) return X and y separately as a DataFrame and a Series respectively. |
sklearn/datasets/tests/test_base.py
Outdated
@@ -252,6 +253,18 @@ def test_loads_dumps_bunch(): | |||
assert bunch_from_pkl['x'] == bunch_from_pkl.x | |||
|
|||
|
|||
def test_fetch_asframe(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test should be in test_california_housing.py I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha; thanks! Wasn't sure since it aims to test a function in the _base.py
file. I'll move it now.
@@ -181,6 +179,9 @@ def fetch_california_housing(data_home=None, download_if_missing=True, | |||
feature_names, | |||
target_names) | |||
|
|||
if return_X_y: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to better handle the case where both return_X_y
and as_frame
are True
. If they are both true, this will return X
and y
as pandas objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TY to @wconnell for this comment => #15486 (comment)
test are failing |
👌 I'll fix the remaining linting issues now. 🔍 As for the other failed tests -- it seems that the test environments don't have pandas installed (?). I'm not sure how to test the new functionality in that case... would you have any advice on how to proceed with that? (e.g. finding a way to install pandas into those test environments, or some other workaround?) . The openml test ( |
e4d547f
to
19ed2f3
Compare
For The original, merged California housing dataloading test ( I'm not sure the best way to proceed on this. Any guidance here would be appreciated. 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
Removed unneeded parens Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
Removed unneeded parens Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
@gitsteph thanks for the email. I added this PR to the LIST @cmarmo Adding this PR to the SF sprint list. @thomasjpfan Are you able to assist on this one? Do we know why the check is failing? |
@gitsteph , the two failing builds are failing because of ubuntu issues:
Would you please sync with master and push... hope that the issues have been resolved. |
Reference Issues/PRs
Handles part of #10733
(Loosely following conventions found in #13902)
What does this implement/fix? Explain your changes.
Adds
as_frame
functionality for the California Housing dataset loader (fetch_california_housing
)Any other comments?
No