Skip to content

ENH adding as_frame functionality for CA housing dataset loader #15950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Dec 26, 2019
Merged

ENH adding as_frame functionality for CA housing dataset loader #15950

merged 27 commits into from
Dec 26, 2019

Conversation

reshamas
Copy link
Member

Reference Issues/PRs

Handles part of #10733
(Loosely following conventions found in #13902)
Closes #15486

What does this implement/fix? Explain your changes.

Adds as_frame functionality for the California Housing dataset loader (fetch_california_housing)

Any other comments?

Continuing work begun by @gitsteph at SF scikit sprint (Nov 2019)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. The coverage is failing because the dataset is not downloaded on the CI machine but the tests pass as expected on my local machine with the prefetched data.

Just a few comments / improvements:

assert hasattr(bunch, 'frame') is True
assert frame.shape == (20640, 9)
assert isinstance(bunch.data, pd.DataFrame)
assert isinstance(bunch.target, pd.DataFrame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test that checks that ImportError is raised if pandas is missing and that the message is " fetch_california_housing with as_frame=True requires pandas."?

frame : pandas DataFrame
Only present when `as_frame=True`. DataFrame with ``data`` and
``target``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleased add the directive:

.. versionadded:: 0.23

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right way to do it?

def _pandas_is_missing():
    try:
        import pandas
    except ImportError:
        raise SkipTest('fetch_california_housing with as_frame=True'
                       ' requires pandas')


@pytest.mark.skipif(
    _pandas_is_missing(),
    reason='Import pandas library to run this test'
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #15950 (comment) I mentioned to add a new test to check that the warning is raised without skipping the test when pandas is missing. Something like the following (untested):

@pytest.mark.skipif(
    _is_california_housing_dataset_not_available(),
    reason='Download California Housing dataset to run this test'
)
def test_pandas_dependency_message()
    try:
        import pandas
        pytest.skip("This test requires pandas to be not installed")
    except ImportError:
        # Check that pandas is imported lazily and that an informative error message is
        # raised when pandas is missing:
        expected_msg = "fetch_california_housing with as_frame=True requires pandas"
        with pytest.raises(ImportError, match=expected_msg):
            fetch_california_housing(as_frame=True)

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2019

The CI linter failed for no good reason. Let's ignore the code style checks for that specific line:

import pandas  # noqa

@reshamas
Copy link
Member Author

The CI linter failed for no good reason. Let's ignore the code style checks for that specific line:

import pandas  # noqa

I am not sure what that means. VSC does show a squiggly line under import but not sure what it prefers to see.

Screen Shot 2019-12-24 at 1 59 06 PM

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2019

Yes this is it. It's flake8 or some other linter that's reporting a pep8 violation on that line. You can mouse over the squiggly line and wait a bit to see a pop-up that gives the reason the linter complains for.

If you append " # noqa" at the end of the line, then save the edited file, this line will be ignored by the linter.

When I said "the CI linter failed for no good reason." I meant the Continuous Integration linter tasks that display red crosses in the github interface on this PR.

@reshamas
Copy link
Member Author

It looks like it is because pandas is unused.

Screen Shot 2019-12-24 at 2 48 23 PM

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gitsteph and @reshamas !

I'll just add a what's new entry, otherwise good to merge.

@rth rth changed the title adding as_frame functionality for CA housing dataset loader ENH adding as_frame functionality for CA housing dataset loader Dec 26, 2019
@rth rth merged commit 1cda8ed into scikit-learn:master Dec 26, 2019
@reshamas reshamas deleted the df_housing branch December 26, 2019 13:18
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
…it-learn#15950)

Co-authored-by: Stephanie Andrews <byronic.string@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants