Skip to content

[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 31, 2018
Merged

[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707

merged 9 commits into from
May 31, 2018

Conversation

jimmywan
Copy link
Contributor

@jimmywan jimmywan commented Sep 7, 2017

I'd like to make this change in order to support using BaggingRegressor on top of pipelines that handle imputation.

Reference Issue

#9708

What does this implement/fix? Explain your changes.

Removes finite checking on X. This seems wholly unnecessary as whatever regressor BaggingRegressor is wrapping should already handle its own consistency checking.

… on top of pipelines that handle imputation.
@ramhiser
Copy link

ramhiser commented Sep 7, 2017

+1. The call to check_X_y(..., force_all_finite=False) makes more sense when a Pipeline has a missing data imputer in it.

Maybe the check should be removed entirely so that the base estimator handles the check?

@jnothman
Copy link
Member

jnothman commented Sep 8, 2017

We can't remove the check entirely: Bagging requires pulling out selected rows and columns, and that requires 2d data with an array-like interface. I suppose we don't need to require that it's numeric, though...? We also do not support Pandas DataFrames here, as that would require use of iloc.

Please add a test.

@jimmywan
Copy link
Contributor Author

jimmywan commented Sep 8, 2017

Updated the PR to drop the type requirement and allow multiple output columns. Will work on a test case.

We also do not support Pandas DataFrames here, as that would require use of iloc.

Hmm, can you walk me through how to make this Pandas-compatible?
A pointer to existing code would be fine if it exists.

@jimmywan
Copy link
Contributor Author

jimmywan commented Sep 8, 2017

Added some test cases for classification and regression, and had to make some minor tweaks to accept-multi-output regression.

@amueller
Copy link
Member

amueller commented Sep 8, 2017

check_array will convert a dataframe to a numpy array

@jimmywan
Copy link
Contributor Author

jimmywan commented Sep 8, 2017

check_array will convert a dataframe to a numpy array

So nothing else required to pass in a pandas DataFrame?

Context is that my pipeline is capable of handling the conversion of categorical string values to numerics before it hits the actual regressor/classifier at the end of the pipeline.

@amueller
Copy link
Member

amueller commented Sep 8, 2017

well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline.

Is there a reason you do the feature expansion in the pipeline and not before the bagging estimator?

@jimmywan
Copy link
Contributor Author

well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline.

Assuming that everything before my custom components support things like get_feature_names or support masks, My components can deal with converting numpy arrays back to DataFrames where I need them. I should be good to go.

@jimmywan
Copy link
Contributor Author

I suppose we don't need to require that it's numeric, though...?

I removed that check as well and wanted to test it, but given the lack of a built-in pipeline friendly categorical encoder, I decided not to add test cases for that particular input.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -280,7 +279,10 @@ def _fit(self, X, y, max_samples=None, max_depth=None, sample_weight=None):
random_state = check_random_state(self.random_state)

# Convert data
X, y = check_X_y(X, y, ['csr', 'csc'])
X, y = check_X_y(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a comment that we require X to be 2d and indexable on both axes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added



def test_bagging_pipeline_with_sparse_inputs():
# Check that BaggingRegressor can accept sparse pipelines inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're testing with is not what we call sparse. I presume sparse inputs are already tested too. Do you mean to test nans and multioutput (which probably belongs in a separate test function)?

Copy link
Contributor Author

@jimmywan jimmywan Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed "sparse" to "missing"
I created separate tests for single and multioutput BaggingRegressors.

# Check that BaggingRegressor can accept sparse pipelines inputs
X = [
[1, 3, 5],
[2, None, 6],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want np.nan rather than None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I generally test against both as I'm not entirely sure what happens under the covers. I know I can initialize a Pandas DataFrame either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added separate imputers/inputs for None/np.nan, np.inf, np.NINF.

@jimmywan
Copy link
Contributor Author

@jnothman Updated as per your PR comments and have updated test cases as well. Please review at your earliest convenience.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

)
pipeline.fit(X, Y).predict(X)
bagging_regressor = BaggingRegressor(pipeline)
bagging_regressor.fit(X, Y).predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check decision_function etc also, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no decision_function on BaggingRegressor.
I'll add it to the BaggingClassifier test.

Copy link
Contributor Author

@jimmywan jimmywan Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually can't add it to the test for BaggingClassifier because DecisionTreeRegressor does not implement decision_function. I don't normally work with classification and this PR was meant to be a patch for the BaggingRegressor. Feel free to add additional tests if you think it's necessary.

pipeline = make_pipeline(regressor)
assert_raises(ValueError, pipeline.fit, X, Y)
bagging_regressor = BaggingRegressor(pipeline)
assert_raises(ValueError, bagging_regressor.fit, X, Y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to check the message here, but okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just a test for BaggingRegressor, I don't really want to couple the behavior of the underlying regressor to this test case. The fact that it raises an error seems sufficient to me.

[2, np.inf, 6],
[2, np.NINF, 6],
]
Y = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be okay with this test not dealing with missing data, and just being about 2d. I'd also be okay with the previous test just looping over 1d, 2d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, though now I have to reuse y for 1D and 2D...

[2, np.inf, 6],
[2, np.NINF, 6],
]
Y = [2, 3, 3, 3, 3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase y if 1d please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

)
pipeline.fit(X, Y).predict(X)
bagging_regressor = BaggingRegressor(pipeline)
bagging_regressor.fit(X, Y).predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check that the prediction shape matches the input shape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

assert_raises(ValueError, bagging_regressor.fit, X, Y)


def test_bagging_classifier_with_missing_inputs():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test add?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests BaggingClassifier while the other tests BaggingRegressor.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jnothman jnothman changed the title Update BaggingRegressor to relax checking X for finite values [MRG+1] Update BaggingRegressor to relax checking X for finite values Sep 14, 2017
@amueller
Copy link
Member

LGTM please add a whatsnew entry.

@jnothman jnothman changed the title [MRG+1] Update BaggingRegressor to relax checking X for finite values [MRG+2] Update BaggingRegressor to relax checking X for finite values May 23, 2018
@jnothman
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman
Copy link
Member

Ping @jimmywan?

@jimmywan
Copy link
Contributor Author

please add a whatsnew entry

Merged scikit-learn:master into my branch and added requested documentation.

@jnothman jnothman merged commit a31a906 into scikit-learn:master May 31, 2018
@jnothman
Copy link
Member

Thanks @jimmywan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants