-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707
Conversation
… on top of pipelines that handle imputation.
+1. The call to Maybe the check should be removed entirely so that the base estimator handles the check? |
We can't remove the check entirely: Bagging requires pulling out selected rows and columns, and that requires 2d data with an array-like interface. I suppose we don't need to require that it's numeric, though...? We also do not support Pandas DataFrames here, as that would require use of iloc. Please add a test. |
Updated the PR to drop the type requirement and allow multiple output columns. Will work on a test case.
Hmm, can you walk me through how to make this Pandas-compatible? |
Added some test cases for classification and regression, and had to make some minor tweaks to accept-multi-output regression. |
|
So nothing else required to pass in a pandas DataFrame? Context is that my pipeline is capable of handling the conversion of categorical string values to numerics before it hits the actual regressor/classifier at the end of the pipeline. |
well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline. Is there a reason you do the feature expansion in the pipeline and not before the bagging estimator? |
Assuming that everything before my custom components support things like |
I removed that check as well and wanted to test it, but given the lack of a built-in pipeline friendly categorical encoder, I decided not to add test cases for that particular input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
sklearn/ensemble/bagging.py
Outdated
@@ -280,7 +279,10 @@ def _fit(self, X, y, max_samples=None, max_depth=None, sample_weight=None): | |||
random_state = check_random_state(self.random_state) | |||
|
|||
# Convert data | |||
X, y = check_X_y(X, y, ['csr', 'csc']) | |||
X, y = check_X_y( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a comment that we require X to be 2d and indexable on both axes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
||
|
||
def test_bagging_pipeline_with_sparse_inputs(): | ||
# Check that BaggingRegressor can accept sparse pipelines inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're testing with is not what we call sparse. I presume sparse inputs are already tested too. Do you mean to test nans and multioutput (which probably belongs in a separate test function)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed "sparse" to "missing"
I created separate tests for single and multioutput BaggingRegressors.
# Check that BaggingRegressor can accept sparse pipelines inputs | ||
X = [ | ||
[1, 3, 5], | ||
[2, None, 6], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want np.nan rather than None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I generally test against both as I'm not entirely sure what happens under the covers. I know I can initialize a Pandas DataFrame either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added separate imputers/inputs for None/np.nan, np.inf, np.NINF.
Add comments. Rename things for consistency. Add additional test case.
…es not try to enforce finite inputs.
@jnothman Updated as per your PR comments and have updated test cases as well. Please review at your earliest convenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
) | ||
pipeline.fit(X, Y).predict(X) | ||
bagging_regressor = BaggingRegressor(pipeline) | ||
bagging_regressor.fit(X, Y).predict(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should check decision_function etc also, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no decision_function
on BaggingRegressor
.
I'll add it to the BaggingClassifier
test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually can't add it to the test for BaggingClassifier
because DecisionTreeRegressor
does not implement decision_function
. I don't normally work with classification and this PR was meant to be a patch for the BaggingRegressor. Feel free to add additional tests if you think it's necessary.
pipeline = make_pipeline(regressor) | ||
assert_raises(ValueError, pipeline.fit, X, Y) | ||
bagging_regressor = BaggingRegressor(pipeline) | ||
assert_raises(ValueError, bagging_regressor.fit, X, Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to check the message here, but okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just a test for BaggingRegressor, I don't really want to couple the behavior of the underlying regressor to this test case. The fact that it raises an error seems sufficient to me.
[2, np.inf, 6], | ||
[2, np.NINF, 6], | ||
] | ||
Y = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be okay with this test not dealing with missing data, and just being about 2d. I'd also be okay with the previous test just looping over 1d, 2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, though now I have to reuse y for 1D and 2D...
[2, np.inf, 6], | ||
[2, np.NINF, 6], | ||
] | ||
Y = [2, 3, 3, 3, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase y if 1d please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
) | ||
pipeline.fit(X, Y).predict(X) | ||
bagging_regressor = BaggingRegressor(pipeline) | ||
bagging_regressor.fit(X, Y).predict(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check that the prediction shape matches the input shape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
assert_raises(ValueError, bagging_regressor.fit, X, Y) | ||
|
||
|
||
def test_bagging_classifier_with_missing_inputs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this test add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests BaggingClassifier while the other tests BaggingRegressor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM please add a whatsnew entry. |
Please add an entry to the change log at |
Ping @jimmywan? |
Merged scikit-learn:master into my branch and added requested documentation. |
Thanks @jimmywan! |
I'd like to make this change in order to support using BaggingRegressor on top of pipelines that handle imputation.
Reference Issue
#9708
What does this implement/fix? Explain your changes.
Removes finite checking on X. This seems wholly unnecessary as whatever regressor BaggingRegressor is wrapping should already handle its own consistency checking.