[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707

jimmywan · 2017-09-07T19:06:42Z

I'd like to make this change in order to support using BaggingRegressor on top of pipelines that handle imputation.

Reference Issue

What does this implement/fix? Explain your changes.

Removes finite checking on X. This seems wholly unnecessary as whatever regressor BaggingRegressor is wrapping should already handle its own consistency checking.

… on top of pipelines that handle imputation.

ramhiser · 2017-09-07T20:21:00Z

+1. The call to check_X_y(..., force_all_finite=False) makes more sense when a Pipeline has a missing data imputer in it.

Maybe the check should be removed entirely so that the base estimator handles the check?

jnothman · 2017-09-08T03:45:11Z

We can't remove the check entirely: Bagging requires pulling out selected rows and columns, and that requires 2d data with an array-like interface. I suppose we don't need to require that it's numeric, though...? We also do not support Pandas DataFrames here, as that would require use of iloc.

Please add a test.

jimmywan · 2017-09-08T17:03:57Z

Updated the PR to drop the type requirement and allow multiple output columns. Will work on a test case.

We also do not support Pandas DataFrames here, as that would require use of iloc.

Hmm, can you walk me through how to make this Pandas-compatible?
A pointer to existing code would be fine if it exists.

Add test cases.

jimmywan · 2017-09-08T18:05:31Z

Added some test cases for classification and regression, and had to make some minor tweaks to accept-multi-output regression.

amueller · 2017-09-08T18:07:20Z

check_array will convert a dataframe to a numpy array

jimmywan · 2017-09-08T18:13:17Z

check_array will convert a dataframe to a numpy array

So nothing else required to pass in a pandas DataFrame?

Context is that my pipeline is capable of handling the conversion of categorical string values to numerics before it hits the actual regressor/classifier at the end of the pipeline.

amueller · 2017-09-08T18:23:41Z

well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline.

Is there a reason you do the feature expansion in the pipeline and not before the bagging estimator?

jimmywan · 2017-09-11T16:37:26Z

well if you do that then the pipeline will receive a numpy array and not a dataframe, but as @jnothman said, the slicing doesn't deal with dataframes, so there is no way to have dataframes in the pipeline.

Assuming that everything before my custom components support things like get_feature_names or support masks, My components can deal with converting numpy arrays back to DataFrames where I need them. I should be good to go.

jimmywan · 2017-09-11T19:57:02Z

I suppose we don't need to require that it's numeric, though...?

I removed that check as well and wanted to test it, but given the lack of a built-in pipeline friendly categorical encoder, I decided not to add test cases for that particular input.

jnothman

Thanks

jnothman · 2017-09-11T23:50:54Z

sklearn/ensemble/bagging.py

@@ -280,7 +279,10 @@ def _fit(self, X, y, max_samples=None, max_depth=None, sample_weight=None):
        random_state = check_random_state(self.random_state)

        # Convert data
-        X, y = check_X_y(X, y, ['csr', 'csc'])
+        X, y = check_X_y(


Perhaps add a comment that we require X to be 2d and indexable on both axes.

jnothman · 2017-09-11T23:52:31Z

sklearn/ensemble/tests/test_bagging.py

+
+
+def test_bagging_pipeline_with_sparse_inputs():
+    # Check that BaggingRegressor can accept sparse pipelines inputs


What you're testing with is not what we call sparse. I presume sparse inputs are already tested too. Do you mean to test nans and multioutput (which probably belongs in a separate test function)?

Renamed "sparse" to "missing"
I created separate tests for single and multioutput BaggingRegressors.

jnothman · 2017-09-11T23:53:00Z

sklearn/ensemble/tests/test_bagging.py

+    # Check that BaggingRegressor can accept sparse pipelines inputs
+    X = [
+        [1, 3, 5],
+        [2, None, 6],


I think you want np.nan rather than None?

TBH, I generally test against both as I'm not entirely sure what happens under the covers. I know I can initialize a Pandas DataFrame either way.

Added separate imputers/inputs for None/np.nan, np.inf, np.NINF.

Add comments. Rename things for consistency. Add additional test case.

…es not try to enforce finite inputs.

jimmywan · 2017-09-13T21:07:34Z

@jnothman Updated as per your PR comments and have updated test cases as well. Please review at your earliest convenience.

jnothman

Thanks

jnothman · 2017-09-13T23:22:22Z

sklearn/ensemble/tests/test_bagging.py

+    )
+    pipeline.fit(X, Y).predict(X)
+    bagging_regressor = BaggingRegressor(pipeline)
+    bagging_regressor.fit(X, Y).predict(X)


should check decision_function etc also, no?

There is no decision_function on BaggingRegressor.
I'll add it to the BaggingClassifier test.

I actually can't add it to the test for BaggingClassifier because DecisionTreeRegressor does not implement decision_function. I don't normally work with classification and this PR was meant to be a patch for the BaggingRegressor. Feel free to add additional tests if you think it's necessary.

jnothman · 2017-09-13T23:23:14Z

sklearn/ensemble/tests/test_bagging.py

+    pipeline = make_pipeline(regressor)
+    assert_raises(ValueError, pipeline.fit, X, Y)
+    bagging_regressor = BaggingRegressor(pipeline)
+    assert_raises(ValueError, bagging_regressor.fit, X, Y)


Might be good to check the message here, but okay.

Since this is just a test for BaggingRegressor, I don't really want to couple the behavior of the underlying regressor to this test case. The fact that it raises an error seems sufficient to me.

jnothman · 2017-09-13T23:24:08Z

sklearn/ensemble/tests/test_bagging.py

+        [2, np.inf, 6],
+        [2, np.NINF, 6],
+    ]
+    Y = [


I'd be okay with this test not dealing with missing data, and just being about 2d. I'd also be okay with the previous test just looping over 1d, 2d.

Changed, though now I have to reuse y for 1D and 2D...

jnothman · 2017-09-13T23:24:44Z

sklearn/ensemble/tests/test_bagging.py

+        [2, np.inf, 6],
+        [2, np.NINF, 6],
+    ]
+    Y = [2, 3, 3, 3, 3]


lowercase y if 1d please

jnothman · 2017-09-13T23:25:04Z

sklearn/ensemble/tests/test_bagging.py

+    )
+    pipeline.fit(X, Y).predict(X)
+    bagging_regressor = BaggingRegressor(pipeline)
+    bagging_regressor.fit(X, Y).predict(X)


please check that the prediction shape matches the input shape

jnothman · 2017-09-13T23:25:22Z

sklearn/ensemble/tests/test_bagging.py

+    assert_raises(ValueError, bagging_regressor.fit, X, Y)
+
+
+def test_bagging_classifier_with_missing_inputs():


What does this test add?

This tests BaggingClassifier while the other tests BaggingRegressor.

jnothman

LGTM

amueller · 2018-05-22T21:37:49Z

LGTM please add a whatsnew entry.

jnothman · 2018-05-23T08:04:12Z

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

jnothman · 2018-05-29T23:43:27Z

Ping @jimmywan?

…_on_x

jimmywan · 2018-05-30T19:37:00Z

please add a whatsnew entry

Merged scikit-learn:master into my branch and added requested documentation.

jnothman · 2018-05-31T02:41:27Z

Thanks @jimmywan!

Relax finite checking on X in order to support using BaggingRegressor…

4a48bcb

… on top of pipelines that handle imputation.

Remove type checking and allow multi-dimensional inputs for regression.

c909e05

Add test cases.

Reformat to avoid travis-ci failures for E501.

f354fd7

jimmywan mentioned this pull request Sep 11, 2017

BaggingRegressor can not be used on top of pipelines that handle imputation #9708

Closed

jnothman reviewed Sep 11, 2017

View reviewed changes

jimmywan added 3 commits September 12, 2017 17:13

Update PR based on review comments:

b309633

Add comments. Rename things for consistency. Add additional test case.

Add tests to make sure that the BaggingRegressor/BaggingClassifier do…

cbd482a

…es not try to enforce finite inputs.

Fix test failure by trimming comment length after stabbing self in face.

5ec96f3

jnothman reviewed Sep 13, 2017

View reviewed changes

Update tests to reflect PR comments.

b86dc11

jnothman approved these changes Sep 14, 2017

View reviewed changes

jnothman changed the title ~~Update BaggingRegressor to relax checking X for finite values~~ [MRG+1] Update BaggingRegressor to relax checking X for finite values Sep 14, 2017

amueller approved these changes May 22, 2018

View reviewed changes

jnothman changed the title ~~[MRG+1] Update BaggingRegressor to relax checking X for finite values~~ [MRG+2] Update BaggingRegressor to relax checking X for finite values May 23, 2018

jimmywan added 2 commits May 30, 2018 19:25

Merge remote-tracking branch 'origin/master' into relax_bagging_check…

12a1e5c

…_on_x

Add documentation to whats_new.

1e3e5e3

jnothman merged commit a31a906 into scikit-learn:master May 31, 2018

qinhanmin2014 mentioned this pull request Jul 11, 2018

TST test_bagging_regressor/classifier_with_missing_inputs fails with SimpleImputer #11482

Closed

venkyyuvy mentioned this pull request Jul 25, 2020

Make sure meta-estimators are lenient towards missing data #15319

Closed

This was referenced Apr 17, 2024

[MRG] ENH: multi-output support for BaggingRegressor #8547

Closed

Add multi-output support to the bagging module #3449

Closed



		def test_bagging_pipeline_with_sparse_inputs():
		# Check that BaggingRegressor can accept sparse pipelines inputs

		assert_raises(ValueError, bagging_regressor.fit, X, Y)


		def test_bagging_classifier_with_missing_inputs():

[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707

[MRG+2] Update BaggingRegressor to relax checking X for finite values #9707

Conversation

jimmywan commented Sep 7, 2017 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

ramhiser commented Sep 7, 2017

jnothman commented Sep 8, 2017

jimmywan commented Sep 8, 2017

jimmywan commented Sep 8, 2017

amueller commented Sep 8, 2017

jimmywan commented Sep 8, 2017

amueller commented Sep 8, 2017

jimmywan commented Sep 11, 2017

jimmywan commented Sep 11, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmywan Sep 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmywan commented Sep 13, 2017

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmywan Sep 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

amueller commented May 22, 2018

jnothman commented May 23, 2018

jnothman commented May 29, 2018

jimmywan commented May 30, 2018

jnothman commented May 31, 2018

jimmywan commented Sep 7, 2017 •

edited

Loading

jimmywan Sep 12, 2017 •

edited

Loading

jimmywan Sep 14, 2017 •

edited

Loading