[MRG] ENH: Make StackingRegressor support Multioutput #27704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

hmasdev wants to merge 14 commits into scikit-learn:main from hmasdev:multioutput-stacking-regressor-available

Contributor

hmasdev commented Nov 2, 2023

Reference Issues/PRs

Related to #25597
Similar to #8547
Similar to #19223

What does this implement/fix? Explain your changes.

Added the support for multioutput in StackingRegressor;
Added the test codes for above changes.
Update the docstring of StackingRegressor.

Any other comments?

I am concerned the followings:

Do we need any other tests?

hmasdev added 2 commits

November 2, 2023 12:45


          make StackingRegressor support Multioutput

4edc110


          update docstring of StackingRegressor

dd947bd

github-actions bot added the module:ensemble label

github-actions bot commented Nov 2, 2023 •

edited

Loading

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bcf4e0e. Link to the linter CI: here}

hmasdev and others added 9 commits

February 3, 2024 11:04


          Merge branch 'main' into multioutput-stacking-regressor-available


          Merge branch 'main' into multioutput-stacking-regressor-available

7a7acc7


          test: update stacking regressor test with Ridge

f323593


          Merge branch 'main' into multioutput-stacking-regressor-available

b022b7b


          Merge branch 'main' into multioutput-stacking-regressor-available

afedb2d


          Merge branch 'main' into multioutput-stacking-regressor-available

4056ce5


          Merge branch 'main' into multioutput-stacking-regressor-available

08270f6


          Merge branch 'main' into multioutput-stacking-regressor-available

ae4afba


          Merge branch 'main' into multioutput-stacking-regressor-available

36232dc

Member

adrinjalali commented Apr 17, 2024

@OmarManzoor would you maybe have time to have a look here?

OmarManzoor reviewed

View reviewed changes

Contributor

OmarManzoor left a comment •

edited

Loading

Thank you for the PR @hmasdev. A few comments. I think we will also need a changelog entry for this.

sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved

sklearn/ensemble/_stacking.py Show resolved Hide resolved


          Commit suggestion: try-except to if syntax

da2f4b7

OmarManzoor reviewed

View reviewed changes

Contributor

OmarManzoor left a comment

Thank you for the changes. Here are a few more changes. Also I think we are still not handling the case where if an estimator/regressor that does not support multioutput is specified. Or do we not need to worry about such a case?

sklearn/ensemble/tests/test_stacking.py Outdated

Comment on lines 888 to 892

+                  # NOTE: In this case the estimator can predict almost exactly the target
+                  assert_allclose(
+                      y_pred,
+                      # NOTE: when the target is 2D but with a single output,
+                      #       the predictions are 1D because of column_or_1d

Contributor

OmarManzoor Apr 25, 2024

Suggested change

      
                # NOTE: In this case the estimator can predict almost exactly the target
          
                assert_allclose(
          
                    y_pred,
          
                    # NOTE: when the target is 2D but with a single output,
          
                    #       the predictions are 1D because of column_or_1d
          
                # NOTE: In this case the estimator can predict almost exactly the target.
          
                # When the target is 2D but with a single output the predictions are 1D 
          
                # because of column_or_1d
          
                assert_allclose(
          
                    y_pred,

sklearn/ensemble/tests/test_stacking.py Outdated

+                      rtol=acceptable_relative_tolerance,
+                      atol=acceptable_aboslute_tolerance,
+                  )
+                  # transform

Contributor

OmarManzoor Apr 25, 2024

Suggested change

# transform

sklearn/ensemble/tests/test_stacking.py Outdated

+                  )
+                  reg.fit(X_train, y_train)
+                  # predict

Contributor

OmarManzoor Apr 25, 2024

Suggested change

# predict

sklearn/ensemble/tests/test_stacking.py Outdated



		def test_stacking_regressor_multioutput_with_passthrough():
		"""Check that a stacking regressor with multioutput works"""

Contributor

OmarManzoor Apr 25, 2024

Suggested change

      
                """Check that a stacking regressor with multioutput works"""
          
                """Check that a stacking regressor with passthrough works with multioutput"""

sklearn/ensemble/tests/test_stacking.py Outdated

+                      rtol=acceptable_relative_tolerance,
+                      atol=acceptable_aboslute_tolerance,
+                  )
+                  # transform

Contributor

OmarManzoor Apr 25, 2024

Suggested change

# transform

sklearn/ensemble/tests/test_stacking.py Outdated



		def test_stacking_regressor_multioutput():
		"""Check that a stacking regressor with multioutput works"""

Contributor

OmarManzoor Apr 25, 2024

Suggested change

      
                """Check that a stacking regressor with multioutput works"""
          
                """Check that a stacking regressor works with multioutput"""

hmasdev added 2 commits

May 7, 2024 23:10


          Commit suggestion: update comments in test code

2c2a49c


          update docstring of _BaseStacking.fit

bcf4e0e

Contributor Author

hmasdev commented May 8, 2024 •

edited

Loading

@OmarManzoor
Thank you for more comments. I applied your suggestions to the test code.

Also I think we are still not handling the case where if an estimator/regressor that does not support multioutput is specified. Or do we not need to worry about such a case?

Actually, I don't have a good idea yet on how to handle an estimator that does not support multiple outputs when it is used in a multi-output problem.
In the current implementation, if an estimator that does not support multioutputs is used in a multioutput problem, ValueError occurs in that estimator as shown below.
If there was an API to determine if an estimator supports multi-output, it would be possible to handle this issue in StackingRegressor.fit.

Do you know such an API?

Python 3.10.13 (main, Feb 22 2024, 10:50:12) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.ensemble import StackingRegressor
>>> from sklearn.svm import SVR
>>> from sklearn.linear_model import LinearRegression
>>> lr  = LinearRegression()
>>> svr = SVR()
>>> model = StackingRegressor(estimators=[('lr', lr), ('svr', svr)])
>>> import numpy  as np
>>> X = np.random.randn(10, 2)
>>> Y = X ** 2
>>> model.fit(X, Y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/workspace/scikit-learn/sklearn/ensemble/_stacking.py", line 973, in fit
    return super().fit(X, y, sample_weight)
  File "/root/workspace/scikit-learn/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/ensemble/_stacking.py", line 224, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs)(
  File "/root/workspace/scikit-learn/sklearn/utils/parallel.py", line 67, in __call__
    return super().__call__(iterable_with_config)
  File "/root/workspace/scikit-learn/sklearn-env/lib/python3.10/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
  File "/root/workspace/scikit-learn/sklearn-env/lib/python3.10/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/utils/parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/ensemble/_base.py", line 40, in _fit_single_estimator
    estimator.fit(X, y, **fit_params)
  File "/root/workspace/scikit-learn/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/root/workspace/scikit-learn/sklearn/svm/_base.py", line 190, in fit
    X, y = self._validate_data(
  File "/root/workspace/scikit-learn/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1282, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1303, in _check_y
    y = column_or_1d(y, warn=True)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1370, in column_or_1d
    raise ValueError(
ValueError: y should be a 1d array, got an array of shape (10, 2) instead.

Contributor Author

hmasdev commented May 8, 2024

Note that StackingClassifier is already available for multilabel classification problem but unavailable for multiclass-multioutput classification. I think that the latter is an issue that is out of scope this PR.

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.ensemble import StackingClassifier
>>> from sklearn.multioutput import MultiOutputClassifier
>>> X = np.random.randn(10, 2)
>>> Y = X > 0  # multilabel classification
>>> model = StackingClassifier(estimators=[('lr', MultiOutputClassifier(LogisticRegression(C=1e3))), ('lr2', MultiOutputClassifier(LogisticRegression(C=1e3)))], final_estimator=MultiOutputClassifier(LogisticRegression(C=1e3)))
>>> model.fit(X, Y)
StackingClassifier(estimators=[('lr',
                                MultiOutputClassifier(estimator=LogisticRegression(C=1000.0))),
                               ('lr2',
                                MultiOutputClassifier(estimator=LogisticRegression(C=1000.0)))],
                   final_estimator=MultiOutputClassifier(estimator=LogisticRegression(C=1000.0)))
>>> model.predict(X)[:3]
array([[ True,  True],
       [False,  True],
       [False, False]])
>>> model.predict_proba(X)[:3]
array([[1.55247883e-03, 7.05602027e-04],
       [9.99983741e-01, 2.31839536e-03],
       [9.99873471e-01, 9.99473360e-01]])
>>> Z = np.random.choice(range(3), size=X.shape)  # multiclass-multioutput classification
>>> model.fit(X, Z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/workspace/scikit-learn/sklearn/ensemble/_stacking.py", line 669, in fit
    self._label_encoder = LabelEncoder().fit(y)
  File "/root/workspace/scikit-learn/sklearn/preprocessing/_label.py", line 97, in fit
    y = column_or_1d(y, warn=True)
  File "/root/workspace/scikit-learn/sklearn/utils/validation.py", line 1370, in column_or_1d
    raise ValueError(
ValueError: y should be a 1d array, got an array of shape (10, 2) instead.

Ref. https://scikit-learn.org/stable/modules/multiclass.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ensemble