MultiOutputClassifier rely on estimator to provide pairwise tag #30236

Sean-Jay-M · 2024-11-07T17:30:39Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Enables multioutput classifier to rely on pairwise estimator.

This has been implemented by altering sklearn_tags through the introduction of the get_tags method. This change is similar to what is present in multiclass.py: OneVsRestClassifier. Implemented single_output and multi_output as present in MultiOutputEstimator.

Any other comments?

The discussion in #29016 mentions a possible fix utilizing features/methods that have since been altered/deprecated.

Conducted some basic tests swapping the SVC kernel value.

github-actions · 2024-11-07T17:32:38Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 876d388. Link to the linter CI: here}

adrinjalali · 2024-11-08T08:15:51Z

There are quite a few failing tests here:

FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_sample_weight_equivalence] - TypeError: MultiOutputClassifier.fit() missing 1 required positional argume...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_estimator_sparse_array] - AssertionError: Estimator MultiOutputClassifier doesn't seem to fail gracef...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_estimator_sparse_matrix] - AssertionError: Estimator MultiOutputClassifier doesn't seem to fail gracef...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_one_label] - AssertionError: Classifier can't train when only one class is present.
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_one_label_sample_weights] - AssertionError: MultiOutputClassifier failed when fitted on one label after...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_classes] - ValueError: y must have at least two dimensions for multi-output regression...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train(readonly_memmap=True)] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train(readonly_memmap=True,X_dtype=float32)] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_methods_sample_order_invariance] - IndexError: list index out of range

Sean-Jay-M · 2024-11-11T13:01:14Z

There are quite a few failing tests here:

FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_sample_weight_equivalence] - TypeError: MultiOutputClassifier.fit() missing 1 required positional argume...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_estimator_sparse_array] - AssertionError: Estimator MultiOutputClassifier doesn't seem to fail gracef...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_estimator_sparse_matrix] - AssertionError: Estimator MultiOutputClassifier doesn't seem to fail gracef...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_one_label] - AssertionError: Classifier can't train when only one class is present.
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_one_label_sample_weights] - AssertionError: MultiOutputClassifier failed when fitted on one label after...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_classes] - ValueError: y must have at least two dimensions for multi-output regression...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train(readonly_memmap=True)] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train(readonly_memmap=True,X_dtype=float32)] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_methods_sample_order_invariance] - IndexError: list index out of range

@adrinjalali Thank you for the comment ! So these tests were failing due to the tag _skip_tests being removed. I tested what happens when it is present and not present. I found that, when ignoring the other changes implemented in this PR and simply removing the skip_tests tag these tests will fail. I have checked and seen that under the class Tags the documentation states:

Don't use this unless you have a *very good* reason.

I will leave this tag as is and reset _skip_tests to True, although I could not find any documentation as to why it was originally set to True.

adrinjalali · 2024-11-15T10:00:24Z

@Sean-Jay-M The issue is that our common tests were developed after MultiOutputClassifier was developed, and it doesn't really follow our API, hence the checks fail.

What you'd need to do is to figure out why each one fails, and either add them to tests that are expected to fail:

scikit-learn/sklearn/utils/_test_common/instance_generator.py

Line 741 in 56a4adb

PER_ESTIMATOR_XFAIL_CHECKS = {

We would greatly appreciate this work.

Sean-Jay-M · 2024-11-15T22:40:34Z

@adrinjalali Understood, working on it !

Sean-Jay-M · 2024-11-22T18:38:36Z

@adrinjalali

Two things -

I added the tests which are expected to fail. I would like some feedback on the language I used to describe why they are failing, as I am unsure if it suits.
check_sample_weight_equivalence provided the argument y and I noticed that the MultiOutputClassifier fit expects Y. I also checked and it appears this is not aligned with the 'standards' established within scikit-learn. At least as far as I can tell. I do understand that a capitalized letter (in this case) Y is sometimes used to represent the shape: (n_samples, n_classes) but this took some research for me to find out and from a user perspective is a bit odd. This test is now passing as a result of this change. However I can change this back if required y -> Y.

Note: It has some issues which I will fix. Sorry for slow progress, im busy and also trying to better learn the project at the same time

sklearn/multioutput.py

sklearn/utils/_test_common/instance_generator.py

adrinjalali · 2024-11-25T09:47:06Z

sklearn/utils/_test_common/instance_generator.py

+        "check_classifiers_one_label": (
+            "fit expects y to have at least two dimensions, tests provides y with one dimension",
+        ),


this seems like there should be a tag controlling this? As in, we have one_d_labels and two_d_labels tags, don't they control the test?

You are correct they do control tests, however I had to ediit Estimator Checks to allow for multi label output models to avoid single label output checks using the tag tags.target_tags.two_d_labels.

adrinjalali · 2024-11-25T09:48:25Z

sklearn/utils/_test_common/instance_generator.py

+            "fit expects y to have at least two dimensions, tests provides y with one dimension",
+        ),
+        "check_classifiers_train": (
+            "predict returns an array of arrays, test expects array",


I feel like predict at least should be returning the right form

Aplogies, the test expects predict to return (n_samples,) but it returns (n_samples, n_outputs). I have updated this.

Sean-Jay-M · 2024-12-04T19:50:37Z

@adrinjalali

I have updated the PR. I have updated two tests and one check group:

test_check_param_validation: The update to this test was done so that MultiOutPutClassifier would skip this test, I followed the paradigm set in the test test_pandas_column_name_consistency by adding in an if statement.
test_fit_docstring_attributes: This was updated to allow for both MultiOutPutClassifier and MultiLabelBinarizer pass this test. MultiLabelBinarizer fit does not take an X argument but MultiOutPutClassifier fit does.
Estimator Checks has been edited to allow for multi label output models to avoid single label output checks using the tag tags.target_tags.two_d_labels.

I further updated the strings for skipping check_classifiers_train you are correct. The test expects predict to return (n_samples,) but it returns (n_samples, n_outputs).

sklearn/tests/test_common.py

sklearn/tests/test_docstring_parameters.py

adrinjalali · 2024-12-05T08:50:22Z

sklearn/utils/_test_common/instance_generator.py

@@ -1017,6 +1017,24 @@ def _yield_instances_for_check(check, estimator_orig):
            "sample_weight is not equivalent to removing/repeating samples."
        ),
    },
+    MultiOutputClassifier: {
+        "check_param_validation": ("fit expects argument Y, test provides argument y"),


I'd say we should deprecate Y and accept y

Yes we discussed this earlier, would you like me to do this as part of this PR or raise an issue and work on it in a dedicated PR ?

this can be in this PR yes

adrinjalali · 2024-12-05T08:53:45Z

sklearn/utils/_test_common/instance_generator.py

+        "check_estimator_sparse_array": (
+            "predict_proba returns a list of arrays, test expects array"
+        ),
+        "check_estimator_sparse_matrix": (
+            "predict_proba returns a list of arrays, test expects array"
+        ),


didn't we have an issue on this @glemaitre ? I can only find #19880

The one that I recall is this one. Basically it is to know if we return a compact from (an array) or a list of arrays.

sklearn/utils/estimator_checks.py

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Sean-Jay-M · 2024-12-08T20:25:24Z

You can enable pre-commit hooks with pre-commit install and then not worry about linters

Thank you for this advice - it's a tool I was unaware of ! @adrinjalali

We will need to have a changelog entry I think.

I have added a changelog entry as API change. @glemaitre

Other comments:

I have re-implemented Y to be y. Apologies I misunderstood previous comments and undid my initial change.

I have undid previous changes I made to check test_check_param_validation as they are no longer needed due to implementing the y change.

One test has failed due to code codecov in estimator checks. I'll need a pointer on how to fix this or if it needs fixing. If I remove the uncovered line LabelBinarizer will expectedly fail the test due to expecting two arguments not three.

Thank you :)

adrinjalali · 2024-12-09T12:57:13Z

sklearn/multioutput.py

@@ -513,15 +513,15 @@ class MultiOutputClassifier(ClassifierMixin, _MultiOutputEstimator):
    def __init__(self, estimator, *, n_jobs=None):
        super().__init__(estimator, n_jobs=n_jobs)

-    def fit(self, X, Y, sample_weight=None, **fit_params):
+    def fit(self, X, y, sample_weight=None, **fit_params):


We need to deprecated Y away instead of simply renaming it, to avoid breaking code which was written as est.fit(X=X, Y=y)

We then need to handle the deprecation also inside the function. Have a look at our contributing guides regarding deprecations.

Suggested change

def fit(self, X, y, sample_weight=None, **fit_params):

@_deprecate_positional_args(version="1.9")

def fit(self, X, y, *, sample_weight=None, Y=None, **fit_params):

I believe I have completed this. I will leave you to mark is as resolved.

adrinjalali · 2024-12-09T12:59:39Z

sklearn/utils/estimator_checks.py

+                        getattr(estimator, method)(X, y)
+                    else:
+                        # The estimator is a label transformer and take only `y`
+                        getattr(estimator, method)(y)


probably @jeremiedbb knows better as why this line is not covered here, but the one bellow is.

For me this is a false positive because locally LabelBinarizer in this PR will get through that branch and raise:

E sklearn.utils._param_validation.InvalidParameterError: The 'neg_label' parameter of LabelBinarizer must be an instance of 'int'. Got <sklearn.utils.estimator_checks.BadType object at 0x13ea44500> instead.

Sean-Jay-M · 2024-12-12T20:33:07Z

@adrinjalali @glemaitre Thank you for the pointers. I believe I followed the correct process for deprecation. Please let me know if not.

I followed the implementation in _deprecate_Y_when_optional in _pls.py for handling the case where both y and Y are provided and raise an error.

I made a test case to cover all cases related to interactions between y and Y:

X, y, Y
X, Y
X

This test is named test_multioutput_classifier_fit_dependency_warning. I based it off of test_deprecation_warning_base_estimator.

I further had to alter two tests to use the keyword argument sample_weights. They were failing if I did not implement these changes.

Tests implemented:

test_multi_output_classification_partial_fit_sample_weights
test_multi_output_classification_sample_weights

I hope this is the last set of commits !

sklearn/multioutput.py

adrinjalali · 2025-01-02T14:06:57Z

sklearn/multioutput.py

-        super().fit(X, Y, sample_weight=sample_weight, **fit_params)
+        if Y is not None:
+            warnings.warn(
+                "`Y` was renamed to `y` in 1.9 and will be removed in 2.1",


sklearn/tests/test_multioutput.py

adrinjalali · 2025-01-02T14:08:37Z

sklearn/utils/_test_common/instance_generator.py

+            "predict_proba returns a list of arrays, test expects array"
+        ),
+        "check_classifiers_train": (
+            "predict returns (n_samples, n_outputs) test expects (n_samples,)"


mmm, this is suspicious, needs either fixing the test or the code

I have fixed this by altering the test to suit the methods predict and predict_proba for the multioutputclassifer. I reviewed and saw other tests accounted for the outpuit datatype of the class, so I have adapted to the tests.

Sean-Jay-M · 2025-01-12T15:02:44Z

Just back from break. Will work on this in the coming week.

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

adrinjalali

Thanks, some nits, and a bit that @glemaitre needs to check. Otherwise LGTM.

sklearn/multioutput.py

adrinjalali · 2025-02-18T17:07:39Z

sklearn/tests/test_docstring_parameters.py

construction and fitting of estimators in this file needs quite a bit of revamp to use the same machinery as the new common tests, but best left for a different PR.

sklearn/tests/test_multioutput.py

adrinjalali · 2025-02-18T17:09:21Z

sklearn/utils/estimator_checks.py

+            if isinstance(y_prob, list):
+                for prob in y_prob:
+                    assert prob.shape[0] == n_samples
+                    assert prob.shape[1] == len(np.unique(y))
+            else:
+                # Original check for single output classifiers
+                assert y_prob.shape == (n_samples, n_classes)
+
+            if isinstance(y_prob, list):
+                for prob_array in y_prob:
+                    assert_array_equal(np.argmax(prob_array, axis=1), y_pred)
+
+                    # check that probas for all classes sum to one
+                    assert_array_almost_equal(
+                        np.sum(prob_array, axis=1), np.ones(n_samples)
+                    )
+            else:
+                assert_array_equal(np.argmax(y_prob, axis=1), y_pred)
+
+                # check that probas for all classes sum to one
+                assert_array_almost_equal(np.sum(y_prob, axis=1), np.ones(n_samples))


I'll need @glemaitre check this section

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Sean-Jay-M · 2025-08-04T14:13:04Z

@glemaitre Should I drop or close this PR? Or is it possible this will get merged thanks!

Sean-Jay-M and others added 2 commits November 6, 2024 17:31

Update MultiOutputClassifier sklearn_tags

07c91e8

Merge branch 'main' into multi-output-classifier-pairwise-tag

4a70d41

Sean-Jay-M marked this pull request as draft November 8, 2024 09:22

Sean-Jay-M and others added 3 commits November 9, 2024 10:18

set skip test to True

bc8959f

ruff formatting

5c81bfd

ruff formatting

0d6653b

Sean-Jay-M marked this pull request as ready for review November 11, 2024 13:01

Sean-Jay-M added 3 commits November 22, 2024 18:22

Merge branch 'main' into multi-output-classifier-pairwise-tag

de0c833

update-test-cases-and-Y-to-y

1f0b0e5

black-formatting

410c24b

adrinjalali reviewed Nov 25, 2024

View reviewed changes

Sean-Jay-M added 8 commits December 3, 2024 17:42

update tests for MultiOutputClassifier

3ab49ea

update test cases

4103f87

ruff formatting

e214e29

ruff formatting

2378c64

Merge branch 'main' into multi-output-classifier-pairwise-tag

6d16ef5

remove unnecessary xfail line

546051a

alter strings

7ac57d5

remove comma

fd37938

Sean-Jay-M requested a review from adrinjalali December 4, 2024 19:55

adrinjalali reviewed Dec 5, 2024

View reviewed changes

Sean-Jay-M and others added 2 commits December 5, 2024 22:20

if single output

d79da77

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

implement suggestion

8bdfae6

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

correct test-case

adfb890

Sean-Jay-M requested a review from adrinjalali December 8, 2024 20:25

adrinjalali reviewed Dec 9, 2024

View reviewed changes

glemaitre self-requested a review December 9, 2024 22:52

Sean-Jay-M added 6 commits December 12, 2024 08:33

deprecate MultiOutputClassifier Y

9a96cec

Correct Docstrings Order

16a559d

update test cases and warnings

192a827

newline

e539281

Merge branch 'main' into multi-output-classifier-pairwise-tag

e2f84d9

update multioutput tests

f629bf5

Sean-Jay-M requested a review from adrinjalali December 12, 2024 20:33

Sean-Jay-M added 3 commits December 19, 2024 23:37

Merge branch 'main' into multi-output-classifier-pairwise-tag

71a04ea

Merge branch 'main' into multi-output-classifier-pairwise-tag

810db45

Merge branch 'main' into multi-output-classifier-pairwise-tag

2cf9f07

adrinjalali reviewed Jan 2, 2025

View reviewed changes

Sean-Jay-M and others added 6 commits February 13, 2025 15:32

Merge branch 'main' into multi-output-classifier-pairwise-tag

1755bec

Update sklearn/tests/test_multioutput.py

b26d6a7

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Update sklearn/multioutput.py

c4d8b9d

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

multi-output-pairwise-tag

a122dc8

remove get tags

fdf807f

update-test-regex-warning

16fc88d

Sean-Jay-M requested a review from adrinjalali February 14, 2025 13:52

adrinjalali approved these changes Feb 18, 2025

View reviewed changes

Sean-Jay-M and others added 3 commits February 19, 2025 16:46

update deprecate version.

5e2af2e

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

restructure todo.

f1c8152

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Merge branch 'main' into multi-output-classifier-pairwise-tag

876d388

	def fit(self, X, y, sample_weight=None, **fit_params):
	@_deprecate_positional_args(version="1.9")
	def fit(self, X, y, , sample_weight=None, Y=None, *fit_params):

Uh oh!

MultiOutputClassifier rely on estimator to provide pairwise tag #30236

Are you sure you want to change the base?

MultiOutputClassifier rely on estimator to provide pairwise tag #30236

Uh oh!

Conversation

Sean-Jay-M commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali commented Nov 8, 2024

Uh oh!

Sean-Jay-M commented Nov 11, 2024

Uh oh!

adrinjalali commented Nov 15, 2024

Uh oh!

Sean-Jay-M commented Nov 15, 2024

Uh oh!

Sean-Jay-M commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sean-Jay-M commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sean-Jay-M commented Dec 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sean-Jay-M commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sean-Jay-M Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sean-Jay-M commented Jan 12, 2025

Uh oh!

Sean-Jay-M commented Nov 7, 2024 •

edited

Loading

github-actions bot commented Nov 7, 2024 •

edited

Loading

Sean-Jay-M commented Nov 22, 2024 •

edited

Loading

Sean-Jay-M commented Dec 8, 2024 •

edited

Loading

Sean-Jay-M commented Dec 12, 2024 •

edited

Loading

Sean-Jay-M Feb 13, 2025 •

edited

Loading