-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
MultiOutputClassifier rely on estimator to provide pairwise tag #30236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
MultiOutputClassifier rely on estimator to provide pairwise tag #30236
Conversation
There are quite a few failing tests here: FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_sample_weight_equivalence] - TypeError: MultiOutputClassifier.fit() missing 1 required positional argume...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_estimator_sparse_array] - AssertionError: Estimator MultiOutputClassifier doesn't seem to fail gracef...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_estimator_sparse_matrix] - AssertionError: Estimator MultiOutputClassifier doesn't seem to fail gracef...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_one_label] - AssertionError: Classifier can't train when only one class is present.
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_one_label_sample_weights] - AssertionError: MultiOutputClassifier failed when fitted on one label after...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_classes] - ValueError: y must have at least two dimensions for multi-output regression...
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train(readonly_memmap=True)] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_classifiers_train(readonly_memmap=True,X_dtype=float32)] - AssertionError
FAILED tests/test_common.py::test_estimators[MultiOutputClassifier(estimator=LogisticRegression(C=1))-check_methods_sample_order_invariance] - IndexError: list index out of range |
@adrinjalali Thank you for the comment ! So these tests were failing due to the tag _skip_tests being removed. I tested what happens when it is present and not present. I found that, when ignoring the other changes implemented in this PR and simply removing the skip_tests tag these tests will fail. I have checked and seen that under the class
I will leave this tag as is and reset _skip_tests to True, although I could not find any documentation as to why it was originally set to True. |
@Sean-Jay-M The issue is that our common tests were developed after What you'd need to do is to figure out why each one fails, and either add them to tests that are expected to fail:
We would greatly appreciate this work. |
@adrinjalali Understood, working on it ! |
Two things -
Note: It has some issues which I will fix. Sorry for slow progress, im busy and also trying to better learn the project at the same time |
"check_classifiers_one_label": ( | ||
"fit expects y to have at least two dimensions, tests provides y with one dimension", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like there should be a tag controlling this? As in, we have one_d_labels
and two_d_labels
tags, don't they control the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct they do control tests, however I had to ediit Estimator Checks to allow for multi label output models to avoid single label output checks using the tag tags.target_tags.two_d_labels.
"fit expects y to have at least two dimensions, tests provides y with one dimension", | ||
), | ||
"check_classifiers_train": ( | ||
"predict returns an array of arrays, test expects array", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like predict
at least should be returning the right form
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aplogies, the test expects predict to return (n_samples,) but it returns (n_samples, n_outputs). I have updated this.
I have updated the PR. I have updated two tests and one check group:
I further updated the strings for skipping check_classifiers_train you are correct. The test expects predict to return (n_samples,) but it returns (n_samples, n_outputs). |
@@ -1017,6 +1017,24 @@ def _yield_instances_for_check(check, estimator_orig): | |||
"sample_weight is not equivalent to removing/repeating samples." | |||
), | |||
}, | |||
MultiOutputClassifier: { | |||
"check_param_validation": ("fit expects argument Y, test provides argument y"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we should deprecate Y
and accept y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we discussed this earlier, would you like me to do this as part of this PR or raise an issue and work on it in a dedicated PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be in this PR yes
"check_estimator_sparse_array": ( | ||
"predict_proba returns a list of arrays, test expects array" | ||
), | ||
"check_estimator_sparse_matrix": ( | ||
"predict_proba returns a list of arrays, test expects array" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't we have an issue on this @glemaitre ? I can only find #19880
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one that I recall is this one. Basically it is to know if we return a compact from (an array) or a list of arrays.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Thank you for this advice - it's a tool I was unaware of ! @adrinjalali
I have added a changelog entry as API change. @glemaitre Other comments: I have re-implemented I have undid previous changes I made to check test_check_param_validation as they are no longer needed due to implementing the One test has failed due to code codecov in estimator checks. I'll need a pointer on how to fix this or if it needs fixing. If I remove the uncovered line LabelBinarizer will expectedly fail the test due to expecting two arguments not three. Thank you :) |
sklearn/multioutput.py
Outdated
@@ -513,15 +513,15 @@ class MultiOutputClassifier(ClassifierMixin, _MultiOutputEstimator): | |||
def __init__(self, estimator, *, n_jobs=None): | |||
super().__init__(estimator, n_jobs=n_jobs) | |||
|
|||
def fit(self, X, Y, sample_weight=None, **fit_params): | |||
def fit(self, X, y, sample_weight=None, **fit_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to deprecated Y
away instead of simply renaming it, to avoid breaking code which was written as est.fit(X=X, Y=y)
We then need to handle the deprecation also inside the function. Have a look at our contributing guides regarding deprecations.
def fit(self, X, y, sample_weight=None, **fit_params): | |
@_deprecate_positional_args(version="1.9") | |
def fit(self, X, y, *, sample_weight=None, Y=None, **fit_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I have completed this. I will leave you to mark is as resolved.
getattr(estimator, method)(X, y) | ||
else: | ||
# The estimator is a label transformer and take only `y` | ||
getattr(estimator, method)(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably @jeremiedbb knows better as why this line is not covered here, but the one bellow is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this is a false positive because locally LabelBinarizer
in this PR will get through that branch and raise:
E sklearn.utils._param_validation.InvalidParameterError: The 'neg_label' parameter of LabelBinarizer must be an instance of 'int'. Got <sklearn.utils.estimator_checks.BadType object at 0x13ea44500> instead.
@adrinjalali @glemaitre Thank you for the pointers. I believe I followed the correct process for deprecation. Please let me know if not. I followed the implementation in I made a test case to cover all cases related to interactions between
This test is named I further had to alter two tests to use the keyword argument Tests implemented:
I hope this is the last set of commits ! |
sklearn/multioutput.py
Outdated
super().fit(X, Y, sample_weight=sample_weight, **fit_params) | ||
if Y is not None: | ||
warnings.warn( | ||
"`Y` was renamed to `y` in 1.9 and will be removed in 2.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"predict_proba returns a list of arrays, test expects array" | ||
), | ||
"check_classifiers_train": ( | ||
"predict returns (n_samples, n_outputs) test expects (n_samples,)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, this is suspicious, needs either fixing the test or the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed this by altering the test to suit the methods predict and predict_proba for the multioutputclassifer. I reviewed and saw other tests accounted for the outpuit datatype of the class, so I have adapted to the tests.
Just back from break. Will work on this in the coming week. |
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, some nits, and a bit that @glemaitre needs to check. Otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
construction and fitting of estimators in this file needs quite a bit of revamp to use the same machinery as the new common tests, but best left for a different PR.
if isinstance(y_prob, list): | ||
for prob in y_prob: | ||
assert prob.shape[0] == n_samples | ||
assert prob.shape[1] == len(np.unique(y)) | ||
else: | ||
# Original check for single output classifiers | ||
assert y_prob.shape == (n_samples, n_classes) | ||
|
||
if isinstance(y_prob, list): | ||
for prob_array in y_prob: | ||
assert_array_equal(np.argmax(prob_array, axis=1), y_pred) | ||
|
||
# check that probas for all classes sum to one | ||
assert_array_almost_equal( | ||
np.sum(prob_array, axis=1), np.ones(n_samples) | ||
) | ||
else: | ||
assert_array_equal(np.argmax(y_prob, axis=1), y_pred) | ||
|
||
# check that probas for all classes sum to one | ||
assert_array_almost_equal(np.sum(y_prob, axis=1), np.ones(n_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need @glemaitre check this section
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@glemaitre Should I drop or close this PR? Or is it possible this will get merged thanks! |
Reference Issues/PRs
Fixes: #29016
What does this implement/fix? Explain your changes.
Enables multioutput classifier to rely on pairwise estimator.
This has been implemented by altering sklearn_tags through the introduction of the get_tags method. This change is similar to what is present in
multiclass.py
:OneVsRestClassifier
. Implementedsingle_output
andmulti_output
as present inMultiOutputEstimator
.Any other comments?
The discussion in #29016 mentions a possible fix utilizing features/methods that have since been altered/deprecated.
Conducted some basic tests swapping the SVC kernel value.