-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX RecursionError
bug with metadata routing in metaestimators with scoring
#28712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for handling this fix @StefanieSenger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right solution. The issue is that _PassthroughScorer
returns what's returned by the estimator, and that's never cached. I'm looking at the code to see what we can do to fix it.
X = np.array([[0, 1], [2, 2], [4, 6], [9, 0], [2, 4]]) | ||
y = [1, 2, 3, 4, 5] | ||
|
||
pipe = SimplePipeline( | ||
[ | ||
ConsumingTransformer() | ||
.set_fit_request(sample_weight=True) | ||
.set_transform_request(sample_weight=True), | ||
ConsumingTransformer() | ||
.set_fit_request(sample_weight=True) | ||
.set_transform_request(sample_weight=True), | ||
metaestimator().set_fit_request(sample_weight=True), | ||
] | ||
) | ||
|
||
params = {"sample_weight": [1, 1, 1, 1, 1]} | ||
|
||
pipe.fit(X, y, **params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual minimal reproducible would be:
metaestimator().get_metadata_routing()
So we can remove the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is another (and more basic) way to trigger the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think we could also allow None to be returned before creating the _PassthroughScorer
in check_scoring
. Even though I like the privious solution more, because is preserves the behaviour before metadata routing was introduced.
def _get_scorer(self):
if self.scoring is not None:
return check_scoring(self, scoring=self.scoring, allow_none=True)
will return None
(as before) when self.scoring is None
.
What do you think @adrinjalali and @OmarManzoor
@@ -48,7 +48,6 @@ | |||
ignore_warnings, | |||
) | |||
from sklearn.utils.fixes import ( | |||
_IS_32BIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why after my last push, this recent addition isn't imported from fixes anymore... Ruff moved it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that not surprisingly raised. I moved it back to where it was and ruff seems to be content with it. 🤷
Another issue with this solution as is, is that it replaces existing scorer request with an empty one, and this scorer happens to support |
Yes, true. I will revert this then, since the other patch did at least keep |
RecursionError
bug with metadata routing in RidgeCV
and RidgeClassifierCV
RecursionError
bug with metadata routing in metaestimators with scoring
About the pickling error: With
I think that no estimator would have been pickable since metadata routing was introduced, so I wonder how come that these metaestimators have not failed this check before. I also wonder why only a few metaestimators are tested for this check? Do we simply exclude estimators, that would not pass this test? @adrinjalali, can you help here? |
The current solution fails on I am not sure if this is doing any harm though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a commit with the fix. Making sure set_score_attribute
is a method instead of an attribute, which cannot be pickled.
sklearn/metrics/_scorer.py
Outdated
if hasattr(estimator, "set_score_request"): | ||
self.set_score_request = estimator.set_score_request | ||
|
||
requests = get_routing_for_object(self._estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly the culprit causing the recursion error that we're trying to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have seen it and changed it back.
sklearn/metrics/_scorer.py
Outdated
def __init__(self, estimator): | ||
self._estimator = estimator | ||
if hasattr(estimator, "set_score_request"): | ||
self.set_score_request = estimator.set_score_request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is making a descriptor (a class attribute which is not pickled) an instance attribute which needs to be pickled which cannot since it returns a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the set_score_request
method builds some specific data for each instance which can be pickled.
But for me it seems that estimator.set_score_request
has been an instance attribute/method before already, only belonging to estimator, instead of to the scorer instance. Edit: okay they are not.
That's expected now, you can fix the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@OmarManzoor wanna have another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think there are two instances of the added lines not being covered by tests. Can we add them?
For testing the new code, I think we can basically check that |
Thanks, @adrinjalali. But I'm not sure: Do you meant to do the
Is this a check for correct routing? There is also |
I have tried do write a test for the new set_score_request method. I have to admit that I am not sure if Do you like to have a look, @OmarManzoor or @adrinjalali? |
@pytest.mark.usefixtures("enable_slep006") | ||
def test_PassthroughScorer_set_score_request(): | ||
"""Test that _PassthroughScorer.set_score_request adds the correct metadata request | ||
on itself.""" | ||
meta_est = GridSearchCV(estimator=LinearSVC(), param_grid={"C": [0.1, 1]}) | ||
|
||
# make a `_PassthroughScorer` with `check_scoring`: | ||
scorer = check_scoring(meta_est, None) | ||
scorer.set_score_request(sample_weight=True) | ||
|
||
assert str(scorer.get_metadata_routing()) == str(scorer._metadata_request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be made more minimal:
--- a/sklearn/metrics/tests/test_score_objects.py
+++ b/sklearn/metrics/tests/test_score_objects.py
@@ -1297,13 +1297,11 @@ def test_PassthroughScorer_metadata_request():
def test_PassthroughScorer_set_score_request():
"""Test that _PassthroughScorer.set_score_request adds the correct metadata request
on itself."""
- meta_est = GridSearchCV(estimator=LinearSVC(), param_grid={"C": [0.1, 1]})
-
# make a `_PassthroughScorer` with `check_scoring`:
- scorer = check_scoring(meta_est, None)
- scorer.set_score_request(sample_weight=True)
+ scorer = check_scoring(LogisticRegression(), None)
+ scorer.set_score_request(sample_weight='my_weights')
- assert str(scorer.get_metadata_routing()) == str(scorer._metadata_request)
+ assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'my_weights'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adrinjalali, but I find this very confusing. What is the behaviour we want _PassthroughScorer
to have and WHY?
Consider this test, where I have tried to unify this new test with the one right above:
"""Test that _PassthroughScorer.set_score_request adds the correct metadata request
on itself."""
# make a `_PassthroughScorer` with `check_scoring`:
scorer = check_scoring(LogisticRegression().set_score_request(sample_weight="alias"), None)
scorer.set_score_request(sample_weight='my_weights')
assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'my_weights'
When we would not do the scorer.set_score_request(sample_weight='my_weights')
, then we would want to assert that sample_weight == "alias". At least this is what is done in the test above. Obviously, _PassthroughScorer
s functionality has changed and maybe the test above is not valid anymore. But what are we trying to archive here?
First, we had detected a RecursionError
on some (Stacking* is now also affected) metaestimators, then we fixed this by making _PassthroughScorer
less passy-through, but here we are testing on normal consumers (LogisticRegression) and anything that has a score method.
When an estimator uses the default scoring, _PassthroughScorer
should actually not change routed metadata, but if we do a set_score_request
on it, it definitely changes routed metadata and we like that and want to test if it really does??
Is this rather a by-product, that will never be used, and we only maintain to keep codecov happy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise a good point there about carrying the request from the original estimator. So the more complete test would be:
est = LogisticRegression().set_score_request(sample_weight="alias")
scorer = check_scoring(est, None)
assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'alias'
scorer.set_score_request(sample_weight='my_weights')
assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'my_weights'
# making sure changing the passthrough object doesn't affect the estimator.
assert est.get_metadata_routing().score.requests['sample_weight'] == 'alias'
This is about having a correct public API, it's not about just making codecov happy.
Note that we're also making sure we have a non-regression test for the original issue. This is another test to make sure the added functionality behaves correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I now see what we want. Let me sum up:
We want a default scorer (_PassthroughScorer
), that adopts the routing set on the estimator it scores for, but that doesn't force the estimator to change its own routing.
Since the default scorer is now newly inheriting from _MetadataRequester
it now newly acts as a real consumer (while before this was only as if) and we can set its own routings. This is part of the public API, because check_scoring
is.
Unfortunately however, _PassthroughScorer
does change its estimators routing. The last assert fails.
I will try to find out why.
sklearn/metrics/_scorer.py
Outdated
def __init__(self, estimator): | ||
self._estimator = estimator | ||
|
||
requests = MetadataRequest(owner=self._estimator.__class__.__name__) | ||
try: | ||
requests.score = estimator._metadata_request.score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fix the issue
requests.score = estimator._metadata_request.score | |
requests.score = deepcopy(estimator._metadata_request.score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, yes. So simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Another look @OmarManzoor ?
I found a bug that causes a
RecursionError
wheneverRidgeCV
orRidgeClassifierCV
are routing metadata without defining thescoring
init param (so it defaults toNone
).I wrote a fix for that: adding a condition in
_BaseRidgeCV._get_scorer()
now results in_get_scorer
to returnNone
(instead of entering a recursive loop via creating an new_PassthroughScorer
incheck_scoring()
). This was the behaviour before metadata routing was introduced, so I think this is what we want in this case.This was not tested for yet and I have added a test.
I have also improved the documentation a bit along the way by adding a link.