Skip to content

FIX RecursionError bug with metadata routing in metaestimators with scoring #28712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 23, 2024

Conversation

StefanieSenger
Copy link
Contributor

I found a bug that causes a RecursionError whenever RidgeCV or RidgeClassifierCV are routing metadata without defining the scoring init param (so it defaults to None).

I wrote a fix for that: adding a condition in _BaseRidgeCV._get_scorer() now results in _get_scorer to return None (instead of entering a recursive loop via creating an new _PassthroughScorer in check_scoring()). This was the behaviour before metadata routing was introduced, so I think this is what we want in this case.

This was not tested for yet and I have added a test.

I have also improved the documentation a bit along the way by adding a link.

Copy link

github-actions bot commented Mar 27, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2ea8b32. Link to the linter CI: here

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for handling this fix @StefanieSenger

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right solution. The issue is that _PassthroughScorer returns what's returned by the estimator, and that's never cached. I'm looking at the code to see what we can do to fix it.

Comment on lines 2101 to 2118
X = np.array([[0, 1], [2, 2], [4, 6], [9, 0], [2, 4]])
y = [1, 2, 3, 4, 5]

pipe = SimplePipeline(
[
ConsumingTransformer()
.set_fit_request(sample_weight=True)
.set_transform_request(sample_weight=True),
ConsumingTransformer()
.set_fit_request(sample_weight=True)
.set_transform_request(sample_weight=True),
metaestimator().set_fit_request(sample_weight=True),
]
)

params = {"sample_weight": [1, 1, 1, 1, 1]}

pipe.fit(X, y, **params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual minimal reproducible would be:

    metaestimator().get_metadata_routing()

So we can remove the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is another (and more basic) way to trigger the error.

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think we could also allow None to be returned before creating the _PassthroughScorer in check_scoring. Even though I like the privious solution more, because is preserves the behaviour before metadata routing was introduced.

def _get_scorer(self):
    if self.scoring is not None:
        return check_scoring(self, scoring=self.scoring, allow_none=True)

will return None (as before) when self.scoring is None.

What do you think @adrinjalali and @OmarManzoor

@@ -48,7 +48,6 @@
ignore_warnings,
)
from sklearn.utils.fixes import (
_IS_32BIT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why after my last push, this recent addition isn't imported from fixes anymore... Ruff moved it up.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that not surprisingly raised. I moved it back to where it was and ruff seems to be content with it. 🤷

@adrinjalali
Copy link
Member

Another issue with this solution as is, is that it replaces existing scorer request with an empty one, and this scorer happens to support sample_weight.

@StefanieSenger
Copy link
Contributor Author

Another issue with this solution as is, is that it replaces existing scorer request with an empty one, and this scorer happens to support sample_weight.

Yes, true. I will revert this then, since the other patch did at least keep check_scoring intact and also imitated the old behaviour before metadata routing was introduced (with returning None).

@StefanieSenger StefanieSenger changed the title FIX RecursionError bug with metadata routing in RidgeCV and RidgeClassifierCV FIX RecursionError bug with metadata routing in metaestimators with scoring Apr 18, 2024
@StefanieSenger
Copy link
Contributor Author

About the pickling error:

With _PassthroughScorer now inheriting from _MetadataRequester, many metaestimators now have estimator.scorer_.set_score_request, which makes them unpicklable.

check_estimators_pickle thus fails.

I think that no estimator would have been pickable since metadata routing was introduced, so I wonder how come that these metaestimators have not failed this check before. I also wonder why only a few metaestimators are tested for this check? Do we simply exclude estimators, that would not pass this test?

@adrinjalali, can you help here?

@StefanieSenger
Copy link
Contributor Author

The current solution fails on test_PassthroughScorer_metadata_request , because now _PassthrouScorer doesn't pass through all the routed metadata anymore, but only its own.

I am not sure if this is doing any harm though.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit with the fix. Making sure set_score_attribute is a method instead of an attribute, which cannot be pickled.

if hasattr(estimator, "set_score_request"):
self.set_score_request = estimator.set_score_request

requests = get_routing_for_object(self._estimator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the culprit causing the recursion error that we're trying to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have seen it and changed it back.

def __init__(self, estimator):
self._estimator = estimator
if hasattr(estimator, "set_score_request"):
self.set_score_request = estimator.set_score_request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is making a descriptor (a class attribute which is not pickled) an instance attribute which needs to be pickled which cannot since it returns a function.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the set_score_request method builds some specific data for each instance which can be pickled.

But for me it seems that estimator.set_score_request has been an instance attribute/method before already, only belonging to estimator, instead of to the scorer instance. Edit: okay they are not.

@adrinjalali
Copy link
Member

doesn't pass through all the routed metadata anymore, but only its own.

That's expected now, you can fix the test.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@OmarManzoor wanna have another look?

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think there are two instances of the added lines not being covered by tests. Can we add them?

@adrinjalali
Copy link
Member

For testing the new code, I think we can basically check that check_scoring(estimator, None) returns something where we can both do set_score_request and get_metadata_routing is correct.

@StefanieSenger
Copy link
Contributor Author

For testing the new code, I think we can basically check that check_scoring(estimator, None) returns something where we can both do set_score_request and get_metadata_routing is correct.

Thanks, @adrinjalali. But I'm not sure: Do you meant to do the set_score_request on an estimator, then pass it through check_scoring(estimator, None) and then do a similar check as in test_PassthroughScorer_metadata_request with something like this:

    assert_request_equal(
        scorer.get_metadata_routing(),
        {"score": {"sample_weight": "alias"}},
    )

Is this a check for correct routing?

There is also _BaseScorer.set_score_request(), which is very similar code and I would be curious to find out how this is tested. It seems I don't really know how to look for it.

@StefanieSenger
Copy link
Contributor Author

I have tried do write a test for the new set_score_request method. I have to admit that I am not sure if scorer.get_metadata_routing() and scorer._metadata_request ever could have different values and if this a valid test.

Do you like to have a look, @OmarManzoor or @adrinjalali?

Comment on lines 1296 to 1306
@pytest.mark.usefixtures("enable_slep006")
def test_PassthroughScorer_set_score_request():
"""Test that _PassthroughScorer.set_score_request adds the correct metadata request
on itself."""
meta_est = GridSearchCV(estimator=LinearSVC(), param_grid={"C": [0.1, 1]})

# make a `_PassthroughScorer` with `check_scoring`:
scorer = check_scoring(meta_est, None)
scorer.set_score_request(sample_weight=True)

assert str(scorer.get_metadata_routing()) == str(scorer._metadata_request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made more minimal:

--- a/sklearn/metrics/tests/test_score_objects.py
+++ b/sklearn/metrics/tests/test_score_objects.py
@@ -1297,13 +1297,11 @@ def test_PassthroughScorer_metadata_request():
 def test_PassthroughScorer_set_score_request():
     """Test that _PassthroughScorer.set_score_request adds the correct metadata request
     on itself."""
-    meta_est = GridSearchCV(estimator=LinearSVC(), param_grid={"C": [0.1, 1]})
-
     # make a `_PassthroughScorer` with `check_scoring`:
-    scorer = check_scoring(meta_est, None)
-    scorer.set_score_request(sample_weight=True)
+    scorer = check_scoring(LogisticRegression(), None)
+    scorer.set_score_request(sample_weight='my_weights')
 
-    assert str(scorer.get_metadata_routing()) == str(scorer._metadata_request)
+    assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'my_weights'

Copy link
Contributor Author

@StefanieSenger StefanieSenger Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adrinjalali, but I find this very confusing. What is the behaviour we want _PassthroughScorer to have and WHY?

Consider this test, where I have tried to unify this new test with the one right above:

    """Test that _PassthroughScorer.set_score_request adds the correct metadata request
    on itself."""
    # make a `_PassthroughScorer` with `check_scoring`:
    scorer = check_scoring(LogisticRegression().set_score_request(sample_weight="alias"), None)
    scorer.set_score_request(sample_weight='my_weights')
    assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'my_weights'

When we would not do the scorer.set_score_request(sample_weight='my_weights'), then we would want to assert that sample_weight == "alias". At least this is what is done in the test above. Obviously, _PassthroughScorers functionality has changed and maybe the test above is not valid anymore. But what are we trying to archive here?

First, we had detected a RecursionError on some (Stacking* is now also affected) metaestimators, then we fixed this by making _PassthroughScorer less passy-through, but here we are testing on normal consumers (LogisticRegression) and anything that has a score method.

When an estimator uses the default scoring, _PassthroughScorer should actually not change routed metadata, but if we do a set_score_request on it, it definitely changes routed metadata and we like that and want to test if it really does??

Is this rather a by-product, that will never be used, and we only maintain to keep codecov happy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a good point there about carrying the request from the original estimator. So the more complete test would be:

    est = LogisticRegression().set_score_request(sample_weight="alias")
    scorer = check_scoring(est, None)
    assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'alias'
    scorer.set_score_request(sample_weight='my_weights')
    assert scorer.get_metadata_routing().score.requests['sample_weight'] == 'my_weights'
    # making sure changing the passthrough object doesn't affect the estimator.
    assert est.get_metadata_routing().score.requests['sample_weight'] == 'alias'

This is about having a correct public API, it's not about just making codecov happy.

Note that we're also making sure we have a non-regression test for the original issue. This is another test to make sure the added functionality behaves correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I now see what we want. Let me sum up:

We want a default scorer (_PassthroughScorer), that adopts the routing set on the estimator it scores for, but that doesn't force the estimator to change its own routing.
Since the default scorer is now newly inheriting from _MetadataRequester it now newly acts as a real consumer (while before this was only as if) and we can set its own routings. This is part of the public API, because check_scoring is.

Unfortunately however, _PassthroughScorer does change its estimators routing. The last assert fails.

I will try to find out why.

def __init__(self, estimator):
self._estimator = estimator

requests = MetadataRequest(owner=self._estimator.__class__.__name__)
try:
requests.score = estimator._metadata_request.score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the issue

Suggested change
requests.score = estimator._metadata_request.score
requests.score = deepcopy(estimator._metadata_request.score)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, yes. So simple.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Another look @OmarManzoor ?

@OmarManzoor OmarManzoor merged commit 78675d1 into scikit-learn:main Apr 23, 2024
30 checks passed
@StefanieSenger StefanieSenger deleted the routing_bug branch April 23, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants