Skip to content

FIX add metadata routing to CV splitters in RidgeCV and RidgeClassifierCV #29634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Oct 7, 2024

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Aug 6, 2024

What does this implement/fix? Explain your changes.

This PR fixes a bug by adding metadata routing to CV splitters in RidgeCV and RidgeClassifierCV.

It was not discovered before by the tests, because
a) it only happens with grouped splitters such as GroupKFold
b) there are some automatisms at play when routing groups (for instance we don't need to request it for grouped splitters) as long as it's not crossing several layers of nestedness and the bug appears when we use RidgeCV and RidgeClassifierCV nested in another router (like cross_validate).

Edit: While working on the test to this (in test_metaestimators_metadata_routing.py), a second bug became visible and it was fixed by adding scorer.set_score_request(sample_weight=True) in fit() for the case the default scoring was used (because here the user doesn't have the chance to set it themselves). The second bug-fix also got a test added in test_ridge.py.

----- rest of this text describes a situation that is outdated now ----
I have tried to add a (somewhat clumsy) test, that tests that the fix works that also tests all the other estimators with internal splitting/CV for the same issue.

But alas, the test also uncovers an unrelated issue with RidgeClassifierCV, since the set_score_request syntax doesn't seem to work here and RidgeClassifierCV still raises and wants me to set a request that I have already set.

minimal reproducible for the original bug:

import numpy as np
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_validate
from sklearn.linear_model import RidgeCV

import sklearn
sklearn.set_config(enable_metadata_routing=True)

rng = np.random.RandomState(42)
X = rng.rand(200, 5)
y = rng.randint(0, 2, size=X.shape[0])
groups = rng.randint(0, 10, size=X.shape[0])

ridge = RidgeCV(cv=GroupKFold(n_splits=2))
cross_validate(ridge, X, y, 
    params={"groups": groups},
    # we are interested in the error that comes up with using GroupKFold within RidgeCV above, but we also need to put this to prevent a different error shown before the one we are now interested in: 
    cv=GroupKFold(n_splits=2) # (although this other error should not raise either, I think)
)

sklearn.set_config(enable_metadata_routing=False)

ValueError: The 'groups' parameter should not be None.

Traceback
<p>File "/home/stefanie/code/scikit-learn_dev/scikit-learn/zzzzz_metatdata_routing.py", line 23, in <module>
cross_validate(ridge, X, y, 
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_param_validation.py", line 216, in wrapper
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_validation.py", line 435, in cross_validate
_warn_or_raise_about_fit_failures(results, error_score)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_validation.py", line 521, in              _warn_or_raise_about_fit_failures
raise ValueError(all_fits_failed_message)
ValueError: 
All the 2 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_validation.py", line 885, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
return fit_method(estimator, *args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/linear_model/_ridge.py", line 2673, in fit
super().fit(X, y, sample_weight=sample_weight, **params)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/linear_model/_ridge.py", line 2433, in fit
grid_search.fit(X, y, **params)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
return fit_method(estimator, *args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_search.py", line 1015, in fit
self._run_search(evaluate_candidates)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_search.py", line 1569, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_search.py", line 973, in evaluate_candidates
for (cand_idx, parameters), (split_idx, (train, test)) in product(
                                                          ^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 411, in split
for train, test in super().split(X, y, groups):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 142, in split
for test_index in self._iter_test_masks(X, y, groups):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 154, in _iter_test_masks
for test_index in self._iter_test_indices(X, y, groups):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 597, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None. </p>

The new bug you can see in the CI.

CC @adrinjalali, your expertise will be of great help. :)

Copy link

github-actions bot commented Aug 6, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6c73dd2. Link to the linter CI: here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre @OmarManzoor I think we can finalize / merge this one, and then open a PR to fix the issues we've been discussing on this thread: #27560 (comment)

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @StefanieSenger

Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @OmarManzoor and @ogrisel. I have adressed your suggestions in my newest commits.

@glemaitre glemaitre self-requested a review September 17, 2024 16:09
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @StefanieSenger

@OmarManzoor
Copy link
Contributor

@glemaitre Since you self requested a review I'll let you merge 😄

@glemaitre
Copy link
Member

Thanks @OmarManzoor I'll quickly check. We might end-up with a working RidgeCV at some point :).

Comment on lines +2400 to +2401
if self.scoring is None:
scorer = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be missing something. Why we would not combine the above call in the same statement?

if self.scoring is not None:
    scorer = self._get_scorer()
else:
    scorer = None

Or is it because we want still to set the score_request but in which case I would expect

scorer = self._get_scorer()
if self.scoring is None:
    scorer = None

I would probably add a small comment on the top to explain we do this procedure to avoid someone reverting to the first version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this. I am not sure if I understand what you mean, @glemaitre.

I will try to explain what I meant:

We have to call self._get_scorer() in any case (even if scorer=None, because in this case we need to use RidgeCV's score() method via Passthrougscorer.

So the code needs to be like this:

scorer = self._get_scorer()  
...
if self.scoring is None:  
    scorer = None  

The scorer needs to be defined (scorer = self._get_scorer()) pretty early in fit(), because it is used in process_routing() already if metadata routing is enabled and cv is None.

But then below we need to reset scorer to None if the original user-inted was to not pass a scoring:

if self.scoring is None:  
    scorer = None 

If we would not do that, we would get an AttributeError: '_IdentityRegressor' object has no attribute 'score' because the scorer is already something else than None and we would pass something else than None into _RidgeGCV() and we go down the wrong path of the code.

Basically, we are misusing the variable scorer by having it have too many meanings at the same time. But maybe that's better than messing with the conventions?

I have added a comment to explain what I mean. Does it all make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes completely. I missed that we were going to get the combination of _PassthroughScorer + _IdentityRegressor that end-up in a non-compatible scikit-learn estimator.

I think we can merge as-is. Maybe one clean-up could be to check that we get an _Identity... estimator that follow the scikit-learn API but I'm not sure that we can end-up with something working.

@glemaitre glemaitre self-requested a review October 7, 2024 16:00
@glemaitre glemaitre merged commit e749dd9 into scikit-learn:main Oct 7, 2024
30 checks passed
@glemaitre
Copy link
Member

Thanks @StefanieSenger I think that we get closer to have a working RidgeCV :)

@StefanieSenger StefanieSenger deleted the routing_RidgeCV branch October 8, 2024 06:34
BenJourdan pushed a commit to gregoryschwartzman/scikit-learn that referenced this pull request Oct 11, 2024
…ifierCV` (scikit-learn#29634)

Co-authored-by: adrinjalali <adrin.jalali@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants