FIX add metadata routing to CV splitters in `RidgeCV` and `RidgeClassifierCV` #29634

StefanieSenger · 2024-08-06T23:51:40Z

What does this implement/fix? Explain your changes.

This PR fixes a bug by adding metadata routing to CV splitters in RidgeCV and RidgeClassifierCV.

It was not discovered before by the tests, because
a) it only happens with grouped splitters such as GroupKFold
b) there are some automatisms at play when routing groups (for instance we don't need to request it for grouped splitters) as long as it's not crossing several layers of nestedness and the bug appears when we use RidgeCV and RidgeClassifierCV nested in another router (like cross_validate).

Edit: While working on the test to this (in test_metaestimators_metadata_routing.py), a second bug became visible and it was fixed by adding scorer.set_score_request(sample_weight=True) in fit() for the case the default scoring was used (because here the user doesn't have the chance to set it themselves). The second bug-fix also got a test added in test_ridge.py.

----- rest of this text describes a situation that is outdated now ----
I have tried to add a (somewhat clumsy) test, that tests that the fix works that also tests all the other estimators with internal splitting/CV for the same issue.

But alas, the test also uncovers an unrelated issue with RidgeClassifierCV, since the set_score_request syntax doesn't seem to work here and RidgeClassifierCV still raises and wants me to set a request that I have already set.

minimal reproducible for the original bug:

import numpy as np
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_validate
from sklearn.linear_model import RidgeCV

import sklearn
sklearn.set_config(enable_metadata_routing=True)

rng = np.random.RandomState(42)
X = rng.rand(200, 5)
y = rng.randint(0, 2, size=X.shape[0])
groups = rng.randint(0, 10, size=X.shape[0])

ridge = RidgeCV(cv=GroupKFold(n_splits=2))
cross_validate(ridge, X, y, 
    params={"groups": groups},
    # we are interested in the error that comes up with using GroupKFold within RidgeCV above, but we also need to put this to prevent a different error shown before the one we are now interested in: 
    cv=GroupKFold(n_splits=2) # (although this other error should not raise either, I think)
)

sklearn.set_config(enable_metadata_routing=False)

ValueError: The 'groups' parameter should not be None.

Traceback

<p>File "/home/stefanie/code/scikit-learn_dev/scikit-learn/zzzzz_metatdata_routing.py", line 23, in <module>
cross_validate(ridge, X, y, 
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/utils/_param_validation.py", line 216, in wrapper
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_validation.py", line 435, in cross_validate
_warn_or_raise_about_fit_failures(results, error_score)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_validation.py", line 521, in              _warn_or_raise_about_fit_failures
raise ValueError(all_fits_failed_message)
ValueError: 
All the 2 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_validation.py", line 885, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
return fit_method(estimator, *args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/linear_model/_ridge.py", line 2673, in fit
super().fit(X, y, sample_weight=sample_weight, **params)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/linear_model/_ridge.py", line 2433, in fit
grid_search.fit(X, y, **params)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/base.py", line 1514, in wrapper
return fit_method(estimator, *args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_search.py", line 1015, in fit
self._run_search(evaluate_candidates)
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_search.py", line 1569, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_search.py", line 973, in evaluate_candidates
for (cand_idx, parameters), (split_idx, (train, test)) in product(
                                                          ^^^^^^^^
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 411, in split
for train, test in super().split(X, y, groups):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 142, in split
for test_index in self._iter_test_masks(X, y, groups):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 154, in _iter_test_masks
for test_index in self._iter_test_indices(X, y, groups):
File "/home/stefanie/code/scikit-learn_dev/scikit-learn/sklearn/model_selection/_split.py", line 597, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None. </p>

The new bug you can see in the CI.

CC @adrinjalali, your expertise will be of great help. :)

github-actions · 2024-08-06T23:52:54Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6c73dd2. Link to the linter CI: here}

sklearn/tests/test_metaestimators_metadata_routing.py

adrinjalali

@glemaitre @OmarManzoor I think we can finalize / merge this one, and then open a PR to fix the issues we've been discussing on this thread: #27560 (comment)

sklearn/linear_model/tests/test_ridge.py

sklearn/tests/test_metaestimators_metadata_routing.py

OmarManzoor

Thanks for the PR @StefanieSenger

sklearn/linear_model/_ridge.py

sklearn/linear_model/tests/test_ridge.py

sklearn/tests/test_metaestimators_metadata_routing.py

StefanieSenger

Thank you @OmarManzoor and @ogrisel. I have adressed your suggestions in my newest commits.

sklearn/linear_model/tests/test_ridge.py

sklearn/tests/test_metaestimators_metadata_routing.py

OmarManzoor

LGTM. Thanks @StefanieSenger

OmarManzoor · 2024-09-18T06:22:23Z

@glemaitre Since you self requested a review I'll let you merge 😄

glemaitre · 2024-09-18T16:47:12Z

Thanks @OmarManzoor I'll quickly check. We might end-up with a working RidgeCV at some point :).

glemaitre · 2024-09-18T17:13:02Z

sklearn/linear_model/_ridge.py

+            if self.scoring is None:
+                scorer = None


I should be missing something. Why we would not combine the above call in the same statement?

if self.scoring is not None: scorer = self._get_scorer() else: scorer = None

Or is it because we want still to set the score_request but in which case I would expect

scorer = self._get_scorer() if self.scoring is None: scorer = None

I would probably add a small comment on the top to explain we do this procedure to avoid someone reverting to the first version.

Thanks for looking into this. I am not sure if I understand what you mean, @glemaitre.

I will try to explain what I meant:

We have to call self._get_scorer() in any case (even if scorer=None, because in this case we need to use RidgeCV's score() method via Passthrougscorer.

So the code needs to be like this:

scorer = self._get_scorer() ... if self.scoring is None: scorer = None

The scorer needs to be defined (scorer = self._get_scorer()) pretty early in fit(), because it is used in process_routing() already if metadata routing is enabled and cv is None.

But then below we need to reset scorer to None if the original user-inted was to not pass a scoring:

if self.scoring is None: scorer = None

If we would not do that, we would get an AttributeError: '_IdentityRegressor' object has no attribute 'score' because the scorer is already something else than None and we would pass something else than None into _RidgeGCV() and we go down the wrong path of the code.

Basically, we are misusing the variable scorer by having it have too many meanings at the same time. But maybe that's better than messing with the conventions?

I have added a comment to explain what I mean. Does it all make sense?

Yes completely. I missed that we were going to get the combination of _PassthroughScorer + _IdentityRegressor that end-up in a non-compatible scikit-learn estimator.

I think we can merge as-is. Maybe one clean-up could be to check that we get an _Identity... estimator that follow the scikit-learn API but I'm not sure that we can end-up with something working.

glemaitre · 2024-10-07T16:13:14Z

Thanks @StefanieSenger I think that we get closer to have a working RidgeCV :)

…ifierCV` (scikit-learn#29634) Co-authored-by: adrinjalali <adrin.jalali@gmail.com>

It's not ideal to require the very latest major release of sklearn. I have looked into downgrading before. It's not trivial, since there are features from 1.6 that I'm using. In particular, functionality related to custom estimators and metadata routing. This addresses some issues: - `validate_data` -> `self._validate_data` - No `expected_failed_checks` arg to `parametrize_with_checks` But we still have more to fix. E.g. we now have tests failing due to [this issue with metadata routing](scikit-learn/scikit-learn#29634).

fix routing to splitter in RidgeCV

e3d4c1a

github-actions bot added the module:linear_model label Aug 6, 2024

Properly set score request in RidgeCV

3956f2c

adrinjalali reviewed Aug 7, 2024

View reviewed changes

sklearn/tests/test_metaestimators_metadata_routing.py Show resolved Hide resolved

StefanieSenger and others added 8 commits September 9, 2024 11:55

Merge branch 'main' into routing_RidgeCV

bc4804e

fix scoring when scoring and cv are both default

8125194

route scorer from get_scoring to build register

2c2db3d

add _routing_enabled and revert previous non-fix

336057a

distinct behaviour if routing_enabled

feca19f

add another condition before setting scorer=None

1b67c17

correct condition for scorer = None and changelog

250b977

add test for fix on falsy UnsetMetadataPassedError

142e23c

adrinjalali reviewed Sep 12, 2024

View reviewed changes

sklearn/linear_model/tests/test_ridge.py Outdated Show resolved Hide resolved

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved

OmarManzoor reviewed Sep 12, 2024

View reviewed changes

sklearn/linear_model/_ridge.py Show resolved Hide resolved

sklearn/linear_model/tests/test_ridge.py Outdated Show resolved Hide resolved

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved

StefanieSenger added 2 commits September 14, 2024 15:56

fix test for metadata routing with group splitter

72f4286

fix test scoring error with SequentialFeatureSelector

d5d3a1c

StefanieSenger commented Sep 14, 2024

View reviewed changes

sklearn/linear_model/tests/test_ridge.py Outdated Show resolved Hide resolved

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved

wording

06cddaa

glemaitre mentioned this pull request Sep 16, 2024

DOC add more developer doc in RidgeGCV #29857

Merged

adrinjalali approved these changes Sep 17, 2024

View reviewed changes

glemaitre self-requested a review September 17, 2024 16:09

OmarManzoor approved these changes Sep 18, 2024

View reviewed changes

glemaitre reviewed Sep 18, 2024

View reviewed changes

StefanieSenger and others added 3 commits October 2, 2024 06:45

avoid calling self._get_scorer() twice if metadata is routed

3ad1190

add comment to explain reset of scorer variable

8ee822d

Merge branch 'main' into routing_RidgeCV

6c73dd2

glemaitre self-requested a review October 7, 2024 16:00

glemaitre approved these changes Oct 7, 2024

View reviewed changes

glemaitre merged commit e749dd9 into scikit-learn:main Oct 7, 2024
30 checks passed

StefanieSenger deleted the routing_RidgeCV branch October 8, 2024 06:34

BenJourdan pushed a commit to gregoryschwartzman/scikit-learn that referenced this pull request Oct 11, 2024

FIX add metadata routing to CV splitters in RidgeCV and `RidgeClass…

39a3ca7

…ifierCV` (scikit-learn#29634) Co-authored-by: adrinjalali <adrin.jalali@gmail.com>

Uh oh!

FIX add metadata routing to CV splitters in RidgeCV and RidgeClassifierCV #29634

FIX add metadata routing to CV splitters in RidgeCV and RidgeClassifierCV #29634

Uh oh!

Conversation

StefanieSenger commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

OmarManzoor commented Sep 18, 2024

Uh oh!

glemaitre commented Sep 18, 2024

Uh oh!

glemaitre Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Oct 7, 2024

Uh oh!

Uh oh!

FIX add metadata routing to CV splitters in `RidgeCV` and `RidgeClassifierCV` #29634

FIX add metadata routing to CV splitters in `RidgeCV` and `RidgeClassifierCV` #29634

StefanieSenger commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024 •

edited

Loading