ENH default routing policy for sample weight #30887

antoinebaker · 2025-02-24T10:10:34Z

Reference Issues/PRs

Towards #26179 (default policy for sample_weight) and meta-issue #16298 (making metaestimators satisfy the sample_weight equivalence check #29818 out of the box when metadata routing is enabled).

What does this implement/fix? Explain your changes.

In this draft PR, the sample_weight metadata is requested by default in all estimators (for fit and score methods) and scorers (for the score method).

It's a similar strategy to the groups metadata which is requested by default in groups-consuming splitters like GroupKFold.

Motivation

The motivation behind is to make metaestimators handle sample_weight correctly by default, without the user needing to specify all requests. For example consider this metaestimator:

scaler = StandardScaler()
spline = SplineTransformer(knots="quantile")
union = FeatureUnion([("scaler", scaler), ("spline", spline)])
logistic =  LogisticRegression()
pipe = Pipeline(steps=[("union", union), ("logistic", logistic)])
param_grid = {
    "union__spline__n_knots": [5, 6, 7],
    "logistic__C": np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=2)

We would like the search metaestimator to handle sample_weight correctly, for instance to respect the repeated/reweighted equivalence check #29818. In scikit-learn, this is only possible with metadata routing enabled.
With the current routing policy one needs to explicitly requests sample_weight for all objects/methods:

scaler.set_fit_request(sample_weight=True)
spline.set_fit_request(sample_weight=True)
logistic.set_fit_request(sample_weight=True)
logistic.set_score_request(sample_weight=True)
pipe.set_score_request(sample_weight=True)

This works, the metaestimator satisfies the repeated/reweighted equivalence check. On the downside it is very verbose, and the user might easily forget a request or add an unnecessary one. It requires the user to know each subestimator and router inner mechanics precisely, which kind of defeats the purpose of the metaestimator. Ideally we would like the metaestimator to "just work" when using sample_weight.

In this draft PR, sample_weight is requested by default, and search.fit handles sample_weight correctly out of the box and satisfy the repeated/reweighted equivalence check.

Any other comments?

This is a very crude draft and it breaks many tests! The current "solution" is wrong, but I hope it can serve as a basis for criticism, feedback on desired goals, and implementation ideas going forward.

For instance:

what about estimator that don't support sample_weight ?
what about metaestimator like BaggingRegressor which should consume sample_weight for sampling but not forward them to subestimator SLEP006: default routing #26179 (comment) ?
should we rather implement a default sample_weight="auto" in the routing API ?
what would be the intended behavior of the "auto" option ?

cc @adrinjalali @ogrisel

github-actions · 2025-02-24T10:11:53Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: a8b91cc. Link to the linter CI: here}

adrinjalali · 2025-02-24T14:41:33Z

sklearn/base.py

+    __metadata_request__score: dict[str, Union[bool, str]] = {"sample_weight": True}
+
+
+class BaseEstimator(_HTMLDocumentationLinkMixin, _SampleWeightEstimatorMixin):


I think this should apply to the estimators individually, and not BaseEstimator. This here would have a lot of impact on external libraries.

adrinjalali · 2025-02-24T14:42:03Z

sklearn/base.py

@@ -153,7 +154,20 @@ def _clone_parametrized(estimator, *, safe=True):
    return new_object


-class BaseEstimator(_HTMLDocumentationLinkMixin, _MetadataRequester):
+class _SampleWeightEstimatorMixin(_MetadataRequester):


we probably want to have two mixins, for fit and score.

adrinjalali · 2025-02-24T14:44:38Z

sklearn/base.py

+    .. versionadded:: 1.7
+    """
+
+    __metadata_request__fit: dict[str, Union[bool, str]] = {"sample_weight": True}


IIUC, I don't think we want to have a constant default routing w/o user input. By that, I mean we want to have a "default routing" setting which is enabled with a global config. Something along the lines of

sklearn.set_config(enable_metadata_routing="default_routing")

And that would enable a default routing kinda thing. Also, the default routing seems to me that it should be on the instance level and not the class level. I remember cases where routing would depend on constructor arguments (sub-estimator specifically). It could be something along the lines of:

class MyEstimator(...): def __sklearn_default_routing__(self): return {"fit": {"sample_weight": True}}

We need to build that in the routing machinery, and then enable that for our estimators.

I think the point of having a default routing for sample weight is that it's enabled by default (similarly to what we do for the CV groups).

adrinjalali · 2025-02-25T13:41:12Z

Notes from our call related to tags and documenting sample_weight support:

When an estimator doesn't support sample_weight by design, meaning the correct way to deal with sample_weight is to ignore it, then we can have a note on the docstring of its relevant method. For instance:

--- a/sklearn/preprocessing/_polynomial.py
+++ b/sklearn/preprocessing/_polynomial.py
@@ -167,6 +167,7 @@ class PolynomialFeatures(TransformerMixin, BaseEstimator):
 
     Notes
     -----
+
     Be aware that the number of features in the output array scales
     polynomially in the number of features of the input array, and
     exponentially in the degree. High degrees can cause overfitting.
@@ -311,6 +312,10 @@ class PolynomialFeatures(TransformerMixin, BaseEstimator):
         """
         Compute number of output features.
 
+        .. note::
+            `sample_weight` is not used in this estimator since the correct way to
+            handle it here is to ignore it.
+
         Parameters
         ----------
         X : {array-like, sparse matrix} of shape (n_samples, n_features)

Estimators declare how they handle sample_weight, and if the correct thing is to not have sample_weight for them, via tags, sth along the lines of:

--- a/sklearn/utils/_tags.py
+++ b/sklearn/utils/_tags.py
@@ -217,6 +217,13 @@ class Tags:
     regressor_tags : :class:`RegressorTags` or None
         The regressor tags.
 
+    sample_weight or how_sample_weight_is_used : dict, or SampleWeightUsageTags
+        This tag is here to show what the estimator does with sample_weight.
+        It can for instance be `{"fit": True, "score": False}` if the estimator
+        uses sample_weight in `fit` but not in `score`.
+        By default, the info is fetched from the signature of the methods,
+        and estimators can override it.
+
     array_api_support : bool, default=False
         Whether the estimator supports Array API compatible inputs.

mixin to request sample weight

a8b91cc

adrinjalali reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH default routing policy for sample weight #30887

ENH default routing policy for sample weight #30887

antoinebaker commented Feb 24, 2025

github-actions bot commented Feb 24, 2025

adrinjalali Feb 24, 2025

adrinjalali Feb 24, 2025 •

edited

Loading

adrinjalali Feb 24, 2025

ogrisel Feb 25, 2025

adrinjalali commented Feb 25, 2025

		__metadata_request__score: dict[str, Union[bool, str]] = {"sample_weight": True}


		class BaseEstimator(_HTMLDocumentationLinkMixin, _SampleWeightEstimatorMixin):

ENH default routing policy for sample weight #30887

Are you sure you want to change the base?

ENH default routing policy for sample weight #30887

Conversation

antoinebaker commented Feb 24, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Motivation

Any other comments?

github-actions bot commented Feb 24, 2025

✔️ Linting Passed

adrinjalali Feb 24, 2025

Choose a reason for hiding this comment

adrinjalali Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

adrinjalali Feb 24, 2025

Choose a reason for hiding this comment

ogrisel Feb 25, 2025

Choose a reason for hiding this comment

adrinjalali commented Feb 25, 2025

adrinjalali Feb 24, 2025 •

edited

Loading