Skip to content

ENH default routing policy for sample weight #30887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

antoinebaker
Copy link
Contributor

Reference Issues/PRs

Towards #26179 (default policy for sample_weight) and meta-issue #16298 (making metaestimators satisfy the sample_weight equivalence check #29818 out of the box when metadata routing is enabled).

What does this implement/fix? Explain your changes.

In this draft PR, the sample_weight metadata is requested by default in all estimators (for fit and score methods) and scorers (for the score method).

It's a similar strategy to the groups metadata which is requested by default in groups-consuming splitters like GroupKFold.

Motivation

The motivation behind is to make metaestimators handle sample_weight correctly by default, without the user needing to specify all requests. For example consider this metaestimator:

scaler = StandardScaler()
spline = SplineTransformer(knots="quantile")
union = FeatureUnion([("scaler", scaler), ("spline", spline)])
logistic =  LogisticRegression()
pipe = Pipeline(steps=[("union", union), ("logistic", logistic)])
param_grid = {
    "union__spline__n_knots": [5, 6, 7],
    "logistic__C": np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=2)

We would like the search metaestimator to handle sample_weight correctly, for instance to respect the repeated/reweighted equivalence check #29818. In scikit-learn, this is only possible with metadata routing enabled.
With the current routing policy one needs to explicitly requests sample_weight for all objects/methods:

scaler.set_fit_request(sample_weight=True)
spline.set_fit_request(sample_weight=True)
logistic.set_fit_request(sample_weight=True)
logistic.set_score_request(sample_weight=True)
pipe.set_score_request(sample_weight=True)

This works, the metaestimator satisfies the repeated/reweighted equivalence check. On the downside it is very verbose, and the user might easily forget a request or add an unnecessary one. It requires the user to know each subestimator and router inner mechanics precisely, which kind of defeats the purpose of the metaestimator. Ideally we would like the metaestimator to "just work" when using sample_weight.

In this draft PR, sample_weight is requested by default, and search.fit handles sample_weight correctly out of the box and satisfy the repeated/reweighted equivalence check.

Any other comments?

This is a very crude draft and it breaks many tests! The current "solution" is wrong, but I hope it can serve as a basis for criticism, feedback on desired goals, and implementation ideas going forward.

For instance:

  • what about estimator that don't support sample_weight ?
  • what about metaestimator like BaggingRegressor which should consume sample_weight for sampling but not forward them to subestimator SLEP006: default routing #26179 (comment) ?
  • should we rather implement a default sample_weight="auto" in the routing API ?
  • what would be the intended behavior of the "auto" option ?

cc @adrinjalali @ogrisel

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a8b91cc. Link to the linter CI: here

__metadata_request__score: dict[str, Union[bool, str]] = {"sample_weight": True}


class BaseEstimator(_HTMLDocumentationLinkMixin, _SampleWeightEstimatorMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should apply to the estimators individually, and not BaseEstimator. This here would have a lot of impact on external libraries.

@@ -153,7 +154,20 @@ def _clone_parametrized(estimator, *, safe=True):
return new_object


class BaseEstimator(_HTMLDocumentationLinkMixin, _MetadataRequester):
class _SampleWeightEstimatorMixin(_MetadataRequester):
Copy link
Member

@adrinjalali adrinjalali Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to have two mixins, for fit and score.

.. versionadded:: 1.7
"""

__metadata_request__fit: dict[str, Union[bool, str]] = {"sample_weight": True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, I don't think we want to have a constant default routing w/o user input. By that, I mean we want to have a "default routing" setting which is enabled with a global config. Something along the lines of

sklearn.set_config(enable_metadata_routing="default_routing")

And that would enable a default routing kinda thing. Also, the default routing seems to me that it should be on the instance level and not the class level. I remember cases where routing would depend on constructor arguments (sub-estimator specifically). It could be something along the lines of:

class MyEstimator(...):
    def __sklearn_default_routing__(self):
        return {"fit": {"sample_weight": True}}

We need to build that in the routing machinery, and then enable that for our estimators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point of having a default routing for sample weight is that it's enabled by default (similarly to what we do for the CV groups).

@adrinjalali
Copy link
Member

Notes from our call related to tags and documenting sample_weight support:

  • When an estimator doesn't support sample_weight by design, meaning the correct way to deal with sample_weight is to ignore it, then we can have a note on the docstring of its relevant method. For instance:
--- a/sklearn/preprocessing/_polynomial.py
+++ b/sklearn/preprocessing/_polynomial.py
@@ -167,6 +167,7 @@ class PolynomialFeatures(TransformerMixin, BaseEstimator):
 
     Notes
     -----
+
     Be aware that the number of features in the output array scales
     polynomially in the number of features of the input array, and
     exponentially in the degree. High degrees can cause overfitting.
@@ -311,6 +312,10 @@ class PolynomialFeatures(TransformerMixin, BaseEstimator):
         """
         Compute number of output features.
 
+        .. note::
+            `sample_weight` is not used in this estimator since the correct way to
+            handle it here is to ignore it.
+
         Parameters
         ----------
         X : {array-like, sparse matrix} of shape (n_samples, n_features)
  • Estimators declare how they handle sample_weight, and if the correct thing is to not have sample_weight for them, via tags, sth along the lines of:
--- a/sklearn/utils/_tags.py
+++ b/sklearn/utils/_tags.py
@@ -217,6 +217,13 @@ class Tags:
     regressor_tags : :class:`RegressorTags` or None
         The regressor tags.
 
+    sample_weight or how_sample_weight_is_used : dict, or SampleWeightUsageTags
+        This tag is here to show what the estimator does with sample_weight.
+        It can for instance be `{"fit": True, "score": False}` if the estimator
+        uses sample_weight in `fit` but not in `score`.
+        By default, the info is fetched from the signature of the methods,
+        and estimators can override it.
+
     array_api_support : bool, default=False
         Whether the estimator supports Array API compatible inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants