-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH default routing policy for sample weight #30887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH default routing policy for sample weight #30887
Conversation
__metadata_request__score: dict[str, Union[bool, str]] = {"sample_weight": True} | ||
|
||
|
||
class BaseEstimator(_HTMLDocumentationLinkMixin, _SampleWeightEstimatorMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should apply to the estimators individually, and not BaseEstimator
. This here would have a lot of impact on external libraries.
@@ -153,7 +154,20 @@ def _clone_parametrized(estimator, *, safe=True): | |||
return new_object | |||
|
|||
|
|||
class BaseEstimator(_HTMLDocumentationLinkMixin, _MetadataRequester): | |||
class _SampleWeightEstimatorMixin(_MetadataRequester): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to have two mixins, for fit
and score
.
.. versionadded:: 1.7 | ||
""" | ||
|
||
__metadata_request__fit: dict[str, Union[bool, str]] = {"sample_weight": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, I don't think we want to have a constant default routing w/o user input. By that, I mean we want to have a "default routing" setting which is enabled with a global config. Something along the lines of
sklearn.set_config(enable_metadata_routing="default_routing")
And that would enable a default routing kinda thing. Also, the default routing seems to me that it should be on the instance level and not the class level. I remember cases where routing would depend on constructor arguments (sub-estimator specifically). It could be something along the lines of:
class MyEstimator(...):
def __sklearn_default_routing__(self):
return {"fit": {"sample_weight": True}}
We need to build that in the routing machinery, and then enable that for our estimators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point of having a default routing for sample weight is that it's enabled by default (similarly to what we do for the CV groups).
Notes from our call related to tags and documenting sample_weight support:
--- a/sklearn/preprocessing/_polynomial.py
+++ b/sklearn/preprocessing/_polynomial.py
@@ -167,6 +167,7 @@ class PolynomialFeatures(TransformerMixin, BaseEstimator):
Notes
-----
+
Be aware that the number of features in the output array scales
polynomially in the number of features of the input array, and
exponentially in the degree. High degrees can cause overfitting.
@@ -311,6 +312,10 @@ class PolynomialFeatures(TransformerMixin, BaseEstimator):
"""
Compute number of output features.
+ .. note::
+ `sample_weight` is not used in this estimator since the correct way to
+ handle it here is to ignore it.
+
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
--- a/sklearn/utils/_tags.py
+++ b/sklearn/utils/_tags.py
@@ -217,6 +217,13 @@ class Tags:
regressor_tags : :class:`RegressorTags` or None
The regressor tags.
+ sample_weight or how_sample_weight_is_used : dict, or SampleWeightUsageTags
+ This tag is here to show what the estimator does with sample_weight.
+ It can for instance be `{"fit": True, "score": False}` if the estimator
+ uses sample_weight in `fit` but not in `score`.
+ By default, the info is fetched from the signature of the methods,
+ and estimators can override it.
+
array_api_support : bool, default=False
Whether the estimator supports Array API compatible inputs. |
Reference Issues/PRs
Towards #26179 (default policy for
sample_weight
) and meta-issue #16298 (making metaestimators satisfy thesample_weight
equivalence check #29818 out of the box when metadata routing is enabled).What does this implement/fix? Explain your changes.
In this draft PR, the
sample_weight
metadata is requested by default in all estimators (for fit and score methods) and scorers (for the score method).It's a similar strategy to the
groups
metadata which is requested by default ingroups
-consuming splitters likeGroupKFold
.Motivation
The motivation behind is to make metaestimators handle
sample_weight
correctly by default, without the user needing to specify all requests. For example consider this metaestimator:We would like the
search
metaestimator to handlesample_weight
correctly, for instance to respect the repeated/reweighted equivalence check #29818. In scikit-learn, this is only possible with metadata routing enabled.With the current routing policy one needs to explicitly requests
sample_weight
for all objects/methods:This works, the metaestimator satisfies the repeated/reweighted equivalence check. On the downside it is very verbose, and the user might easily forget a request or add an unnecessary one. It requires the user to know each subestimator and router inner mechanics precisely, which kind of defeats the purpose of the metaestimator. Ideally we would like the metaestimator to "just work" when using
sample_weight
.In this draft PR,
sample_weight
is requested by default, andsearch.fit
handlessample_weight
correctly out of the box and satisfy the repeated/reweighted equivalence check.Any other comments?
This is a very crude draft and it breaks many tests! The current "solution" is wrong, but I hope it can serve as a basis for criticism, feedback on desired goals, and implementation ideas going forward.
For instance:
sample_weight
?BaggingRegressor
which should consumesample_weight
for sampling but not forward them to subestimator SLEP006: default routing #26179 (comment) ?sample_weight="auto"
in the routing API ?cc @adrinjalali @ogrisel