[WIP] sample props (proposal 4) #16079

adrinjalali · 2020-01-09T17:02:32Z

Taking a stab at proposal 4 of the sample prop proposal (SLEP006 Routing sample-aligned meta-data).

Clear WIP, just putting it here for now for us to see how it'd look.

The first change I've made compared to the proposal is that the dict
storing the required props, includes a key as which method requires
which props. This way score and fit and predict can all have different
requirements.

You can see SLEP examples implemented in test_props.py as test_slep_case* functions.

Closes #9566, Closes #15425, Closes #3524, Closes #15425,
Fixes #4497, Fixes #7136, Fixes #13432, Fixes #4632, Fixes #7646 (add test), Fixes #8127 (add test), Fixes #8158,
Enables #6322,
TODO:
#12052, #8710, #15282, #2630, #8950, #11429, #15282, #18028

jnothman · 2020-01-09T20:44:30Z

Yay! I've been trying to find more time to work on the SLEP, but my break was consumed visiting family. Thanks

adrinjalali · 2020-01-10T16:02:26Z

Here's an example with a pipeline. @jnothman would you please check if it's [almost] what you had in mind?

@NicolasHug you were interested in this one as well IIRC.

from sklearn.model_selection import GroupKFold, cross_validate
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification()
sw = np.random.rand(len(X))
sw2 = [5, 6]
brand = ['my brand']

from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import _validate_required_props
from sklearn.svm import SVC

class MyEst(ClassifierMixin, BaseEstimator):
    def __init__(self):
        self._props_request = {'fit': ['sample_weight', 'brand']}

    def fit(self, X, y, **fit_params):
        _validate_required_props(self, fit_params, 'fit')
        self._estimator = SVC().fit(X, y, fit_params['sample_weight'])
        print("{} recieved params {} of lengths {}".format(
            self.__class__.__name__,
            list(fit_params.keys()), [len(v) for v in fit_params.values()]))

class MyTrs(TransformerMixin, BaseEstimator):
    def __init__(self):
        self._props_request = {'fit': ['sample_weight']}

    def fit(self, X, y=None, **fit_params):
        _validate_required_props(self, fit_params, 'fit')
        self._estimator = SelectKBest().fit(X, y)
        print("{} recieved params {} of lengths {}".format(
            self.__class__.__name__,
            list(fit_params.keys()), [len(v) for v in fit_params.values()]))
        return self

    def transform(self, X, y=None):
        return self._estimator.transform(X)

clf = make_pipeline(MyTrs(), MyEst())
clf.fit(X, y, sample_weight=sw, brand=brand)

MyTrs recieved params ['sample_weight'] of lengths [100]
MyEst recieved params ['sample_weight', 'brand'] of lengths [100, 1]

Pipeline(memory=None, steps=[('mytrs', MyTrs()), ('myest', MyEst())],
         verbose=False)

trs = MyTrs().set_props_request(
    None).set_props_request(
    {'fit': {'new_param': 'my_sw'}})
clf = make_pipeline(trs, MyEst())
clf.fit(X, y, sample_weight=sw, brand=brand, my_sw=sw2)

MyTrs recieved params ['new_param'] of lengths [2]
MyEst recieved params ['sample_weight', 'brand'] of lengths [100, 1]

Pipeline(memory=None, steps=[('mytrs', MyTrs()), ('myest', MyEst())],
         verbose=False)

clf.fit(X, y, brand=brand)

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-6-b2a96948c309> in <module>
----> 1 clf.fit(X, y, brand=brand)


~/Projects/sklearn/scikit-learn/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    364             This estimator
    365         """
--> 366         Xt, fit_params = self._fit(X, y, **fit_params)
    367         with _print_elapsed_time('Pipeline',
    368                                  self._log_message(len(self.steps) - 1)):


~/Projects/sklearn/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    288         fit_transform_one_cached = memory.cache(_fit_transform_one)
    289 
--> 290         _validate_required_props(self, fit_params, 'fit')
    291 
    292         for (step_idx,


~/Projects/sklearn/scikit-learn/sklearn/utils/validation.py in _validate_required_props(obj, given_props, method)
   1337     given_props = set(given_props.keys())
   1338     if required_props != given_props:
-> 1339         raise ValueError("Requested properties are: {}, but {} "
   1340                          "provided".format(list(required_props),
   1341                                            list(given_props)))


ValueError: Requested properties are: ['brand', 'my_sw', 'sample_weight'], but ['brand'] provided

jnothman · 2020-01-13T22:50:44Z

That example looks nice! I think I had not envisaged that all requested props were required, i.e. an estimator could request sample_weight by default but only receive it if available. What are your thoughts on that?

What I'm not sure about is whether we need to specify that the request is for fit. I suppose that makes things more flexible.

I have looked at your modification to clone. Okay. But I've not yet looked at much else

adrinjalali · 2020-02-10T14:25:38Z

I think I had not envisaged that all requested props were required, i.e. an estimator could request sample_weight by default but only receive it if available. What are your thoughts on that?

I also first thought of having it as "please give me that prop if you have it" rather than "I really need it". The reason I have it as a mandatory requirement, is that it kinda removes the chances of a meta estimator not passing a required prop by mistake. My gut feeling was that it removes some of those bugs.

What I'm not sure about is whether we need to specify that the request is for fit. I suppose that makes things more flexible.

I think your other proposals had a more flexible system for routing the props around. Especially when it comes to fit/score/split distinction. Also, I thought this makes it easier to have some props going to transform/predict as well.

jnothman · 2020-02-11T07:08:44Z

The reason I have it as a mandatory requirement, is that it kinda removes the chances of a meta estimator not passing a required prop by mistake. My gut feeling was that it removes some of those bugs.

But this would be a mistake in meta-estimator implementation which should not be the user's problem?

adriangb · 2020-04-29T06:34:46Z

This looks good!

Edit: previous comment was irrelevant after I finally read all of #4497 and #SLEP006. My initial conception (and hacky implementation in personal project) looked very similar to the single argument, no routing approach.

Now for some comments (sorry if these are still uninformed):

single argument vs. kwargs: I feel strongly that the single argument approach is best.
routing vs. blind passing of all: I see pros and cons to each, but I think it should be possible to implement in a non-exclusive manner. Ex, if the key starts with route_ it gets routed (however that ends up working) and if not it always gets passed. That would also allow for the simpler behavior to be implemented immediately and the routing to be done separately once that basic infrastructure is in place.
unexpected behavior changes: very good points were raised in [API] Consistent API for attaching properties to samples #4497 regarding changes in behavior without changes in data. I think that is definitely a concern, but I cannot immediately think of a way around that.

I guess my main question/point is: can this be done in two phases? Phase 1 would be implementing the basic infrastructure (i.e. changing a lot of signatures thought the API) so that sample_props gets passed correctly. Changes would also have to be made to fold in sample_weight and other existing sample properties. Phase 2 would be routing, requesting/requiring, etc.

That said, how can I be of use to get this done?

jnothman

This is good, but it needs work :)

sklearn/base.py

jnothman · 2020-06-22T12:47:42Z

sklearn/pipeline.py

@@ -257,6 +258,24 @@ def _log_message(self, step_idx):
                                                  len(self.steps),
                                                  name)

+    def get_props_request(self):


There will be cases where the meta-estimator requests a prop, such as sample_weight regardless of whether its descendant estimators also request props.

The implementation here implies that the caller of get_props_request always wants the "deep" props of its descendants. Do we have use cases that would want us to be able to inspect only the locally consumed props?

I can't think of such a case off the top of my head, but if we do at some point, we can add a deep parameter here.

sklearn/base.py

jnothman

Happy to work through this more or assist with the changes

sklearn/base.py

jnothman · 2020-07-02T23:43:22Z

I think the consensus at the last meeting was to rename the prop request to a metadata request.

How is this going? Is there something I can help with here?

adrinjalali

The code is now a bit simpler, tests on sklearn.metrics pass, and test_props.py also passes.

I need to support the old behavior in grid search and pipeline somehow as well.

sklearn/base.py

adrinjalali · 2020-07-03T12:27:10Z

sklearn/pipeline.py

@@ -257,6 +258,24 @@ def _log_message(self, step_idx):
                                                  len(self.steps),
                                                  name)

+    def get_props_request(self):


I can't think of such a case off the top of my head, but if we do at some point, we can add a deep parameter here.

sklearn/base.py

sklearn/metrics/_scorer.py

jnothman · 2020-07-04T19:44:17Z

Continuing on this thought that only some estimators are PropConsumers, and those that are only support a small set of props, I wonder if we should make setting the prop consumer's request simpler / more explicit. For example, instead of set_prop_request, should we just have a parameter request_sample_weight in each sample_weight-consuming estimator?

jnothman

What do you think of the idea above, to get rid of set_metatdata_request and instead just add request_sample_weight parameters:

request_sample_weight : bool, str
    When this estimator is in a Pipeline or other meta estimator, its fit 
    method will receive the metadata passed as 'sample_weight' if
    request_sample_weight is True. If it is a string, it will receive the
    metadata passed with that key. If False, no value of sample_weight 
    will be requested.

With this approach, the keys supported by a consumer become explicit, giving us:

Free error checking of the expected prop name
Free cloning
Higher visibility
Familiar API
No need to distinguish API for setting vs updating an estimator's request
No need to burden users with deciding which verbs are relevant to a request, i.e. is this a fit param or a transform param, because this is now controlled by what the estimator designer encodes about how the estimator consumes each specific parameter.

Disadvantages include needing to add a parameter to every estimator that supports weighting, and hence increased friction if we want to add more user-configurable requests.

sklearn/metrics/_scorer.py

jnothman · 2020-07-06T21:04:59Z

sklearn/pipeline.py

-            fit_params_steps[step][param] = pval
-        """
+        # Remove old_behavior in 0.26
+        old_behavior = np.any(['__' in pname for pname in fit_params.keys()])


Really anything not requested by a child should raise an error in the future.

_validate_required_props does that, or do you mean something else?

You're right.

Though what I mean is that a child could request a parameter with a __ in its name, and in the future it should not raise a warning or an error. I'm okay for it to warn now as well.

sklearn/pipeline.py

jnothman · 2020-07-06T21:07:30Z

sklearn/pipeline.py

        for _, name, transformer in self._iter(filter_passthrough=True):
            if transformer is None:
                continue
-            transformer_fit_props = transformer.get_metadata_request().fit
+            try:
+                transformer_fit_props = transformer.get_metadata_request().fit


Do we want to require attribute access to get fit?

don't understand, could you please elaborate?

I mean this assumes that the metadata request has an attribute called 'fit'. Should we assume that?

get_metadata_request() will always return a fit attribute even if it's empty with the current implementation.

adrinjalali · 2020-07-07T15:58:31Z

I don't like adding these parameters to the constructor. I'm still hoping we'd use this as a starting point for not having all parameters as constructor parameters.

Other than that, the API you propose also doesn't have the flexibility for the routing we have right now. You can't route different sample weights to different estimators or different metadata to different verbs (fit, transform, etc), which the user might want to control and not leave it to the developer of the estimator.

I'm happy to introduce helper functions such as set_sampleweight_request etc to make it easier for people to set the routing for the usual cases though.

adrinjalali

I also tried another way to have an easy way to handle common fit parameters like sample_weight. You can see the remnants of the work here.

I thought of having a Mixin for each parameter, like FitRequestsSampleWeight which would set a parameter such as _metadata_request__sample_weight, and the get_metadata_request would merge all such attributes together. The issue I encountered was how the user would change the default settings. If they do it through the _metadata_request, then some sort of overriding mechanism should exist, which is tricky if we allow users to pass a given prop to multiple ones, which they can do now: a single sample_weight can be passed to fancy_sample_weight and nonfancy_sample_weight for example, if the estimator would accept both.

adrinjalali · 2020-07-07T16:26:37Z

sklearn/pipeline.py

-            fit_params_steps[step][param] = pval
-        """
+        # Remove old_behavior in 0.26
+        old_behavior = np.any(['__' in pname for pname in fit_params.keys()])


_validate_required_props does that, or do you mean something else?

sklearn/pipeline.py

adrinjalali · 2020-07-07T16:29:15Z

sklearn/pipeline.py

        for _, name, transformer in self._iter(filter_passthrough=True):
            if transformer is None:
                continue
-            transformer_fit_props = transformer.get_metadata_request().fit
+            try:
+                transformer_fit_props = transformer.get_metadata_request().fit


don't understand, could you please elaborate?

jnothman · 2020-07-07T23:56:15Z

Other than that, the API you propose also doesn't have the flexibility for the routing we have right now. You can't route different sample weights to different estimators or different metadata to different verbs (fit, transform, etc), which the user might want to control and not leave it to the developer of the estimator.

Yes, the assumption is that the verbs relevant to a parameter is determined by the meaning of the parameter to that consumer. So that parameter knows how to consume sample_weight in fit, it will consume it in fit. You're right that this might be a problem for sample_weight in score.

A possibility: A method called request_sample_weight could more explicitly define which verbs sample_weight can be requested for ...?

jnothman

then some sort of overriding mechanism should exist, which is tricky if we allow users to pass a given prop to multiple ones, which they can do now: a single sample_weight can be passed to fancy_sample_weight and nonfancy_sample_weight for example, if the estimator would accept both.

I'm not entirely sure what the issue is here. The consumer has a notion of sample_weight. if it had a notion of multiple different weights, like fitting_weight and calibration_weight or something, then each of them is a separate request. The semantics of something like _metadata_request__sample_weight are that the consumer knows about something called sample_weight, but the request instructs a metaestimator where to get it from.

sklearn/utils/validation.py

jnothman · 2020-07-08T00:00:28Z

sklearn/pipeline.py

-            fit_params_steps[step][param] = pval
-        """
+        # Remove old_behavior in 0.26
+        old_behavior = np.any(['__' in pname for pname in fit_params.keys()])


You're right.

Though what I mean is that a child could request a parameter with a __ in its name, and in the future it should not raise a warning or an error. I'm okay for it to warn now as well.

sklearn/pipeline.py

jnothman

should we validate expected props in set_metadata_request? we validate param names in set_params. We could, a bit like set params, check against the signature of the targeted method if **kw is not used.

jnothman

I'm not loving how _get_props_from_objs is used in *SearchCV. The logic doesn't appear very clear. Is the following a helpful tool to make this more explicit and less error-prone?

def build_router_metadata_request(children : Dict[str, Container[object]],
                                  routing : List[Tuple[str, str, str]]):
    """Compose the request that a router should return from get_metadata_request

    For example, GridSearchCV might use:

    def get_metadata_request(self):
        score_from = "base" if self.scoring is None else "scorers"
        return build_router_metadata_request(
            children={"base": [self.estimator],
                      "cv": [check_cv(self.cv)],
                      "scorers": check_multimetric_scoring(...)[0].values()},
            routing=[
                ("base", "fit", "fit"),
                ("cv", "fit", "split"),
                (score_from, "fit", "score"),
                (score_from, "score", "score"),
                # XXX: we might want a special way to handle 'remainder'
                ("base", "transform", "transform"),
                ("base", "inverse_transform", "inverse_transform"),
                ("base", "predict", "predict"),
            ])

    Parameters
    ----------
    children : dict
        Key is a role name (e.g. 'step') used as an identifier in the routing parameter.
        Value is a collection of objects that may provide get_metadata_request.
    routing : list of tuples
        Tuples are a triple of strings (child_role, router_method, child_method).
    """
    children_props = {k: _get_props_from_objs(v) for k, v in children.items()}
    out = {method: set() for method in _empty_metadata_request()}
    for child_role, router_method, child_method in routing:
        # note that we get rid of what the child wants each item to be called
        # and only keep the names we expect to be passed here
        out[router_method].update(children_props[child_method].keys())
    return out

Also:
Just thought of a tricky case that we probably won't handle for a while: a grid search over two different estimators where one of the estimators requests sample_weight and the other does not:

GridSearchCV(Pipeline([('step', None)]),
             {"step": [LogisticRegression().set_metadata_request({'fit': ['sample_weight']}),
                       LogisticRegression()]})

sklearn/utils/__init__.py

jnothman

Apologies for my patchwork reviews.

Here I'm trying to make the tests more complete.

I still think the set_metadata_request API needs some work. I find it funny that a for an estimator not supporting sample_weight, I can still run .set_metadata_request({'fit': {'sample_weight': 'sample_weight'}}) without error and that for an estimator supporting sample_weight, I can run .set_metadata_request({'fit': {'sample_weight': 'ample_weight'}}) without error. (Hence my suggestion of request_sample_weight instead. I'd accept another solution, but am not really happy with the present one.)

The rest is shaping up very nicely!

sklearn/tests/test_props.py

jnothman

Apologies for my patchwork reviews.

Here I'm trying to make the tests more complete.

I still think the set_metadata_request API needs some work. I find it funny that a for an estimator not supporting sample_weight, I can still run .set_metadata_request({'fit': {'sample_weight': 'sample_weight'}}) without error and that for an estimator supporting sample_weight, I can run .set_metadata_request({'fit': {'sample_weight': 'ample_weight'}}) without error. (Hence my suggestion of request_sample_weight instead. I'd accept another solution, but am not really happy with the present one.)

The rest is shaping up very nicely!

jnothman · 2020-07-23T11:45:31Z

How about this kind of thing for consumers?:

class MetadataConsumer:
	def _request_key_for_method(self, method, key, value):
		method_dict = self._metadata_request[method]
		if value is True:
			method_dict[key] = key
		elif value is False:
			del method_dict[key]
		elif isinstance(value, str):
			method_dict[key] = value
		else:
			raise TypeError

class SampleWeightConsumer(MetadataConsumer):
    def request_sample_weight(*, fit=True, score=False):
        """Define how to receive sample_weight from a parent meta-estimator

        When called with default arguments, fitting will be weighted,
        and the meta-estimator should be passed a fit parameter by the name
        'sample_weight'.

        Parameters
        ----------
        fit : string or bool, default=True
            The fit parameter name that a meta-estimator should pass to this
            estimator as sample_weight. If true, the name will be 'sample_weight'.
            If False, no fit parameter will be passed.

        score : string or bool, default=True
            The parameter name that a meta-estimator should pass to this
            estimator as sample_weight when calling its `score` method.
            If true, the name will be 'sample_weight'.
            If False, no fit parameter will be passed.

        Returns
        -------
        self

        Examples
        --------
        >>> from sklearn.linear_model import LogisticRegression
        >>> from sklearn.model_selection import GridSearchCV
        >>> from sklearn.datasets import load_iris
        >>> X, y = load_iris(return_X_y=True)
        >>> sample_weight = np.random.rand(len(X))
        >>> clf = LogisticRegression()
        >>> gs = GridSearchCV(clf, {"C": [.1, 1, 10]})
        # Unweighted fitting and scoring
        >>> gs.fit(X, y)
        # Weighted fitting, unweighted scoring
        >>> clf.request_sample_weight()
        >>> gs.fit(X, y, sample_weight=sample_weight)
        # Weighted fitting and scoring
        >>> clf.request_sample_weight(fit=True, score=True)
        >>> gs.fit(X, y, sample_weight=sample_weight)
        # Weighted scoring only
        >>> clf.request_sample_weight(fit=False, score=True)
        >>> gs.fit(X, y, sample_weight=sample_weight)
        # Distinct weights for fit and score
        >>> score_sample_weight = np.random.rand(len(X))
        >>> clf.request_sample_weight(fit='fit_sample_weight', score='score_sample_weight')
        >>> gs.fit(X, y, fit_sample_weight=sample_weight,
        ...        score_sample_weight=score_sample_weight)
        """
        self._request_key_for_method('fit', 'sample_weight', fit)
        self._request_key_for_method('score', 'sample_weight', score)
        return self

jnothman

I think we need that deep parameter for clone. set_metadata_request needs to only pertain to the needs of that estimator as consumer, not as router, and this is what should be copied in clone. However, for the purpose of actually doing the routing, there needs to be a way to get the "deep" metadata request.

On the topic of request_* methods:

I think we should keep set_metadata_request, if only because of its use in clone, but we might emphasise that most users should not use it as request_* methods should be a simpler, more explicit way of accessing the same functionality. But I'm struggling to work out how to neatly make request_sample_weight appear in only the consumers of sample_weight.

I'd now tweak the example in the previous comment such that neither fit nor score defaults to True or False. Both should default to None, and no change should occur.

sklearn/base.py

adrinjalali · 2020-10-19T09:16:43Z

I think we need that deep parameter for clone. set_metadata_request needs to only pertain to the needs of that estimator as consumer, not as router, and this is what should be copied in clone. However, for the purpose of actually doing the routing, there needs to be a way to get the "deep" metadata request.

I copy the variable itself now, this should solve the deep/shallow issue.

I think we should keep set_metadata_request, if only because of its use in clone, but we might emphasise that most users should not use it as request_* methods should be a simpler, more explicit way of accessing the same functionality.

Need the method in metrics, but seems like we could do w/o for the other estimators. Happy to put it back. Made it private for now, since it's only used in make_scorer.

But I'm struggling to work out how to neatly make request_sample_weight appear in only the consumers of sample_weight.

We just need to use the mixin only where it's appropriate. I'm not sure if I get where you see the challenge here.

I'd now tweak the example in the previous comment such that neither fit nor score defaults to True or False. Both should default to None, and no change should occur.

Yes, agreed. There's another challenge which is to have a set of defaults in different mixins, and have get_metadata_request look for them in a general way. w/o having the user ever call a set_*. Update: Done in 6d27c79

I'm not totally happy with how it's done in cross_validate now, it'd be nice if you could let me know what you think @jnothman

The tests now include the 4 usecases presented in the SLEP and they run, but need to add tests to see if they're run correctly.

jnothman · 2021-03-16T22:58:37Z

It works fine if there's no ambiguity. A more complicated model, e.g. with stacking, would mess that up. Governance is not clear re updates to the slep. Does the proposed slep000 cover it??

thomasjpfan · 2021-03-17T13:35:09Z

SLEP000 has a "provisionally accepted":

To allow gathering of additional design and interface feedback before
committing to long term stability for a feature or API, a SLEP may also be
marked as Provisional. This is short for "Provisionally Accepted", and
indicates that the proposal has been accepted for inclusion in the reference
implementation, but additional user feedback is needed before the full design
can be considered Final. Unlike regular accepted SLEPs, provisionally
accepted SLEPs may still be Rejected or Withdrawn even after the
related changes have been included in a scikit-learn release.

I feel the safe choice would be to reject and stay in "Draft":

If the vote does not achieve a required majority, then the SLEP remains in
Draft state, discussion continues as normal, and it can be proposed for
acceptance again later once the objections are resolved.

This way some of the concerns here can be resolved in the SLEP.

https://github.com/scikit-learn/enhancement_proposals/pull/30/files

jnothman · 2021-05-27T00:12:19Z

Hey @adrinjalali, I'd like to have this closer to the SLEP when we next bring it to vote, and I wonder whether:

you could put together a to do list so it's clear to us all
we could consider management strategies to expedite it, e.g. allocating a few days to sprint on it, with either pair programming or fast review turnaround

adrinjalali · 2021-05-31T08:37:17Z

@jnothman I'm happy to do both. I'm working on a cleaner solution for the same API. I'd be happy to have a sprint to work on this together as well.

jnothman · 2021-06-03T00:42:17Z

Just to leave our scribbles from yesterday:

# A messy case

# One option in grid search is weighted, the other is unweighted

# Messy because when do you work out all the requests, and that all metadata
# are consumed?

GridSearchCV(
    Pipeline([
        ('clf', None)
    ]),
    param_grid={"clf": [SVM().fit_requests(sample_weight=True),
                        SVM().fit_requests(sample_weight=False)],
                "clf__C": [1,2,3],
                }
).fit(X, y, sample_weight=sample_weight)



###############

# Request type behaviour

class UnchangedSingleton:
    __dict__ = None

    def __repr__(self):
        return "<UNCHANGED>"


from enum import Enum

class RequestType(Enum):
    UNREQUESTED = False
    REQUESTED = True
    ERROR_IF_PASSED = None

UNCHANGED = UnchangedSingleton()


def fit_requests(self,
                 sample_weight : Union[RequestType, str, Literal[UNCHANGED]] = UNCHANGED):


fit_requests(sample_weight=metadata.REQUESTED)
fit_requests(sample_weight=metadata.UNREQUESTED)
fit_requests(sample_weight=metadata.ERROR_IF_PASSED)
fit_requests(sample_weight='my_weight')

###############

# Magic creation of `fit_requests` etc.

class RequestMethod:
    def __init__(self, name, keys):
        self.name = name
        self.keys = keys

    def __get__(self, instance, owner):
        # what we are avoiding
        def func(**kw):
            if set(kw) - keys:
                raise TypeError("Unexpected kws")

            instance._metadata_request = ....

            return instance

        func.__name__ = f"{self.name}_requests"
        func.__signature__ = inspect.Signature([
            inspect.Parameter(k,
                              inspect.Parameter.KEYWORD_ONLY,
                              default=metadata.UNCHANGED,
                              annotation=Optional[Union[RequestType, str]])
            for k in keys
        ], return_annotation=type(instance))
        func.__doc__ = "Smething useful"
        return func


class BaseEstimator:
    def __init_subclass__(cls, **kwargs):
        metadata = defaultdict(set)
        for k, v in vars(cls).items():
            if k.startswith("_metadata_re,
                            annotation=quest___"):
                for request_method, request_keys in v:
                    metadata[request_method].update(request_keys)

        # now we have something like
        # metadata = {"fit": {"sample_weight", "groups"}}

        for request_method, request_keys in metadata:
            # set a fit_requests method
            setattr(f"{request_method}_requests",
                    RequestMethod(request_method, sorted(request_keys)))


        """
        REQUEST_METHOD_MAPPING = [
            ('fit', "fit"),
            ("fit_transform", "fit"),
            ("fit_transform", "transform"),
            ("transform", "transform"),
        ]
        metadata = defaultdict(set)
        for method, request_method_name in REQUEST_METHOD_MAPPING:
            if method in attrs:
                # for example, if the class being instantiated has `fit`
                sig = inspect.signature(attrs[method])
                metadata[request_method_name].add()
        """
        super().__init_subclass__(**attrs)

thomasjpfan · 2021-06-06T16:56:01Z

The three request states makes a sense. I do not see a clean way to avoid using the sentinel value (reminds me of PEP661).

jnothman · 2021-06-07T03:54:18Z

I do not see a clean way to avoid using the sentinel value

Well, we can avoid the use of None for ERROR_IF_CHANGED, and use a sentinel for that instead. That would allow us to use Optional

thomasjpfan · 2021-06-07T21:03:33Z

Well, we can avoid the use of None for ERROR_IF_CHANGED, and use a sentinel for that instead. That would allow us to use Optional

What do we use for ERROR_IF_PASSED instead of None?

I'm assuming we still want fit_requests(sample_weight=True) to mean fit_requests(sample_weight=metadata.REQUESTED), so users do not need to import metadata to set their request. Strings are already used for key aliasing.

Edit: I meant ERROR_IF_PASSED

jnothman · 2021-06-07T22:30:01Z

Yes, True is fine. Strings used for key aliasing should usually be valid identifiers so we can use a string sentinel containing a non-alphanum character as the "error if changed" sentinel without getting in too much trouble. It will only be seen by users who call get_metadata_request. I've not yet decided which is more friendly, that or None.

thomasjpfan · 2021-06-08T01:23:50Z

I think "error if passed" is more explicit than None. Maybe error-if-passed since dashes have limited use in python for naming things.

It will only be seen by users who call get_metadata_request.

From the original SLEP, I was under the impression that get_metadata_request is empty when no metadata is requested.

# No metadata requested
log_reg = LogisticRegression()
log_reg.get_metadata_request()
# {'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}

Do we want fit to contain "error-if-passed" for sample weights?

log_reg.get_metadata_request()
# {'fit': {'sample_weight': 'error-if-passed'}, ...}

jnothman · 2021-06-10T08:51:56Z

Yes, I think that's the right default request.

adrinjalali · 2021-06-24T15:57:53Z

I've put an alternate implementation here: #20350

It is closer to what we have in the slep, and includes ERROR_IF_PASSED

* initial base implementation commit * fix test_props and the issue with attribute starting with __ * skip doctest in metadata_routing.rst for now * DOC explain why aliasing on sub-estimator of a consumer/router is useful * reduce diff * DOC add user guide link to method docstrings * DOC apply Thomas's suggestions to the rst file * CLN address a few comments in docs * ignore sentinel docstring check * handling backward compatibility and deprecation prototype * Update examples/plot_metadata_routing.py Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> * make __metadata_request__* format more intuitive and less redundant * metadata_request_factory always returns a copy * fix tests for the changed __metadata_request__* format * in example: foo->sample_weight, bar->groups * get_method_input->get_input * minor comments from Guillaume * fix estimator checks tests * Improved sample props developer API * fixes, updated doc, decorator * Add docstrings and some API cleanup * unify serialize/deserialize methods * Add more docstring to process_routing * fix MetadataRouter.get_params parameter mismatch * DOC add missing name to MethodMetadataRequest.deserialize docstring * DOC add MethodMapping.add docstring * DOC fix colons after versionadded * fix {method}_requests return type annotation * metadata_request_factory -> metadata_router_factory and docstring fixes * move 'me' out of the map in MetadataRouter * more docstring refinements * cleanup API addresses and create a utils.metadata_routing sub-folder * fix module import issue * more tests and a few bug fixes * Joel's comments * make process_routing a function * docstring fix * ^type -> $type * remove deserialize, return instance, and add type as an attribute * remove sentinels and use strings instead * make RequestType searchable and check for valid identifier * Route -> MethodPair * remove unnecessary sorted * clarification on usage of the process_routing func in the example * only print methods with non-empty requests * fix test_string_representations * remove source build cache from CircleCI (temporarily) * Trigger CI * Invalidate linux-arm64 ccache my changing the key * Trigger CI * method, used_in -> callee, caller * show RequestType instead of RequestType.value in _serialize() * more informative error messages * fix checking for conflicting keys * get_router_for_object -> get_routing_for_object * \{method\}_requests -> set_\{method\}_request * address metadata_routing.rst comments * some test enhancements * TypeError for extra arguments * add_request: prop -> param * original_names -> return_alias * add more tests for MetadataRouter and MethodMapping * more suggestions from Joel's review * fix return type * apply more suggestions from Joel's review * Christian\'s suggestions * more notes from Christian * test_get_routing_for_object returns empty requests on unknown objects * more notes from Christian * remove double line break * more notes from Christian * more notes from Christian * make type private * add more comments/docs * fix test * fix nits * add forgotten nit Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FEAT add metadata routing to splitters * forgot black * add end of file line * ignore mypy error * SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS * simplify comment * add notes on what the tests do * improve test and avoid using repr

* initial base implementation commit * fix test_props and the issue with attribute starting with __ * skip doctest in metadata_routing.rst for now * DOC explain why aliasing on sub-estimator of a consumer/router is useful * reduce diff * DOC add user guide link to method docstrings * DOC apply Thomas's suggestions to the rst file * CLN address a few comments in docs * ignore sentinel docstring check * handling backward compatibility and deprecation prototype * Update examples/plot_metadata_routing.py Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> * make __metadata_request__* format more intuitive and less redundant * metadata_request_factory always returns a copy * fix tests for the changed __metadata_request__* format * in example: foo->sample_weight, bar->groups * get_method_input->get_input * minor comments from Guillaume * fix estimator checks tests * Improved sample props developer API * fixes, updated doc, decorator * Add docstrings and some API cleanup * unify serialize/deserialize methods * Add more docstring to process_routing * fix MetadataRouter.get_params parameter mismatch * DOC add missing name to MethodMetadataRequest.deserialize docstring * DOC add MethodMapping.add docstring * DOC fix colons after versionadded * fix {method}_requests return type annotation * metadata_request_factory -> metadata_router_factory and docstring fixes * move 'me' out of the map in MetadataRouter * more docstring refinements * cleanup API addresses and create a utils.metadata_routing sub-folder * fix module import issue * more tests and a few bug fixes * Joel's comments * make process_routing a function * docstring fix * ^type -> $type * remove deserialize, return instance, and add type as an attribute * remove sentinels and use strings instead * make RequestType searchable and check for valid identifier * Route -> MethodPair * remove unnecessary sorted * clarification on usage of the process_routing func in the example * only print methods with non-empty requests * fix test_string_representations * remove source build cache from CircleCI (temporarily) * Trigger CI * Invalidate linux-arm64 ccache my changing the key * Trigger CI * method, used_in -> callee, caller * show RequestType instead of RequestType.value in _serialize() * more informative error messages * fix checking for conflicting keys * get_router_for_object -> get_routing_for_object * \{method\}_requests -> set_\{method\}_request * address metadata_routing.rst comments * some test enhancements * TypeError for extra arguments * add_request: prop -> param * original_names -> return_alias * add more tests for MetadataRouter and MethodMapping * more suggestions from Joel's review * fix return type * apply more suggestions from Joel's review * Christian\'s suggestions * more notes from Christian * test_get_routing_for_object returns empty requests on unknown objects * more notes from Christian * remove double line break * more notes from Christian * more notes from Christian * make type private * add more comments/docs * fix test * fix nits * add forgotten nit Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

* FEAT add metadata routing to splitters * forgot black * add end of file line * ignore mypy error * SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS * simplify comment * add notes on what the tests do * improve test and avoid using repr

…nto sample-props

jnothman mentioned this pull request Apr 21, 2020

[API] Consistent API for attaching properties to samples #4497

Closed

adrinjalali mentioned this pull request Jun 19, 2020

SLEP006 on Sample Properties scikit-learn/enhancement_proposals#16

Merged

jnothman reviewed Jun 22, 2020

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

adrinjalali commented Jul 3, 2020

View reviewed changes

jnothman reviewed Jul 6, 2020

View reviewed changes

adrinjalali commented Jul 7, 2020

View reviewed changes

jnothman reviewed Jul 8, 2020

View reviewed changes

jnothman reviewed Jul 14, 2020

View reviewed changes

jnothman reviewed Jul 15, 2020

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

jnothman reviewed Jul 15, 2020

View reviewed changes

hildeweerts mentioned this pull request Aug 3, 2020

What should the metrics API look like? fairlearn/fairlearn-proposals#12

Open

jnothman mentioned this pull request Aug 4, 2020

Completing sample props SLEP006 scikit-learn/enhancement_proposals#43

Merged

jnothman reviewed Aug 4, 2020

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

sklearn/base.py Outdated Show resolved Hide resolved

sklearn/base.py Outdated Show resolved Hide resolved

sklearn/base.py Outdated Show resolved Hide resolved

jnothman mentioned this pull request Aug 15, 2020

Add sample_weight fit param for Pipeline #18159

Closed

adrinjalali mentioned this pull request Aug 31, 2020

Revised metrics proposal fairlearn/fairlearn-proposals#15

Open

adrinjalali mentioned this pull request Oct 18, 2020

Sample property routing from #9566 with added support for routing to transform, scorers, prediction functions #15425

Closed

adrinjalali mentioned this pull request Jun 24, 2021

sample-props alternate implementation #20350

Closed

adrinjalali mentioned this pull request Jan 16, 2022

Base sample-prop implementation and docs (alternative to #21284) #22083

Merged

adrinjalali and others added 12 commits March 10, 2022 15:39

Merge branch 'sample-props' of github.com:scikit-learn/scikit-learn i…

9183f5f

…nto sample-props

FEAT SLEP6: scorers (scikit-learn#22757)

c963068

Merge remote-tracking branch 'origin/main' into sample-props

29dc4ab

Merge branch 'sample-props' of github.com:scikit-learn/scikit-learn i…

d9105f6

…nto sample-props

Merge remote-tracking branch 'origin/main' into sample-props

4225a75

Merge remote-tracking branch 'origin/main' into sample-props

bbd96ba

MNT change versions to 1.2 (scikit-learn#23341)

6ae2052

Merge remote-tracking branch 'origin/main' into sample-props

e80be0b

adrinjalali force-pushed the sample-props branch from 3098ca0 to e80be0b Compare May 17, 2022 15:22

adrinjalali mentioned this pull request Aug 23, 2022

SLEP006 - Metadata Routing task list #22893

Open

28 tasks

adrinjalali closed this Jun 6, 2023

adrinjalali deleted the sample-props branch June 6, 2023 14:19

Uh oh!

[WIP] sample props (proposal 4) #16079

[WIP] sample props (proposal 4) #16079

Uh oh!

Conversation

adrinjalali commented Jan 9, 2020 • edited by jnothman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jan 9, 2020 via email

Uh oh!

adrinjalali commented Jan 10, 2020

Uh oh!

jnothman commented Jan 13, 2020

Uh oh!

adrinjalali commented Feb 10, 2020

Uh oh!

jnothman commented Feb 11, 2020

Uh oh!

adriangb commented Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Jul 2, 2020

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jnothman commented Jul 4, 2020 via email

Uh oh!

jnothman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Jul 7, 2020

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

adrinjalali commented Jan 9, 2020 •

edited by jnothman

Loading

adriangb commented Apr 29, 2020 •

edited

Loading

jnothman left a comment •

edited

Loading

jnothman left a comment •

edited

Loading

jnothman commented Jul 23, 2020 •

edited

Loading

adrinjalali commented Oct 19, 2020 •

edited

Loading

jnothman commented Jun 3, 2021 •

edited

Loading