Skip to content

[WIP] sample props (proposal 4) #16079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Jan 9, 2020

Taking a stab at proposal 4 of the sample prop proposal (SLEP006 Routing sample-aligned meta-data).

Clear WIP, just putting it here for now for us to see how it'd look.

The first change I've made compared to the proposal is that the dict
storing the required props, includes a key as which method requires
which props. This way score and fit and predict can all have different
requirements.

You can see SLEP examples implemented in test_props.py as test_slep_case* functions.

Closes #9566, Closes #15425, Closes #3524, Closes #15425,
Fixes #4497, Fixes #7136, Fixes #13432, Fixes #4632, Fixes #7646 (add test), Fixes #8127 (add test), Fixes #8158,
Enables #6322,
TODO:
#12052, #8710, #15282, #2630, #8950, #11429, #15282, #18028

@jnothman
Copy link
Member

jnothman commented Jan 9, 2020 via email

@adrinjalali
Copy link
Member Author

Here's an example with a pipeline. @jnothman would you please check if it's [almost] what you had in mind?

@NicolasHug you were interested in this one as well IIRC.

from sklearn.model_selection import GroupKFold, cross_validate
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification()
sw = np.random.rand(len(X))
sw2 = [5, 6]
brand = ['my brand']
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import _validate_required_props
from sklearn.svm import SVC

class MyEst(ClassifierMixin, BaseEstimator):
    def __init__(self):
        self._props_request = {'fit': ['sample_weight', 'brand']}

    def fit(self, X, y, **fit_params):
        _validate_required_props(self, fit_params, 'fit')
        self._estimator = SVC().fit(X, y, fit_params['sample_weight'])
        print("{} recieved params {} of lengths {}".format(
            self.__class__.__name__,
            list(fit_params.keys()), [len(v) for v in fit_params.values()]))

class MyTrs(TransformerMixin, BaseEstimator):
    def __init__(self):
        self._props_request = {'fit': ['sample_weight']}

    def fit(self, X, y=None, **fit_params):
        _validate_required_props(self, fit_params, 'fit')
        self._estimator = SelectKBest().fit(X, y)
        print("{} recieved params {} of lengths {}".format(
            self.__class__.__name__,
            list(fit_params.keys()), [len(v) for v in fit_params.values()]))
        return self

    def transform(self, X, y=None):
        return self._estimator.transform(X)
clf = make_pipeline(MyTrs(), MyEst())
clf.fit(X, y, sample_weight=sw, brand=brand)
MyTrs recieved params ['sample_weight'] of lengths [100]
MyEst recieved params ['sample_weight', 'brand'] of lengths [100, 1]

Pipeline(memory=None, steps=[('mytrs', MyTrs()), ('myest', MyEst())],
         verbose=False)
trs = MyTrs().set_props_request(
    None).set_props_request(
    {'fit': {'new_param': 'my_sw'}})
clf = make_pipeline(trs, MyEst())
clf.fit(X, y, sample_weight=sw, brand=brand, my_sw=sw2)
MyTrs recieved params ['new_param'] of lengths [2]
MyEst recieved params ['sample_weight', 'brand'] of lengths [100, 1]

Pipeline(memory=None, steps=[('mytrs', MyTrs()), ('myest', MyEst())],
         verbose=False)
clf.fit(X, y, brand=brand)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-6-b2a96948c309> in <module>
----> 1 clf.fit(X, y, brand=brand)


~/Projects/sklearn/scikit-learn/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    364             This estimator
    365         """
--> 366         Xt, fit_params = self._fit(X, y, **fit_params)
    367         with _print_elapsed_time('Pipeline',
    368                                  self._log_message(len(self.steps) - 1)):


~/Projects/sklearn/scikit-learn/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    288         fit_transform_one_cached = memory.cache(_fit_transform_one)
    289 
--> 290         _validate_required_props(self, fit_params, 'fit')
    291 
    292         for (step_idx,


~/Projects/sklearn/scikit-learn/sklearn/utils/validation.py in _validate_required_props(obj, given_props, method)
   1337     given_props = set(given_props.keys())
   1338     if required_props != given_props:
-> 1339         raise ValueError("Requested properties are: {}, but {} "
   1340                          "provided".format(list(required_props),
   1341                                            list(given_props)))


ValueError: Requested properties are: ['brand', 'my_sw', 'sample_weight'], but ['brand'] provided

@jnothman
Copy link
Member

That example looks nice! I think I had not envisaged that all requested props were required, i.e. an estimator could request sample_weight by default but only receive it if available. What are your thoughts on that?

What I'm not sure about is whether we need to specify that the request is for fit. I suppose that makes things more flexible.

I have looked at your modification to clone. Okay. But I've not yet looked at much else

@adrinjalali
Copy link
Member Author

I think I had not envisaged that all requested props were required, i.e. an estimator could request sample_weight by default but only receive it if available. What are your thoughts on that?

I also first thought of having it as "please give me that prop if you have it" rather than "I really need it". The reason I have it as a mandatory requirement, is that it kinda removes the chances of a meta estimator not passing a required prop by mistake. My gut feeling was that it removes some of those bugs.

What I'm not sure about is whether we need to specify that the request is for fit. I suppose that makes things more flexible.

I think your other proposals had a more flexible system for routing the props around. Especially when it comes to fit/score/split distinction. Also, I thought this makes it easier to have some props going to transform/predict as well.

@jnothman
Copy link
Member

The reason I have it as a mandatory requirement, is that it kinda removes the chances of a meta estimator not passing a required prop by mistake. My gut feeling was that it removes some of those bugs.

But this would be a mistake in meta-estimator implementation which should not be the user's problem?

@adriangb
Copy link
Contributor

adriangb commented Apr 29, 2020

This looks good!

Edit: previous comment was irrelevant after I finally read all of #4497 and #SLEP006. My initial conception (and hacky implementation in personal project) looked very similar to the single argument, no routing approach.

Now for some comments (sorry if these are still uninformed):

  • single argument vs. kwargs: I feel strongly that the single argument approach is best.
  • routing vs. blind passing of all: I see pros and cons to each, but I think it should be possible to implement in a non-exclusive manner. Ex, if the key starts with route_ it gets routed (however that ends up working) and if not it always gets passed. That would also allow for the simpler behavior to be implemented immediately and the routing to be done separately once that basic infrastructure is in place.
  • unexpected behavior changes: very good points were raised in [API] Consistent API for attaching properties to samples #4497 regarding changes in behavior without changes in data. I think that is definitely a concern, but I cannot immediately think of a way around that.

I guess my main question/point is: can this be done in two phases? Phase 1 would be implementing the basic infrastructure (i.e. changing a lot of signatures thought the API) so that sample_props gets passed correctly. Changes would also have to be made to fold in sample_weight and other existing sample properties. Phase 2 would be routing, requesting/requiring, etc.

That said, how can I be of use to get this done?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but it needs work :)

@@ -257,6 +258,24 @@ def _log_message(self, step_idx):
len(self.steps),
name)

def get_props_request(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be cases where the meta-estimator requests a prop, such as sample_weight regardless of whether its descendant estimators also request props.

The implementation here implies that the caller of get_props_request always wants the "deep" props of its descendants. Do we have use cases that would want us to be able to inspect only the locally consumed props?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of such a case off the top of my head, but if we do at some point, we can add a deep parameter here.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to work through this more or assist with the changes

@jnothman
Copy link
Member

jnothman commented Jul 2, 2020

I think the consensus at the last meeting was to rename the prop request to a metadata request.

How is this going? Is there something I can help with here?

Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is now a bit simpler, tests on sklearn.metrics pass, and test_props.py also passes.

I need to support the old behavior in grid search and pipeline somehow as well.

@@ -257,6 +258,24 @@ def _log_message(self, step_idx):
len(self.steps),
name)

def get_props_request(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of such a case off the top of my head, but if we do at some point, we can add a deep parameter here.

@jnothman
Copy link
Member

jnothman commented Jul 4, 2020 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the idea above, to get rid of set_metatdata_request and instead just add request_sample_weight parameters:

request_sample_weight : bool, str
    When this estimator is in a Pipeline or other meta estimator, its fit 
    method will receive the metadata passed as 'sample_weight' if
    request_sample_weight is True. If it is a string, it will receive the
    metadata passed with that key. If False, no value of sample_weight 
    will be requested.

With this approach, the keys supported by a consumer become explicit, giving us:

  • Free error checking of the expected prop name
  • Free cloning
  • Higher visibility
  • Familiar API
  • No need to distinguish API for setting vs updating an estimator's request
  • No need to burden users with deciding which verbs are relevant to a request, i.e. is this a fit param or a transform param, because this is now controlled by what the estimator designer encodes about how the estimator consumes each specific parameter.

Disadvantages include needing to add a parameter to every estimator that supports weighting, and hence increased friction if we want to add more user-configurable requests.

fit_params_steps[step][param] = pval
"""
# Remove old_behavior in 0.26
old_behavior = np.any(['__' in pname for pname in fit_params.keys()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really anything not requested by a child should raise an error in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_required_props does that, or do you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

Though what I mean is that a child could request a parameter with a __ in its name, and in the future it should not raise a warning or an error. I'm okay for it to warn now as well.

for _, name, transformer in self._iter(filter_passthrough=True):
if transformer is None:
continue
transformer_fit_props = transformer.get_metadata_request().fit
try:
transformer_fit_props = transformer.get_metadata_request().fit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to require attribute access to get fit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't understand, could you please elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this assumes that the metadata request has an attribute called 'fit'. Should we assume that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_metadata_request() will always return a fit attribute even if it's empty with the current implementation.

@adrinjalali
Copy link
Member Author

I don't like adding these parameters to the constructor. I'm still hoping we'd use this as a starting point for not having all parameters as constructor parameters.

Other than that, the API you propose also doesn't have the flexibility for the routing we have right now. You can't route different sample weights to different estimators or different metadata to different verbs (fit, transform, etc), which the user might want to control and not leave it to the developer of the estimator.

I'm happy to introduce helper functions such as set_sampleweight_request etc to make it easier for people to set the routing for the usual cases though.

Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried another way to have an easy way to handle common fit parameters like sample_weight. You can see the remnants of the work here.

I thought of having a Mixin for each parameter, like FitRequestsSampleWeight which would set a parameter such as _metadata_request__sample_weight, and the get_metadata_request would merge all such attributes together. The issue I encountered was how the user would change the default settings. If they do it through the _metadata_request, then some sort of overriding mechanism should exist, which is tricky if we allow users to pass a given prop to multiple ones, which they can do now: a single sample_weight can be passed to fancy_sample_weight and nonfancy_sample_weight for example, if the estimator would accept both.

fit_params_steps[step][param] = pval
"""
# Remove old_behavior in 0.26
old_behavior = np.any(['__' in pname for pname in fit_params.keys()])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_required_props does that, or do you mean something else?

for _, name, transformer in self._iter(filter_passthrough=True):
if transformer is None:
continue
transformer_fit_props = transformer.get_metadata_request().fit
try:
transformer_fit_props = transformer.get_metadata_request().fit
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't understand, could you please elaborate?

@jnothman
Copy link
Member

jnothman commented Jul 7, 2020

Other than that, the API you propose also doesn't have the flexibility for the routing we have right now. You can't route different sample weights to different estimators or different metadata to different verbs (fit, transform, etc), which the user might want to control and not leave it to the developer of the estimator.

Yes, the assumption is that the verbs relevant to a parameter is determined by the meaning of the parameter to that consumer. So that parameter knows how to consume sample_weight in fit, it will consume it in fit. You're right that this might be a problem for sample_weight in score.

A possibility: A method called request_sample_weight could more explicitly define which verbs sample_weight can be requested for ...?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then some sort of overriding mechanism should exist, which is tricky if we allow users to pass a given prop to multiple ones, which they can do now: a single sample_weight can be passed to fancy_sample_weight and nonfancy_sample_weight for example, if the estimator would accept both.

I'm not entirely sure what the issue is here. The consumer has a notion of sample_weight. if it had a notion of multiple different weights, like fitting_weight and calibration_weight or something, then each of them is a separate request. The semantics of something like _metadata_request__sample_weight are that the consumer knows about something called sample_weight, but the request instructs a metaestimator where to get it from.

fit_params_steps[step][param] = pval
"""
# Remove old_behavior in 0.26
old_behavior = np.any(['__' in pname for pname in fit_params.keys()])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

Though what I mean is that a child could request a parameter with a __ in its name, and in the future it should not raise a warning or an error. I'm okay for it to warn now as well.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we validate expected props in set_metadata_request? we validate param names in set_params. We could, a bit like set params, check against the signature of the targeted method if **kw is not used.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not loving how _get_props_from_objs is used in *SearchCV. The logic doesn't appear very clear. Is the following a helpful tool to make this more explicit and less error-prone?

def build_router_metadata_request(children : Dict[str, Container[object]],
                                  routing : List[Tuple[str, str, str]]):
    """Compose the request that a router should return from get_metadata_request

    For example, GridSearchCV might use:

    def get_metadata_request(self):
        score_from = "base" if self.scoring is None else "scorers"
        return build_router_metadata_request(
            children={"base": [self.estimator],
                      "cv": [check_cv(self.cv)],
                      "scorers": check_multimetric_scoring(...)[0].values()},
            routing=[
                ("base", "fit", "fit"),
                ("cv", "fit", "split"),
                (score_from, "fit", "score"),
                (score_from, "score", "score"),
                # XXX: we might want a special way to handle 'remainder'
                ("base", "transform", "transform"),
                ("base", "inverse_transform", "inverse_transform"),
                ("base", "predict", "predict"),
            ])

    Parameters
    ----------
    children : dict
        Key is a role name (e.g. 'step') used as an identifier in the routing parameter.
        Value is a collection of objects that may provide get_metadata_request.
    routing : list of tuples
        Tuples are a triple of strings (child_role, router_method, child_method).
    """
    children_props = {k: _get_props_from_objs(v) for k, v in children.items()}
    out = {method: set() for method in _empty_metadata_request()}
    for child_role, router_method, child_method in routing:
        # note that we get rid of what the child wants each item to be called
        # and only keep the names we expect to be passed here
        out[router_method].update(children_props[child_method].keys())
    return out

Also:
Just thought of a tricky case that we probably won't handle for a while: a grid search over two different estimators where one of the estimators requests sample_weight and the other does not:

GridSearchCV(Pipeline([('step', None)]),
             {"step": [LogisticRegression().set_metadata_request({'fit': ['sample_weight']}),
                       LogisticRegression()]})

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for my patchwork reviews.

Here I'm trying to make the tests more complete.

I still think the set_metadata_request API needs some work. I find it funny that a for an estimator not supporting sample_weight, I can still run .set_metadata_request({'fit': {'sample_weight': 'sample_weight'}}) without error and that for an estimator supporting sample_weight, I can run .set_metadata_request({'fit': {'sample_weight': 'ample_weight'}}) without error. (Hence my suggestion of request_sample_weight instead. I'd accept another solution, but am not really happy with the present one.)

The rest is shaping up very nicely!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for my patchwork reviews.

Here I'm trying to make the tests more complete.

I still think the set_metadata_request API needs some work. I find it funny that a for an estimator not supporting sample_weight, I can still run .set_metadata_request({'fit': {'sample_weight': 'sample_weight'}}) without error and that for an estimator supporting sample_weight, I can run .set_metadata_request({'fit': {'sample_weight': 'ample_weight'}}) without error. (Hence my suggestion of request_sample_weight instead. I'd accept another solution, but am not really happy with the present one.)

The rest is shaping up very nicely!

@jnothman
Copy link
Member

jnothman commented Jul 23, 2020

How about this kind of thing for consumers?:

class MetadataConsumer:
	def _request_key_for_method(self, method, key, value):
		method_dict = self._metadata_request[method]
		if value is True:
			method_dict[key] = key
		elif value is False:
			del method_dict[key]
		elif isinstance(value, str):
			method_dict[key] = value
		else:
			raise TypeError

class SampleWeightConsumer(MetadataConsumer):
    def request_sample_weight(*, fit=True, score=False):
        """Define how to receive sample_weight from a parent meta-estimator

        When called with default arguments, fitting will be weighted,
        and the meta-estimator should be passed a fit parameter by the name
        'sample_weight'.

        Parameters
        ----------
        fit : string or bool, default=True
            The fit parameter name that a meta-estimator should pass to this
            estimator as sample_weight. If true, the name will be 'sample_weight'.
            If False, no fit parameter will be passed.

        score : string or bool, default=True
            The parameter name that a meta-estimator should pass to this
            estimator as sample_weight when calling its `score` method.
            If true, the name will be 'sample_weight'.
            If False, no fit parameter will be passed.

        Returns
        -------
        self

        Examples
        --------
        >>> from sklearn.linear_model import LogisticRegression
        >>> from sklearn.model_selection import GridSearchCV
        >>> from sklearn.datasets import load_iris
        >>> X, y = load_iris(return_X_y=True)
        >>> sample_weight = np.random.rand(len(X))
        >>> clf = LogisticRegression()
        >>> gs = GridSearchCV(clf, {"C": [.1, 1, 10]})
        # Unweighted fitting and scoring
        >>> gs.fit(X, y)
        # Weighted fitting, unweighted scoring
        >>> clf.request_sample_weight()
        >>> gs.fit(X, y, sample_weight=sample_weight)
        # Weighted fitting and scoring
        >>> clf.request_sample_weight(fit=True, score=True)
        >>> gs.fit(X, y, sample_weight=sample_weight)
        # Weighted scoring only
        >>> clf.request_sample_weight(fit=False, score=True)
        >>> gs.fit(X, y, sample_weight=sample_weight)
        # Distinct weights for fit and score
        >>> score_sample_weight = np.random.rand(len(X))
        >>> clf.request_sample_weight(fit='fit_sample_weight', score='score_sample_weight')
        >>> gs.fit(X, y, fit_sample_weight=sample_weight,
        ...        score_sample_weight=score_sample_weight)
        """
        self._request_key_for_method('fit', 'sample_weight', fit)
        self._request_key_for_method('score', 'sample_weight', score)
        return self

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need that deep parameter for clone. set_metadata_request needs to only pertain to the needs of that estimator as consumer, not as router, and this is what should be copied in clone. However, for the purpose of actually doing the routing, there needs to be a way to get the "deep" metadata request.

On the topic of request_* methods:

I think we should keep set_metadata_request, if only because of its use in clone, but we might emphasise that most users should not use it as request_* methods should be a simpler, more explicit way of accessing the same functionality. But I'm struggling to work out how to neatly make request_sample_weight appear in only the consumers of sample_weight.

I'd now tweak the example in the previous comment such that neither fit nor score defaults to True or False. Both should default to None, and no change should occur.

@adrinjalali
Copy link
Member Author

adrinjalali commented Oct 19, 2020

I think we need that deep parameter for clone. set_metadata_request needs to only pertain to the needs of that estimator as consumer, not as router, and this is what should be copied in clone. However, for the purpose of actually doing the routing, there needs to be a way to get the "deep" metadata request.

I copy the variable itself now, this should solve the deep/shallow issue.

I think we should keep set_metadata_request, if only because of its use in clone, but we might emphasise that most users should not use it as request_* methods should be a simpler, more explicit way of accessing the same functionality.

Need the method in metrics, but seems like we could do w/o for the other estimators. Happy to put it back. Made it private for now, since it's only used in make_scorer.

But I'm struggling to work out how to neatly make request_sample_weight appear in only the consumers of sample_weight.

We just need to use the mixin only where it's appropriate. I'm not sure if I get where you see the challenge here.

I'd now tweak the example in the previous comment such that neither fit nor score defaults to True or False. Both should default to None, and no change should occur.

Yes, agreed. There's another challenge which is to have a set of defaults in different mixins, and have get_metadata_request look for them in a general way. w/o having the user ever call a set_*. Update: Done in 6d27c79

I'm not totally happy with how it's done in cross_validate now, it'd be nice if you could let me know what you think @jnothman

The tests now include the 4 usecases presented in the SLEP and they run, but need to add tests to see if they're run correctly.

@jnothman
Copy link
Member

jnothman commented Mar 16, 2021 via email

@thomasjpfan
Copy link
Member

SLEP000 has a "provisionally accepted":

To allow gathering of additional design and interface feedback before
committing to long term stability for a feature or API, a SLEP may also be
marked as Provisional. This is short for "Provisionally Accepted", and
indicates that the proposal has been accepted for inclusion in the reference
implementation, but additional user feedback is needed before the full design
can be considered Final. Unlike regular accepted SLEPs, provisionally
accepted SLEPs may still be Rejected or Withdrawn even after the
related changes have been included in a scikit-learn release.

I feel the safe choice would be to reject and stay in "Draft":

If the vote does not achieve a required majority, then the SLEP remains in
Draft state, discussion continues as normal, and it can be proposed for
acceptance again later once the objections are resolved.

This way some of the concerns here can be resolved in the SLEP.

https://github.com/scikit-learn/enhancement_proposals/pull/30/files

@jnothman
Copy link
Member

Hey @adrinjalali, I'd like to have this closer to the SLEP when we next bring it to vote, and I wonder whether:

  • you could put together a to do list so it's clear to us all
  • we could consider management strategies to expedite it, e.g. allocating a few days to sprint on it, with either pair programming or fast review turnaround

@adrinjalali
Copy link
Member Author

@jnothman I'm happy to do both. I'm working on a cleaner solution for the same API. I'd be happy to have a sprint to work on this together as well.

@jnothman
Copy link
Member

jnothman commented Jun 3, 2021

Just to leave our scribbles from yesterday:

# A messy case

# One option in grid search is weighted, the other is unweighted

# Messy because when do you work out all the requests, and that all metadata
# are consumed?

GridSearchCV(
    Pipeline([
        ('clf', None)
    ]),
    param_grid={"clf": [SVM().fit_requests(sample_weight=True),
                        SVM().fit_requests(sample_weight=False)],
                "clf__C": [1,2,3],
                }
).fit(X, y, sample_weight=sample_weight)



###############

# Request type behaviour

class UnchangedSingleton:
    __dict__ = None

    def __repr__(self):
        return "<UNCHANGED>"


from enum import Enum

class RequestType(Enum):
    UNREQUESTED = False
    REQUESTED = True
    ERROR_IF_PASSED = None

UNCHANGED = UnchangedSingleton()


def fit_requests(self,
                 sample_weight : Union[RequestType, str, Literal[UNCHANGED]] = UNCHANGED):


fit_requests(sample_weight=metadata.REQUESTED)
fit_requests(sample_weight=metadata.UNREQUESTED)
fit_requests(sample_weight=metadata.ERROR_IF_PASSED)
fit_requests(sample_weight='my_weight')

###############

# Magic creation of `fit_requests` etc.

class RequestMethod:
    def __init__(self, name, keys):
        self.name = name
        self.keys = keys

    def __get__(self, instance, owner):
        # what we are avoiding
        def func(**kw):
            if set(kw) - keys:
                raise TypeError("Unexpected kws")

            instance._metadata_request = ....

            return instance

        func.__name__ = f"{self.name}_requests"
        func.__signature__ = inspect.Signature([
            inspect.Parameter(k,
                              inspect.Parameter.KEYWORD_ONLY,
                              default=metadata.UNCHANGED,
                              annotation=Optional[Union[RequestType, str]])
            for k in keys
        ], return_annotation=type(instance))
        func.__doc__ = "Smething useful"
        return func


class BaseEstimator:
    def __init_subclass__(cls, **kwargs):
        metadata = defaultdict(set)
        for k, v in vars(cls).items():
            if k.startswith("_metadata_re,
                            annotation=quest___"):
                for request_method, request_keys in v:
                    metadata[request_method].update(request_keys)

        # now we have something like
        # metadata = {"fit": {"sample_weight", "groups"}}

        for request_method, request_keys in metadata:
            # set a fit_requests method
            setattr(f"{request_method}_requests",
                    RequestMethod(request_method, sorted(request_keys)))


        """
        REQUEST_METHOD_MAPPING = [
            ('fit', "fit"),
            ("fit_transform", "fit"),
            ("fit_transform", "transform"),
            ("transform", "transform"),
        ]
        metadata = defaultdict(set)
        for method, request_method_name in REQUEST_METHOD_MAPPING:
            if method in attrs:
                # for example, if the class being instantiated has `fit`
                sig = inspect.signature(attrs[method])
                metadata[request_method_name].add()
        """
        super().__init_subclass__(**attrs)

@thomasjpfan
Copy link
Member

The three request states makes a sense. I do not see a clean way to avoid using the sentinel value (reminds me of PEP661).

@jnothman
Copy link
Member

jnothman commented Jun 7, 2021

I do not see a clean way to avoid using the sentinel value

Well, we can avoid the use of None for ERROR_IF_CHANGED, and use a sentinel for that instead. That would allow us to use Optional

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 7, 2021

Well, we can avoid the use of None for ERROR_IF_CHANGED, and use a sentinel for that instead. That would allow us to use Optional

What do we use for ERROR_IF_PASSED instead of None?

I'm assuming we still want fit_requests(sample_weight=True) to mean fit_requests(sample_weight=metadata.REQUESTED), so users do not need to import metadata to set their request. Strings are already used for key aliasing.

Edit: I meant ERROR_IF_PASSED

@jnothman
Copy link
Member

jnothman commented Jun 7, 2021 via email

@thomasjpfan
Copy link
Member

I think "error if passed" is more explicit than None. Maybe error-if-passed since dashes have limited use in python for naming things.

It will only be seen by users who call get_metadata_request.

From the original SLEP, I was under the impression that get_metadata_request is empty when no metadata is requested.

# No metadata requested
log_reg = LogisticRegression()
log_reg.get_metadata_request()
# {'fit': {}, 'predict': {}, 'transform': {}, 'score': {}, 'split': {}, 'inverse_transform': {}}

Do we want fit to contain "error-if-passed" for sample weights?

log_reg.get_metadata_request()
# {'fit': {'sample_weight': 'error-if-passed'}, ...}

@jnothman
Copy link
Member

jnothman commented Jun 10, 2021 via email

@adrinjalali
Copy link
Member Author

I've put an alternate implementation here: #20350

It is closer to what we have in the slep, and includes ERROR_IF_PASSED

adrinjalali and others added 12 commits March 10, 2022 15:39
* initial base implementation commit

* fix test_props and the issue with attribute starting with __

* skip doctest in metadata_routing.rst for now

* DOC explain why aliasing on sub-estimator of a consumer/router is useful

* reduce diff

* DOC add user guide link to method docstrings

* DOC apply Thomas's suggestions to the rst file

* CLN address a few comments in docs

* ignore sentinel docstring check

* handling backward compatibility and deprecation prototype

* Update examples/plot_metadata_routing.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* make __metadata_request__* format more intuitive and less redundant

* metadata_request_factory always returns a copy

* fix tests for the changed __metadata_request__* format

* in example: foo->sample_weight, bar->groups

* get_method_input->get_input

* minor comments from Guillaume

* fix estimator checks tests

* Improved sample props developer API

* fixes, updated doc, decorator

* Add docstrings and some API cleanup

* unify serialize/deserialize methods

* Add more docstring to process_routing

* fix MetadataRouter.get_params parameter mismatch

* DOC add missing name to MethodMetadataRequest.deserialize docstring

* DOC add MethodMapping.add docstring

* DOC fix colons after versionadded

* fix {method}_requests return type annotation

* metadata_request_factory -> metadata_router_factory and docstring fixes

* move 'me' out of the map in MetadataRouter

* more docstring refinements

* cleanup API addresses and create a utils.metadata_routing sub-folder

* fix module import issue

* more tests and a few bug fixes

* Joel's comments

* make process_routing a function

* docstring fix

* ^type -> $type

* remove deserialize, return instance, and add type as an attribute

* remove sentinels and use strings instead

* make RequestType searchable and check for valid identifier

* Route -> MethodPair

* remove unnecessary sorted

* clarification on usage of the process_routing func in the example

* only print methods with non-empty requests

* fix test_string_representations

* remove source build cache from CircleCI (temporarily)

* Trigger CI

* Invalidate linux-arm64 ccache my changing the key

* Trigger CI

* method, used_in -> callee, caller

* show RequestType instead of RequestType.value in _serialize()

* more informative error messages

* fix checking for conflicting keys

* get_router_for_object -> get_routing_for_object

* \{method\}_requests -> set_\{method\}_request

* address metadata_routing.rst comments

* some test enhancements

* TypeError for extra arguments

* add_request: prop -> param

* original_names -> return_alias

* add more tests for MetadataRouter and MethodMapping

* more suggestions from Joel's review

* fix return type

* apply more suggestions from Joel's review

* Christian\'s suggestions

* more notes from Christian

* test_get_routing_for_object returns empty requests on unknown objects

* more notes from Christian

* remove double line break

* more notes from Christian

* more notes from Christian

* make type private

* add more comments/docs

* fix test

* fix nits

* add forgotten nit

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
* FEAT add metadata routing to splitters

* forgot black

* add end of file line

* ignore mypy error

* SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS

* simplify comment

* add notes on what the tests do

* improve test and avoid using repr
* initial base implementation commit

* fix test_props and the issue with attribute starting with __

* skip doctest in metadata_routing.rst for now

* DOC explain why aliasing on sub-estimator of a consumer/router is useful

* reduce diff

* DOC add user guide link to method docstrings

* DOC apply Thomas's suggestions to the rst file

* CLN address a few comments in docs

* ignore sentinel docstring check

* handling backward compatibility and deprecation prototype

* Update examples/plot_metadata_routing.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* make __metadata_request__* format more intuitive and less redundant

* metadata_request_factory always returns a copy

* fix tests for the changed __metadata_request__* format

* in example: foo->sample_weight, bar->groups

* get_method_input->get_input

* minor comments from Guillaume

* fix estimator checks tests

* Improved sample props developer API

* fixes, updated doc, decorator

* Add docstrings and some API cleanup

* unify serialize/deserialize methods

* Add more docstring to process_routing

* fix MetadataRouter.get_params parameter mismatch

* DOC add missing name to MethodMetadataRequest.deserialize docstring

* DOC add MethodMapping.add docstring

* DOC fix colons after versionadded

* fix {method}_requests return type annotation

* metadata_request_factory -> metadata_router_factory and docstring fixes

* move 'me' out of the map in MetadataRouter

* more docstring refinements

* cleanup API addresses and create a utils.metadata_routing sub-folder

* fix module import issue

* more tests and a few bug fixes

* Joel's comments

* make process_routing a function

* docstring fix

* ^type -> $type

* remove deserialize, return instance, and add type as an attribute

* remove sentinels and use strings instead

* make RequestType searchable and check for valid identifier

* Route -> MethodPair

* remove unnecessary sorted

* clarification on usage of the process_routing func in the example

* only print methods with non-empty requests

* fix test_string_representations

* remove source build cache from CircleCI (temporarily)

* Trigger CI

* Invalidate linux-arm64 ccache my changing the key

* Trigger CI

* method, used_in -> callee, caller

* show RequestType instead of RequestType.value in _serialize()

* more informative error messages

* fix checking for conflicting keys

* get_router_for_object -> get_routing_for_object

* \{method\}_requests -> set_\{method\}_request

* address metadata_routing.rst comments

* some test enhancements

* TypeError for extra arguments

* add_request: prop -> param

* original_names -> return_alias

* add more tests for MetadataRouter and MethodMapping

* more suggestions from Joel's review

* fix return type

* apply more suggestions from Joel's review

* Christian\'s suggestions

* more notes from Christian

* test_get_routing_for_object returns empty requests on unknown objects

* more notes from Christian

* remove double line break

* more notes from Christian

* more notes from Christian

* make type private

* add more comments/docs

* fix test

* fix nits

* add forgotten nit

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
* FEAT add metadata routing to splitters

* forgot black

* add end of file line

* ignore mypy error

* SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS

* simplify comment

* add notes on what the tests do

* improve test and avoid using repr
@adrinjalali adrinjalali closed this Jun 6, 2023
@adrinjalali adrinjalali deleted the sample-props branch June 6, 2023 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Breaking Change Issue resolution would not be easily handled by the usual deprecation cycle.
Projects
None yet
8 participants