Skip to content

Base sample-prop implementation and docs (alternative to #21284) #22083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Dec 27, 2021

This is an alternative to #21284.

The main motivation behind the changes is what @glemaitre noticed as repetition of some of the routing logic between get_metadata_request methods and what would go in fit, transform, etc.

To review this PR, you probably should start with the plot_metadata_routing file.

This PR simplifies the developer API substantially, such that a simple pipeline can be implemented as:

class SimplePipeline(ClassifierMixin, BaseEstimator):
    _required_parameters = ["estimator"]

    def __init__(self, transformer, classifier):
        self.transformer = transformer
        self.classifier = classifier

    def fit(self, X, y, **fit_params):
        params = process_routing(self, "fit", fit_params)

        self.transformer_ = clone(self.transformer).fit(X, y, **params.transformer.fit)
        X_transformed = self.transformer_.transform(X, **params.transformer.transform)

        self.classifier_ = clone(self.classifier).fit(
            X_transformed, y, **params.classifier.fit
        )
        return self

    def predict(self, X, **predict_params):
        params = process_routing(self, "predict", predict_params)

        X_transformed = self.transformer_.transform(X, **params.transformer.transform)
        return self.classifier_.predict(X_transformed, **params.classifier.predict)

    def get_metadata_routing(self):
        router = (
            MetadataRouter()
            .add(
                transformer=self.transformer,
                method_mapping=MethodMapping()
                .add(method="fit", used_in="fit")
                .add(method="transform", used_in="fit")
                .add(method="transform", used_in="predict"),
            )
            .add(classifier=self.classifier, method_mapping="one-to-one")
        )
        return router.serialize()

This PR does NOT change the user API compared to #21284.

cc @jnothman @glemaitre @thomasjpfan @agramfort


note from the other PR

This PR is into the sample-props branch and not main. The idea is to break #20350 into smaller PRs for easier review and discussion rounds.

This PR adds the base implementation, and some documentation and a few tests. The tests are re-done from the previous PR. You can probably start with examples/metadata_routing.py to get a sense of how things work, and then check the implementation.

This PR does NOT touch splitters and scorers, those and all meta-estimators will be done in future PRs.

EDIT: process_routing is now a function rather than a decorator.
EDIT: Some updates and summary of current discussions: #22083 (comment)
EDIT: a timeline of this feature being merged into main is written under scikit-learn/enhancement_proposals#65 (comment)
EDIT: this PR creates set_{method}_request instead of {method}_requests now.

adrinjalali and others added 25 commits October 8, 2021 17:27
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more.

@adrinjalali
Copy link
Member Author

@jnothman per your suggestion I changed the get_routing_for_object to raise an error instead of returning an empty request if it doesn't recognize the object. Now I remember the reason the function wasn't raising an error was that I didn't want to break people's code if they used a third party estimator which doesn't implement get_metadata_routing (yet). What's your take on that?

@jnothman
Copy link
Member

Now I remember the reason the function wasn't raising an error was that I didn't want to break people's code if they used a third party estimator which doesn't implement get_metadata_routing (yet). What's your take on that?

I think you're right that this should not raise an error, such that route_params will return an empty dict for that case. (And this should be tested.)

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slowly progressing...


The default (UNCHANGED) retains the existing request. This allows
you to change the request for some parameters and not others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good discussion!
I try to summarize:

  1. Handle fit_transform as separate method with its own set_fit_transform_requests.
  2. Merge the requests of fit and transform (error if inconsistent)
  3. Distinguish between fit_transform that only calls .fit(X).transform(X) and the rest (where it does something meaningful).

Even if it's appealing, I don't like the 3rd option. I have a slight preference for the 2nd option: merge the requests.

Comment on lines +161 to +165
owner : str
A display name for the object owning these requests.

method : str
The name of the method to which these requests belong.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive my stupid questions. What's the difference between owner, method and callee, caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not stupid at all. owner is only used to enhance the error message if an error is raised. method is the method for which this MethodMetadataRequest is used to store the request data.

If a meta-estimator is a consumer and a router, e.g. fit (caller) consumes sample_weight but also routes sample_weight to the sub-estimator's fit (callee), then both caller and callee methods will have a MethodMetdataRequest attached to them, with the corresponding method name.

self.method = method

@property
def requests(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring would be nice, something like "Dictionary of key values pairs: param, alias."

Is it a property without setter for write protection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users should use add_request to modify this dict.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished with class MethodMetadataRequest.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class MetadataRequest

# this is here for us to use this attribute's value instead of doing
# `isinstance`` in our checks, so that we avoid issues when people vendor
# this file instad of using it directly from scikit-learn.
type = "request"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an important name. How about metadata_type, routing_type? Just type seems like provoking name clashes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the attribute name type, not its values like "request". As this attribute is not inherited, e.g. via _MetadataRequester, it seems fine. I was worried about name clashes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. Yeah I wouldn't want to have two different names for it since this way I can just check for the value of the variable w/o having to care which instance type it is. I think it should be private anyway, so making it private.

Comment on lines 559 to 561
# `isinstance`` in our checks, so that we avoid issues when people vendor
# this file instad of using it directly from scikit-learn.
type = "router"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same typos and same comment about naming of type as in MetadataRequest.

self._self = None
self.owner = owner

def add_self(self, obj):
Copy link
Member

@lorentzenchr lorentzenchr Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this very confusing at first read. IIUK, it adds obj so _self and _self is an attribute with special treatment/meaning in routing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried adding some clarification here and above.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali Really great work!

I would appreciate the cast of a searching 3rd eye (=reviewer) as I might very well have overlooked or not understood something. This is an important change that would deserve such an investment.

@jnothman
Copy link
Member

jnothman commented Mar 9, 2022

I would appreciate the cast of a searching 3rd eye (=reviewer) as I might very well have overlooked or not understood something. This is an important change that would deserve such an investment.

I am of the opinion that since we have had the SLEP accepted, and this is not being merged to master, we have enough layers of assurance; rather, we will get more helpful review once we're implementing specific routers. @adrinjalali please ping when there are follow-up PRs or issues.

@lorentzenchr
Copy link
Member

I would appreciate the cast of a searching 3rd eye (=reviewer) as I might very well have overlooked or not understood something. This is an important change that would deserve such an investment.

I am of the opinion that since we have had the SLEP accepted, and this is not being merged to master, we have enough layers of assurance; rather, we will get more helpful review once we're implementing specific routers. @adrinjalali please ping when there are follow-up PRs or issues.

That's totally fine with me. I just expressed "my feelings". Let's merge as soon as the last nitpicks are addressed.

@adrinjalali
Copy link
Member Author

Nice. I'll merge when the CI is green (applied the latest nits I think).

I think the only major thing we have here is the fit_transform issue, and will open a separate issue to talk about that specific topic.

@adrinjalali adrinjalali merged commit 0b298ed into scikit-learn:sample-props Mar 10, 2022
@adrinjalali adrinjalali deleted the sample-props-base-alternate2 branch March 10, 2022 14:39
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a TODO list of what proceeds from here, and where additional contributors can play a part? We haven't made those changes to the glossary, for instance.

adrinjalali added a commit that referenced this pull request Mar 22, 2022
* initial base implementation commit

* fix test_props and the issue with attribute starting with __

* skip doctest in metadata_routing.rst for now

* DOC explain why aliasing on sub-estimator of a consumer/router is useful

* reduce diff

* DOC add user guide link to method docstrings

* DOC apply Thomas's suggestions to the rst file

* CLN address a few comments in docs

* ignore sentinel docstring check

* handling backward compatibility and deprecation prototype

* Update examples/plot_metadata_routing.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* make __metadata_request__* format more intuitive and less redundant

* metadata_request_factory always returns a copy

* fix tests for the changed __metadata_request__* format

* in example: foo->sample_weight, bar->groups

* get_method_input->get_input

* minor comments from Guillaume

* fix estimator checks tests

* Improved sample props developer API

* fixes, updated doc, decorator

* Add docstrings and some API cleanup

* unify serialize/deserialize methods

* Add more docstring to process_routing

* fix MetadataRouter.get_params parameter mismatch

* DOC add missing name to MethodMetadataRequest.deserialize docstring

* DOC add MethodMapping.add docstring

* DOC fix colons after versionadded

* fix {method}_requests return type annotation

* metadata_request_factory -> metadata_router_factory and docstring fixes

* move 'me' out of the map in MetadataRouter

* more docstring refinements

* cleanup API addresses and create a utils.metadata_routing sub-folder

* fix module import issue

* more tests and a few bug fixes

* Joel's comments

* make process_routing a function

* docstring fix

* ^type -> $type

* remove deserialize, return instance, and add type as an attribute

* remove sentinels and use strings instead

* make RequestType searchable and check for valid identifier

* Route -> MethodPair

* remove unnecessary sorted

* clarification on usage of the process_routing func in the example

* only print methods with non-empty requests

* fix test_string_representations

* remove source build cache from CircleCI (temporarily)

* Trigger CI

* Invalidate linux-arm64 ccache my changing the key

* Trigger CI

* method, used_in -> callee, caller

* show RequestType instead of RequestType.value in _serialize()

* more informative error messages

* fix checking for conflicting keys

* get_router_for_object -> get_routing_for_object

* \{method\}_requests -> set_\{method\}_request

* address metadata_routing.rst comments

* some test enhancements

* TypeError for extra arguments

* add_request: prop -> param

* original_names -> return_alias

* add more tests for MetadataRouter and MethodMapping

* more suggestions from Joel's review

* fix return type

* apply more suggestions from Joel's review

* Christian\'s suggestions

* more notes from Christian

* test_get_routing_for_object returns empty requests on unknown objects

* more notes from Christian

* remove double line break

* more notes from Christian

* more notes from Christian

* make type private

* add more comments/docs

* fix test

* fix nits

* add forgotten nit

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should cross-validation scoring take sample-weights into account?
8 participants