-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Base sample-prop implementation and docs (alternative to #21284) #22083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base sample-prop implementation and docs (alternative to #21284) #22083
Conversation
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…rn into sample-props-base
…s-base-alternate2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more.
@jnothman per your suggestion I changed the |
I think you're right that this should not raise an error, such that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slowly progressing...
|
||
The default (UNCHANGED) retains the existing request. This allows | ||
you to change the request for some parameters and not others. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good discussion!
I try to summarize:
- Handle
fit_transform
as separate method with its ownset_fit_transform_requests
. - Merge the requests of
fit
andtransform
(error if inconsistent) - Distinguish between
fit_transform
that only calls.fit(X).transform(X)
and the rest (where it does something meaningful).
Even if it's appealing, I don't like the 3rd option. I have a slight preference for the 2nd option: merge the requests.
owner : str | ||
A display name for the object owning these requests. | ||
|
||
method : str | ||
The name of the method to which these requests belong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive my stupid questions. What's the difference between owner, method and callee, caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not stupid at all. owner is only used to enhance the error message if an error is raised. method
is the method for which this MethodMetadataRequest
is used to store the request data.
If a meta-estimator is a consumer and a router, e.g. fit
(caller) consumes sample_weight
but also routes sample_weight
to the sub-estimator's fit
(callee), then both caller and callee methods will have a MethodMetdataRequest
attached to them, with the corresponding method name.
self.method = method | ||
|
||
@property | ||
def requests(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring would be nice, something like "Dictionary of key values pairs: param, alias."
Is it a property without setter for write protection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, users should use add_request
to modify this dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished with class MethodMetadataRequest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class MetadataRequest
sklearn/utils/_metadata_requests.py
Outdated
# this is here for us to use this attribute's value instead of doing | ||
# `isinstance`` in our checks, so that we avoid issues when people vendor | ||
# this file instad of using it directly from scikit-learn. | ||
type = "request" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an important name. How about metadata_type
, routing_type
? Just type
seems like provoking name clashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the attribute name type
, not its values like "request"
. As this attribute is not inherited, e.g. via _MetadataRequester
, it seems fine. I was worried about name clashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. Yeah I wouldn't want to have two different names for it since this way I can just check for the value of the variable w/o having to care which instance type it is. I think it should be private anyway, so making it private.
sklearn/utils/_metadata_requests.py
Outdated
# `isinstance`` in our checks, so that we avoid issues when people vendor | ||
# this file instad of using it directly from scikit-learn. | ||
type = "router" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same typos and same comment about naming of type
as in MetadataRequest
.
self._self = None | ||
self.owner = owner | ||
|
||
def add_self(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this very confusing at first read. IIUK, it adds obj
so _self
and _self
is an attribute with special treatment/meaning in routing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried adding some clarification here and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali Really great work!
I would appreciate the cast of a searching 3rd eye (=reviewer) as I might very well have overlooked or not understood something. This is an important change that would deserve such an investment.
I am of the opinion that since we have had the SLEP accepted, and this is not being merged to master, we have enough layers of assurance; rather, we will get more helpful review once we're implementing specific routers. @adrinjalali please ping when there are follow-up PRs or issues. |
That's totally fine with me. I just expressed "my feelings". Let's merge as soon as the last nitpicks are addressed. |
Nice. I'll merge when the CI is green (applied the latest nits I think). I think the only major thing we have here is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a TODO list of what proceeds from here, and where additional contributors can play a part? We haven't made those changes to the glossary, for instance.
* initial base implementation commit * fix test_props and the issue with attribute starting with __ * skip doctest in metadata_routing.rst for now * DOC explain why aliasing on sub-estimator of a consumer/router is useful * reduce diff * DOC add user guide link to method docstrings * DOC apply Thomas's suggestions to the rst file * CLN address a few comments in docs * ignore sentinel docstring check * handling backward compatibility and deprecation prototype * Update examples/plot_metadata_routing.py Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> * make __metadata_request__* format more intuitive and less redundant * metadata_request_factory always returns a copy * fix tests for the changed __metadata_request__* format * in example: foo->sample_weight, bar->groups * get_method_input->get_input * minor comments from Guillaume * fix estimator checks tests * Improved sample props developer API * fixes, updated doc, decorator * Add docstrings and some API cleanup * unify serialize/deserialize methods * Add more docstring to process_routing * fix MetadataRouter.get_params parameter mismatch * DOC add missing name to MethodMetadataRequest.deserialize docstring * DOC add MethodMapping.add docstring * DOC fix colons after versionadded * fix {method}_requests return type annotation * metadata_request_factory -> metadata_router_factory and docstring fixes * move 'me' out of the map in MetadataRouter * more docstring refinements * cleanup API addresses and create a utils.metadata_routing sub-folder * fix module import issue * more tests and a few bug fixes * Joel's comments * make process_routing a function * docstring fix * ^type -> $type * remove deserialize, return instance, and add type as an attribute * remove sentinels and use strings instead * make RequestType searchable and check for valid identifier * Route -> MethodPair * remove unnecessary sorted * clarification on usage of the process_routing func in the example * only print methods with non-empty requests * fix test_string_representations * remove source build cache from CircleCI (temporarily) * Trigger CI * Invalidate linux-arm64 ccache my changing the key * Trigger CI * method, used_in -> callee, caller * show RequestType instead of RequestType.value in _serialize() * more informative error messages * fix checking for conflicting keys * get_router_for_object -> get_routing_for_object * \{method\}_requests -> set_\{method\}_request * address metadata_routing.rst comments * some test enhancements * TypeError for extra arguments * add_request: prop -> param * original_names -> return_alias * add more tests for MetadataRouter and MethodMapping * more suggestions from Joel's review * fix return type * apply more suggestions from Joel's review * Christian\'s suggestions * more notes from Christian * test_get_routing_for_object returns empty requests on unknown objects * more notes from Christian * remove double line break * more notes from Christian * more notes from Christian * make type private * add more comments/docs * fix test * fix nits * add forgotten nit Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This is an alternative to #21284.
The main motivation behind the changes is what @glemaitre noticed as repetition of some of the routing logic between
get_metadata_request
methods and what would go infit
,transform
, etc.To review this PR, you probably should start with the
plot_metadata_routing
file.This PR simplifies the developer API substantially, such that a simple pipeline can be implemented as:
This PR does NOT change the user API compared to #21284.
cc @jnothman @glemaitre @thomasjpfan @agramfort
note from the other PR
This PR is into the
sample-props
branch and notmain
. The idea is to break #20350 into smaller PRs for easier review and discussion rounds.This PR adds the base implementation, and some documentation and a few tests. The tests are re-done from the previous PR. You can probably start with
examples/metadata_routing.py
to get a sense of how things work, and then check the implementation.This PR does NOT touch splitters and scorers, those and all meta-estimators will be done in future PRs.
EDIT:
process_routing
is now a function rather than a decorator.EDIT: Some updates and summary of current discussions: #22083 (comment)
EDIT: a timeline of this feature being merged into
main
is written under scikit-learn/enhancement_proposals#65 (comment)EDIT: this PR creates
set_{method}_request
instead of{method}_requests
now.