-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC Clarify metadata routing docs from _metadata_requests.py
module
#31419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DOC Clarify metadata routing docs from _metadata_requests.py
module
#31419
Conversation
If the router is also a consumer, it also checks for warnings of | ||
`self`'s/consumer's requested metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the two lines on checking warnings, since this method only does the smaller part of the work of checking for warnings and thus this might not be as relevant here. But it would actually be nice, to do this all in one place. The rest of the warning checking is done in MethodMetadataRequest._route_params()
.
For another PR, I was thinking of trying to put all the warning checking into here (MetadataRouter.route_params()
) and then documenting it with:
Metadata requests are also validated and may trigger warnings for newly added
metadata. This includes the router’s own requests if it also acts as a consumer.
Would that be a good idea?
Specifies which metadata should be routed to `param` | ||
Specifies which metadata should be routed to the method that owns this | ||
`MethodMetadataRequest`. | ||
- str: the name (or alias) of metadata given to a meta-estimator that | ||
should be routed to this parameter. | ||
should be routed to the method that owns this `MethodMetadataRequest`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was wrong before.
sklearn/utils/_metadata_requests.py
Outdated
This class is used by router objects to store and handle metadata routing. | ||
Routing information is stored as a dictionary of the form ``{"object_name": | ||
This class is used by :term:`meta-estimators` or functions that can route metadata | ||
to store and handle their metadata routing. Routing information is stored in a | ||
dictionary-like structure of the form ``{"object_name": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finding different words for "router objects" to prevent confusion with this classes' own name.
sklearn/utils/_metadata_requests.py
Outdated
The routing is coordinated by building ``MetadataRequest`` objects | ||
for objects that consume metadata, and ``MetadataRouter`` objects for objects that | ||
can route metadata, which are then aligned during a call to `process_routing()`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part explaining that MetadataRequest
objects and MetadataRouter
objects are aligned during a call to process_routing()
is crucial to communicate the general idea.
sklearn/utils/_metadata_requests.py
Outdated
item : object | ||
The given item to be checked if it can be an alias. | ||
item : str | ||
The given name to be checked if it can be an alias for the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a copy-paste from the docstring of request_is_valid
and even though a "str" is an "object", it'll help users more to be more specific here. I would also rename the item
param into name
.
But do request_is_alias
and request_is_valid
need to be in the public API at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still an object
here, particularly, anything in VALID_REQUEST_VALUES
. Also, this function checks if the given object is a string, so the input doesn't have to be a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that's true.
@adrinjalali, would you like to have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a very thoughtful change and a nice improvement! I feel like I understand this stuff better (though still some confusion).
The module docstring _metadata_requests.py
is quite long, it maybe nice to just visually break it up by adding a section header (even though this will never be rendered) for the part that talks about MetadataRequest
and MetadataRouter
.
Additionally, it feels like the last paragraph:
scikit-learn/sklearn/utils/_metadata_requests.py
Lines 69 to 74 in 4560abc
The ``set_{method}_request`` methods are dynamically generated for estimators | |
which inherit from the ``BaseEstimator``. This is done by attaching instances | |
of the ``RequestMethod`` descriptor to classes, which is done in the | |
``_MetadataRequester`` class, and ``BaseEstimator`` inherits from this mixin. | |
This mixin also implements the ``get_metadata_routing``, which meta-estimators | |
need to override, but it works for simple consumers as is. |
belongs further up, with the section that talks about MetadataRequest
.
Another suggestion which doesn't have to be made by you/in this PR:
I feel like the 'alias' part is half explained here:
scikit-learn/sklearn/utils/_metadata_requests.py
Lines 39 to 43 in 4560abc
The ``alias`` above in the ``add`` method has to be either a string (an alias), | |
or a {True (requested), False (unrequested), None (error if passed)}``. There | |
are some other special values such as ``UNUSED`` and ``WARN`` which are used | |
for purposes such as warning of removing a metadata in a child class, but not | |
used by the end users. |
and half explained here (too long to show all of it, but here is the start):
# Each request value needs to be one of the following values, or an alias. |
It would be nice to have it in one place. Also it would be nice to more explicitly explain things "error if passed" -> raise error if this parameter is passed (I think this is what is meant?)
sklearn/utils/_metadata_requests.py
Outdated
can route metadata, which are then aligned during a call to `process_routing()`." | ||
The ``MetadataRequest`` and ``MetadataRouter`` objects are constructed via a | ||
``get_metadata_routing`` method, which all estimators (should) therefore implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the brackets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was, I think, supposed to express that third party developers and users who implement custom meta-estimators should also implement a get_metadata_routing
method, while it is not strictly required to write a scikit-learn compatible estimator.
I have now re-phrased it to only match scikit-learn's native estimators. A guide on how third party developers can use scikit-learns metadata routing API is already existing in the user guide and better explained there as well.
Co-authored-by: Lucy Liu <jliu176@gmail.com>
…ikit-learn into metadata_internal_docs
Co-authored-by: Lucy Liu <jliu176@gmail.com>
…ikit-learn into metadata_internal_docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing, @lucyleeow!
I went through all your comments and have further refined the docs, especially the module docstring. I agree it is quite long, but I have broken it up with section headings and also re-ordered it a bit.
Now, it might be also exportable / usable in the docs that get rendered (maybe in the example linked below), and I will think about it.
The last paragraph I have left where it is for now, putting "implementation details" as a sub-heading, because it seems not so relevant to understand how to implement a custom estimator that routes metadata, which was the lens I tried to read this through. I can also move it up, if that helps though.
I will keep your comment on the alias in mind. I have not fully wrapped my head around it yet.
sklearn/utils/_metadata_requests.py
Outdated
can route metadata, which are then aligned during a call to `process_routing()`." | ||
The ``MetadataRequest`` and ``MetadataRouter`` objects are constructed via a | ||
``get_metadata_routing`` method, which all estimators (should) therefore implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was, I think, supposed to express that third party developers and users who implement custom meta-estimators should also implement a get_metadata_routing
method, while it is not strictly required to write a scikit-learn compatible estimator.
I have now re-phrased it to only match scikit-learn's native estimators. A guide on how third party developers can use scikit-learns metadata routing API is already existing in the user guide and better explained there as well.
sklearn/utils/_metadata_requests.py
Outdated
``get_metadata_routing`` returns a ``MetadataRouter`` object. It stores | ||
information on which method from the router object calls which in a consumer's | ||
object, because this is how the metadata is to be passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, exactly! It would look something like this:
{'estimator': {'mapping': [{'caller': 'fit', 'callee': 'fit'}, {'caller': 'predict', 'callee': 'predict'}, {'caller': 'score', 'callee': 'score'}], 'router': {'fit': {'sample_weight': True}, 'predict': {'groups': None}, 'score': {'sample_weight': None}}}}
Though, now I am wondering why the second part is called "router" when in other parts of the docs the same concept is referred to as "request"...
Do you think it makes sense to change that, too?
@adrinjalali, would you verify this can be changed? What would be your take on that?
sklearn/utils/_metadata_requests.py
Outdated
also returns a ``MetadataRouter`` object. Its | ||
routing information includes both information about the object itself (added | ||
via ``MetadataRouter.add_self_request``), as well as the routing information | ||
for its sub-estimators. | ||
A ``MetadataRequest`` instance includes one ``MethodMetadataRequest`` per | ||
method in ``METHODS``, which includes ``fit``, ``score``, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I had spared that out until now. But you are right to ask me to clarify this as well. So I have tried to re-write this paragraph. Please let me know what you think. :)
My understanding is that each consuming estimator communicates their requests by their _get_metadata_request
method, which in the simplest case inspects the methods' signatures and builds a dictionary with all the requested metadata from it, which is then stored as an estimator's _metadata_request
attribute. This attribute is created while running set_{method}_request
via RequestMethod.__get__
(not sure how it gets there though, just inspecting the output of my debugging session).
Something like metadatarequest.fit.add
does not exist in the code as far as I can tell. I would assume the syntax has changed and metadatarequest.fit.add(param="sample_weight", alias="my_weights")
refers to method_metadata_request.add_request(param=prop, alias=alias)
, which - yes - is done per method and add
became add_request
.
the router itself needs to be included in the routing. The passed object | ||
can be an estimator or a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my current understanding, it will mostly be an estimator and I wonder more about how this would work / what could happen for this to be another MetadataRequest
.
If you look at the example for developers writing their own meta-estimators, there we would add self
into .add_self_request()
when the MetadataRouter
is build:
class RouterConsumerClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
def __init__(self, estimator):
self.estimator = estimator
def get_metadata_routing(self):
router = (
[MetadataRouter](https://scikit-learn.org/stable/modules/generated/sklearn.utils.metadata_routing.MetadataRouter.html#sklearn.utils.metadata_routing.MetadataRouter)(owner=self.__class__.__name__)
# defining metadata routing request values for usage in the meta-estimator
.add_self_request(self)
# defining metadata routing request values for usage in the sub-estimator
.add(
estimator=self.estimator,
method_mapping=[MethodMapping](https://scikit-learn.org/stable/modules/generated/sklearn.utils.metadata_routing.MethodMapping.html#sklearn.utils.metadata_routing.MethodMapping)()
.add(caller="fit", callee="fit")
.add(caller="predict", callee="predict")
.add(caller="score", callee="score"),
)
)
return router
``MetadataRouter`` includes information about sub-objects' routing and how | ||
methods are mapped together. For instance, the information about which methods | ||
of a sub-estimator are called in which methods of the meta-estimator are all | ||
stored here. Conceptually, this information looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding the word "store" for MetadataRouter
, since it gets re-build on demand and is not attached to any object.
the router itself needs to be included in the routing. The passed object | ||
can be an estimator or a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that makes more sense, and combined with your explanation in #31419 (comment) I understand better.
The MetadataRequest
use case is indeed intriguing. I can imagine 'how' it could work but not sure in what situation it would be used.
Anyway thanks!
Co-authored-by: Lucy Liu <jliu176@gmail.com>
sklearn/utils/_metadata_requests.py
Outdated
REQUESTER_DOC = """ Request metadata passed to the ``{method}`` method. | ||
Note that this method is only relevant if | ||
``enable_metadata_routing=True`` (see :func:`sklearn.set_config`). | ||
Please see :ref:`User Guide <metadata_routing>` on how the routing | ||
# the methods. | ||
REQUESTER_DOC = """Configure whether metadata should be requested and passed to the | ||
``{method}`` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased to avoid reading 'request' as a noun and to emphasize that this method configures metadata handling not definitely passes any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
sklearn/utils/_metadata_requests.py
Outdated
The ``MetadataRouter`` object is created anew whenever a router calls a method from a | ||
sub-estimator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you mean here, but it's somewhat misleading.
I would maybe phrase is as:
The
MetadataRouter
objects are never stored and are always recreated whenever
the object'sget_metadata_routing
method is called.
sklearn/utils/_metadata_requests.py
Outdated
item : object | ||
The given item to be checked if it can be an alias. | ||
item : str | ||
The given name to be checked if it can be an alias for the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's still an object
here, particularly, anything in VALID_REQUEST_VALUES
. Also, this function checks if the given object is a string, so the input doesn't have to be a string.
the router itself needs to be included in the routing. The passed object | ||
can be an estimator or a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we've had a usecase for it being anything other than self
. You could argue that, that information can be retrieved from owner
and instead of having a add_self_request
we could add a owner_is_consumer
to the constructor.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adrinjalali, I've addressed your suggestions.
sklearn/utils/_metadata_requests.py
Outdated
item : object | ||
The given item to be checked if it can be an alias. | ||
item : str | ||
The given name to be checked if it can be an alias for the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that's true.
the router itself needs to be included in the routing. The passed object | ||
can be an estimator or a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases developers who write their own meta-estimators would have to act, so having a param instead doesn't really facilitate their work. I think it's fine as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll let @lucyleeow have another look and merge if there's nothing left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, minor nits (and some questions) but okay with this as is.
sklearn/utils/_metadata_requests.py
Outdated
The routing is coordinated by building ``MetadataRequest`` objects | ||
for objects that consume metadata, and ``MetadataRouter`` objects for objects that | ||
can route metadata, which are then aligned during a call to `process_routing()`. The | ||
actual metadata values are held out until this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry question, what does "held out" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I means that it doesn't get used / stored / routed anywhere before process_routing
is called. Maybe we should write that just like this? Or, more intuitively, without negation. I've pushed a suggestion.
This function returns a Bunch object (dictionary-like) with all the information on the
consumers and which metadata they had requested and the actual metadata values. A
routing method (such as `fit` in a meta-estimator) can now provide the metadata to the
relevant consuming method (such as `fit` in a sub-estimator).
sklearn/utils/_metadata_requests.py
Outdated
~~~~~~~~~~~~~~~ | ||
In non-routing consumers, the simplest case, e.g. ``SVM``, ``get_metadata_routing`` | ||
returns a ``MetadataRequest`` object as the consumer's `_metadata_request` attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This somehow reads funny to me but I don't know what to suggest.
"returns a MetadataRequest
object, which is stored in the consumer's _metadata_request
attribute." ?
but we use the word 'store' again in the next sentence so that is not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "which is assigned to"?
sklearn/utils/_metadata_requests.py
Outdated
``get_metadata_routing`` returns a ``MetadataRouter`` object. It stores | ||
information on which method from the router object calls which in a consumer's | ||
object, because this is how the metadata is to be passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. So does this mean that not only do you need
information about which method, from the router object, calls which in a consumer's
object
you also need the 'routing' info (i.e. what is requested)? Is this second part also important info that a MetadataRouter
object needs? Or is that part not relevant? (I am confused about this part - and just clarifying)
Co-authored-by: Lucy Liu <jliu176@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucyleeow, your feeback, again, triggered some good improvements.
Github doesn't allow me to comment on this comment or yours:
Thanks. So does this mean that not only do you need
information about which method, from the router object, calls which in a consumer's
object
you also need the 'routing' info (i.e. what is requested)? Is this second part also important info that a MetadataRouter object needs? Or is that part not relevant? (I am confused about this part - and just clarifying)
You're exactly right. Also, I have looked further into the MetadataRouter
and found that these two parts had both been re-named. So I have fixed that in the docstring as well.
There is more I want to do to the API docs and I see some work to be done for the MetadataRouter
specifically. I will go back to it in another PR, especially to add examples, but also to write better docstrings.
sklearn/utils/_metadata_requests.py
Outdated
The routing is coordinated by building ``MetadataRequest`` objects | ||
for objects that consume metadata, and ``MetadataRouter`` objects for objects that | ||
can route metadata, which are then aligned during a call to `process_routing()`. The | ||
actual metadata values are held out until this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I means that it doesn't get used / stored / routed anywhere before process_routing
is called. Maybe we should write that just like this? Or, more intuitively, without negation. I've pushed a suggestion.
This function returns a Bunch object (dictionary-like) with all the information on the
consumers and which metadata they had requested and the actual metadata values. A
routing method (such as `fit` in a meta-estimator) can now provide the metadata to the
relevant consuming method (such as `fit` in a sub-estimator).
sklearn/utils/_metadata_requests.py
Outdated
~~~~~~~~~~~~~~~ | ||
In non-routing consumers, the simplest case, e.g. ``SVM``, ``get_metadata_routing`` | ||
returns a ``MetadataRequest`` object as the consumer's `_metadata_request` attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "which is assigned to"?
What does this implement/fix? Explain your changes.
This PR aims to clarify the documentation (internal and public) on the
_metadata_requests.py
module.It's also meant to simplify further work on this module for us (me).
This PR entails:
object
,param
,prop
) with what is meant in this context in method docstrings, when a special term is needed to explain the usecaseobject
andparam
in parts of the docstrings where they seem to refer to more technical aspects (objects
as class instances,params
as something to pass to or accepted by methods)prop
andproperty
as terms completely, since I find this is very ambiguous