-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEAT SLEP006: metadata routing infrastructure #24027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* initial base implementation commit * fix test_props and the issue with attribute starting with __ * skip doctest in metadata_routing.rst for now * DOC explain why aliasing on sub-estimator of a consumer/router is useful * reduce diff * DOC add user guide link to method docstrings * DOC apply Thomas's suggestions to the rst file * CLN address a few comments in docs * ignore sentinel docstring check * handling backward compatibility and deprecation prototype * Update examples/plot_metadata_routing.py Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> * make __metadata_request__* format more intuitive and less redundant * metadata_request_factory always returns a copy * fix tests for the changed __metadata_request__* format * in example: foo->sample_weight, bar->groups * get_method_input->get_input * minor comments from Guillaume * fix estimator checks tests * Improved sample props developer API * fixes, updated doc, decorator * Add docstrings and some API cleanup * unify serialize/deserialize methods * Add more docstring to process_routing * fix MetadataRouter.get_params parameter mismatch * DOC add missing name to MethodMetadataRequest.deserialize docstring * DOC add MethodMapping.add docstring * DOC fix colons after versionadded * fix {method}_requests return type annotation * metadata_request_factory -> metadata_router_factory and docstring fixes * move 'me' out of the map in MetadataRouter * more docstring refinements * cleanup API addresses and create a utils.metadata_routing sub-folder * fix module import issue * more tests and a few bug fixes * Joel's comments * make process_routing a function * docstring fix * ^type -> $type * remove deserialize, return instance, and add type as an attribute * remove sentinels and use strings instead * make RequestType searchable and check for valid identifier * Route -> MethodPair * remove unnecessary sorted * clarification on usage of the process_routing func in the example * only print methods with non-empty requests * fix test_string_representations * remove source build cache from CircleCI (temporarily) * Trigger CI * Invalidate linux-arm64 ccache my changing the key * Trigger CI * method, used_in -> callee, caller * show RequestType instead of RequestType.value in _serialize() * more informative error messages * fix checking for conflicting keys * get_router_for_object -> get_routing_for_object * \{method\}_requests -> set_\{method\}_request * address metadata_routing.rst comments * some test enhancements * TypeError for extra arguments * add_request: prop -> param * original_names -> return_alias * add more tests for MetadataRouter and MethodMapping * more suggestions from Joel's review * fix return type * apply more suggestions from Joel's review * Christian\'s suggestions * more notes from Christian * test_get_routing_for_object returns empty requests on unknown objects * more notes from Christian * remove double line break * more notes from Christian * more notes from Christian * make type private * add more comments/docs * fix test * fix nits * add forgotten nit Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
* FEAT add metadata routing to splitters * forgot black * add end of file line * ignore mypy error * SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS * simplify comment * add notes on what the tests do * improve test and avoid using repr
* initial base implementation commit * fix test_props and the issue with attribute starting with __ * skip doctest in metadata_routing.rst for now * DOC explain why aliasing on sub-estimator of a consumer/router is useful * reduce diff * DOC add user guide link to method docstrings * DOC apply Thomas's suggestions to the rst file * CLN address a few comments in docs * ignore sentinel docstring check * handling backward compatibility and deprecation prototype * Update examples/plot_metadata_routing.py Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> * make __metadata_request__* format more intuitive and less redundant * metadata_request_factory always returns a copy * fix tests for the changed __metadata_request__* format * in example: foo->sample_weight, bar->groups * get_method_input->get_input * minor comments from Guillaume * fix estimator checks tests * Improved sample props developer API * fixes, updated doc, decorator * Add docstrings and some API cleanup * unify serialize/deserialize methods * Add more docstring to process_routing * fix MetadataRouter.get_params parameter mismatch * DOC add missing name to MethodMetadataRequest.deserialize docstring * DOC add MethodMapping.add docstring * DOC fix colons after versionadded * fix {method}_requests return type annotation * metadata_request_factory -> metadata_router_factory and docstring fixes * move 'me' out of the map in MetadataRouter * more docstring refinements * cleanup API addresses and create a utils.metadata_routing sub-folder * fix module import issue * more tests and a few bug fixes * Joel's comments * make process_routing a function * docstring fix * ^type -> $type * remove deserialize, return instance, and add type as an attribute * remove sentinels and use strings instead * make RequestType searchable and check for valid identifier * Route -> MethodPair * remove unnecessary sorted * clarification on usage of the process_routing func in the example * only print methods with non-empty requests * fix test_string_representations * remove source build cache from CircleCI (temporarily) * Trigger CI * Invalidate linux-arm64 ccache my changing the key * Trigger CI * method, used_in -> callee, caller * show RequestType instead of RequestType.value in _serialize() * more informative error messages * fix checking for conflicting keys * get_router_for_object -> get_routing_for_object * \{method\}_requests -> set_\{method\}_request * address metadata_routing.rst comments * some test enhancements * TypeError for extra arguments * add_request: prop -> param * original_names -> return_alias * add more tests for MetadataRouter and MethodMapping * more suggestions from Joel's review * fix return type * apply more suggestions from Joel's review * Christian\'s suggestions * more notes from Christian * test_get_routing_for_object returns empty requests on unknown objects * more notes from Christian * remove double line break * more notes from Christian * more notes from Christian * make type private * add more comments/docs * fix test * fix nits * add forgotten nit Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
* FEAT add metadata routing to splitters * forgot black * add end of file line * ignore mypy error * SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS * simplify comment * add notes on what the tests do * improve test and avoid using repr
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: adrinjalali <adrin.jalali@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan I added a documentation section to _metadata_requests.py
, and reordered the file into sections. It should make it much easier to read / maintain the file.
I wonder if we should move sklearn.utils.metadata_routing
to sklearn.metadata_routing
. I don't have a strong preference.
sklearn/utils/_metadata_requests.py
Outdated
- ``str``: metadata should be passed to the meta-estimator with \ | ||
this given alias instead of the original name. | ||
|
||
The default (``UNCHANGED``) retains the existing request. This allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to sklearn.utils.metadata_routing.UNCHANGED
.
from .. import get_config | ||
from ..exceptions import UnsetMetadataPassedError | ||
from ._bunch import Bunch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These relative imports will make it harder to vendor. I'm okay with leaving this as is and fixing it as a follow up.
self._check_warnings(params=params) | ||
unrequested = dict() | ||
params = {} if params is None else params | ||
args = {arg: value for arg, value in params.items() if value is not None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests passed without needing to filter out if value is not None
. Is it required to filter them out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests pass because we use process_routing
which applies the same filter. I removed the filter in process_routing
and kept the one here now, and added a test to explicitly test passing None
has no effect.
sklearn/utils/_metadata_requests.py
Outdated
A string representing the mapping, it can be: | ||
|
||
- `"one-to-one"`: a one to one mapping for all methods. | ||
- `"method"`: the name of a single method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A developer can be confused and think the string "method" is a valid route. We can dynamically list out the METHODS
to make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded the description to make it more clear.
|
||
validate_keys : bool, default=True | ||
Whether to check if the requested parameters fit the actual parameters | ||
of the method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kwarg is only used for testing:
of the method. | |
of the method. Only used for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, in the mocking there's the only place we have validate_keys=False
but that kind of usecase can happen in the wild too, where you want to accept any kind of request w/o having them pre-defined.
if "__metadata_request__" in attr | ||
} | ||
defaults.update(base_defaults) | ||
defaults = dict(sorted(defaults.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line required?
defaults = dict(sorted(defaults.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to make the output deterministic, otherwise I saw cases where the output would change and the repr would not give a consistent result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan should be good now.
self._check_warnings(params=params) | ||
unrequested = dict() | ||
params = {} if params is None else params | ||
args = {arg: value for arg, value in params.items() if value is not None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests pass because we use process_routing
which applies the same filter. I removed the filter in process_routing
and kept the one here now, and added a test to explicitly test passing None
has no effect.
route : str | ||
A string representing the mapping, it can be: | ||
|
||
- `"one-to-one"`: a one to one mapping for all methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me a one to one mapping is a bijection, doesn't really need all
in the name.
sklearn/utils/_metadata_requests.py
Outdated
A string representing the mapping, it can be: | ||
|
||
- `"one-to-one"`: a one to one mapping for all methods. | ||
- `"method"`: the name of a single method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanded the description to make it more clear.
|
||
validate_keys : bool, default=True | ||
Whether to check if the requested parameters fit the actual parameters | ||
of the method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, in the mocking there's the only place we have validate_keys=False
but that kind of usecase can happen in the wild too, where you want to accept any kind of request w/o having them pre-defined.
if "__metadata_request__" in attr | ||
} | ||
defaults.update(base_defaults) | ||
defaults = dict(sorted(defaults.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to make the output deterministic, otherwise I saw cases where the output would change and the repr would not give a consistent result.
In routers, e.g. meta-estimators and a multi metric scorer, | ||
``get_metadata_routing`` returns a ``MetadataRouter`` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
What about a dedicated new section: "For 3rd party maintainers" in the usual changelog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last concern, otherwise this looks ready to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think further improvements to the docs and code can be done in follow up PRs.
routed_params = Bunch( | ||
estimator=Bunch(partial_fit=Bunch(sample_weight=sample_weight)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker: Nesting Bunches like this is a little ugly. I'm considering not using Bunch at all and going with dicts. It would make the _metadata_request.py
slightly easier to vendor. Then developers would need to do:
routed_params["estimator"]["partial_fit"]
(I'm +0.25 on using regular dicts)
@thomasjpfan done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I added a comment to the meta-issue on SLEP006: #22893 (comment) |
❤️ @adrinjalali ❤️ |
Thanks, @adrinjalali and everyone involved for this massive contribution. 🥳 👏 🤝 |
How about a blog post like like the one for pandas in-out? |
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
This PR is to merge the
sample-props
branch intomain
.All contributions here are already reviewed in their corresponding PRs, but to make it easier, we can review this now directly here and do modifications directly on this branch rather than through PRs to this branch.
This feature is not enabled by default, and the userguide says that it's not ready, and the link to the user guide is also in an "under development" section.
EDIT (26.05.2023): calling
set_{method}_request
now raises if feature flag is not enabled, to prevent accidental behaviors for users thinking they're setting requests but those requests not having an effect.