Skip to content

FEAT SLEP006: metadata routing infrastructure #24027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Jun 2, 2023
Merged

FEAT SLEP006: metadata routing infrastructure #24027

merged 79 commits into from
Jun 2, 2023

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Jul 27, 2022

This PR is to merge the sample-props branch into main.

All contributions here are already reviewed in their corresponding PRs, but to make it easier, we can review this now directly here and do modifications directly on this branch rather than through PRs to this branch.

This feature is not enabled by default, and the userguide says that it's not ready, and the link to the user guide is also in an "under development" section.

EDIT (26.05.2023): calling set_{method}_request now raises if feature flag is not enabled, to prevent accidental behaviors for users thinking they're setting requests but those requests not having an effect.

adrinjalali and others added 16 commits March 10, 2022 15:39
* initial base implementation commit

* fix test_props and the issue with attribute starting with __

* skip doctest in metadata_routing.rst for now

* DOC explain why aliasing on sub-estimator of a consumer/router is useful

* reduce diff

* DOC add user guide link to method docstrings

* DOC apply Thomas's suggestions to the rst file

* CLN address a few comments in docs

* ignore sentinel docstring check

* handling backward compatibility and deprecation prototype

* Update examples/plot_metadata_routing.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* make __metadata_request__* format more intuitive and less redundant

* metadata_request_factory always returns a copy

* fix tests for the changed __metadata_request__* format

* in example: foo->sample_weight, bar->groups

* get_method_input->get_input

* minor comments from Guillaume

* fix estimator checks tests

* Improved sample props developer API

* fixes, updated doc, decorator

* Add docstrings and some API cleanup

* unify serialize/deserialize methods

* Add more docstring to process_routing

* fix MetadataRouter.get_params parameter mismatch

* DOC add missing name to MethodMetadataRequest.deserialize docstring

* DOC add MethodMapping.add docstring

* DOC fix colons after versionadded

* fix {method}_requests return type annotation

* metadata_request_factory -> metadata_router_factory and docstring fixes

* move 'me' out of the map in MetadataRouter

* more docstring refinements

* cleanup API addresses and create a utils.metadata_routing sub-folder

* fix module import issue

* more tests and a few bug fixes

* Joel's comments

* make process_routing a function

* docstring fix

* ^type -> $type

* remove deserialize, return instance, and add type as an attribute

* remove sentinels and use strings instead

* make RequestType searchable and check for valid identifier

* Route -> MethodPair

* remove unnecessary sorted

* clarification on usage of the process_routing func in the example

* only print methods with non-empty requests

* fix test_string_representations

* remove source build cache from CircleCI (temporarily)

* Trigger CI

* Invalidate linux-arm64 ccache my changing the key

* Trigger CI

* method, used_in -> callee, caller

* show RequestType instead of RequestType.value in _serialize()

* more informative error messages

* fix checking for conflicting keys

* get_router_for_object -> get_routing_for_object

* \{method\}_requests -> set_\{method\}_request

* address metadata_routing.rst comments

* some test enhancements

* TypeError for extra arguments

* add_request: prop -> param

* original_names -> return_alias

* add more tests for MetadataRouter and MethodMapping

* more suggestions from Joel's review

* fix return type

* apply more suggestions from Joel's review

* Christian\'s suggestions

* more notes from Christian

* test_get_routing_for_object returns empty requests on unknown objects

* more notes from Christian

* remove double line break

* more notes from Christian

* more notes from Christian

* make type private

* add more comments/docs

* fix test

* fix nits

* add forgotten nit

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
* FEAT add metadata routing to splitters

* forgot black

* add end of file line

* ignore mypy error

* SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS

* simplify comment

* add notes on what the tests do

* improve test and avoid using repr
* initial base implementation commit

* fix test_props and the issue with attribute starting with __

* skip doctest in metadata_routing.rst for now

* DOC explain why aliasing on sub-estimator of a consumer/router is useful

* reduce diff

* DOC add user guide link to method docstrings

* DOC apply Thomas's suggestions to the rst file

* CLN address a few comments in docs

* ignore sentinel docstring check

* handling backward compatibility and deprecation prototype

* Update examples/plot_metadata_routing.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

* make __metadata_request__* format more intuitive and less redundant

* metadata_request_factory always returns a copy

* fix tests for the changed __metadata_request__* format

* in example: foo->sample_weight, bar->groups

* get_method_input->get_input

* minor comments from Guillaume

* fix estimator checks tests

* Improved sample props developer API

* fixes, updated doc, decorator

* Add docstrings and some API cleanup

* unify serialize/deserialize methods

* Add more docstring to process_routing

* fix MetadataRouter.get_params parameter mismatch

* DOC add missing name to MethodMetadataRequest.deserialize docstring

* DOC add MethodMapping.add docstring

* DOC fix colons after versionadded

* fix {method}_requests return type annotation

* metadata_request_factory -> metadata_router_factory and docstring fixes

* move 'me' out of the map in MetadataRouter

* more docstring refinements

* cleanup API addresses and create a utils.metadata_routing sub-folder

* fix module import issue

* more tests and a few bug fixes

* Joel's comments

* make process_routing a function

* docstring fix

* ^type -> $type

* remove deserialize, return instance, and add type as an attribute

* remove sentinels and use strings instead

* make RequestType searchable and check for valid identifier

* Route -> MethodPair

* remove unnecessary sorted

* clarification on usage of the process_routing func in the example

* only print methods with non-empty requests

* fix test_string_representations

* remove source build cache from CircleCI (temporarily)

* Trigger CI

* Invalidate linux-arm64 ccache my changing the key

* Trigger CI

* method, used_in -> callee, caller

* show RequestType instead of RequestType.value in _serialize()

* more informative error messages

* fix checking for conflicting keys

* get_router_for_object -> get_routing_for_object

* \{method\}_requests -> set_\{method\}_request

* address metadata_routing.rst comments

* some test enhancements

* TypeError for extra arguments

* add_request: prop -> param

* original_names -> return_alias

* add more tests for MetadataRouter and MethodMapping

* more suggestions from Joel's review

* fix return type

* apply more suggestions from Joel's review

* Christian\'s suggestions

* more notes from Christian

* test_get_routing_for_object returns empty requests on unknown objects

* more notes from Christian

* remove double line break

* more notes from Christian

* more notes from Christian

* make type private

* add more comments/docs

* fix test

* fix nits

* add forgotten nit

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
* FEAT add metadata routing to splitters

* forgot black

* add end of file line

* ignore mypy error

* SIMPLE_SPLITTERS -> NO_GROUP_SPLITTERS

* simplify comment

* add notes on what the tests do

* improve test and avoid using repr
@lorentzenchr lorentzenchr linked an issue Aug 6, 2022 that may be closed by this pull request
Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I added a documentation section to _metadata_requests.py, and reordered the file into sections. It should make it much easier to read / maintain the file.

I wonder if we should move sklearn.utils.metadata_routing to sklearn.metadata_routing. I don't have a strong preference.

- ``str``: metadata should be passed to the meta-estimator with \
this given alias instead of the original name.

The default (``UNCHANGED``) retains the existing request. This allows
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to sklearn.utils.metadata_routing.UNCHANGED.

Comment on lines +80 to +82
from .. import get_config
from ..exceptions import UnsetMetadataPassedError
from ._bunch import Bunch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These relative imports will make it harder to vendor. I'm okay with leaving this as is and fixing it as a follow up.

self._check_warnings(params=params)
unrequested = dict()
params = {} if params is None else params
args = {arg: value for arg, value in params.items() if value is not None}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests passed without needing to filter out if value is not None. Is it required to filter them out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests pass because we use process_routing which applies the same filter. I removed the filter in process_routing and kept the one here now, and added a test to explicitly test passing None has no effect.

A string representing the mapping, it can be:

- `"one-to-one"`: a one to one mapping for all methods.
- `"method"`: the name of a single method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A developer can be confused and think the string "method" is a valid route. We can dynamically list out the METHODS to make it clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the description to make it more clear.


validate_keys : bool, default=True
Whether to check if the requested parameters fit the actual parameters
of the method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kwarg is only used for testing:

Suggested change
of the method.
of the method. Only used for testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, in the mocking there's the only place we have validate_keys=False but that kind of usecase can happen in the wild too, where you want to accept any kind of request w/o having them pre-defined.

if "__metadata_request__" in attr
}
defaults.update(base_defaults)
defaults = dict(sorted(defaults.items()))
Copy link
Member

@thomasjpfan thomasjpfan Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line required?

Suggested change
defaults = dict(sorted(defaults.items()))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make the output deterministic, otherwise I saw cases where the output would change and the repr would not give a consistent result.

Copy link
Member Author

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan should be good now.

self._check_warnings(params=params)
unrequested = dict()
params = {} if params is None else params
args = {arg: value for arg, value in params.items() if value is not None}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests pass because we use process_routing which applies the same filter. I removed the filter in process_routing and kept the one here now, and added a test to explicitly test passing None has no effect.

route : str
A string representing the mapping, it can be:

- `"one-to-one"`: a one to one mapping for all methods.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me a one to one mapping is a bijection, doesn't really need all in the name.

A string representing the mapping, it can be:

- `"one-to-one"`: a one to one mapping for all methods.
- `"method"`: the name of a single method.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the description to make it more clear.


validate_keys : bool, default=True
Whether to check if the requested parameters fit the actual parameters
of the method.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, in the mocking there's the only place we have validate_keys=False but that kind of usecase can happen in the wild too, where you want to accept any kind of request w/o having them pre-defined.

if "__metadata_request__" in attr
}
defaults.update(base_defaults)
defaults = dict(sorted(defaults.items()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make the output deterministic, otherwise I saw cases where the output would change and the repr would not give a consistent result.

Comment on lines +20 to +21
In routers, e.g. meta-estimators and a multi metric scorer,
``get_metadata_routing`` returns a ``MetadataRouter`` object.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lorentzenchr
Copy link
Member

lorentzenchr commented Jun 2, 2023

Changelog reflects changes that users should care about, but we kinda don't want end users to start using this immediately, while we want third party developers to care about it.

I could think to advertise it inside an "Experimental" section.

What about a dedicated new section: "For 3rd party maintainers" in the usual changelog.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last concern, otherwise this looks ready to go.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think further improvements to the docs and code can be done in follow up PRs.

Comment on lines +181 to +183
routed_params = Bunch(
estimator=Bunch(partial_fit=Bunch(sample_weight=sample_weight))
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: Nesting Bunches like this is a little ugly. I'm considering not using Bunch at all and going with dicts. It would make the _metadata_request.py slightly easier to vendor. Then developers would need to do:

routed_params["estimator"]["partial_fit"]

(I'm +0.25 on using regular dicts)

@adrinjalali
Copy link
Member Author

One last concern, otherwise this looks ready to go.

@thomasjpfan done.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit 62671a7 into main Jun 2, 2023
@thomasjpfan thomasjpfan deleted the sample-props branch June 2, 2023 20:43
@thomasjpfan
Copy link
Member

I added a comment to the meta-issue on SLEP006: #22893 (comment)

@agramfort
Copy link
Member

❤️ @adrinjalali ❤️

@jjerphan
Copy link
Member

jjerphan commented Jun 4, 2023

Thanks, @adrinjalali and everyone involved for this massive contribution. 🥳 👏 🤝

@lorentzenchr
Copy link
Member

How about a blog post like like the one for pandas in-out?

manudarmi pushed a commit to primait/scikit-learn that referenced this pull request Jun 12, 2023
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add sample_weight fit param for Pipeline
9 participants