[WIP] Sample property routing #9566

jnothman · 2017-08-16T05:34:05Z

Reference issues

In an ideal world:

Closes [API] Consistent API for attaching properties to samples #4497 (Gaël's overarching issue)
Closes [WIP] Sample properties #4696 (Andy's sample properties PR)
Closes [WIP] sample_weight support #1574 (Noel's sample_weight support everywhere PR)
Closes [WIP] scorer_params and sample_weight support #3524 (Vlad's extension of Noel's work)
Closes Should cross-validation scoring take sample-weights into account? #4632 (Andy's issue about weighted scoring in CV)
Closes grid_search: feeding parameters to scorer functions #8158 (issue about parameters to CV scoring)
Closes Make Pipeline compatible with AdaBoost #2630 (issue about putting a Pipeline in AdaBoost)
Closes GroupKFold fails in nested cross-validation (similar to #2879) #7646 (issue about nested grouped CV)
Closes LogisticRegressionCV not compatible with LeaveOneGroupOut #8950 (inability to use grouped CV splitters in LogisticRegressionCV)
Closes Nested CV of LeaveOneGroupOut fails in permutation_test_score #8127 (inability to use grouped CV splitters in permutation_test_score)
Helps towards Add TimeSeriesCV and HomogeneousTimeSeriesCV #6322 (time series props)

Functionality

As derived from my ranting at Thierry's scikit-learn/enhancement_proposals#6, this:

maintains backwards compatibility with the routing of kwargs to Pipeline.fit
maintains backwards compatibility with the routing of kwargs and groups to *SearchCV.fit
allows the user to specify a custom routing scheme with a fairly straightforward notation (see check_routing docs), in which the *SearchCV default policy, but not the Pipeline default policy, can be expressed
- each destination name takes one of: all props, all but specified props (blacklist), only specified props (whitelist) optionally renamed, no props
a meta-estimator has a prop_routing parameter to facilitate this
a meta-estimator defines:
- a set of destinations (e.g. steps' fit methods in Pipeline; perhaps also steps' transform methods in Pipeline; {cv, estimator, scoring} in *SearchCV)
- a set of aliases for each destination (e.g. in Pipeline the step name to route to each fit method; "*" to route to all steps' fit methods)
- a default routing policy (a specialised function in Pipeline; {'estimator': '-groups', 'cv': 'groups'} in *SearchCV)

Example?

Ideally we should be able to get nested CV with groups and weighted scoring working:

cross_val_score(GridSearchCV(..., cv=GroupKFold(),
                             prop_routing={'estimator': '-groups',
                                           'cv': 'groups',
                                           'scoring': 'sample_weight'}),
                X, y, groups,
                fit_params={'sample_weight': sample_weight},
                cv=GroupKFold(),
                prop_routing={'estimator': '*',
                              'cv': 'groups',
                              'scoring': 'sample_weight'})

It's pretty intricate, but I'm not sure we're going to get better than this!

TODO (and help wanted!)

The last steps there are the most arduous, and if we decide this is the way forward, I would welcome collaboration when we get to that stage.

…line

amueller · 2017-08-16T20:56:16Z

Can you show an example with a pipeline, where some estimators support sample weights and others don't? How does it look with FeatureUnion? Why did Pipeline get a routing argument, but FeatureUnion did not? This looks pretty similar to the meta-estimator based proposal we discussed at the sprint. One of my main questions was "how does routing nest and how does that affect specification".

I think you did what looks to me like "shallow" routing in that you specify routing at every level for one level below, right? (Not sure how that deals with FeatureUnion right now because that doesn't do routing).

Did you want to allow renaming? What would that look like?

jnothman · 2017-08-17T00:02:54Z

I've not implemented in FeatureUnion (or anywhere but *SearchCV and Pipeline) yet. I can do, but would rather work out the best way to test the existing changes, and ensure the API is right, before I go on a rampage.

Currently FeatureUnion.fit doesn't take fit_params (except, strangely, in fit_transform, which may well be used by someone somewhere because Pipeline will use fit_transform when available). So we'll adopt a default policy there of routing all props to fit/fit_transform of all transformers. That would be achieved with:

router = check_routing(self.prop_routing,
                       dests=[[name, '*']
                              for name, _ in self.transformer_list],
                       default={'*': '*'})
props_per_transformer, unused = router(fit_params)
if unused:
    raise ValueError(...)

Yes, it is shallow. I've not yet come up with a deep approach, but perhaps it is possible. I think the present solution is going to be much easier if we want to maintain backwards compatibility.

Yes, it supports renaming (see the check_routing examples). However, I am a bit concerned about renaming...

One thing I remain confused about is what happens when someone calls GridSearchCV(..., prop_routing={...}).score with some props. Now prop_routing is particularly defined for routing what gets passed into fit. Do we:

pass all the props to the scorer?
do any renaming due to prop_routing['scoring'] and raise an error if there are any unused props?

Not supporting renaming is simpler in this case. However, you can imagine we might want to support routing in score, in order to pass some things to the prediction method in the scorer, and others to the metric... This overall design is flexible to addition of new destinations like that, but I'm not sure how to deal with params provided to methods other than fit.

jnothman · 2017-08-17T01:23:51Z

We can withhold both renaming and passing props to score for now, too.

I'll have to think about the mechanics of a "deep" spec but I think I'd rather local ("shallow") routing as that makes it easier to then wrap something.

jnothman · 2017-08-17T03:39:12Z

I forgot to respond to this:

Can you show an example with a pipeline, where some estimators support sample weights and others don't?

Supporting is not sufficient to be passed something. The user needs to be explicit about where a prop should be passed. If we allow it to be based on supporting rather than the user requesting, then changing an estimator from def fit(self, X, y) to def fit(self, X, y, sample_weight) would change behaviour of callers.

Rather, if you want to route a single sample_weight prop to some but not all elements of a pipeline:

p = Pipeline(('a', MyTransformer()), ('b', MyTransformer()), ('c', MyTransformer()),
             prop_routing={'a': 'sample_weight', 'c': 'sample_weight'})
p.fit(X, y, sample_weight=sample_weight)

Currently Pipeline defines the destination alias * to pass props to fit of all steps, so passing to all can be achieved with:

p = Pipeline(('a', MyTransformer()), ('b', MyTransformer()), ('c', MyTransformer()),
             prop_routing={'*': 'sample_weight'})
p.fit(X, y, sample_weight=sample_weight)

As use-cases become apparent, we may decide to add other destination aliases to Pipeline including:

*nolast: pass these props to fit of all but the last step
<STEP NAME>:transform: pass these props to transform for the named step
<STEP NAME>:fit: as a more explicit alias for <STEP NAME>
*:transform: pass these props to transform for all steps
*:fit an alias for *
*nolast:transform etc.

adrinjalali · 2018-07-31T13:45:35Z

Took me a while to go through all the comments on #4497, some other threads, and here. Is this proposal now the way to go (with possible minor changes)? If yes, I could give you a hand if you wanna prioritize and revive it.

shoshijak · 2018-10-09T14:04:53Z

Same as @adrinjalali, if this proposal is the way to go, I would help to make it happen.

amueller · 2018-10-09T15:30:47Z

I think there is still no consensus.

jnothman · 2018-10-10T02:10:18Z

I think there is still no consensus.

But facts on the ground might make consensus happen.

There are limitations to this approach:

A "deep" approach might be easier for users to understand conceptually.
A "deep" approach might lend itself better to routing in methods other than fit, such as score.
Having samplewise metadata controlled by keyword arguments limits reuse of keyword arguments for other things, including passing featurewise metadata or metadata relating to the target.

amueller · 2018-10-10T04:08:27Z

(I'm not very excited about this part of the roadmap and would rather think about freezing first - also estimator tags are ready for review ;)

jnothman · 2018-10-10T05:46:50Z

(You're not excited about it because you don't think it's needed, or because you think any solution is bound to be ugly?)

amueller · 2018-10-10T19:37:09Z

I think there is a need, but I don't think it impacts a lot of users. I think other API issues impact more users.

hermidalc · 2019-10-27T15:28:32Z

Sorry that I duplicated some work in #15370, didn't see this PR until now. But maybe it will help give some ideas regarding that there are common use cases where one also needs sample properties passed to transform from both train and test. So I didn't name it score_props because you are also passing properties that are needed in transform methods.

jnothman added 2 commits August 15, 2017 18:03

Attempt implementation of sample prop routing in grid search and pipe…

8b587bb

…line

Improved routing specification

c7bd915

jnothman added API Enhancement labels Aug 16, 2017

Tweaks

c2c1ee9

jnothman mentioned this pull request Aug 16, 2017

Sample props specification scikit-learn/enhancement_proposals#6

Merged

jnothman added 4 commits August 16, 2017 16:54

Docs, doctests for check_routing; and a bug fix

6fc39bf

Fix estimator reprs in doctests

97f6a11

Doc tweaks

c0276c7

Another doctest repr fix

d1e85ba

PeterDSteinberg mentioned this pull request Nov 7, 2017

cross validation for xarray_filters.MLDataset - Elm PR 221 dask/dask-searchcv#61

Closed

3 tasks

jnothman mentioned this pull request Jan 14, 2018

[API] Consistent API for attaching properties to samples #4497

Closed

jnothman mentioned this pull request Jun 9, 2018

fit_params in conjunction with FeatureUnion #7136

Closed

jnothman mentioned this pull request Jul 29, 2018

ClassifierChain does not support GroupKFold #11429

Closed

adrinjalali mentioned this pull request Aug 31, 2018

add **kwargs to sklearn.multiclass.(OneVsOne,OneVsRest).fit(X,y) #11956

Closed

adrinjalali mentioned this pull request Sep 11, 2018

CalibratedClassifierCV doesn't support groups parameter in fit method #12052

Closed

adrinjalali mentioned this pull request Feb 8, 2019

RFC Dataset API #13123

Closed

amueller added Needs Decision Requires decision Stalled labels Aug 6, 2019

This was referenced Oct 27, 2019

Sample properties support in transform methods with **transform_params #15370

Closed

Pandas in, Pandas out? #5523

Closed

Sample property routing from #9566 with added support for routing to transform, scorers, prediction functions #15425

Closed

github-actions bot added module:model_selection module:pipeline module:utils labels Mar 2, 2020

vnmabus mentioned this pull request Jun 22, 2020

Landmark-based registration functions should be scikit-learn-like estimators GAA-UAM/scikit-fda#229

Open

adrinjalali mentioned this pull request Oct 18, 2020

[WIP] sample props (proposal 4) #16079

Closed

Base automatically changed from master to main January 22, 2021 10:49

adrinjalali mentioned this pull request Jun 24, 2021

sample-props alternate implementation #20350

Closed

jnothman closed this Mar 12, 2022

adrinjalali mentioned this pull request Aug 18, 2022

FEAT SLEP006: metadata routing infrastructure #24027

Merged

adrinjalali mentioned this pull request Apr 26, 2023

SLEP006 - Metadata Routing task list #22893

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Sample property routing #9566

[WIP] Sample property routing #9566

jnothman commented Aug 16, 2017 •

edited

Loading

amueller commented Aug 16, 2017

jnothman commented Aug 17, 2017

jnothman commented Aug 17, 2017

jnothman commented Aug 17, 2017

adrinjalali commented Jul 31, 2018

shoshijak commented Oct 9, 2018

amueller commented Oct 9, 2018

jnothman commented Oct 10, 2018

amueller commented Oct 10, 2018

jnothman commented Oct 10, 2018 via email

amueller commented Oct 10, 2018

hermidalc commented Oct 27, 2019

[WIP] Sample property routing #9566

[WIP] Sample property routing #9566

Conversation

jnothman commented Aug 16, 2017 • edited Loading

Reference issues

Functionality

Example?

TODO (and help wanted!)

amueller commented Aug 16, 2017

jnothman commented Aug 17, 2017

jnothman commented Aug 17, 2017

jnothman commented Aug 17, 2017

adrinjalali commented Jul 31, 2018

shoshijak commented Oct 9, 2018

amueller commented Oct 9, 2018

jnothman commented Oct 10, 2018

amueller commented Oct 10, 2018

jnothman commented Oct 10, 2018 via email

amueller commented Oct 10, 2018

hermidalc commented Oct 27, 2019

jnothman commented Aug 16, 2017 •

edited

Loading