Skip to content

[WIP] Sample property routing #9566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Aug 16, 2017

Reference issues

In an ideal world:

Functionality

As derived from my ranting at Thierry's scikit-learn/enhancement_proposals#6, this:

  • maintains backwards compatibility with the routing of kwargs to Pipeline.fit
  • maintains backwards compatibility with the routing of kwargs and groups to *SearchCV.fit
  • allows the user to specify a custom routing scheme with a fairly straightforward notation (see check_routing docs), in which the *SearchCV default policy, but not the Pipeline default policy, can be expressed
    • each destination name takes one of: all props, all but specified props (blacklist), only specified props (whitelist) optionally renamed, no props
  • a meta-estimator has a prop_routing parameter to facilitate this
  • a meta-estimator defines:
    • a set of destinations (e.g. steps' fit methods in Pipeline; perhaps also steps' transform methods in Pipeline; {cv, estimator, scoring} in *SearchCV)
    • a set of aliases for each destination (e.g. in Pipeline the step name to route to each fit method; "*" to route to all steps' fit methods)
    • a default routing policy (a specialised function in Pipeline; {'estimator': '-groups', 'cv': 'groups'} in *SearchCV)

Example?

Ideally we should be able to get nested CV with groups and weighted scoring working:

cross_val_score(GridSearchCV(..., cv=GroupKFold(),
                             prop_routing={'estimator': '-groups',
                                           'cv': 'groups',
                                           'scoring': 'sample_weight'}),
                X, y, groups,
                fit_params={'sample_weight': sample_weight},
                cv=GroupKFold(),
                prop_routing={'estimator': '*',
                              'cv': 'groups',
                              'scoring': 'sample_weight'})

It's pretty intricate, but I'm not sure we're going to get better than this!

TODO (and help wanted!)

  • ensure check_routing API is as we want (please comment!)
  • documentation for check_routing
  • documentation for Pipeline and *SearchCV
  • test "*" destination alias in Pipeline.
  • test prop_routing in *SearchCV, particularly to metrics
  • tests for error cases in Router and check_routing
  • more tests for valid cases in Router and check_routing
  • make prop_routing available in all CV routines
  • make prop_routing available in all metaestimators
  • make prop_routing available in gaussian processes (for kernels)
  • get rid of has_fit_parameter where possible

The last steps there are the most arduous, and if we decide this is the way forward, I would welcome collaboration when we get to that stage.

@amueller
Copy link
Member

Can you show an example with a pipeline, where some estimators support sample weights and others don't? How does it look with FeatureUnion? Why did Pipeline get a routing argument, but FeatureUnion did not? This looks pretty similar to the meta-estimator based proposal we discussed at the sprint. One of my main questions was "how does routing nest and how does that affect specification".

I think you did what looks to me like "shallow" routing in that you specify routing at every level for one level below, right? (Not sure how that deals with FeatureUnion right now because that doesn't do routing).

Did you want to allow renaming? What would that look like?

@jnothman
Copy link
Member Author

I've not implemented in FeatureUnion (or anywhere but *SearchCV and Pipeline) yet. I can do, but would rather work out the best way to test the existing changes, and ensure the API is right, before I go on a rampage.

Currently FeatureUnion.fit doesn't take fit_params (except, strangely, in fit_transform, which may well be used by someone somewhere because Pipeline will use fit_transform when available). So we'll adopt a default policy there of routing all props to fit/fit_transform of all transformers. That would be achieved with:

router = check_routing(self.prop_routing,
                       dests=[[name, '*']
                              for name, _ in self.transformer_list],
                       default={'*': '*'})
props_per_transformer, unused = router(fit_params)
if unused:
    raise ValueError(...)

Yes, it is shallow. I've not yet come up with a deep approach, but perhaps it is possible. I think the present solution is going to be much easier if we want to maintain backwards compatibility.

Yes, it supports renaming (see the check_routing examples). However, I am a bit concerned about renaming...

One thing I remain confused about is what happens when someone calls GridSearchCV(..., prop_routing={...}).score with some props. Now prop_routing is particularly defined for routing what gets passed into fit. Do we:

  • pass all the props to the scorer?
  • do any renaming due to prop_routing['scoring'] and raise an error if there are any unused props?

Not supporting renaming is simpler in this case. However, you can imagine we might want to support routing in score, in order to pass some things to the prediction method in the scorer, and others to the metric... This overall design is flexible to addition of new destinations like that, but I'm not sure how to deal with params provided to methods other than fit.

@jnothman
Copy link
Member Author

We can withhold both renaming and passing props to score for now, too.

I'll have to think about the mechanics of a "deep" spec but I think I'd rather local ("shallow") routing as that makes it easier to then wrap something.

@jnothman
Copy link
Member Author

I forgot to respond to this:

Can you show an example with a pipeline, where some estimators support sample weights and others don't?

Supporting is not sufficient to be passed something. The user needs to be explicit about where a prop should be passed. If we allow it to be based on supporting rather than the user requesting, then changing an estimator from def fit(self, X, y) to def fit(self, X, y, sample_weight) would change behaviour of callers.

Rather, if you want to route a single sample_weight prop to some but not all elements of a pipeline:

p = Pipeline(('a', MyTransformer()), ('b', MyTransformer()), ('c', MyTransformer()),
             prop_routing={'a': 'sample_weight', 'c': 'sample_weight'})
p.fit(X, y, sample_weight=sample_weight)

Currently Pipeline defines the destination alias * to pass props to fit of all steps, so passing to all can be achieved with:

p = Pipeline(('a', MyTransformer()), ('b', MyTransformer()), ('c', MyTransformer()),
             prop_routing={'*': 'sample_weight'})
p.fit(X, y, sample_weight=sample_weight)

As use-cases become apparent, we may decide to add other destination aliases to Pipeline including:

  • *nolast: pass these props to fit of all but the last step
  • <STEP NAME>:transform: pass these props to transform for the named step
  • <STEP NAME>:fit: as a more explicit alias for <STEP NAME>
  • *:transform: pass these props to transform for all steps
  • *:fit an alias for *
  • *nolast:transform etc.

@adrinjalali
Copy link
Member

Took me a while to go through all the comments on #4497, some other threads, and here. Is this proposal now the way to go (with possible minor changes)? If yes, I could give you a hand if you wanna prioritize and revive it.

@shoshijak
Copy link

Same as @adrinjalali, if this proposal is the way to go, I would help to make it happen.

@amueller
Copy link
Member

amueller commented Oct 9, 2018

I think there is still no consensus.

@jnothman
Copy link
Member Author

I think there is still no consensus.

But facts on the ground might make consensus happen.

There are limitations to this approach:

  1. A "deep" approach might be easier for users to understand conceptually.
  2. A "deep" approach might lend itself better to routing in methods other than fit, such as score.
  3. Having samplewise metadata controlled by keyword arguments limits reuse of keyword arguments for other things, including passing featurewise metadata or metadata relating to the target.

@amueller
Copy link
Member

(I'm not very excited about this part of the roadmap and would rather think about freezing first - also estimator tags are ready for review ;)

@jnothman
Copy link
Member Author

jnothman commented Oct 10, 2018 via email

@amueller
Copy link
Member

I think there is a need, but I don't think it impacts a lot of users. I think other API issues impact more users.

@adrinjalali adrinjalali mentioned this pull request Feb 8, 2019
@amueller amueller added Needs Decision Requires decision Stalled labels Aug 6, 2019
@hermidalc
Copy link
Contributor

Sorry that I duplicated some work in #15370, didn't see this PR until now. But maybe it will help give some ideas regarding that there are common use cases where one also needs sample properties passed to transform from both train and test. So I didn't name it score_props because you are also passing properties that are needed in transform methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment