Completing sample props SLEP006 #43

jnothman · 2020-08-04T13:04:13Z

https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html

jnothman · 2020-08-04T13:05:22Z

This PR is filling in several TODOs in SLEP006, with an idea to having it ready for voting in Sept 2020, in conjunction with refinements to scikit-learn/scikit-learn#16079.

adrinjalali · 2020-08-21T12:13:35Z

slep006/proposal.rst

+By default, `sample_weight` will not be requested by estimators that support
+it. This ensures that addition of `sample_weight` support to an estimator will
+not change its behaviour.


Are you sure? I think right now if the underlying estimator supports sample_weight and the user passes sample_weight to the meta-estimator's fit, it'll be passed to the estimator. With this default you suggest, it wouldn't.

right now if the underlying estimator supports sample_weight and the user passes sample_weight to the meta-estimator's fit, it'll be passed to the estimator. With this default you suggest, it wouldn't.

That's true. My suggestion here sacrifices some usability for explicitness. The idea would be that (after a deprecation or version 1.0) passing sample_weight to the meta estimator where no consumer has requested it (default) would raise an error: unexpected fit parameter. Thus the explicit use of require_sample_weight would be forced.

As well as ongoing compatibility issues (code invisibly starts using weights when someone implemented them), my concern with requesting sample_weight is also that it may be confusing if estimators request sample_weight by default but scorers do not.

Making it explicit sucks a little because people need to express sample_weight usage in two places rather than one, but I think it won't hurt too much and will make sure the reader of the code knows what's going on

I'd say looking at the conversation in scikit-learn/scikit-learn#18159 folks would rather have this routing done implicitly?

I think I'd be in favor of that implicit routing too.

And I think scorers should ask for sample_weight by default too.

I'd say looking at the conversation in scikit-learn/scikit-learn#18159 folks would rather have this routing done implicitly?

I think @rth's motivation there is explicitly about avoiding the coupling of the param name with the composite structure of the estimator, i.e. you need to prefix things differently depending on how many layers of pipeline it's in, or depending on what the estimator's called. I've long been frustrated by that, for parameter names too, and it's a big reason I made searchgrid. Only Solution 2 here retains this issue. It is no longer seriously being considered.

I think I'd be in favor of that implicit routing too. And I think scorers should ask for sample_weight by default too.

I get that these days we don't have a PR called "Add sample_weight support to ..." very often, but it used to be very common, and I fear that we will fit very different models if we change support for sample_weight in an estimator or metric. It will become ever more important for people to pin their version, and ever harder to know what some code does just by reading the latest docs.

I understand the desire to have the behavior not change. But if for a moment we forget about the current behavior and think about what the intuitive desired behavior would be, which one would you go for?

From the developer's perspective it would be great to have sample weights always passed automatically: for scoring and fitting to be weighted by default.

From the maintainer's or reader's perspective, however, this is hell. To know if weighted fitting is used, you need to explicitly look up whether weights are supported in each estimator or scorer used. Switching estimators in a grid search can enable or disable weighted fitting or scoring implicitly. Upgrading scikit-learn or third-party libraries could enable weighted fitting or scoring implicitly.

All this tells me is that the straw man developer above was mistaken. All code must be written to be read, and to be read unambiguously. This developer had the intuition that implicit is better than explicit, but they were wrong.

Ok fair enough. Pinging @rth @glemaitre and @NicolasHug since this is very much related to scikit-learn/scikit-learn#18159

slep006/cases_opt4b.py

adrinjalali · 2020-08-21T12:16:23Z

slep006/proposal.rst

+During a deprecation period, fit_params will be handled dually: Keys that are
+requested will be passed through the new request mechanism, while keys that are
+not known will be routed using legacy mechanisms. At completion of the
+deprecation period, the legacy handling will cease.


do we have to support the combination? It'd be probably cleaner if we only support the legacy behavior if all follow that protocol?

Although it may have the issue that third party estimators won't be supporting this API at the same time maybe.

How will the metaestimator know whether or not to use legacy mode?

Here I'm assuming you're talking about the estimator__param format, and for that, at least in the Pipeline, I thought of checking if any parameter follows that format. If yes, then we fallback to the old behavior completely. If no parameter follows that format, then we do things the new way.

I think this would cause less confusion than handling them both at the same time.

If yes, then we fallback to the old behavior completely. If no parameter follows that format, then we do things the new way.

Then that disables support for a prop explicitly requested as step_name__*?

I don't really see what we lose by supporting both during deprecation.

I find understanding a code which passes some parameters to a pipeline as step_nem__* and other parameters as param_name confusing. Otherwise, I'm just worried about the code complexity (and potential bugs) of supporting them side by side.

I'm not yet sure how error-prone or complex this code would be.

We could have users switch on the new routing mechanism with a config until v1.0?

jnothman · 2020-08-30T12:14:18Z

Do we merge this so that the full SLEP is rendered on rtd?

adrinjalali · 2020-08-31T08:59:16Z

Sure, it's easier to follow up on the rest of the issues if we have this up there anyway.

rth · 2020-09-02T12:55:15Z

Thanks for doing this SLEP and proposing associated implementations! It's certainly a challenging topic and I don't have a good visibility on all the implications/constraints. My comments are mostly on usability with pipelines as outlined in scikit-learn/scikit-learn#18159, in particular,

with simple pipelines one would have to do something like
```
pipe = make_pipeline(
    StandardScaler().request_sample_weight(fit='sample_weight'), 
    LogisticRegression().request_sample_weight(fit='sample_weight')
)
```
which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight. And since one could argue that sample weights should be passed to all estimators (Add sample_weight fit param for Pipeline scikit-learn#18159 (comment)), the unfortunate reason for this verbosity is mostly to preserve backward compatibility as opposed to really needing fine grained control of where sample weight are sent at fit time.
For the "5.4. Keyword arguments vs. a single argument" discussion, single argument
```
est.fit(X, y, prop={'groups': groups, 'weight': weight})
```
goes kind of against the idea of a common scikit-learn API for estimator. As this would create two type of objects, a pipeline like/meta-estimator objects and others. Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True) which goes against the idea that say StandardScaler() + Ridge() will behave the same as Ridge(normalize=True) , or that CountVecotorizer() + TfidfTransformer() = TfidfVectorizer(). Granted these don't support sample_weight at the moment, but they very well could.

I would interested in exploring a mixture of option proposed in scikit-learn/scikit-learn#18159 (comment) and option 4. For sample weight that would be,

Add sample_weight to Pipeline, ColumnTransformer, FeatureUnion. Pass them to all estimators, fail otherwise.
Whether an estimator supports sample_weight is stored in an estimator tag (also [WIP] Add a supports_sample_weight tag scikit-learn#13565)
One can overwrite this tag e.g. StandardScaler().set_estimator.tag(fit_sample_weight=None) to allow customization if needed.

So for sample weight that would be a little similar to option 4 but less verbose and with reasonable default. I'm not yet sure how this would generalize to other sample props, maybe keeping a props argument for others is fine.

scikit-learn/scikit-learn#18159 (comment)
Why treat sample_weight specially?

At least for fit, estimators / pipelines that support sample weight need to verify invariance properties (check_sample_weights_invariance(..., kind="zeros") common check). So that should constrain where they need to be passed in a pipeline. Also they are probably the most common sample props, and it would be nice to be able to use them without much added code.

adrinjalali · 2020-09-02T13:10:47Z

Another option for the verbosity issue is to have a utility function to say something like propagate_metadata(pipeline, 'all', 'sample_weights')

jnothman · 2020-09-02T20:58:26Z

Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True)

No, I'd see this as choosing to deprecate est.fit(X, y, sample_weight=True) support, and always use prop.

pipe = make_pipeline(
StandardScaler().request_sample_weight(fit='sample_weight'),
LogisticRegression().request_sample_weight(fit='sample_weight')
)
which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight.

I'd now see more likely syntax as StandardScaler().request_sample_weight(fit=True) or StandardScaler(request_sample_weight=True).

I appreciate that this could be more usable still. Making sample_weight requested by default only works, as far as I'm concerned (see #43 (comment)) if every every estimator (and perhaps scorer) requests sample_weight and then raises an error if it does not actually support it: the user needs to explicitly turn it off every time sample_weight is used and some component does not support it. It does not seem a reasonable approach for other props at this point.

jnothman added 4 commits August 4, 2020 22:51

Examples for case 0a

cf39e6a

Add variant for solution 4

70f7a95

Examples for case 0b

1eb2642

Complete some TODOs

ad7cfb6

adrinjalali reviewed Aug 21, 2020

View reviewed changes

adrinjalali merged commit 4f17c3a into master Aug 31, 2020

Uh oh!

Completing sample props SLEP006 #43

Completing sample props SLEP006 #43

Uh oh!

Conversation

jnothman commented Aug 4, 2020 • edited by rth Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 30, 2020

Uh oh!

adrinjalali commented Aug 31, 2020

Uh oh!

rth commented Sep 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Sep 2, 2020

Uh oh!

jnothman commented Sep 2, 2020

Uh oh!

Uh oh!

jnothman commented Aug 4, 2020 •

edited by rth

Loading

rth commented Sep 2, 2020 •

edited

Loading