Skip to content

Completing sample props SLEP006 #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 31, 2020
Merged

Completing sample props SLEP006 #43

merged 4 commits into from
Aug 31, 2020

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Aug 4, 2020

@jnothman
Copy link
Member Author

jnothman commented Aug 4, 2020

This PR is filling in several TODOs in SLEP006, with an idea to having it ready for voting in Sept 2020, in conjunction with refinements to scikit-learn/scikit-learn#16079.

Comment on lines +448 to +450
By default, `sample_weight` will not be requested by estimators that support
it. This ensures that addition of `sample_weight` support to an estimator will
not change its behaviour.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I think right now if the underlying estimator supports sample_weight and the user passes sample_weight to the meta-estimator's fit, it'll be passed to the estimator. With this default you suggest, it wouldn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now if the underlying estimator supports sample_weight and the user passes sample_weight to the meta-estimator's fit, it'll be passed to the estimator. With this default you suggest, it wouldn't.

That's true. My suggestion here sacrifices some usability for explicitness. The idea would be that (after a deprecation or version 1.0) passing sample_weight to the meta estimator where no consumer has requested it (default) would raise an error: unexpected fit parameter. Thus the explicit use of require_sample_weight would be forced.

As well as ongoing compatibility issues (code invisibly starts using weights when someone implemented them), my concern with requesting sample_weight is also that it may be confusing if estimators request sample_weight by default but scorers do not.

Making it explicit sucks a little because people need to express sample_weight usage in two places rather than one, but I think it won't hurt too much and will make sure the reader of the code knows what's going on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say looking at the conversation in scikit-learn/scikit-learn#18159 folks would rather have this routing done implicitly?

I think I'd be in favor of that implicit routing too.

And I think scorers should ask for sample_weight by default too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say looking at the conversation in scikit-learn/scikit-learn#18159 folks would rather have this routing done implicitly?

I think @rth's motivation there is explicitly about avoiding the coupling of the param name with the composite structure of the estimator, i.e. you need to prefix things differently depending on how many layers of pipeline it's in, or depending on what the estimator's called. I've long been frustrated by that, for parameter names too, and it's a big reason I made searchgrid. Only Solution 2 here retains this issue. It is no longer seriously being considered.

I think I'd be in favor of that implicit routing too. And I think scorers should ask for sample_weight by default too.

I get that these days we don't have a PR called "Add sample_weight support to ..." very often, but it used to be very common, and I fear that we will fit very different models if we change support for sample_weight in an estimator or metric. It will become ever more important for people to pin their version, and ever harder to know what some code does just by reading the latest docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire to have the behavior not change. But if for a moment we forget about the current behavior and think about what the intuitive desired behavior would be, which one would you go for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the developer's perspective it would be great to have sample weights always passed automatically: for scoring and fitting to be weighted by default.

From the maintainer's or reader's perspective, however, this is hell. To know if weighted fitting is used, you need to explicitly look up whether weights are supported in each estimator or scorer used. Switching estimators in a grid search can enable or disable weighted fitting or scoring implicitly. Upgrading scikit-learn or third-party libraries could enable weighted fitting or scoring implicitly.

All this tells me is that the straw man developer above was mistaken. All code must be written to be read, and to be read unambiguously. This developer had the intuition that implicit is better than explicit, but they were wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fair enough. Pinging @rth @glemaitre and @NicolasHug since this is very much related to scikit-learn/scikit-learn#18159

Comment on lines +452 to +455
During a deprecation period, fit_params will be handled dually: Keys that are
requested will be passed through the new request mechanism, while keys that are
not known will be routed using legacy mechanisms. At completion of the
deprecation period, the legacy handling will cease.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to support the combination? It'd be probably cleaner if we only support the legacy behavior if all follow that protocol?

Although it may have the issue that third party estimators won't be supporting this API at the same time maybe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the metaestimator know whether or not to use legacy mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm assuming you're talking about the estimator__param format, and for that, at least in the Pipeline, I thought of checking if any parameter follows that format. If yes, then we fallback to the old behavior completely. If no parameter follows that format, then we do things the new way.

I think this would cause less confusion than handling them both at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes, then we fallback to the old behavior completely. If no parameter follows that format, then we do things the new way.

Then that disables support for a prop explicitly requested as step_name__*?

I don't really see what we lose by supporting both during deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find understanding a code which passes some parameters to a pipeline as step_nem__* and other parameters as param_name confusing. Otherwise, I'm just worried about the code complexity (and potential bugs) of supporting them side by side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet sure how error-prone or complex this code would be.

We could have users switch on the new routing mechanism with a config until v1.0?

@jnothman
Copy link
Member Author

Do we merge this so that the full SLEP is rendered on rtd?

@adrinjalali
Copy link
Member

Sure, it's easier to follow up on the rest of the issues if we have this up there anyway.

@adrinjalali adrinjalali merged commit 4f17c3a into master Aug 31, 2020
@rth
Copy link
Member

rth commented Sep 2, 2020

Thanks for doing this SLEP and proposing associated implementations! It's certainly a challenging topic and I don't have a good visibility on all the implications/constraints. My comments are mostly on usability with pipelines as outlined in scikit-learn/scikit-learn#18159, in particular,

  • with simple pipelines one would have to do something like
    pipe = make_pipeline(
        StandardScaler().request_sample_weight(fit='sample_weight'), 
        LogisticRegression().request_sample_weight(fit='sample_weight')
    )
    which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight. And since one could argue that sample weights should be passed to all estimators (Add sample_weight fit param for Pipeline scikit-learn#18159 (comment)), the unfortunate reason for this verbosity is mostly to preserve backward compatibility as opposed to really needing fine grained control of where sample weight are sent at fit time.
  • For the "5.4. Keyword arguments vs. a single argument" discussion, single argument
    est.fit(X, y, prop={'groups': groups, 'weight': weight})
    goes kind of against the idea of a common scikit-learn API for estimator. As this would create two type of objects, a pipeline like/meta-estimator objects and others. Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True) which goes against the idea that say StandardScaler() + Ridge() will behave the same as Ridge(normalize=True) , or that CountVecotorizer() + TfidfTransformer() = TfidfVectorizer(). Granted these don't support sample_weight at the moment, but they very well could.

I would interested in exploring a mixture of option proposed in scikit-learn/scikit-learn#18159 (comment) and option 4. For sample weight that would be,

  • Add sample_weight to Pipeline, ColumnTransformer, FeatureUnion. Pass them to all estimators, fail otherwise.
  • Whether an estimator supports sample_weight is stored in an estimator tag (also [WIP] Add a supports_sample_weight tag scikit-learn#13565)
  • One can overwrite this tag e.g. StandardScaler().set_estimator.tag(fit_sample_weight=None) to allow customization if needed.

So for sample weight that would be a little similar to option 4 but less verbose and with reasonable default. I'm not yet sure how this would generalize to other sample props, maybe keeping a props argument for others is fine.

scikit-learn/scikit-learn#18159 (comment)
Why treat sample_weight specially?

At least for fit, estimators / pipelines that support sample weight need to verify invariance properties (check_sample_weights_invariance(..., kind="zeros") common check). So that should constrain where they need to be passed in a pipeline. Also they are probably the most common sample props, and it would be nice to be able to use them without much added code.

@adrinjalali
Copy link
Member

Another option for the verbosity issue is to have a utility function to say something like propagate_metadata(pipeline, 'all', 'sample_weights')

@jnothman
Copy link
Member Author

jnothman commented Sep 2, 2020

Some will be fitted with est.fit(X, y, prop={'weight': weight}) and others with est.fit(X, y, sample_weight=True)

No, I'd see this as choosing to deprecate est.fit(X, y, sample_weight=True) support, and always use prop.

pipe = make_pipeline(
StandardScaler().request_sample_weight(fit='sample_weight'),
LogisticRegression().request_sample_weight(fit='sample_weight')
)
which can become quite verbose for complex pipelines, particularly assuming that more and more estimator are going to support sample weight.

I'd now see more likely syntax as StandardScaler().request_sample_weight(fit=True) or StandardScaler(request_sample_weight=True).

I appreciate that this could be more usable still. Making sample_weight requested by default only works, as far as I'm concerned (see #43 (comment)) if every every estimator (and perhaps scorer) requests sample_weight and then raises an error if it does not actually support it: the user needs to explicitly turn it off every time sample_weight is used and some component does not support it. It does not seem a reasonable approach for other props at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants