Skip to content

[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Feb 13, 2017

Reference Issue

#8157

What does this implement/fix? Explain your changes.

Modify the Pipeline API such that steps given by the user at initialization is not modified.

Any other comments?

  • steps is replaced by steps_ at fit time to follow scikit-learn API.
  • named_steps has the same behaviour as before with a FutureWarning. It will return unfitted estimators from steps.
  • named_steps_ returns fitted estimators from steps_ .
  • steps becomes _steps in order to create property and setter steps.

Update: FeatureUnion has a nearly identical design issue.

@jnothman
Copy link
Member

jnothman commented Feb 13, 2017 via email

@glemaitre glemaitre force-pushed the refactor_pipeline_api branch from ea97581 to d96f0f5 Compare February 14, 2017 17:49
@glemaitre
Copy link
Member Author

Not looked at code yet, but from your description, I'd be worried that you are not maintaining access to fitted steps with steps, get_params, set_params and that it might not be possible to construct a pipeline from a list of fitted steps. All of these behaviours must remain during deprecation.

I made the changes.

We also must provide an official way to construct a pipeline from fitted components (e.g. a substring of a fitted pipeline).

@jnothman was proposing a get_steps().

Any suggestions @jnothman @ogrisel @raghavrv @GaelVaroquaux @agramfort ?

@glemaitre glemaitre changed the title [WIP] FIX Modify API Pipeline [WIP] FIX Modify Pipeline API Feb 14, 2017
@codecov
Copy link

codecov bot commented Feb 14, 2017

Codecov Report

Merging #8350 into master will increase coverage by 0.74%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8350      +/-   ##
==========================================
+ Coverage   94.75%   95.49%   +0.74%     
==========================================
  Files         342      342              
  Lines       60801    61026     +225     
==========================================
+ Hits        57609    58274     +665     
+ Misses       3192     2752     -440
Impacted Files Coverage Δ
sklearn/feature_extraction/tests/test_text.py 98.72% <100%> (ø)
sklearn/pipeline.py 99.36% <100%> (+0.1%)
sklearn/tests/test_pipeline.py 99.83% <100%> (+0.21%)
sklearn/ensemble/gradient_boosting.py 95.79% <0%> (ø)
sklearn/cluster/tests/test_dbscan.py 100% <0%> (ø)
sklearn/ensemble/forest.py 98.16% <0%> (ø)
sklearn/utils/tests/test_class_weight.py 100% <0%> (ø)
sklearn/neighbors/tests/test_kde.py 98.85% <0%> (ø)
sklearn/ensemble/tests/test_weight_boosting.py 100% <0%> (ø)
sklearn/linear_model/tests/test_randomized_l1.py 100% <0%> (ø)
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8694278...c30b8ff. Read the comment docs.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

@jnothman
Copy link
Member

Do you need to construct it in the official scikit-learn way, ie from a
the constructor, or is it enough to instantiate the Pipeline and modify
the .steps_ afterwards?

Other estimated being constructed with existing models are done by modifying coef_ or whatever. But it's not clear to what extent this is recommended. For instance, when replacing an attribute in __dict__ with a property, we've rarely bothered implementing setters. When adding an attribute in version x, we've not checked that the user setting only the attributes available in version x-1 will suffice to not get a NotFittedError. Basically, we've given the user no assurances that this is okay, let alone recommended it.

I often find myself wanting to slice up pipelines. It deserves both how-tos and tests. Hacking steps_ is probably enough, but we should write out what that looks like and see if we're happy with it.

If it's not enough, I support the addition of a "Frozen" class that makes
an estimator robust to cloning

This either requires clone to be aware of the specific frozen case, or for clone to more generally support polymorphism or singledispatch. I'm still in favour of the latter -- and given this change have an ancient implementation of a wrapper for a fitted estimator -- though I know you feel it opens floodgates to misuse and error.

as well as a "freeze" function that would apply this to a list of estimator.

as in copy/wrap the estimators? or mutate them to mark them frozen?

I do think that the ability to freeze things is valuable all over the place, notably for semi-supervision (e.g. decomposition transformation trained on lots of unlabelled data incorporated in feature processing pipeline for supervised task).

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

@jnothman
Copy link
Member

jnothman commented Feb 15, 2017

Thinking about this instead of sleeping. Cloning is only part of the problem with freezing. The other part is overwriting fit so it does nothing. This can only be generically achieved with a wrapper.

  • clone adds is isinstance(obj, FrozenModel): return obj.
  • FrozenModel has fit(self, *args, **kwargs): return self
  • FrozenModel delegates __getattr__ and __setattr__ to its wrapped estimator.
    • hence its estimator cannot be accessible at FrozenModel().estimator but at some more munged name.
  • freeze(obj, copy=True)
    • if copy, performs deepcopy
    • if an estimator, returns FrozenModel(obj)
    • if a list or object array, returns list(map(FrozenModel, obj))
  • thus Pipeline(freeze(my_pipeline.steps_[:-1])).fit(X, y).transform(X) sort of does what Pipeline(my_pipeline.steps[:-1]).transform(X) used to.
    • The need to call fit outside of a CV context or similar where it's done for you is a bit annoying. This still may not remain the nicest interface for Pipeline substring extraction.

@jnothman
Copy link
Member

(I've made some edits where I'd failed to fix up a previous version of that comment)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 15, 2017 via email

@jnothman
Copy link
Member

Not tonight :p

@glemaitre glemaitre force-pushed the refactor_pipeline_api branch from d8fb218 to cfd41cf Compare February 16, 2017 01:16
@jnothman
Copy link
Member

jnothman commented Feb 16, 2017 via email

@jnothman
Copy link
Member

I vomited my panicked thoughts about freezing into issue #8370.

Guillaume Lemaitre and others added 7 commits February 17, 2017 14:17
* create 'steps_' at fit time
* create 'named_steps_' to return step from 'steps_'
* add future warning for 'named_steps'
* modify all examples and tests accordingly
@glemaitre glemaitre force-pushed the refactor_pipeline_api branch from cfd41cf to c8b10e1 Compare February 17, 2017 13:18
@glemaitre
Copy link
Member Author

@jnothman @GaelVaroquaux Shall I change [WIP] to [MRG] get some reviews.
Any potential merging of this PR should wait for the introduction of the 'freezing business', right?

@jnothman
Copy link
Member

jnothman commented Feb 19, 2017 via email

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Feb 22, 2017
Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568
without __getitem__ and mixed semantics.

Designed to assist in model inspection and particularly to replicate the
composite transformer represented by steps of the pipeline with the
exception of the last. I.e. pipe.get_subsequence(0, -1) is a common
idiom. I feel like this becomes more necessary when considering more
API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1])
is no longer possible.
@glemaitre glemaitre changed the title [WIP] FIX Modify Pipeline API [MRG] FIX Modify Pipeline API Feb 22, 2017
@glemaitre
Copy link
Member Author

What is the desired behavior of cloning a compose.Pipeline when the final step sets warm_start=True?

You don't want to clone the final estimator and use the one passed in estimator.

@NicolasHug
Copy link
Member

I have a few questions:

  1. It seems that Joel's concern about being able to build pipelines out of other pipelines addressed by the new slicing mechanism. Is that correct or am I missing something?
  2. Why is warm_start + cloning a problem with pipelines? Cloning an estimator resets the state of the estimator (regardless of warm-start), why should it be any different for a pipeline??
  3. What is stalling the PR now? Is there anything I can help with to move this forward?

cc @thomasjpfan

@jnothman
Copy link
Member

jnothman commented Sep 16, 2019 via email

@NicolasHug
Copy link
Member

NicolasHug commented Sep 17, 2019

Thanks @jnothman .

We discussed it with @thomasjpfan and we thought about the following solution:

  1. instead of always cloning the estimators in fit, only clone them the first time fit is called (with e.g. self.steps_ = getattr(self, 'steps_', [clone(est) for est in self.steps]))
    This way you can still do stuff like
    pipe = make_pipeline(...)
    pipe.fit(...)
    pip.set_param(gbdt__max_iter=100, gbdt__warm_start=True)
    pip.fit(...)  # warm start works ok
  2. in __init__, set self.steps = [clone(est) for est in steps]. This is not breaking our API convention, since self.steps still behaves exactly like the steps parameter.
    We think this is needed because it prevents unexpected behaviors/bugs. If we don't clone, changing the pipeline with set_params will change an estimator passed as parameter, and changing an estimator parameter will also changes the pipeline (we want to avoid that).

@jnothman
Copy link
Member

jnothman commented Sep 17, 2019 via email

@NicolasHug
Copy link
Member

Yes, set_paramas would keep both steps and steps_ synchronized.

@thomasjpfan
Copy link
Member

And I can't say I like the idea of not fitting from scratch in Pipeline by default, but I can't quite put my finger on a specific problem with it.

It is still fitting from scratch as long as the items in steps follow our API. The nice thing is that if one of the steps has warm_start=True, this also works. Pipeline is delegating what it means to fit to steps_.

For behaviour (2), rather than cloning in init, you would need to clone
every time a step parameter (or steps) was set, if I'm not mistaken.

Agreed. It can be a @property with a setter, where it is initially set at __init__.

@NicolasHug
Copy link
Member

And I can't say
I like the idea of not fitting from scratch in Pipeline by default, but I
can't quite put my finger on a specific problem with it.

I feel like we always want to fit a pipeline from scratch? Not fitting from scratch is an extreme use case that we don't want to support (at least explicitly?). It means that your final estimator has been already fitted on whatever the pre-processors of your pipeline can output... How do you do that without the pipeline in the first place?

For behaviour (2), rather than cloning in init, you would need to clone
every time a step parameter (or steps) was set, if I'm not mistaken.

You mean someone doing e.g. pipe.steps[2] = some_new_est? Shouldn't this be totally forbidden? users should be using set_params

@thomasjpfan
Copy link
Member

Shouldn't this be totally forbidden? users should be using set_params

pipe = Pipeline(
    [('pre', StandardScaler()),
    ('est', LogisticRegression())]
)
pipe.set_params(est=MLPClassifier())

@NicolasHug
Copy link
Member

yeah so we can just clone in set_params too

@NicolasHug
Copy link
Member

NicolasHug commented Sep 25, 2019

Question: how bad is the problem we are trying to solve here? Do we really care that pipelines don't respect our API conventions? Does it prevent us from doing more cool things?

I can't say I have a compelling reason to go through all this, and it seems that the original motivation #8157 was addressed already.

But if we do I'd like to move forward and try to implement #8350 (comment).

@jnothman
Copy link
Member

I agree the current behaviour is mostly innocuous and have never thought the change was a priority without checking all the use cases are going to be handled.

We have had bugs raised where the fundamental issue was the non cloning in Pipeline.... But I can't find them. And they're more "surprising" than "broken".

The other reason to do this is to set an example of how to build such things in a compliant way!

@amueller
Copy link
Member

So... should we remove it from the roadmap?

@amueller
Copy link
Member

actually, this issue is quite weird/annoying and is caused by this: #10063

We could also try and fix it by cloning the params in GridSearchCV but that would be somewhat strange as well...

@amueller amueller removed this from the 0.22 milestone Oct 17, 2019
@amueller
Copy link
Member

removing milestone, I think this needs a slep or a close.

Base automatically changed from master to main January 22, 2021 10:49
@ogrisel ogrisel changed the title [MRG] FIX Modify Pipeline API [MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions Oct 25, 2022
@adrinjalali
Copy link
Member

We can certainly have another look at this, but I think we can close this one and a fresh PR at some point would make more sense.

@glemaitre
Copy link
Member Author

Yep, it will be scikit-learn 2.0 anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants