[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

glemaitre · 2017-02-13T17:11:11Z

Reference Issue

#8157

What does this implement/fix? Explain your changes.

Modify the Pipeline API such that steps given by the user at initialization is not modified.

Any other comments?

steps is replaced by steps_ at fit time to follow scikit-learn API.
named_steps has the same behaviour as before with a FutureWarning. It will return unfitted estimators from steps.
named_steps_ returns fitted estimators from steps_ .
steps becomes _steps in order to create property and setter steps.

Update: FeatureUnion has a nearly identical design issue.

jnothman · 2017-02-13T22:45:01Z

Not looked at code yet, but from your description I'd be worried that you are not maintaining access to fitted steps with `steps`, `get_params`, `set_params` and that it might not be possible to construct a pipeline from a list of fitted steps. All of these behaviours must remain during deprecation. We also must provide an official way to construct a pipeline from fitted components (e.g. a substring of a fitted pipeline).

glemaitre · 2017-02-14T18:55:32Z

Not looked at code yet, but from your description, I'd be worried that you are not maintaining access to fitted steps with steps, get_params, set_params and that it might not be possible to construct a pipeline from a list of fitted steps. All of these behaviours must remain during deprecation.

I made the changes.

We also must provide an official way to construct a pipeline from fitted components (e.g. a substring of a fitted pipeline).

@jnothman was proposing a get_steps().

Any suggestions @jnothman @ogrisel @raghavrv @GaelVaroquaux @agramfort ?

codecov · 2017-02-14T20:34:28Z

Codecov Report

Merging #8350 into master will increase coverage by 0.74%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8350      +/-   ##
==========================================
+ Coverage   94.75%   95.49%   +0.74%     
==========================================
  Files         342      342              
  Lines       60801    61026     +225     
==========================================
+ Hits        57609    58274     +665     
+ Misses       3192     2752     -440

Impacted Files	Coverage Δ
sklearn/feature_extraction/tests/test_text.py	`98.72% <100%> (ø)`	✅
sklearn/pipeline.py	`99.36% <100%> (+0.1%)`	✅
sklearn/tests/test_pipeline.py	`99.83% <100%> (+0.21%)`	✅
sklearn/ensemble/gradient_boosting.py	`95.79% <0%> (ø)`	✅
sklearn/cluster/tests/test_dbscan.py	`100% <0%> (ø)`	✅
sklearn/ensemble/forest.py	`98.16% <0%> (ø)`	✅
sklearn/utils/tests/test_class_weight.py	`100% <0%> (ø)`	✅
sklearn/neighbors/tests/test_kde.py	`98.85% <0%> (ø)`	✅
sklearn/ensemble/tests/test_weight_boosting.py	`100% <0%> (ø)`	✅
sklearn/linear_model/tests/test_randomized_l1.py	`100% <0%> (ø)`	✅
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8694278...c30b8ff. Read the comment docs.

GaelVaroquaux · 2017-02-15T12:52:05Z

We also must provide an official way to construct a pipeline from fitted components (e.g. a substring of a fitted pipeline).

Do you need to construct it in the official scikit-learn way, ie from a the constructor, or is it enough to instantiate the Pipeline and modify the .steps_ afterwards? If it's not enough, I support the addition of a "Frozen" class that makes an estimator robust to cloning, as well as a "freeze" function that would apply this to a list of estimator. I am aware that you have been asking for something in this direction for a while. Now seems like a good moment to think about it.

jnothman · 2017-02-15T13:25:41Z

Do you need to construct it in the official scikit-learn way, ie from a
the constructor, or is it enough to instantiate the Pipeline and modify
the .steps_ afterwards?

Other estimated being constructed with existing models are done by modifying coef_ or whatever. But it's not clear to what extent this is recommended. For instance, when replacing an attribute in __dict__ with a property, we've rarely bothered implementing setters. When adding an attribute in version x, we've not checked that the user setting only the attributes available in version x-1 will suffice to not get a NotFittedError. Basically, we've given the user no assurances that this is okay, let alone recommended it.

I often find myself wanting to slice up pipelines. It deserves both how-tos and tests. Hacking steps_ is probably enough, but we should write out what that looks like and see if we're happy with it.

If it's not enough, I support the addition of a "Frozen" class that makes
an estimator robust to cloning

This either requires clone to be aware of the specific frozen case, or for clone to more generally support polymorphism or singledispatch. I'm still in favour of the latter -- and given this change have an ancient implementation of a wrapper for a fitted estimator -- though I know you feel it opens floodgates to misuse and error.

as well as a "freeze" function that would apply this to a list of estimator.

as in copy/wrap the estimators? or mutate them to mark them frozen?

I do think that the ability to freeze things is valuable all over the place, notably for semi-supervision (e.g. decomposition transformation trained on lots of unlabelled data incorporated in feature processing pipeline for supervised task).

GaelVaroquaux · 2017-02-15T13:34:38Z

Basically, we've given the user no assurances that this is okay, let alone recommended it.

Indeed, so far yes.

This either requires clone to be aware of the specific frozen case, or for clone to more generally support polymorphism or singledispatch.

Well, we can have very simple solutions, like adding a "frozen" attribute in the BaseEstimator and checking for frozen True / False in clone. That would be a very very simple solution giving much less cognitive and design overhead than polymorphism or singledispatch.

as in copy/wrap the estimators? or mutate them to mark them frozen?

This should be discussed, I believe.

I do think that the ability to freeze things is valuable all over the place, notably for semi-supervision

Indeed, or transfer learning.

jnothman · 2017-02-15T14:20:57Z

Thinking about this instead of sleeping. Cloning is only part of the problem with freezing. The other part is overwriting fit so it does nothing. This can only be generically achieved with a wrapper.

clone adds is isinstance(obj, FrozenModel): return obj.
FrozenModel has fit(self, *args, **kwargs): return self
FrozenModel delegates __getattr__ and __setattr__ to its wrapped estimator.
- hence its estimator cannot be accessible at FrozenModel().estimator but at some more munged name.
freeze(obj, copy=True)
- if copy, performs deepcopy
- if an estimator, returns FrozenModel(obj)
- if a list or object array, returns list(map(FrozenModel, obj))
thus Pipeline(freeze(my_pipeline.steps_[:-1])).fit(X, y).transform(X) sort of does what Pipeline(my_pipeline.steps[:-1]).transform(X) used to.
- The need to call fit outside of a CV context or similar where it's done for you is a bit annoying. This still may not remain the nicest interface for Pipeline substring extraction.

jnothman · 2017-02-15T14:23:04Z

(I've made some edits where I'd failed to fix up a previous version of that comment)

GaelVaroquaux · 2017-02-15T14:23:15Z

Sounds right to me. Do you want to start an RFC/API issue?

jnothman · 2017-02-15T14:24:34Z

Not tonight :p

jnothman · 2017-02-16T04:46:30Z

This freezing business gets tricky if we want `isinstance(FrozenModel(obj), type(obj)) == True`. That, I think, we can only resolve with dynamic creation of classes. :(

…

On 16 February 2017 at 01:23, Gael Varoquaux ***@***.***> wrote: Sounds right to me. Do you want to start an RFC/API issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8350 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_j1SOiMg47AsOYUKZRB_yi61AdCks5rcwpUgaJpZM4L_eGQ> .

jnothman · 2017-02-16T05:49:27Z

I vomited my panicked thoughts about freezing into issue #8370.

* create 'steps_' at fit time * create 'named_steps_' to return step from 'steps_' * add future warning for 'named_steps' * modify all examples and tests accordingly

glemaitre · 2017-02-17T14:21:05Z

@jnothman @GaelVaroquaux Shall I change [WIP] to [MRG] get some reviews.
Any potential merging of this PR should wait for the introduction of the 'freezing business', right?

jnothman · 2017-02-19T12:53:36Z

I don't think it's contingent on the freezing business in particular, but we've got to have some documentation describing how to extract parts of pipelines in the new regime.

…

On 18 February 2017 at 01:21, Guillaume Lemaitre ***@***.***> wrote: @jnothman <https://github.com/jnothman> @GaelVaroquaux <https://github.com/GaelVaroquaux> Shall I change [WIP] to [MRG] get some reviews. Any potential merging of this PR should wait for the introduction of the 'freezing business', right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8350 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68krXOOsmhAF23nMI-mUsddpmKQvks5rdazTgaJpZM4L_eGQ> .

Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568 without __getitem__ and mixed semantics. Designed to assist in model inspection and particularly to replicate the composite transformer represented by steps of the pipeline with the exception of the last. I.e. pipe.get_subsequence(0, -1) is a common idiom. I feel like this becomes more necessary when considering more API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1]) is no longer possible.

glemaitre · 2019-09-09T09:17:34Z

What is the desired behavior of cloning a compose.Pipeline when the final step sets warm_start=True?

You don't want to clone the final estimator and use the one passed in estimator.

NicolasHug · 2019-09-16T18:17:33Z

I have a few questions:

It seems that Joel's concern about being able to build pipelines out of other pipelines addressed by the new slicing mechanism. Is that correct or am I missing something?
Why is warm_start + cloning a problem with pipelines? Cloning an estimator resets the state of the estimator (regardless of warm-start), why should it be any different for a pipeline??
What is stalling the PR now? Is there anything I can help with to move this forward?

cc @thomasjpfan

jnothman · 2019-09-16T20:59:58Z

1. Yes 2. The issue with warm_start isn't about cloning the Pipeline, it's about set_params not working to exploit warm_start in an estimator in such a pipeline. The parameter will be set on the estimator parameter, not on the fitted estimator... And running fit again will re-clone that estimator, losing the warmth. 3. I see 2 as the problem. Also I'm sure there are changes to merge in here.

NicolasHug · 2019-09-17T19:46:11Z

Thanks @jnothman .

We discussed it with @thomasjpfan and we thought about the following solution:

instead of always cloning the estimators in fit, only clone them the first time fit is called (with e.g. self.steps_ = getattr(self, 'steps_', [clone(est) for est in self.steps]))
This way you can still do stuff like
```
pipe = make_pipeline(...)
pipe.fit(...)
pip.set_param(gbdt__max_iter=100, gbdt__warm_start=True)
pip.fit(...)  # warm start works ok
```
in __init__, set self.steps = [clone(est) for est in steps]. This is not breaking our API convention, since self.steps still behaves exactly like the steps parameter.
We think this is needed because it prevents unexpected behaviors/bugs. If we don't clone, changing the pipeline with set_params will change an estimator passed as parameter, and changing an estimator parameter will also changes the pipeline (we want to avoid that).

jnothman · 2019-09-17T23:56:03Z

You're also implying in (1) that set_params is redefined to have effect on both the fitted and unfitted versions of the estimators...? And I can't say I like the idea of not fitting from scratch in Pipeline by default, but I can't quite put my finger on a specific problem with it. For behaviour (2), rather than cloning in __init__, you would need to clone every time a step parameter (or steps) was set, if I'm not mistaken.

NicolasHug · 2019-09-18T00:47:50Z

Yes, set_paramas would keep both steps and steps_ synchronized.

thomasjpfan · 2019-09-18T03:01:27Z

And I can't say I like the idea of not fitting from scratch in Pipeline by default, but I can't quite put my finger on a specific problem with it.

It is still fitting from scratch as long as the items in steps follow our API. The nice thing is that if one of the steps has warm_start=True, this also works. Pipeline is delegating what it means to fit to steps_.

For behaviour (2), rather than cloning in init, you would need to clone
every time a step parameter (or steps) was set, if I'm not mistaken.

Agreed. It can be a @property with a setter, where it is initially set at __init__.

NicolasHug · 2019-09-18T15:56:53Z

And I can't say
I like the idea of not fitting from scratch in Pipeline by default, but I
can't quite put my finger on a specific problem with it.

I feel like we always want to fit a pipeline from scratch? Not fitting from scratch is an extreme use case that we don't want to support (at least explicitly?). It means that your final estimator has been already fitted on whatever the pre-processors of your pipeline can output... How do you do that without the pipeline in the first place?

For behaviour (2), rather than cloning in init, you would need to clone
every time a step parameter (or steps) was set, if I'm not mistaken.

You mean someone doing e.g. pipe.steps[2] = some_new_est? Shouldn't this be totally forbidden? users should be using set_params

thomasjpfan · 2019-09-18T17:19:19Z

Shouldn't this be totally forbidden? users should be using set_params

pipe = Pipeline(
    [('pre', StandardScaler()),
    ('est', LogisticRegression())]
)
pipe.set_params(est=MLPClassifier())

NicolasHug · 2019-09-18T17:50:26Z

yeah so we can just clone in set_params too

NicolasHug · 2019-09-25T19:07:16Z

Question: how bad is the problem we are trying to solve here? Do we really care that pipelines don't respect our API conventions? Does it prevent us from doing more cool things?

I can't say I have a compelling reason to go through all this, and it seems that the original motivation #8157 was addressed already.

But if we do I'd like to move forward and try to implement #8350 (comment).

jnothman · 2019-09-25T20:33:09Z

I agree the current behaviour is mostly innocuous and have never thought the change was a priority without checking all the use cases are going to be handled.

We have had bugs raised where the fundamental issue was the non cloning in Pipeline.... But I can't find them. And they're more "surprising" than "broken".

The other reason to do this is to set an example of how to build such things in a compliant way!

amueller · 2019-09-26T16:01:11Z

So... should we remove it from the roadmap?

amueller · 2019-09-26T16:54:34Z

actually, this issue is quite weird/annoying and is caused by this: #10063

We could also try and fix it by cloning the params in GridSearchCV but that would be somewhat strange as well...

amueller · 2019-10-17T14:09:32Z

removing milestone, I think this needs a slep or a close.

adrinjalali · 2024-04-17T09:57:32Z

We can certainly have another look at this, but I think we can close this one and a fresh PR at some point would make more sense.

glemaitre · 2024-05-18T09:37:13Z

Yep, it will be scikit-learn 2.0 anyway.

This was referenced Feb 13, 2017

RFC: Cloning estimators in pipeline #8157

Open

Initial attributes are modified in FeatureUnion #8351

Open

glemaitre force-pushed the refactor_pipeline_api branch from ea97581 to d96f0f5 Compare February 14, 2017 17:49

glemaitre changed the title ~~[WIP] FIX Modify API Pipeline~~ [WIP] FIX Modify Pipeline API Feb 14, 2017

glemaitre force-pushed the refactor_pipeline_api branch from d8fb218 to cfd41cf Compare February 16, 2017 01:16

Guillaume Lemaitre and others added 7 commits February 17, 2017 14:17

FIX Modify API Pipeline

1d4b5e6

* create 'steps_' at fit time * create 'named_steps_' to return step from 'steps_' * add future warning for 'named_steps' * modify all examples and tests accordingly

fix most of unit test failure

9e3da89

FIX bug where estimator type was misidentified

a2d6e7d

correct the doc

0502635

fix make a copy of steps

3338171

Fix the different tests

de70e29

Fix examples

c8b10e1

glemaitre force-pushed the refactor_pipeline_api branch from cfd41cf to c8b10e1 Compare February 17, 2017 13:18

jnothman mentioned this pull request Feb 22, 2017

[WIP] ENH allow extraction of subsequence pipeline #8431

Closed

2 tasks

glemaitre changed the title ~~[WIP] FIX Modify Pipeline API~~ [MRG] FIX Modify Pipeline API Feb 22, 2017

amueller mentioned this pull request Sep 26, 2019

MRG clone parameters in gridsearch etc #15096

Merged

amueller removed this from the 0.22 milestone Oct 17, 2019

github-actions bot added module:feature_extraction module:pipeline labels Mar 2, 2020

glemaitre mentioned this pull request Oct 20, 2020

check_is_fitted gives false positive when extracted from ensemble classifier #18648

Closed

Base automatically changed from master to main January 22, 2021 10:49

glemaitre mentioned this pull request Feb 23, 2021

Passing fitting parameters to transformers of ColumnTransformer #19465

Closed

thomasjpfan mentioned this pull request Jul 24, 2021

API add named_transformers attribute to FeatureUnion #20331

Merged

ogrisel changed the title ~~[MRG] FIX Modify Pipeline API~~ [MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions Oct 25, 2022

thomasjpfan mentioned this pull request Dec 22, 2022

ENH add feature_names_in_ in FeatureUnion #25220

Merged

adrinjalali closed this Apr 17, 2024

thomasjpfan mentioned this pull request Sep 19, 2024

Implications of FrozenEstimator on our API #29893

Open

glemaitre mentioned this pull request Oct 1, 2024

TransformedTargetRegressor with Pipeline is not fitting model upon calling .fit #29983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

glemaitre commented Feb 13, 2017 •

edited by ogrisel

Loading

jnothman commented Feb 13, 2017 via email •

edited

Loading

glemaitre commented Feb 14, 2017

codecov bot commented Feb 14, 2017 •

edited

Loading

GaelVaroquaux commented Feb 15, 2017 via email

jnothman commented Feb 15, 2017

GaelVaroquaux commented Feb 15, 2017 via email

jnothman commented Feb 15, 2017 •

edited

Loading

jnothman commented Feb 15, 2017

GaelVaroquaux commented Feb 15, 2017 via email

jnothman commented Feb 15, 2017

jnothman commented Feb 16, 2017 via email

jnothman commented Feb 16, 2017

glemaitre commented Feb 17, 2017

jnothman commented Feb 19, 2017 via email

glemaitre commented Sep 9, 2019

NicolasHug commented Sep 16, 2019

jnothman commented Sep 16, 2019 via email

NicolasHug commented Sep 17, 2019 •

edited

Loading

jnothman commented Sep 17, 2019 via email

NicolasHug commented Sep 18, 2019

thomasjpfan commented Sep 18, 2019

NicolasHug commented Sep 18, 2019

thomasjpfan commented Sep 18, 2019

NicolasHug commented Sep 18, 2019

NicolasHug commented Sep 25, 2019 •

edited

Loading

jnothman commented Sep 25, 2019

amueller commented Sep 26, 2019

amueller commented Sep 26, 2019

amueller commented Oct 17, 2019

adrinjalali commented Apr 17, 2024

glemaitre commented May 18, 2024

[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350

Conversation

glemaitre commented Feb 13, 2017 • edited by ogrisel Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

jnothman commented Feb 13, 2017 via email • edited Loading

glemaitre commented Feb 14, 2017

codecov bot commented Feb 14, 2017 • edited Loading

Codecov Report

GaelVaroquaux commented Feb 15, 2017 via email

jnothman commented Feb 15, 2017

GaelVaroquaux commented Feb 15, 2017 via email

jnothman commented Feb 15, 2017 • edited Loading

jnothman commented Feb 15, 2017

GaelVaroquaux commented Feb 15, 2017 via email

jnothman commented Feb 15, 2017

jnothman commented Feb 16, 2017 via email

jnothman commented Feb 16, 2017

glemaitre commented Feb 17, 2017

jnothman commented Feb 19, 2017 via email

glemaitre commented Sep 9, 2019

NicolasHug commented Sep 16, 2019

jnothman commented Sep 16, 2019 via email

NicolasHug commented Sep 17, 2019 • edited Loading

jnothman commented Sep 17, 2019 via email

NicolasHug commented Sep 18, 2019

thomasjpfan commented Sep 18, 2019

NicolasHug commented Sep 18, 2019

thomasjpfan commented Sep 18, 2019

NicolasHug commented Sep 18, 2019

NicolasHug commented Sep 25, 2019 • edited Loading

jnothman commented Sep 25, 2019

amueller commented Sep 26, 2019

amueller commented Sep 26, 2019

amueller commented Oct 17, 2019

adrinjalali commented Apr 17, 2024

glemaitre commented May 18, 2024

glemaitre commented Feb 13, 2017 •

edited by ogrisel

Loading

jnothman commented Feb 13, 2017 via email •

edited

Loading

codecov bot commented Feb 14, 2017 •

edited

Loading

jnothman commented Feb 15, 2017 •

edited

Loading

NicolasHug commented Sep 17, 2019 •

edited

Loading

NicolasHug commented Sep 25, 2019 •

edited

Loading