-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FIX Modify the API of Pipeline and FeatureUnion to match common scikit-learn estimators conventions #8350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not looked at code yet, but from your description I'd be worried that you
are not maintaining access to fitted steps with `steps`, `get_params`,
`set_params` and that it might not be possible to construct a pipeline from
a list of fitted steps. All of these behaviours must remain during
deprecation.
We also must provide an official way to construct a pipeline from fitted
components (e.g. a substring of a fitted pipeline).
|
ea97581
to
d96f0f5
Compare
I made the changes.
@jnothman was proposing a Any suggestions @jnothman @ogrisel @raghavrv @GaelVaroquaux @agramfort ? |
Codecov Report
@@ Coverage Diff @@
## master #8350 +/- ##
==========================================
+ Coverage 94.75% 95.49% +0.74%
==========================================
Files 342 342
Lines 60801 61026 +225
==========================================
+ Hits 57609 58274 +665
+ Misses 3192 2752 -440
Continue to review full report at Codecov.
|
We also must provide an official way to construct a pipeline from fitted components (e.g. a substring of a fitted pipeline).
Do you need to construct it in the official scikit-learn way, ie from a
the constructor, or is it enough to instantiate the Pipeline and modify
the .steps_ afterwards?
If it's not enough, I support the addition of a "Frozen" class that makes
an estimator robust to cloning, as well as a "freeze" function that would
apply this to a list of estimator.
I am aware that you have been asking for something in this direction for
a while. Now seems like a good moment to think about it.
|
Other estimated being constructed with existing models are done by modifying I often find myself wanting to slice up pipelines. It deserves both how-tos and tests. Hacking
This either requires
as in copy/wrap the estimators? or mutate them to mark them frozen? I do think that the ability to freeze things is valuable all over the place, notably for semi-supervision (e.g. decomposition transformation trained on lots of unlabelled data incorporated in feature processing pipeline for supervised task). |
Basically, we've given the user no assurances that this is okay, let alone recommended it.
Indeed, so far yes.
This either requires clone to be aware of the specific frozen case, or for
clone to more generally support polymorphism or singledispatch.
Well, we can have very simple solutions, like adding a "frozen" attribute
in the BaseEstimator and checking for frozen True / False in clone. That
would be a very very simple solution giving much less cognitive and
design overhead than polymorphism or singledispatch.
as in copy/wrap the estimators? or mutate them to mark them frozen?
This should be discussed, I believe.
I do think that the ability to freeze things is valuable all over the place, notably for semi-supervision
Indeed, or transfer learning.
|
Thinking about this instead of sleeping. Cloning is only part of the problem with freezing. The other part is overwriting
|
(I've made some edits where I'd failed to fix up a previous version of that comment) |
Sounds right to me. Do you want to start an RFC/API issue?
|
Not tonight :p |
d8fb218
to
cfd41cf
Compare
This freezing business gets tricky if we want `isinstance(FrozenModel(obj),
type(obj)) == True`. That, I think, we can only resolve with dynamic
creation of classes. :(
…On 16 February 2017 at 01:23, Gael Varoquaux ***@***.***> wrote:
Sounds right to me. Do you want to start an RFC/API issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8350 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_j1SOiMg47AsOYUKZRB_yi61AdCks5rcwpUgaJpZM4L_eGQ>
.
|
I vomited my panicked thoughts about freezing into issue #8370. |
* create 'steps_' at fit time * create 'named_steps_' to return step from 'steps_' * add future warning for 'named_steps' * modify all examples and tests accordingly
cfd41cf
to
c8b10e1
Compare
@jnothman @GaelVaroquaux Shall I change [WIP] to [MRG] get some reviews. |
I don't think it's contingent on the freezing business in particular, but
we've got to have some documentation describing how to extract parts of
pipelines in the new regime.
…On 18 February 2017 at 01:21, Guillaume Lemaitre ***@***.***> wrote:
@jnothman <https://github.com/jnothman> @GaelVaroquaux
<https://github.com/GaelVaroquaux> Shall I change [WIP] to [MRG] get some
reviews.
Any potential merging of this PR should wait for the introduction of the
'freezing business', right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8350 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68krXOOsmhAF23nMI-mUsddpmKQvks5rdazTgaJpZM4L_eGQ>
.
|
Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568 without __getitem__ and mixed semantics. Designed to assist in model inspection and particularly to replicate the composite transformer represented by steps of the pipeline with the exception of the last. I.e. pipe.get_subsequence(0, -1) is a common idiom. I feel like this becomes more necessary when considering more API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1]) is no longer possible.
You don't want to clone the final estimator and use the one passed in estimator. |
I have a few questions:
cc @thomasjpfan |
1. Yes
2. The issue with warm_start isn't about cloning the Pipeline, it's about
set_params not working to exploit warm_start in an estimator in such a
pipeline. The parameter will be set on the estimator parameter, not on the
fitted estimator... And running fit again will re-clone that estimator,
losing the warmth.
3. I see 2 as the problem. Also I'm sure there are changes to merge in here.
|
Thanks @jnothman . We discussed it with @thomasjpfan and we thought about the following solution:
|
You're also implying in (1) that set_params is redefined to have effect on
both the fitted and unfitted versions of the estimators...? And I can't say
I like the idea of not fitting from scratch in Pipeline by default, but I
can't quite put my finger on a specific problem with it.
For behaviour (2), rather than cloning in __init__, you would need to clone
every time a step parameter (or steps) was set, if I'm not mistaken.
|
Yes, set_paramas would keep both steps and steps_ synchronized. |
It is still fitting from scratch as long as the items in
Agreed. It can be a |
I feel like we always want to fit a pipeline from scratch? Not fitting from scratch is an extreme use case that we don't want to support (at least explicitly?). It means that your final estimator has been already fitted on whatever the pre-processors of your pipeline can output... How do you do that without the pipeline in the first place?
You mean someone doing e.g. |
pipe = Pipeline(
[('pre', StandardScaler()),
('est', LogisticRegression())]
)
pipe.set_params(est=MLPClassifier()) |
yeah so we can just clone in |
Question: how bad is the problem we are trying to solve here? Do we really care that pipelines don't respect our API conventions? Does it prevent us from doing more cool things? I can't say I have a compelling reason to go through all this, and it seems that the original motivation #8157 was addressed already. But if we do I'd like to move forward and try to implement #8350 (comment). |
I agree the current behaviour is mostly innocuous and have never thought the change was a priority without checking all the use cases are going to be handled. We have had bugs raised where the fundamental issue was the non cloning in Pipeline.... But I can't find them. And they're more "surprising" than "broken". The other reason to do this is to set an example of how to build such things in a compliant way! |
So... should we remove it from the roadmap? |
actually, this issue is quite weird/annoying and is caused by this: #10063 We could also try and fix it by cloning the params in |
removing milestone, I think this needs a slep or a close. |
We can certainly have another look at this, but I think we can close this one and a fresh PR at some point would make more sense. |
Yep, it will be scikit-learn 2.0 anyway. |
Reference Issue
#8157
What does this implement/fix? Explain your changes.
Modify the Pipeline API such that
steps
given by the user at initialization is not modified.Any other comments?
steps
is replaced bysteps_
at fit time to follow scikit-learn API.named_steps
has the same behaviour as before with aFutureWarning
. It will return unfitted estimators fromsteps
.named_steps_
returns fitted estimators fromsteps_
.steps
becomes_steps
in order to create property and settersteps
.Update:
FeatureUnion
has a nearly identical design issue.