[WIP] ENH allow extraction of subsequence pipeline #8431

jnothman · 2017-02-22T08:02:39Z

Conceptually Fixes #8414 and related issues. Alternative to #2568, #2561, #2562.
without __getitem__ and the different return types for different arguments.

Designed to assist in model inspection and particularly to replicate the
composite transformer represented by steps of the pipeline with the
exception of the last. I.e. pipe.get_subsequence(0, -1) is a common
idiom. I feel like this becomes more necessary when considering more
API-consistent clone behaviour as per #8350 as Pipeline(pipe.steps[:-1])
is no longer possible. I still find pipe[:-1] of #2568 more readable than .get_subsequence(0, -1)

I'm happy to forego the handling of subtypes if seen as too magical, though I thought imblearn would appreciate it.

Ping @glemaitre, @GaelVaroquaux

TODO:

narrative docs
use in existing example?

Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568 without __getitem__ and mixed semantics. Designed to assist in model inspection and particularly to replicate the composite transformer represented by steps of the pipeline with the exception of the last. I.e. pipe.get_subsequence(0, -1) is a common idiom. I feel like this becomes more necessary when considering more API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1]) is no longer possible.

glemaitre · 2017-02-22T09:16:22Z

I like the pipeline slicing until it behaves as a python list. Meaning pipe[1] should return the tuple (str, estimator), and pipe[:-1] should return a Pipeline, fitted or unfitted, whatever the stage of the previous pipeline was. It would mean to copy steps and steps_ if available.

Not sure that @GaelVaroquaux would agree.

jnothman · 2017-02-22T09:30:24Z

pipe[1] becomes ambiguous once we distinguish between the fitted and unfitted versions. I'd rather leave it alone.

…

On 22 February 2017 at 20:16, Guillaume Lemaitre ***@***.***> wrote: I like the pipeline slicing until it behaves as a python list. Meaning pipe[1] should return the tuple (str, estimator), and pipe[:-1] should return a Pipeline, fitted or unfitted, whatever the stage of the previous pipeline was. It would mean to copy steps and steps_ if available. Not sure that @GaelVaroquaux <https://github.com/GaelVaroquaux> would agree. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6xJVJruZrYSdrsaOBxq8RN78Ofu_ks5re_zngaJpZM4MIUiq> .

GaelVaroquaux · 2017-02-22T12:28:44Z

pipe[1] becomes ambiguous once we distinguish between the fitted and unfitted versions. I'd rather leave it alone.

+1. I prefer Joel's approach. As we were discussing over lunch, I think that this is a good direction to go in. My major comment was that we need to explain well what this is useful for, elsewhere people won't find it / use it.

glemaitre · 2017-02-22T16:21:54Z

After playing with the examples below, I have the impression that having a get method is not enough.
I would expect to have a concatenate to plug a(some) additional estimator(s). Using the memory feature (or _Frozen), it is then possible to just fit only the new estimator(s) and improve readability.

Example without concatenation

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import RobustScaler 
from sklearn.decomposition import PCA 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import LinearSVC 
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 
  
data = load_iris() 
X_train, X_test, y_train, y_test = train_test_split( 
  data.data, data.target, test_size=0.33, random_state=42) 
  
# create a pipeline 
pipe_rf = Pipeline(steps=[ 
  ('sc', RobustScaler()), 
  ('pca', PCA()), 
  ('clf', RandomForestClassifier()) 
]) 
  
pipe_rf.fit(X_train, y_train) 
  
# interested about the score of the pipeline 
print('Score for pipeline with RF: ', pipe_rf.score(X_test, y_test)) 
  
# make some transfer learning using the first two steps 
X_trans_train = pipe_rf.get_subsequence(0, -1).transform(X_train) 
X_trans_test = pipe_rf.get_subsequence(0, -1).transform(X_test) 
  
# classification using an SVM 
svc = LinearSVC() 
print('Score for pipeline with an SVM: ', 
  svc.fit(X_trans_train, y_train).score(X_trans_test, y_test))

Example with concatenation

from sklearn.pipeline import Pipeline                                                           
from sklearn.preprocessing import RobustScaler                                                  
from sklearn.decomposition import PCA                                                           
from sklearn.ensemble import RandomForestClassifier                                             
from sklearn.svm import LinearSVC                                                               
from sklearn.datasets import load_iris                                                          
from sklearn.model_selection import train_test_split                                            
from sklearn.externals.joblib import Memory                                                     
                                                                                                
data = load_iris()                                                                              
X_train, X_test, y_train, y_test = train_test_split(                                            
    data.data, data.target, test_size=0.33, random_state=42)                                    
                                                                                                
# create a pipeline                                                                             
cachedir = './'                                                                                 
memory = Memory(cachedir=cachedir, verbose=10)                                                  
pipe_rf = Pipeline(steps=[                                                                      
    ('sc', RobustScaler()),                                                                     
    ('pca', PCA()),                                                                             
    ('clf', RandomForestClassifier())                                                           
], memory=memory)                                                                               
                                                                                                
pipe_rf.fit(X_train, y_train)                                                                   
                                                                                                
# interested about the score of the pipeline                                                    
print('Score for pipeline with RF: ', pipe_rf.score(X_test, y_test))                            
                                                                                                
# new pipeline                                                                                  
pipe_svm = pipe_rf.get_subsequence(0, -1).concatenate_estimator(LinearSVC())                    
print('Score for pipeline with an SVM: ',                                                       
      pipe_svm.fit(X_train, y_train).score(X_test, y_test))

with

 def concatenate_estimator(self, *estimators):                                                                                                                                                 
        """Concatenate some estimators to the current pipeline.                                                                                                                                   
                                                                                                                                                                                                  
        This method can be used to make transfer learning. To take advantage                                                                                                                      
        of this method, you need to activate the cache to avoid refitting the                                                                                                                     
        previous steps.                                                                                                                                                                           
        """                                                                                                                                                                                       
        name_estimators = _name_estimators(estimators)                                                                                                                                            
        self.steps += name_estimators                                                                                                                                                             
                                                                                                                                                                                                  
        return self

jnothman · 2017-02-22T23:14:30Z

concatenate_estimator would traditionally be called extend. But the sort of thing you illustrate can be performed easily with memory and set_params(clf=LinearSVC()).

The main point of the proposal here is to support model inspection cases, where the model has been fit, and we may not even easily be able to (or care to) reproduce the X and y used to fit it (due to models being dumped by a cross validation routine, for instance). Usually model inspection cases will involve getting the feature importances from the last step of the pipeline and deriving feature names or features input to that last step from the earlier steps in the pipeline; sometimes one will want to remove a feature extraction step for inspection too. If we have X and y, these use cases are also possible with memory and set_params(clf=None) and calling fit again but I think that is highly unintuitive for what I consider to be routine model inspection cases.

The point is that after #8350 it becomes much harder to get parts of a fitted pipeline without calling fit again. I'm not sure the present PR is the right solution, but I think we need one. I don't think freezing is it, because it requires a new fit.

I'm now inclining towards something like:

    def pop(self, pos=-1):
        if isinstance(pos, string_types): # retrieve index for name ...
        if pos < 0: pos = len(self.steps) + pos
        # TODO: handle `steps_`
        out = Pipeline(self.steps[:pos] + self.steps[pos+1:])
        out.steps_ = self.steps_[:pos] + self.steps_[pos+1:]
        return out, self.steps[pos][1]

This makes it easy for the user to do:

pipe.fit(X, y)
transformer, predictor = pipe.pop()
weights = predictor.feature_importances_
top_feats = np.argsort()[::-1][:10]
print(zip(np.take(transformer.get_feature_names(), top_feats), weights.take(top_feats)))

By far, this seems the most common use case for what I'm talking about, and returning the last estimator, as well as a pipeline containing everything else, is a nice boon.

I don't mind a general "insert(before, name, estimator, inplace=False)" to make pipeline modification easier, but this pertains exclusively to unfitted models.

glemaitre · 2017-02-23T10:34:49Z

I misunderstood the usage between model inspection and just transfer of the transformers (which I also misused). I am more inclined to the pop solution since you can get both the model and the transformers.

How far the model inspection that you provide as an example, is far from the SelectFromModel.

jnothman · 2017-02-23T12:07:41Z

I'm not sure how this relates to SelectFromModel. Could you clarify your question?

The issue is that model inspection in a pipeline (see #2562, #2561) requires interpreting the model attributes of the last step of the pipeline. But these attributes are calculated with respect to the output of the feature transformation pipeline that precedes it. So interpreting the model either involves taking information from that final step and passing it to inverse_transform of all previous steps, or it involves taking information from that final step and calculating properties of the features output by the transformation pipeline, such as feature names (given #6425 or TeamHG-Memex/eli5#158).

@GaelVaroquaux what do you think of pop described above rather the more generic get_subsequence implemented here?

glemaitre · 2017-02-23T16:04:54Z

@jnothman I was recalling an experiment that I did but I forgot that I didn't have any transformers before the model itself. So my question is irrelevant. Nevertheless, thanks for your time spent to clarify.

amueller · 2018-11-27T20:25:27Z

Coming back to this (or maybe here for the first time?). I actually would have called it pop coming from #12627. I think removing the last element will be the most common use-case. I would call it pop but would implement a similar interface as here, with the default behavior of making the last step -1 and the first step 0.

jnothman

Problem with calling it pop is that Python's standard data structure pop modifies the object in place, and returns the popped item.

You just want to call it pop so that there's a syntactically simple way to get a pipeline that is all-but-last? I think head would be a more appropriate name for that. (But if there's head there should be tail, and then that is not identical in meaning to *nix head, which not all our users will be familiar with ... why not just get_subsequence or extract or something.) How about my_pipe.until(-1)? my_pipe.stop_at(-1)?

amueller · 2018-11-28T16:58:45Z

I like pop because it's short and mnemonic. I agree the difference in semantics with standard Python pop is a problem.

head and tail are also commonly used in pandas btw.
And what's the difference to the nix ones? That we return both parts? As I said in the other thread, there's already an easy way to get a single step. I'm not sure there is good applications for slicing a pipeline into two pipelines.
In #8431 (comment) you slice into a pipeline and a single estimator even when slicing in the middle of a pipeline. That seems a bit counter-intuitive to me because you're dropping some parts.

jnothman · 2018-11-28T22:34:27Z

Difference with head and tail from pandas and Unix is that they get a head of fixed size rather than all but one element

amueller · 2018-11-29T17:12:33Z

df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion',
                   'monkey', 'parrot', 'shark', 'whale', 'zebra']})
df.head(-1)

      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale

so only the default value is different.

jnothman · 2018-12-03T07:38:57Z

Yes, the default value is different, but so is the way users think about it. Having said that, I don't mind head.

amueller · 2018-12-12T18:03:05Z

So rename to head and finish up? ;) Happy to help

jnothman · 2018-12-13T09:44:07Z

I'd still rather the slicing syntax, personally, but I know that gets strong push-back from @GaelVaroquaux.

If I changed this to head, I would be inclined to remove the start parameter and set stop=-1 by default.

amueller · 2018-12-13T15:43:46Z

I'm also happy with slicing. We can wait to finalize the governance doc and test our resolution mechanism ;)

jnothman added API Enhancement labels Feb 22, 2017

jnothman changed the title ~~ENH allow extraction of subsequence pipeline~~ [WIP] ENH allow extraction of subsequence pipeline Feb 22, 2017

jnothman mentioned this pull request Feb 24, 2017

Pipeline pop #8448

Closed

3 tasks

amueller mentioned this pull request Dec 12, 2017

Imputer to maintain missing collumns #8613

Closed

jnothman commented Nov 28, 2018

View reviewed changes

amueller mentioned this pull request Feb 7, 2019

SLEP needed: slicling pipelines scikit-learn/enhancement_proposals#13

Closed

jnothman mentioned this pull request Feb 28, 2019

[MRG+1] Pipeline can now be sliced or indexed #2568

Merged

adrinjalali closed this in #2568 Mar 7, 2019

Uh oh!

[WIP] ENH allow extraction of subsequence pipeline #8431

[WIP] ENH allow extraction of subsequence pipeline #8431

Uh oh!

Conversation

jnothman commented Feb 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Feb 22, 2017

Uh oh!

jnothman commented Feb 22, 2017 via email

Uh oh!

GaelVaroquaux commented Feb 22, 2017 via email

Uh oh!

glemaitre commented Feb 22, 2017

Example without concatenation

Example with concatenation

Uh oh!

jnothman commented Feb 22, 2017

Uh oh!

glemaitre commented Feb 23, 2017

Uh oh!

jnothman commented Feb 23, 2017

Uh oh!

glemaitre commented Feb 23, 2017

Uh oh!

amueller commented Nov 27, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

amueller commented Nov 28, 2018

Uh oh!

jnothman commented Nov 28, 2018 via email

Uh oh!

amueller commented Nov 29, 2018

Uh oh!

jnothman commented Dec 3, 2018 via email

Uh oh!

amueller commented Dec 12, 2018

Uh oh!

jnothman commented Dec 13, 2018

Uh oh!

amueller commented Dec 13, 2018

Uh oh!

Uh oh!

jnothman commented Feb 22, 2017 •

edited

Loading