Skip to content

[WIP] ENH allow extraction of subsequence pipeline #8431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Feb 22, 2017

Conceptually Fixes #8414 and related issues. Alternative to #2568, #2561, #2562.
without __getitem__ and the different return types for different arguments.

Designed to assist in model inspection and particularly to replicate the
composite transformer represented by steps of the pipeline with the
exception of the last. I.e. pipe.get_subsequence(0, -1) is a common
idiom. I feel like this becomes more necessary when considering more
API-consistent clone behaviour as per #8350 as Pipeline(pipe.steps[:-1])
is no longer possible. I still find pipe[:-1] of #2568 more readable than .get_subsequence(0, -1)

I'm happy to forego the handling of subtypes if seen as too magical, though I thought imblearn would appreciate it.

Ping @glemaitre, @GaelVaroquaux

TODO:

  • narrative docs
  • use in existing example?

Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568
without __getitem__ and mixed semantics.

Designed to assist in model inspection and particularly to replicate the
composite transformer represented by steps of the pipeline with the
exception of the last. I.e. pipe.get_subsequence(0, -1) is a common
idiom. I feel like this becomes more necessary when considering more
API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1])
is no longer possible.
@jnothman jnothman changed the title ENH allow extraction of subsequence pipeline [WIP] ENH allow extraction of subsequence pipeline Feb 22, 2017
@glemaitre
Copy link
Member

I like the pipeline slicing until it behaves as a python list. Meaning pipe[1] should return the tuple (str, estimator), and pipe[:-1] should return a Pipeline, fitted or unfitted, whatever the stage of the previous pipeline was. It would mean to copy steps and steps_ if available.

Not sure that @GaelVaroquaux would agree.

@jnothman
Copy link
Member Author

jnothman commented Feb 22, 2017 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 22, 2017 via email

@glemaitre
Copy link
Member

After playing with the examples below, I have the impression that having a get method is not enough.
I would expect to have a concatenate to plug a(some) additional estimator(s). Using the memory feature (or _Frozen), it is then possible to just fit only the new estimator(s) and improve readability.

Example without concatenation

from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import RobustScaler 
from sklearn.decomposition import PCA 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import LinearSVC 
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 
  
data = load_iris() 
X_train, X_test, y_train, y_test = train_test_split( 
  data.data, data.target, test_size=0.33, random_state=42) 
  
# create a pipeline 
pipe_rf = Pipeline(steps=[ 
  ('sc', RobustScaler()), 
  ('pca', PCA()), 
  ('clf', RandomForestClassifier()) 
]) 
  
pipe_rf.fit(X_train, y_train) 
  
# interested about the score of the pipeline 
print('Score for pipeline with RF: ', pipe_rf.score(X_test, y_test)) 
  
# make some transfer learning using the first two steps 
X_trans_train = pipe_rf.get_subsequence(0, -1).transform(X_train) 
X_trans_test = pipe_rf.get_subsequence(0, -1).transform(X_test) 
  
# classification using an SVM 
svc = LinearSVC() 
print('Score for pipeline with an SVM: ', 
  svc.fit(X_trans_train, y_train).score(X_trans_test, y_test)) 

Example with concatenation

from sklearn.pipeline import Pipeline                                                           
from sklearn.preprocessing import RobustScaler                                                  
from sklearn.decomposition import PCA                                                           
from sklearn.ensemble import RandomForestClassifier                                             
from sklearn.svm import LinearSVC                                                               
from sklearn.datasets import load_iris                                                          
from sklearn.model_selection import train_test_split                                            
from sklearn.externals.joblib import Memory                                                     
                                                                                                
data = load_iris()                                                                              
X_train, X_test, y_train, y_test = train_test_split(                                            
    data.data, data.target, test_size=0.33, random_state=42)                                    
                                                                                                
# create a pipeline                                                                             
cachedir = './'                                                                                 
memory = Memory(cachedir=cachedir, verbose=10)                                                  
pipe_rf = Pipeline(steps=[                                                                      
    ('sc', RobustScaler()),                                                                     
    ('pca', PCA()),                                                                             
    ('clf', RandomForestClassifier())                                                           
], memory=memory)                                                                               
                                                                                                
pipe_rf.fit(X_train, y_train)                                                                   
                                                                                                
# interested about the score of the pipeline                                                    
print('Score for pipeline with RF: ', pipe_rf.score(X_test, y_test))                            
                                                                                                
# new pipeline                                                                                  
pipe_svm = pipe_rf.get_subsequence(0, -1).concatenate_estimator(LinearSVC())                    
print('Score for pipeline with an SVM: ',                                                       
      pipe_svm.fit(X_train, y_train).score(X_test, y_test))

with

 def concatenate_estimator(self, *estimators):                                                                                                                                                 
        """Concatenate some estimators to the current pipeline.                                                                                                                                   
                                                                                                                                                                                                  
        This method can be used to make transfer learning. To take advantage                                                                                                                      
        of this method, you need to activate the cache to avoid refitting the                                                                                                                     
        previous steps.                                                                                                                                                                           
        """                                                                                                                                                                                       
        name_estimators = _name_estimators(estimators)                                                                                                                                            
        self.steps += name_estimators                                                                                                                                                             
                                                                                                                                                                                                  
        return self

@jnothman
Copy link
Member Author

concatenate_estimator would traditionally be called extend. But the sort of thing you illustrate can be performed easily with memory and set_params(clf=LinearSVC()).

The main point of the proposal here is to support model inspection cases, where the model has been fit, and we may not even easily be able to (or care to) reproduce the X and y used to fit it (due to models being dumped by a cross validation routine, for instance). Usually model inspection cases will involve getting the feature importances from the last step of the pipeline and deriving feature names or features input to that last step from the earlier steps in the pipeline; sometimes one will want to remove a feature extraction step for inspection too. If we have X and y, these use cases are also possible with memory and set_params(clf=None) and calling fit again but I think that is highly unintuitive for what I consider to be routine model inspection cases.

The point is that after #8350 it becomes much harder to get parts of a fitted pipeline without calling fit again. I'm not sure the present PR is the right solution, but I think we need one. I don't think freezing is it, because it requires a new fit.

I'm now inclining towards something like:

    def pop(self, pos=-1):
        if isinstance(pos, string_types): # retrieve index for name ...
        if pos < 0: pos = len(self.steps) + pos
        # TODO: handle `steps_`
        out = Pipeline(self.steps[:pos] + self.steps[pos+1:])
        out.steps_ = self.steps_[:pos] + self.steps_[pos+1:]
        return out, self.steps[pos][1]

This makes it easy for the user to do:

pipe.fit(X, y)
transformer, predictor = pipe.pop()
weights = predictor.feature_importances_
top_feats = np.argsort()[::-1][:10]
print(zip(np.take(transformer.get_feature_names(), top_feats), weights.take(top_feats)))

By far, this seems the most common use case for what I'm talking about, and returning the last estimator, as well as a pipeline containing everything else, is a nice boon.

I don't mind a general "insert(before, name, estimator, inplace=False)" to make pipeline modification easier, but this pertains exclusively to unfitted models.

@glemaitre
Copy link
Member

I misunderstood the usage between model inspection and just transfer of the transformers (which I also misused). I am more inclined to the pop solution since you can get both the model and the transformers.

How far the model inspection that you provide as an example, is far from the SelectFromModel.

@jnothman
Copy link
Member Author

I'm not sure how this relates to SelectFromModel. Could you clarify your question?

The issue is that model inspection in a pipeline (see #2562, #2561) requires interpreting the model attributes of the last step of the pipeline. But these attributes are calculated with respect to the output of the feature transformation pipeline that precedes it. So interpreting the model either involves taking information from that final step and passing it to inverse_transform of all previous steps, or it involves taking information from that final step and calculating properties of the features output by the transformation pipeline, such as feature names (given #6425 or TeamHG-Memex/eli5#158).

@GaelVaroquaux what do you think of pop described above rather the more generic get_subsequence implemented here?

@glemaitre
Copy link
Member

@jnothman I was recalling an experiment that I did but I forgot that I didn't have any transformers before the model itself. So my question is irrelevant. Nevertheless, thanks for your time spent to clarify.

@jnothman jnothman mentioned this pull request Feb 24, 2017
3 tasks
@amueller
Copy link
Member

Coming back to this (or maybe here for the first time?). I actually would have called it pop coming from #12627. I think removing the last element will be the most common use-case. I would call it pop but would implement a similar interface as here, with the default behavior of making the last step -1 and the first step 0.

Copy link
Member Author

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with calling it pop is that Python's standard data structure pop modifies the object in place, and returns the popped item.

You just want to call it pop so that there's a syntactically simple way to get a pipeline that is all-but-last? I think head would be a more appropriate name for that. (But if there's head there should be tail, and then that is not identical in meaning to *nix head, which not all our users will be familiar with ... why not just get_subsequence or extract or something.) How about my_pipe.until(-1)? my_pipe.stop_at(-1)?

@amueller
Copy link
Member

I like pop because it's short and mnemonic. I agree the difference in semantics with standard Python pop is a problem.

head and tail are also commonly used in pandas btw.
And what's the difference to the nix ones? That we return both parts? As I said in the other thread, there's already an easy way to get a single step. I'm not sure there is good applications for slicing a pipeline into two pipelines.
In #8431 (comment) you slice into a pipeline and a single estimator even when slicing in the middle of a pipeline. That seems a bit counter-intuitive to me because you're dropping some parts.

@jnothman
Copy link
Member Author

jnothman commented Nov 28, 2018 via email

@amueller
Copy link
Member

df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion',
                   'monkey', 'parrot', 'shark', 'whale', 'zebra']})
df.head(-1)
      animal
0  alligator
1        bee
2     falcon
3       lion
4     monkey
5     parrot
6      shark
7      whale

so only the default value is different.

@jnothman
Copy link
Member Author

jnothman commented Dec 3, 2018 via email

@amueller
Copy link
Member

So rename to head and finish up? ;) Happy to help

@jnothman
Copy link
Member Author

I'd still rather the slicing syntax, personally, but I know that gets strong push-back from @GaelVaroquaux.

If I changed this to head, I would be inclined to remove the start parameter and set stop=-1 by default.

@amueller
Copy link
Member

I'm also happy with slicing. We can wait to finalize the governance doc and test our resolution mechanism ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants