-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] ENH allow extraction of subsequence pipeline #8431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conceptually Fixes scikit-learn#8414 and related issues. Alternative to scikit-learn#2568 without __getitem__ and mixed semantics. Designed to assist in model inspection and particularly to replicate the composite transformer represented by steps of the pipeline with the exception of the last. I.e. pipe.get_subsequence(0, -1) is a common idiom. I feel like this becomes more necessary when considering more API-consistent clone behaviour as per scikit-learn#8350 as Pipeline(pipe.steps[:-1]) is no longer possible.
I like the pipeline slicing until it behaves as a python list. Meaning pipe[1] should return the tuple (str, estimator), and pipe[:-1] should return a Pipeline, fitted or unfitted, whatever the stage of the previous pipeline was. It would mean to copy Not sure that @GaelVaroquaux would agree. |
pipe[1] becomes ambiguous once we distinguish between the fitted and
unfitted versions. I'd rather leave it alone.
…On 22 February 2017 at 20:16, Guillaume Lemaitre ***@***.***> wrote:
I like the pipeline slicing until it behaves as a python list. Meaning
pipe[1] should return the tuple (str, estimator), and pipe[:-1] should
return a Pipeline, fitted or unfitted, whatever the stage of the previous
pipeline was. It would mean to copy steps and steps_ if available.
Not sure that @GaelVaroquaux <https://github.com/GaelVaroquaux> would
agree.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xJVJruZrYSdrsaOBxq8RN78Ofu_ks5re_zngaJpZM4MIUiq>
.
|
pipe[1] becomes ambiguous once we distinguish between the fitted and
unfitted versions. I'd rather leave it alone.
+1. I prefer Joel's approach.
As we were discussing over lunch, I think that this is a good direction
to go in. My major comment was that we need to explain well what this is
useful for, elsewhere people won't find it / use it.
|
After playing with the examples below, I have the impression that having a Example without concatenationfrom sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.33, random_state=42)
# create a pipeline
pipe_rf = Pipeline(steps=[
('sc', RobustScaler()),
('pca', PCA()),
('clf', RandomForestClassifier())
])
pipe_rf.fit(X_train, y_train)
# interested about the score of the pipeline
print('Score for pipeline with RF: ', pipe_rf.score(X_test, y_test))
# make some transfer learning using the first two steps
X_trans_train = pipe_rf.get_subsequence(0, -1).transform(X_train)
X_trans_test = pipe_rf.get_subsequence(0, -1).transform(X_test)
# classification using an SVM
svc = LinearSVC()
print('Score for pipeline with an SVM: ',
svc.fit(X_trans_train, y_train).score(X_trans_test, y_test)) Example with concatenationfrom sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.externals.joblib import Memory
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.33, random_state=42)
# create a pipeline
cachedir = './'
memory = Memory(cachedir=cachedir, verbose=10)
pipe_rf = Pipeline(steps=[
('sc', RobustScaler()),
('pca', PCA()),
('clf', RandomForestClassifier())
], memory=memory)
pipe_rf.fit(X_train, y_train)
# interested about the score of the pipeline
print('Score for pipeline with RF: ', pipe_rf.score(X_test, y_test))
# new pipeline
pipe_svm = pipe_rf.get_subsequence(0, -1).concatenate_estimator(LinearSVC())
print('Score for pipeline with an SVM: ',
pipe_svm.fit(X_train, y_train).score(X_test, y_test)) with def concatenate_estimator(self, *estimators):
"""Concatenate some estimators to the current pipeline.
This method can be used to make transfer learning. To take advantage
of this method, you need to activate the cache to avoid refitting the
previous steps.
"""
name_estimators = _name_estimators(estimators)
self.steps += name_estimators
return self |
The main point of the proposal here is to support model inspection cases, where the model has been The point is that after #8350 it becomes much harder to get parts of a fitted pipeline without calling fit again. I'm not sure the present PR is the right solution, but I think we need one. I don't think freezing is it, because it requires a new I'm now inclining towards something like: def pop(self, pos=-1):
if isinstance(pos, string_types): # retrieve index for name ...
if pos < 0: pos = len(self.steps) + pos
# TODO: handle `steps_`
out = Pipeline(self.steps[:pos] + self.steps[pos+1:])
out.steps_ = self.steps_[:pos] + self.steps_[pos+1:]
return out, self.steps[pos][1] This makes it easy for the user to do: pipe.fit(X, y)
transformer, predictor = pipe.pop()
weights = predictor.feature_importances_
top_feats = np.argsort()[::-1][:10]
print(zip(np.take(transformer.get_feature_names(), top_feats), weights.take(top_feats))) By far, this seems the most common use case for what I'm talking about, and returning the last estimator, as well as a pipeline containing everything else, is a nice boon. I don't mind a general "insert(before, name, estimator, inplace=False)" to make pipeline modification easier, but this pertains exclusively to unfitted models. |
I misunderstood the usage between model inspection and just transfer of the transformers (which I also misused). I am more inclined to the How far the model inspection that you provide as an example, is far from the |
I'm not sure how this relates to The issue is that model inspection in a pipeline (see #2562, #2561) requires interpreting the model attributes of the last step of the pipeline. But these attributes are calculated with respect to the output of the feature transformation pipeline that precedes it. So interpreting the model either involves taking information from that final step and passing it to @GaelVaroquaux what do you think of |
@jnothman I was recalling an experiment that I did but I forgot that I didn't have any transformers before the model itself. So my question is irrelevant. Nevertheless, thanks for your time spent to clarify. |
Coming back to this (or maybe here for the first time?). I actually would have called it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem with calling it pop
is that Python's standard data structure pop
modifies the object in place, and returns the popped item.
You just want to call it pop
so that there's a syntactically simple way to get a pipeline that is all-but-last? I think head
would be a more appropriate name for that. (But if there's head
there should be tail
, and then that is not identical in meaning to *nix head, which not all our users will be familiar with ... why not just get_subsequence
or extract
or something.) How about my_pipe.until(-1)
? my_pipe.stop_at(-1)
?
I like pop because it's short and mnemonic. I agree the difference in semantics with standard Python pop is a problem. head and tail are also commonly used in pandas btw. |
Difference with head and tail from pandas and Unix is that they get a head
of fixed size rather than all but one element
|
df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion',
'monkey', 'parrot', 'shark', 'whale', 'zebra']})
df.head(-1)
so only the default value is different. |
Yes, the default value is different, but so is the way users think about it.
Having said that, I don't mind head.
|
So rename to head and finish up? ;) Happy to help |
I'd still rather the slicing syntax, personally, but I know that gets strong push-back from @GaelVaroquaux. If I changed this to |
I'm also happy with slicing. We can wait to finalize the governance doc and test our resolution mechanism ;) |
Conceptually Fixes #8414 and related issues. Alternative to #2568, #2561, #2562.
without
__getitem__
and the different return types for different arguments.Designed to assist in model inspection and particularly to replicate the
composite transformer represented by steps of the pipeline with the
exception of the last. I.e.
pipe.get_subsequence(0, -1)
is a commonidiom. I feel like this becomes more necessary when considering more
API-consistent clone behaviour as per #8350 as
Pipeline(pipe.steps[:-1])
is no longer possible. I still find
pipe[:-1]
of #2568 more readable than.get_subsequence(0, -1)
I'm happy to forego the handling of subtypes if seen as too magical, though I thought
imblearn
would appreciate it.Ping @glemaitre, @GaelVaroquaux
TODO: