-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC Implement Pipeline get feature names #12627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -336,8 +336,12 @@ def get_feature_names(self): | |||
raise AttributeError("Transformer %s (type %s) does not " | |||
"provide get_feature_names." | |||
% (str(name), type(trans).__name__)) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ducktyping to support both transformative and non-transformative get_feature_names.
@@ -531,6 +531,20 @@ def _pairwise(self): | |||
# check if first estimator expects pairwise input | |||
return getattr(self.steps[0][1], '_pairwise', False) | |||
|
|||
def get_feature_names(self, input_features=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the actual implementation that enables everything. It's pretty short and would be even shorter if we force get_feature_names
to accept input_features
(which I think we should).
sklearn/pipeline.py
Outdated
Transformed feature names | ||
""" | ||
feature_names = input_features | ||
with_final = hasattr(self._final_estimator, "transform") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty controversial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an alternative that's not slicing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option would be to give all supervised models a pass-through get_feature_names
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of get_feature_names sometimes being about input and sometimes about output. We need to consider some kind of interface for slicing if we switch to a new cloning Pipeline anyway (because the current Pipeline you can do it easily, just without syntactic sugar).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I also think it's a bit confusing. But the most common use-case should be syntactically nice.
Maybe like pipe.pop()
or something.
Have you thought about slicing cloned pipelines?
The "easiest" way I could think of was freezing all steps, making the sliced pipeline immutable. That's probably what we want, right? Creating a clone of the sliced pipeline in case someone wants to refit it should be pretty simple, right?
I didn't think doing this would link to slicing pipelines and freezing. I thought this one was the easy one :-/ Though I guess we really need the freezing "only" once we have the cloning pipeline. So it might be possible to implement this without solving the other issues first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, raising an error with non-transformers at the end of a pipeline is perfectly good for a start.
I think that would be a very ugly solution, because it requires the user to either specify the pipeline in a weird way or to slice after the fact.
Would you allow fitting on a "view" of that slice? That might lead to unexpected consequences - though I guess not if cloning happens in fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #8448 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman Can you explain a bit more why you find this controversial?
Is it the way we the last estimator is skipped, or is it the actual skipping? In case of the second one: you think it should be an explicit action of the user to slice his full pipeline to remove the last estimator, to know the feature names of the features used in the last estimator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that @jnothman would like the user to explicitly skip the last estimator, say via slicing. I think that actually makes more logical sense.
There is really no way for the pipeline to know whether the user meant to include the last model or not. For example LinearDiscriminantAnalysis has a transform and a predict, same for KMeans, and the pipeline doesn't know if the user is using predict on the pipeline (in which case they wanted the input to the last step) or transform (in which case they wanted the output of the last step).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this comment, now that we have slicing ;)
You are welcome to reuse my implementations and tests from eli5, if you like. https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/transform.py
No, it is not. But the singledispatch approach in eli5 is more suggestive that users can create their own feature name functions...
I think this is sensible. Btw, I'm a fan of introducing I have proposed similar before. I can't imagine it would be very hard to implement, and to enforce in a common test, except perhaps where it applies to meta-estimators. ... And there would be tricky cases with vectorizers and FunctionTransformer that take non-array-like inputs. Hmm. But we can certainly require
I agree. I don't think we can use ducktyping of the last estimator safely. I have also been a proponent of estimator slicing.
This is hard. Rank and log (and tfidf) seem more important...
I think for now we need to let the user do this as input to |
Taking the tests would be good. What part of the implementation do you think is relevant? I guess adding
can you elaborate? |
I'd also be happy to add |
Yes, FunctionTransformer is not so much of a problem... It just needs to run a custom feature names function on X at fit time and store the results. |
Cool! Some quick feedback on the top-level explanation:
You also said it later in the top comment as well, but the third option you mention here about tighter integration with pandas is not an alternative right, but rather a possible extension? Nothing in the current PR prevents adding this later I think?
Just to clarify: is there a fourth case missing in that list for the simple Transformer that doesn't change number of inputs (like StandardScaler -> your And then one actual implementation related comment: One problem with the current implementation in the |
Added that.
yes. |
# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py
I placed this comment in the wrong issue before: After reviewing all the approaches to get the feature names through, I think the following approach combined with this PR could work: Fitting a pipelineclf = MyPipeline([
('recorder', ColumnNameRecorder()),
('preprocessor', preprocessor),
('selector', MySelectKBest(k=5)),
('classifier', LogisticRegression())])
_ = clf.fit(X, y) where the class ColumnNameRecorder(TransformerMixin, BaseEstimator):
def fit(self, X, y=None):
if hasattr(X, "columns"):
self.feature_names_in_ = np.asarray(X.columns)
return self
def transform(self, X, y=None):
return X
def get_feature_names(self, input_features=None):
# uses input_features if self.feature_names_in_ is None
return self.feature_names_in_ Get input features of classiferclf[:-1].get_feature_names()
# array(['num__fare', 'cat__sex_female', 'cat__sex_male', 'cat__pclass_1.0',
# 'cat__pclass_3.0'], dtype=object)
# if we want some sugar this can work as well
clf[:'classifier'].get_feature_names() Get input features of selectorclf[:-2].get_feature_names()
# array(['num__age', 'num__fare', 'cat__embarked_C', 'cat__embarked_Q',
# 'cat__embarked_S', 'cat__embarked_missing', 'cat__sex_female',
# 'cat__sex_male', 'cat__pclass_1.0', 'cat__pclass_2.0',
# 'cat__pclass_3.0'], dtype=object)
# with more sugar
clf[:'selector'].get_feature_names() The benefits of this approach is:
Here is an quick implementation of this idea. (Note the implementation uses (We an also come up with a better name for |
This PR is very incomplete and doesn't work. You can check out #13307 for all the issues that happen if you actually try to do this. Another issue is the distinction between input and output feature |
…it-learn into pipeline_get_feature_names
@@ -689,6 +690,45 @@ def fit_transform(self, X, y=None, **fit_params): | |||
# fit method of arity 2 (supervised transformation) | |||
return self.fit(X, y, **fit_params).transform(X) | |||
|
|||
def get_feature_names(self, input_features=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can push this down if people think having it here is ugly.
# because n_components_ means something else | ||
# in agglomerative clustering | ||
n_features = self.n_clusters | ||
elif hasattr(self, '_max_components'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops this can be removed, it's in the class now
Can we make a small step with this PR by having a smaller PR by adding |
Sounds good! Do you want to take the lead? Or do you want me to simplify this one? |
superceeded by #18444. |
Reference Issues/PRs
This is a draft implementation of #6424.
It doesn't really introduce anything new in the API, but I'm happy to move this to a SLEP.
Below is an initial description. Happy to include feedback in the SLEP.
Fixes #6425
What does this implement/fix? Explain your changes.
The main idea of this is to make compound scikit-learn estimators less opaque by providing "feature names" as strings.
Motivation
We've been making it easier to build complex workflows with the ColumnTransformer and I expect it will find wide adoption. However, using it results in very opaque models, even more so than before.
We have a great usage example in the gallery that applies a classifier to the titanic data set. To me this is a very simple standard usecase.
Markdown doesn't let me paste this as details, so just look here:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html
However, it's impossible to interpret or even sanity-check the
LogisticRegression
instance that's produced here, because the correspondence of the coefficients to the input features is basically impossible to figure out.This PR enables using
get_feature_names
to obtain the semantics for the coefficients:I think this is essential information in any machine learning workflow and it's imperative that we allow the user to get to this information in some way.
The proposed API will add a method
get_feature_names
to all supported (see below) transformers, with a (possibly optional, see below) parameter "input features" which is an array-like of strings.Alternative Interfaces
To me there are four main options for interfaces to enable this:
get_feature_names
as in this PRa) to output feature semantics.
b) to determine feature semantics
While I think 2) and 3) a) is are valid option for the future, I think trying to implement this now will probably result in a gridlock and/or take too much time. I think we should iterate and provide something that solves the 80% use-case quickly. We can create a more elaborate solution later, in particular since this proposal/PR doesn't introduce any concepts that are not in sklearn already.
3 b) is discussed below.
I don't think 4) is a realistic option. I assume we can agree that the titanic example above is a valid use-case, and that getting the semantics of features is important. Below is the code that the user would have to write to do this themselves. This will become even harder in the future if the pipeline will do cloning.
Scope
I suggest we limit
get_feature_names
to transformers that either:Also, I want the string to only convey presence or absence of features, or constant functions of the features. So scaling would not change a feature_name, while a log-transformation (or polynomial) might. This limits the complexity of the string (but also it's usefulness somewhat).
Together, these mean that there will be no support for multivariate transformations like PCA or NMF or KMeans.
Implementation
Given the above scope and API, and the current implementation of
get_feature_names
inColumnTransformer
there are two main mechanism that need to be implemented.There are basically three cases the meta-estimators need to take care of:
a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection
b) The transformer does "nothing" to the columns, like StandardScaler.
c) The transformer does a "too complex" operation to the columns, like PCA.
For a), only the estimator can handle this case, so the estimator needs to provide a function to do that - already implemented in several cases as a transformative
get_feature_names
. For b) the meta-estimator can simply do a pass-through, so we need to "flag" these in some way. There is no way for the meta-estimator to really handle c) so if the estimator is not "tagged" as being trival and doesn't implementget_feature_names
the meta-estimator needs to bail in some way.I added a "OneToOneMixin" to tag the trivial transformations. It would be possible to just use this as a tag, and let the meta-estimators handle the pass-through. Given that we already have the mechanism to handle the pass-through, I thought it would be simpler to just implement a pass-through
get_feature_names
(another alternative would be to add an estimator tag, but that also seems less elegant).Right now the bail in case c) is a TypeError.
Limitations
The general API requires "input features". In
PolynomialFeatures
this was optional. Unfortunately we have no way to know the input dimensionality of a fitted transformer in general, so automatically generatingx1
,x2
, etc is not possible. This could be fixed by adding a requiredn_features_
to the API, which would probably be helpful but also would be a relatively heavy addition.Because we don't know the number of input features, there's no way to ensure the user passed the right length of
input_features
The implementation of
get_feature_names
in Pipeline is a hack, because it includes or excludes the last step based on whether the last step hastransform
. The reason for this is that given a trained pipeline with a classifier in the end, I want to be able to get the feature names, which would not include the last step. In preprocessing pipelines we always want to include all the steps, though.The real solution to this in my opinion is always to include the last step, and allow slicing the pipeline ([MRG+1] Pipeline can now be sliced or indexed #2568) to get the feature names for a pipeline with a final supervised step.
Bailing to a TypeError if any "complex" transformation happens is a bit of a bummer. We could try to generate names like
pca_1
,pca_2
, ... but to do this automatically we would need to know the output dimensionality, which we don't (unless we addn_outputs_
as required attribute to the API similar ton_features_
above)Open Questions
Do we want to require
get_feature_names
to acceptinput_features
? Right now the vectorizers don't and it makes the code slightly more complex.How do we want to handle the hack in Pipeline.get_feature_names for the last step?
Do we want to encode fixed univariate transformations ("scale", "log", "rank"?)
Possible Extensions
input_features
and generate namesinput_features
if available (3b above)I already discussed the requirements for the first two extensions (adding
n_features_
andn_outputs_
).The last one would require storing the input column names if the input is a pandas dataframe. It shouldn't be hard to do, and would also enable solving #7242 and I'd like to do that, but it's not required for this proposal to be useful.
Todo
get_feature_names
methods?