Skip to content

RFC Implement Pipeline get feature names #12627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Nov 20, 2018

Reference Issues/PRs

This is a draft implementation of #6424.
It doesn't really introduce anything new in the API, but I'm happy to move this to a SLEP.
Below is an initial description. Happy to include feedback in the SLEP.

Fixes #6425

What does this implement/fix? Explain your changes.

The main idea of this is to make compound scikit-learn estimators less opaque by providing "feature names" as strings.

Motivation

We've been making it easier to build complex workflows with the ColumnTransformer and I expect it will find wide adoption. However, using it results in very opaque models, even more so than before.
We have a great usage example in the gallery that applies a classifier to the titanic data set. To me this is a very simple standard usecase.

Markdown doesn't let me paste this as details, so just look here:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html

However, it's impossible to interpret or even sanity-check the LogisticRegression instance that's produced here, because the correspondence of the coefficients to the input features is basically impossible to figure out.
This PR enables using get_feature_names to obtain the semantics for the coefficients:

preprocessor.get_feature_names(X.columns)
['num__age',
 'num__fare',
 'cat__embarked_C',
 'cat__embarked_Q',
 'cat__embarked_S',
 'cat__embarked_missing',
 'cat__sex_female',
 'cat__sex_male',
 'cat__pclass_1',
 'cat__pclass_2',
 'cat__pclass_3']

I think this is essential information in any machine learning workflow and it's imperative that we allow the user to get to this information in some way.

The proposed API will add a method get_feature_names to all supported (see below) transformers, with a (possibly optional, see below) parameter "input features" which is an array-like of strings.

Alternative Interfaces

To me there are four main options for interfaces to enable this:

  1. Implement transformative get_feature_names as in this PR
  2. Implement a more comprehensive feature description language (as done in ELI-5, I think)
  3. Tie in more strongly with pandas and use dataframes / column names
    a) to output feature semantics.
    b) to determine feature semantics
  4. Leave it to the user.

While I think 2) and 3) a) is are valid option for the future, I think trying to implement this now will probably result in a gridlock and/or take too much time. I think we should iterate and provide something that solves the 80% use-case quickly. We can create a more elaborate solution later, in particular since this proposal/PR doesn't introduce any concepts that are not in sklearn already.
3 b) is discussed below.

I don't think 4) is a realistic option. I assume we can agree that the titanic example above is a valid use-case, and that getting the semantics of features is important. Below is the code that the user would have to write to do this themselves. This will become even harder in the future if the pipeline will do cloning.

I'm hardcoding that the second imputer uses "constant" and the first one doesn't otherwise this would get way too messy. This also hard-codes several other things, like the order of the transformers in the column transformer. It also already makes use of a transformative ``get_feature_names`` in the ``OneHotEncoder`` without which it would be completely impossible.
numeric_fitted = preprocessor.named_transformers_.num

num_features_transformed = np.array(numeric_features)[np.logical_not(np.isnan(numeric_fitted.named_steps.imputer.statistics_))]
categorical_fitted = preprocessor.named_transformers_.cat
cat_features_transformed = categorical_fitted.named_steps.onehot.get_feature_names(categorical_features)

feature_names = np.hstack([num_features_transformed, cat_features_transformed])

Scope

I suggest we limit get_feature_names to transformers that either:

  • leave columns unchanged
  • Select a subset of columns
  • create new columns where each column depends on at most one input column.
  • PolynomialFeatures (or possibly algorithms that create combinations of O(1) features)

Also, I want the string to only convey presence or absence of features, or constant functions of the features. So scaling would not change a feature_name, while a log-transformation (or polynomial) might. This limits the complexity of the string (but also it's usefulness somewhat).

Together, these mean that there will be no support for multivariate transformations like PCA or NMF or KMeans.

Implementation

Given the above scope and API, and the current implementation of get_feature_names in ColumnTransformer there are two main mechanism that need to be implemented.

  1. allow pipeline to pass around names
  2. provide a mechanism for meta-estimators (ColumnTransformer and Pipeline and Feature Union) to "know what to do".

There are basically three cases the meta-estimators need to take care of:
a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection
b) The transformer does "nothing" to the columns, like StandardScaler.
c) The transformer does a "too complex" operation to the columns, like PCA.

For a), only the estimator can handle this case, so the estimator needs to provide a function to do that - already implemented in several cases as a transformative get_feature_names. For b) the meta-estimator can simply do a pass-through, so we need to "flag" these in some way. There is no way for the meta-estimator to really handle c) so if the estimator is not "tagged" as being trival and doesn't implement get_feature_names the meta-estimator needs to bail in some way.

I added a "OneToOneMixin" to tag the trivial transformations. It would be possible to just use this as a tag, and let the meta-estimators handle the pass-through. Given that we already have the mechanism to handle the pass-through, I thought it would be simpler to just implement a pass-through get_feature_names (another alternative would be to add an estimator tag, but that also seems less elegant).

Right now the bail in case c) is a TypeError.

Limitations

  • The general API requires "input features". In PolynomialFeatures this was optional. Unfortunately we have no way to know the input dimensionality of a fitted transformer in general, so automatically generating x1, x2, etc is not possible. This could be fixed by adding a required n_features_ to the API, which would probably be helpful but also would be a relatively heavy addition.

  • Because we don't know the number of input features, there's no way to ensure the user passed the right length of input_features

  • The implementation of get_feature_names in Pipeline is a hack, because it includes or excludes the last step based on whether the last step has transform. The reason for this is that given a trained pipeline with a classifier in the end, I want to be able to get the feature names, which would not include the last step. In preprocessing pipelines we always want to include all the steps, though.
    The real solution to this in my opinion is always to include the last step, and allow slicing the pipeline ([MRG+1] Pipeline can now be sliced or indexed #2568) to get the feature names for a pipeline with a final supervised step.

  • Bailing to a TypeError if any "complex" transformation happens is a bit of a bummer. We could try to generate names like pca_1, pca_2, ... but to do this automatically we would need to know the output dimensionality, which we don't (unless we add n_outputs_ as required attribute to the API similar to n_features_ above)

Open Questions

  • Do we want to require get_feature_names to accept input_features? Right now the vectorizers don't and it makes the code slightly more complex.

  • How do we want to handle the hack in Pipeline.get_feature_names for the last step?

  • Do we want to encode fixed univariate transformations ("scale", "log", "rank"?)

Possible Extensions

  • don't require input_features and generate names
  • generate names for "complex transformations"
  • Use pandas column names as input_features if available (3b above)

I already discussed the requirements for the first two extensions (adding n_features_ and n_outputs_).
The last one would require storing the input column names if the input is a pandas dataframe. It shouldn't be hard to do, and would also enable solving #7242 and I'd like to do that, but it's not required for this proposal to be useful.

Todo

  • add to narrative docs
  • add to remaining estimators (feature selection is the only left?)
  • allow input_feature for all get_feature_names methods?

@@ -336,8 +336,12 @@ def get_feature_names(self):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
try:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ducktyping to support both transformative and non-transformative get_feature_names.

@@ -531,6 +531,20 @@ def _pairwise(self):
# check if first estimator expects pairwise input
return getattr(self.steps[0][1], '_pairwise', False)

def get_feature_names(self, input_features=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual implementation that enables everything. It's pretty short and would be even shorter if we force get_feature_names to accept input_features (which I think we should).

Transformed feature names
"""
feature_names = input_features
with_final = hasattr(self._final_estimator, "transform")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty controversial

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an alternative that's not slicing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would be to give all supervised models a pass-through get_feature_names maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of get_feature_names sometimes being about input and sometimes about output. We need to consider some kind of interface for slicing if we switch to a new cloning Pipeline anyway (because the current Pipeline you can do it easily, just without syntactic sugar).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I also think it's a bit confusing. But the most common use-case should be syntactically nice.
Maybe like pipe.pop() or something.
Have you thought about slicing cloned pipelines?
The "easiest" way I could think of was freezing all steps, making the sliced pipeline immutable. That's probably what we want, right? Creating a clone of the sliced pipeline in case someone wants to refit it should be pretty simple, right?

I didn't think doing this would link to slicing pipelines and freezing. I thought this one was the easy one :-/ Though I guess we really need the freezing "only" once we have the cloning pipeline. So it might be possible to implement this without solving the other issues first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, raising an error with non-transformers at the end of a pipeline is perfectly good for a start.

I think that would be a very ugly solution, because it requires the user to either specify the pipeline in a weird way or to slice after the fact.

Would you allow fitting on a "view" of that slice? That might lead to unexpected consequences - though I guess not if cloning happens in fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Can you explain a bit more why you find this controversial?
Is it the way we the last estimator is skipped, or is it the actual skipping? In case of the second one: you think it should be an explicit action of the user to slice his full pipeline to remove the last estimator, to know the feature names of the features used in the last estimator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that @jnothman would like the user to explicitly skip the last estimator, say via slicing. I think that actually makes more logical sense.

There is really no way for the pipeline to know whether the user meant to include the last model or not. For example LinearDiscriminantAnalysis has a transform and a predict, same for KMeans, and the pipeline doesn't know if the user is using predict on the pipeline (in which case they wanted the input to the last step) or transform (in which case they wanted the output of the last step).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this comment, now that we have slicing ;)

@jnothman
Copy link
Member

You are welcome to reuse my implementations and tests from eli5, if you like.

https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/transform.py
https://github.com/TeamHG-Memex/eli5/blob/master/tests/test_sklearn_transform.py
https://github.com/TeamHG-Memex/eli5/pull/208/files

Implement a more comprehensive feature description language (as done in ELI-5, I think)

No, it is not. But the singledispatch approach in eli5 is more suggestive that users can create their own feature name functions...

We could try to generate names like pca_1, pca_2,

I think this is sensible.

Btw, I'm a fan of introducing n_features_in_ and n_features_out_ as general API for transformers. This solves #1952 and makes ColumnTransformer.inverse_transform simpler.

I have proposed similar before. I can't imagine it would be very hard to implement, and to enforce in a common test, except perhaps where it applies to meta-estimators. ... And there would be tricky cases with vectorizers and FunctionTransformer that take non-array-like inputs. Hmm.

But we can certainly require input_features for now and extend this without breaking much...

The real solution to this in my opinion is always to include the last step, and allow slicing the pipeline (#2568) to get the feature names for a pipeline with a final supervised step.

I agree. I don't think we can use ducktyping of the last estimator safely. I have also been a proponent of estimator slicing.

Do we want to encode fixed univariate transformations ("scale", "log", "rank"?)

This is hard. Rank and log (and tfidf) seem more important...

Use pandas column names as input_features if available (3b above)

I think for now we need to let the user do this as input to get_feature_names. Perhaps we need to support get_feature_names in FunctionTransformer too somehow.

@amueller
Copy link
Member Author

Taking the tests would be good. What part of the implementation do you think is relevant? I guess adding get_feature_names to SelectorMixin is the part that's not here yet.

And there would be tricky cases with vectorizers and FunctionTransformer that take non-array-like inputs. Hmm.

can you elaborate?
For Vectorizers, I think 1 is a good number of input features ;) [or 0 possibly]. There's only so much we can do in FunctionTransformer if the functions are completely arbitrary, I think, and we shouldn't let that dictate our API.
If validate=True we can probably do most things because X will be 2d, right?

@amueller
Copy link
Member Author

I'd also be happy to add n_features_in_ and n_features_out_

@jnothman
Copy link
Member

Yes, FunctionTransformer is not so much of a problem... It just needs to run a custom feature names function on X at fit time and store the results.

@jorisvandenbossche
Copy link
Member

Cool! Some quick feedback on the top-level explanation:

To me there are four main options for interfaces to enable this:

You also said it later in the top comment as well, but the third option you mention here about tighter integration with pandas is not an alternative right, but rather a possible extension? Nothing in the current PR prevents adding this later I think?
(I think this would be really nice, personally, but it's indeed a good idea to keep this for another PR)

I suggest we limit get_feature_names to transformers that either:

Just to clarify: is there a fourth case missing in that list for the simple Transformer that doesn't change number of inputs (like StandardScaler -> your OneToOneMixin)?


And then one actual implementation related comment:

One problem with the current implementation in the ColumnTransformer.get_feature_names is that it uses the columns from the transformer tuple specification. But, this doesn't necessarily need to be the column names (can also be integer locations, boolean mask).
That's a problem in general (also if you have dataframes as input but use a mask as column specifier), but also for having it work with arrays, the passed input_features should there also be used I think? And it also makes it a bit inconsistent with other get_feature_names that all have a input_features.

@amueller
Copy link
Member Author

Just to clarify: is there a fourth case missing in that list for the simple Transformer that doesn't change number of inputs (like StandardScaler -> your OneToOneMixin)?

Added that.

That's a problem in general (also if you have dataframes as input but use a mask as column specifier), but also for having it work with arrays, the passed input_features should there also be used I think?

yes.

# Conflicts:
#	sklearn/base.py
#	sklearn/impute.py
#	sklearn/preprocessing/data.py
@thomasjpfan
Copy link
Member

I placed this comment in the wrong issue before:

After reviewing all the approaches to get the feature names through, I think the following approach combined with this PR could work:

Fitting a pipeline

clf = MyPipeline([
    ('recorder', ColumnNameRecorder()),
    ('preprocessor', preprocessor),
    ('selector', MySelectKBest(k=5)),
    ('classifier', LogisticRegression())])
_ = clf.fit(X, y)

where the ColumnNameRecorder just saves the column names:

class ColumnNameRecorder(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        if hasattr(X, "columns"):
            self.feature_names_in_ = np.asarray(X.columns)
        return self
    
    def transform(self, X, y=None):
        return X
    
    def get_feature_names(self, input_features=None):
        # uses input_features if self.feature_names_in_ is None
        return self.feature_names_in_

Get input features of classifer

clf[:-1].get_feature_names()
# array(['num__fare', 'cat__sex_female', 'cat__sex_male', 'cat__pclass_1.0',
#       'cat__pclass_3.0'], dtype=object)

# if we want some sugar this can work as well
clf[:'classifier'].get_feature_names()

Get input features of selector

clf[:-2].get_feature_names()
# array(['num__age', 'num__fare', 'cat__embarked_C', 'cat__embarked_Q',
#      'cat__embarked_S', 'cat__embarked_missing', 'cat__sex_female',
#      'cat__sex_male', 'cat__pclass_1.0', 'cat__pclass_2.0',
#      'cat__pclass_3.0'], dtype=object)

# with more sugar
clf[:'selector'].get_feature_names()

The benefits of this approach is:

  1. All estimators do not need to store the feature names. The feature names are only stored in one object ColumnNameRecorder, which means it is stored in the model itself.
  2. Users do not need to pass in input feature names into get_feature_names. Thus not needing to keep track of them after fitting.
  3. With a pipeline a user can have the flexibility of getting the input features of any step in the piepline.
  4. If a user only needs the feature names behavior, they can create a pipeline with two steps: ColumnNameRecorder and an estimator.

Here is an quick implementation of this idea. (Note the implementation uses n_features_in_ to generate default input feature names, when input_features=None)

(We an also come up with a better name for get_feature_names and ColumnNameRecorder)

@amueller
Copy link
Member Author

amueller commented Feb 18, 2020

This PR is very incomplete and doesn't work. You can check out #13307 for all the issues that happen if you actually try to do this.
You're assuming a pipeline. What if the outermost layer is any other meta-estimator?

Another issue is the distinction between input and output feature get_features produces output features. Let's say you do BaggingClassifier(LogisticRegression(), max_features=5). if you want the input feature names of any of the models, what do you call and where?

@@ -689,6 +690,45 @@ def fit_transform(self, X, y=None, **fit_params):
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)

def get_feature_names(self, input_features=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can push this down if people think having it here is ugly.

# because n_components_ means something else
# in agglomerative clustering
n_features = self.n_clusters
elif hasattr(self, '_max_components'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops this can be removed, it's in the class now

@thomasjpfan
Copy link
Member

Can we make a small step with this PR by having a smaller PR by adding get_feature_names for all transformers? For simplicity we can restrict it to non meta-estimators.

@amueller
Copy link
Member Author

amueller commented Sep 16, 2020

Sounds good! Do you want to take the lead? Or do you want me to simplify this one?

@amueller amueller removed the Superseded PR has been replace by a newer PR label Sep 16, 2020
Base automatically changed from master to main January 22, 2021 10:50
@amueller
Copy link
Member Author

superceeded by #18444.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transformative get_feature_names for various transformers
4 participants