RFC Implement Pipeline get feature names #12627

amueller · 2018-11-20T19:42:19Z

Reference Issues/PRs

This is a draft implementation of #6424.
It doesn't really introduce anything new in the API, but I'm happy to move this to a SLEP.
Below is an initial description. Happy to include feedback in the SLEP.

Fixes #6425

What does this implement/fix? Explain your changes.

The main idea of this is to make compound scikit-learn estimators less opaque by providing "feature names" as strings.

Motivation

We've been making it easier to build complex workflows with the ColumnTransformer and I expect it will find wide adoption. However, using it results in very opaque models, even more so than before.
We have a great usage example in the gallery that applies a classifier to the titanic data set. To me this is a very simple standard usecase.

Markdown doesn't let me paste this as details, so just look here:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html

However, it's impossible to interpret or even sanity-check the LogisticRegression instance that's produced here, because the correspondence of the coefficients to the input features is basically impossible to figure out.
This PR enables using get_feature_names to obtain the semantics for the coefficients:

preprocessor.get_feature_names(X.columns)

['num__age',
 'num__fare',
 'cat__embarked_C',
 'cat__embarked_Q',
 'cat__embarked_S',
 'cat__embarked_missing',
 'cat__sex_female',
 'cat__sex_male',
 'cat__pclass_1',
 'cat__pclass_2',
 'cat__pclass_3']

I think this is essential information in any machine learning workflow and it's imperative that we allow the user to get to this information in some way.

The proposed API will add a method get_feature_names to all supported (see below) transformers, with a (possibly optional, see below) parameter "input features" which is an array-like of strings.

Alternative Interfaces

To me there are four main options for interfaces to enable this:

Implement transformative get_feature_names as in this PR
Implement a more comprehensive feature description language (as done in ELI-5, I think)
Tie in more strongly with pandas and use dataframes / column names
a) to output feature semantics.
b) to determine feature semantics
Leave it to the user.

While I think 2) and 3) a) is are valid option for the future, I think trying to implement this now will probably result in a gridlock and/or take too much time. I think we should iterate and provide something that solves the 80% use-case quickly. We can create a more elaborate solution later, in particular since this proposal/PR doesn't introduce any concepts that are not in sklearn already.
3 b) is discussed below.

I don't think 4) is a realistic option. I assume we can agree that the titanic example above is a valid use-case, and that getting the semantics of features is important. Below is the code that the user would have to write to do this themselves. This will become even harder in the future if the pipeline will do cloning.

I'm hardcoding that the second imputer uses "constant" and the first one doesn't otherwise this would get way too messy. This also hard-codes several other things, like the order of the transformers in the column transformer. It also already makes use of a transformative ``get_feature_names`` in the ``OneHotEncoder`` without which it would be completely impossible.

numeric_fitted = preprocessor.named_transformers_.num

num_features_transformed = np.array(numeric_features)[np.logical_not(np.isnan(numeric_fitted.named_steps.imputer.statistics_))]
categorical_fitted = preprocessor.named_transformers_.cat
cat_features_transformed = categorical_fitted.named_steps.onehot.get_feature_names(categorical_features)

feature_names = np.hstack([num_features_transformed, cat_features_transformed])

Scope

I suggest we limit get_feature_names to transformers that either:

leave columns unchanged
Select a subset of columns
create new columns where each column depends on at most one input column.
PolynomialFeatures (or possibly algorithms that create combinations of O(1) features)

Also, I want the string to only convey presence or absence of features, or constant functions of the features. So scaling would not change a feature_name, while a log-transformation (or polynomial) might. This limits the complexity of the string (but also it's usefulness somewhat).

Together, these mean that there will be no support for multivariate transformations like PCA or NMF or KMeans.

Implementation

Given the above scope and API, and the current implementation of get_feature_names in ColumnTransformer there are two main mechanism that need to be implemented.

allow pipeline to pass around names
provide a mechanism for meta-estimators (ColumnTransformer and Pipeline and Feature Union) to "know what to do".

There are basically three cases the meta-estimators need to take care of:
a) The transformer does a non-trivial column transformation, like OneHotEncoder or feature selection
b) The transformer does "nothing" to the columns, like StandardScaler.
c) The transformer does a "too complex" operation to the columns, like PCA.

For a), only the estimator can handle this case, so the estimator needs to provide a function to do that - already implemented in several cases as a transformative get_feature_names. For b) the meta-estimator can simply do a pass-through, so we need to "flag" these in some way. There is no way for the meta-estimator to really handle c) so if the estimator is not "tagged" as being trival and doesn't implement get_feature_names the meta-estimator needs to bail in some way.

I added a "OneToOneMixin" to tag the trivial transformations. It would be possible to just use this as a tag, and let the meta-estimators handle the pass-through. Given that we already have the mechanism to handle the pass-through, I thought it would be simpler to just implement a pass-through get_feature_names (another alternative would be to add an estimator tag, but that also seems less elegant).

Right now the bail in case c) is a TypeError.

Limitations

The general API requires "input features". In PolynomialFeatures this was optional. Unfortunately we have no way to know the input dimensionality of a fitted transformer in general, so automatically generating x1, x2, etc is not possible. This could be fixed by adding a required n_features_ to the API, which would probably be helpful but also would be a relatively heavy addition.
Because we don't know the number of input features, there's no way to ensure the user passed the right length of input_features
The implementation of get_feature_names in Pipeline is a hack, because it includes or excludes the last step based on whether the last step has transform. The reason for this is that given a trained pipeline with a classifier in the end, I want to be able to get the feature names, which would not include the last step. In preprocessing pipelines we always want to include all the steps, though.
The real solution to this in my opinion is always to include the last step, and allow slicing the pipeline ([MRG+1] Pipeline can now be sliced or indexed #2568) to get the feature names for a pipeline with a final supervised step.
Bailing to a TypeError if any "complex" transformation happens is a bit of a bummer. We could try to generate names like pca_1, pca_2, ... but to do this automatically we would need to know the output dimensionality, which we don't (unless we add n_outputs_ as required attribute to the API similar to n_features_ above)

Open Questions

Do we want to require get_feature_names to accept input_features? Right now the vectorizers don't and it makes the code slightly more complex.
How do we want to handle the hack in Pipeline.get_feature_names for the last step?
Do we want to encode fixed univariate transformations ("scale", "log", "rank"?)

Possible Extensions

don't require input_features and generate names
generate names for "complex transformations"
Use pandas column names as input_features if available (3b above)

I already discussed the requirements for the first two extensions (adding n_features_ and n_outputs_).
The last one would require storing the input column names if the input is a pandas dataframe. It shouldn't be hard to do, and would also enable solving #7242 and I'd like to do that, but it's not required for this proposal to be useful.

Todo

add to narrative docs
add to remaining estimators (feature selection is the only left?)
allow input_feature for all get_feature_names methods?

…_feature_names

amueller · 2018-11-20T20:05:12Z

sklearn/compose/_column_transformer.py

@@ -336,8 +336,12 @@ def get_feature_names(self):
                raise AttributeError("Transformer %s (type %s) does not "
                                     "provide get_feature_names."
                                     % (str(name), type(trans).__name__))
+            try:


this is ducktyping to support both transformative and non-transformative get_feature_names.

amueller · 2018-11-20T20:06:07Z

sklearn/pipeline.py

@@ -531,6 +531,20 @@ def _pairwise(self):
        # check if first estimator expects pairwise input
        return getattr(self.steps[0][1], '_pairwise', False)

+    def get_feature_names(self, input_features=None):


this is the actual implementation that enables everything. It's pretty short and would be even shorter if we force get_feature_names to accept input_features (which I think we should).

jnothman · 2018-11-21T01:26:59Z

sklearn/pipeline.py

+            Transformed feature names
+        """
+        feature_names = input_features
+        with_final = hasattr(self._final_estimator, "transform")


This is pretty controversial

do you have an alternative that's not slicing?

The other option would be to give all supervised models a pass-through get_feature_names maybe?

I don't like the idea of get_feature_names sometimes being about input and sometimes about output. We need to consider some kind of interface for slicing if we switch to a new cloning Pipeline anyway (because the current Pipeline you can do it easily, just without syntactic sugar).

yeah I also think it's a bit confusing. But the most common use-case should be syntactically nice.
Maybe like pipe.pop() or something.
Have you thought about slicing cloned pipelines?
The "easiest" way I could think of was freezing all steps, making the sliced pipeline immutable. That's probably what we want, right? Creating a clone of the sliced pipeline in case someone wants to refit it should be pretty simple, right?

I didn't think doing this would link to slicing pipelines and freezing. I thought this one was the easy one :-/ Though I guess we really need the freezing "only" once we have the cloning pipeline. So it might be possible to implement this without solving the other issues first?

Also, raising an error with non-transformers at the end of a pipeline is perfectly good for a start.

I think that would be a very ugly solution, because it requires the user to either specify the pipeline in a weird way or to slice after the fact.

Would you allow fitting on a "view" of that slice? That might lead to unexpected consequences - though I guess not if cloning happens in fit.

See #8448 (comment)

@jnothman Can you explain a bit more why you find this controversial?
Is it the way we the last estimator is skipped, or is it the actual skipping? In case of the second one: you think it should be an explicit action of the user to slice his full pipeline to remove the last estimator, to know the feature names of the features used in the last estimator?

My understanding is that @jnothman would like the user to explicitly skip the last estimator, say via slicing. I think that actually makes more logical sense.

There is really no way for the pipeline to know whether the user meant to include the last model or not. For example LinearDiscriminantAnalysis has a transform and a predict, same for KMeans, and the pipeline doesn't know if the user is using predict on the pipeline (in which case they wanted the input to the last step) or transform (in which case they wanted the output of the last step).

I addressed this comment, now that we have slicing ;)

jnothman · 2018-11-21T01:50:19Z

You are welcome to reuse my implementations and tests from eli5, if you like.

https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/transform.py
https://github.com/TeamHG-Memex/eli5/blob/master/tests/test_sklearn_transform.py
https://github.com/TeamHG-Memex/eli5/pull/208/files

Implement a more comprehensive feature description language (as done in ELI-5, I think)

No, it is not. But the singledispatch approach in eli5 is more suggestive that users can create their own feature name functions...

We could try to generate names like pca_1, pca_2,

I think this is sensible.

Btw, I'm a fan of introducing n_features_in_ and n_features_out_ as general API for transformers. This solves #1952 and makes ColumnTransformer.inverse_transform simpler.

I have proposed similar before. I can't imagine it would be very hard to implement, and to enforce in a common test, except perhaps where it applies to meta-estimators. ... And there would be tricky cases with vectorizers and FunctionTransformer that take non-array-like inputs. Hmm.

But we can certainly require input_features for now and extend this without breaking much...

The real solution to this in my opinion is always to include the last step, and allow slicing the pipeline (#2568) to get the feature names for a pipeline with a final supervised step.

I agree. I don't think we can use ducktyping of the last estimator safely. I have also been a proponent of estimator slicing.

Do we want to encode fixed univariate transformations ("scale", "log", "rank"?)

This is hard. Rank and log (and tfidf) seem more important...

Use pandas column names as input_features if available (3b above)

I think for now we need to let the user do this as input to get_feature_names. Perhaps we need to support get_feature_names in FunctionTransformer too somehow.

amueller · 2018-11-21T02:19:25Z

Taking the tests would be good. What part of the implementation do you think is relevant? I guess adding get_feature_names to SelectorMixin is the part that's not here yet.

And there would be tricky cases with vectorizers and FunctionTransformer that take non-array-like inputs. Hmm.

can you elaborate?
For Vectorizers, I think 1 is a good number of input features ;) [or 0 possibly]. There's only so much we can do in FunctionTransformer if the functions are completely arbitrary, I think, and we shouldn't let that dictate our API.
If validate=True we can probably do most things because X will be 2d, right?

amueller · 2018-11-21T02:20:50Z

I'd also be happy to add n_features_in_ and n_features_out_

jnothman · 2018-11-21T04:03:13Z

Yes, FunctionTransformer is not so much of a problem... It just needs to run a custom feature names function on X at fit time and store the results.

jorisvandenbossche · 2018-12-13T16:48:55Z

Cool! Some quick feedback on the top-level explanation:

To me there are four main options for interfaces to enable this:

You also said it later in the top comment as well, but the third option you mention here about tighter integration with pandas is not an alternative right, but rather a possible extension? Nothing in the current PR prevents adding this later I think?
(I think this would be really nice, personally, but it's indeed a good idea to keep this for another PR)

I suggest we limit get_feature_names to transformers that either:

Just to clarify: is there a fourth case missing in that list for the simple Transformer that doesn't change number of inputs (like StandardScaler -> your OneToOneMixin)?

And then one actual implementation related comment:

One problem with the current implementation in the ColumnTransformer.get_feature_names is that it uses the columns from the transformer tuple specification. But, this doesn't necessarily need to be the column names (can also be integer locations, boolean mask).
That's a problem in general (also if you have dataframes as input but use a mask as column specifier), but also for having it work with arrays, the passed input_features should there also be used I think? And it also makes it a bit inconsistent with other get_feature_names that all have a input_features.

amueller · 2018-12-13T17:26:58Z

Just to clarify: is there a fourth case missing in that list for the simple Transformer that doesn't change number of inputs (like StandardScaler -> your OneToOneMixin)?

Added that.

That's a problem in general (also if you have dataframes as input but use a mask as column specifier), but also for having it work with arrays, the passed input_features should there also be used I think?

yes.

# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py

thomasjpfan · 2020-02-18T19:00:33Z

I placed this comment in the wrong issue before:

After reviewing all the approaches to get the feature names through, I think the following approach combined with this PR could work:

Fitting a pipeline

clf = MyPipeline([
    ('recorder', ColumnNameRecorder()),
    ('preprocessor', preprocessor),
    ('selector', MySelectKBest(k=5)),
    ('classifier', LogisticRegression())])
_ = clf.fit(X, y)

where the ColumnNameRecorder just saves the column names:

class ColumnNameRecorder(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        if hasattr(X, "columns"):
            self.feature_names_in_ = np.asarray(X.columns)
        return self
    
    def transform(self, X, y=None):
        return X
    
    def get_feature_names(self, input_features=None):
        # uses input_features if self.feature_names_in_ is None
        return self.feature_names_in_

Get input features of classifer

clf[:-1].get_feature_names()
# array(['num__fare', 'cat__sex_female', 'cat__sex_male', 'cat__pclass_1.0',
#       'cat__pclass_3.0'], dtype=object)

# if we want some sugar this can work as well
clf[:'classifier'].get_feature_names()

Get input features of selector

clf[:-2].get_feature_names()
# array(['num__age', 'num__fare', 'cat__embarked_C', 'cat__embarked_Q',
#      'cat__embarked_S', 'cat__embarked_missing', 'cat__sex_female',
#      'cat__sex_male', 'cat__pclass_1.0', 'cat__pclass_2.0',
#      'cat__pclass_3.0'], dtype=object)

# with more sugar
clf[:'selector'].get_feature_names()

The benefits of this approach is:

All estimators do not need to store the feature names. The feature names are only stored in one object ColumnNameRecorder, which means it is stored in the model itself.
Users do not need to pass in input feature names into get_feature_names. Thus not needing to keep track of them after fitting.
With a pipeline a user can have the flexibility of getting the input features of any step in the piepline.
If a user only needs the feature names behavior, they can create a pipeline with two steps: ColumnNameRecorder and an estimator.

Here is an quick implementation of this idea. (Note the implementation uses n_features_in_ to generate default input feature names, when input_features=None)

(We an also come up with a better name for get_feature_names and ColumnNameRecorder)

amueller · 2020-02-18T19:19:37Z

This PR is very incomplete and doesn't work. You can check out #13307 for all the issues that happen if you actually try to do this.
You're assuming a pipeline. What if the outermost layer is any other meta-estimator?

Another issue is the distinction between input and output feature get_features produces output features. Let's say you do BaggingClassifier(LogisticRegression(), max_features=5). if you want the input feature names of any of the models, what do you call and where?

…it-learn into pipeline_get_feature_names

amueller · 2020-09-09T19:29:40Z

sklearn/base.py

@@ -689,6 +690,45 @@ def fit_transform(self, X, y=None, **fit_params):
            # fit method of arity 2 (supervised transformation)
            return self.fit(X, y, **fit_params).transform(X)

+    def get_feature_names(self, input_features=None):


We can push this down if people think having it here is ugly.

amueller · 2020-09-09T19:32:25Z

sklearn/base.py

+            # because n_components_ means something else
+            # in agglomerative clustering
+            n_features = self.n_clusters
+        elif hasattr(self, '_max_components'):


whoops this can be removed, it's in the class now

thomasjpfan · 2020-09-09T20:39:56Z

Can we make a small step with this PR by having a smaller PR by adding get_feature_names for all transformers? For simplicity we can restrict it to non meta-estimators.

amueller · 2020-09-16T20:08:04Z

Sounds good! Do you want to take the lead? Or do you want me to simplify this one?

amueller · 2021-07-26T19:30:55Z

superceeded by #18444.

amueller added 3 commits November 20, 2018 10:52

work on get_feature_names for pipeline

ab2acbd

fix SimpleImputer get_feature_names

3bc674b

use hasattr(transform) to check whether to use final estimator in get…

1c4a78f

…_feature_names

amueller commented Nov 20, 2018

View reviewed changes

add some docstrings

7881930

jnothman reviewed Nov 21, 2018

View reviewed changes

amueller mentioned this pull request Nov 27, 2018

[WIP] ENH allow extraction of subsequence pipeline #8431

Closed

2 tasks

fix docstring

de63353

Merge branch 'master' into pipeline_get_feature_names

8835f3b

# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py

amueller mentioned this pull request Feb 27, 2019

Feature names with input features #13307

Closed

amueller mentioned this pull request Apr 5, 2019

Slep007 - feature names, their generation and the API scikit-learn/enhancement_proposals#17

Merged

amueller added 2 commits May 30, 2019 15:55

fix merge issues with master

2eba5de

fix merge issue

449ed23

amueller mentioned this pull request Jul 2, 2019

RFC/WIP Feature names within fit #14238

Closed

amueller added the Superseded PR has been replace by a newer PR label Aug 6, 2019

amueller mentioned this pull request Dec 3, 2019

What's the recommended approach for building a complete data frame (feature values + names) after using ColumTransformer/ FeatureUnion? #15755

Open

github-actions bot added module:compose module:pipeline module:preprocessing labels Mar 2, 2020

amueller added 2 commits May 21, 2020 18:23

Merge branch 'master' into pipeline_get_feature_names

a1fcf67

don't do magic slicing in pipeline.get_feature_names

b929341

thomasjpfan mentioned this pull request May 27, 2020

Potential error caused by different column order #7242

Closed

amueller added 11 commits June 2, 2020 15:51

trying to merge with input feature pr

5eb7603

Merge branch 'master' into pipeline_get_feature_names

f4f832a

remove tests taht don't apply

3a9054c

Merge branch 'pipeline_get_feature_names' of github.com:amueller/scik…

9c4420d

…it-learn into pipeline_get_feature_names

fix onetoone mixing feature names

76f5b54

remove more tests

52f38e1

fix test for better expected outputs

cdda1fb

fix priorities in catch-all get_feature_names

5f4abbc

flake8

4305a28

remove redundant code

c387b5b

fix error message

2fefb67

amueller mentioned this pull request Jun 2, 2020

[WIP] Feature names with pandas or xarray data structures #16772

Closed

amueller added 7 commits June 2, 2020 19:36

fix mixin order

a6832c3

small refactor with helper function

0f45b22

linting for new options

4717a73

add feature names to lineardiscriminantanalysis and birch

a658ba7

add get_feature_names in a couple more places

e9e45af

fix up docs

5acaced

make example actually work

0353f69

fberanizo mentioned this pull request Jun 6, 2020

Add Filter and Pre Selection Components platiagro/projects#65

Merged

amueller commented Sep 9, 2020

View reviewed changes

amueller removed the Superseded PR has been replace by a newer PR label Sep 16, 2020

thomasjpfan mentioned this pull request Sep 23, 2020

API Implements get_feature_names_out for transformers that support get_feature_names #18444

Merged

Base automatically changed from master to main January 22, 2021 10:50

ogrisel closed this in #18444 Sep 7, 2021

Uh oh!

RFC Implement Pipeline get feature names #12627

RFC Implement Pipeline get feature names #12627

Uh oh!

Conversation

amueller commented Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Motivation

Alternative Interfaces

Scope

Implementation

Limitations

Open Questions

Possible Extensions

Todo

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 21, 2018

Uh oh!

amueller commented Nov 21, 2018

Uh oh!

amueller commented Nov 21, 2018

Uh oh!

jnothman commented Nov 21, 2018

Uh oh!

jorisvandenbossche commented Dec 13, 2018

Uh oh!

amueller commented Dec 13, 2018

Uh oh!

thomasjpfan commented Feb 18, 2020

Fitting a pipeline

Get input features of classifer

Get input features of selector

Uh oh!

amueller commented Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Sep 9, 2020

Uh oh!

amueller commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Jul 26, 2021

Uh oh!

Uh oh!

amueller commented Nov 20, 2018 •

edited

Loading

amueller commented Feb 18, 2020 •

edited

Loading

amueller commented Sep 16, 2020 •

edited

Loading