get_feature_names support for pipelines #2007

kmike · 2013-05-27T13:37:48Z

Hi,

Pipeline.get_feature_names() method is added in this pull request. This fixes FeatureUnion.get_feature_names() when one of transformers is a Pipeline.

I tried to provide an example in tests. I'm not entirely sure there is no a better way to write the code in example - please double-check it.

jnothman · 2013-05-27T14:10:00Z

Interesting... I am not sure I'm happy with this meaning of Pipeline.get_feature_names.
What does it mean I then when I want to transform the output of the feature extractor:
my_pipeline = Pipeline[ ('vectorize', DictVectorizer()), ('scale', StandardScaler()), ('select', SelectKBest()), ]
In this case (as opposed to if I tacked on a PCA), all the output features can be named, but those names cannot be retrieved from the last step alone. You could perhaps implement something crazy like:

def get_feature_names(self):
    """Assuming a single transformer in the `Pipeline` has `get_feature_names`,
    call it and transform its result through the remainder of the pipeline."""
    names = None
    for name, step in transformers:
        if hasattr(step, 'get_feature_names'):
            if names is not None:
                raise ValueError('Multiple steps with get_feature_names')
            names = step.get_feature_names()
        elif names is not None:
            names = step.transform(names)
    if names is None:
        raise ValueError('No step with get_feature_names')
    return names

get_feature_names seems to be a fairly informal part of the scikit-learn API. It seems specific, but not universal, to feature extractors. Until it is clearer, it may be better just to privately extend Pipeline (i.e. write class ExtractorPipeline) for your purposes, or else just get the feature names on an ad-hoc basis. (I think it's more important to be able to find out which features in a union come from where as in #1952, than what their names are.)

kmike · 2013-09-19T17:19:57Z

Just faced this issue again. I still think that feature names could be very helpful. For example, if LinearSVC or LogisticRegression is used for text categorization, it is convenient to look at features with coefficients with largest absolute values to see what classifier learned and why is it making errors - feature names that come from CountVectorizer/DictVectorizer/TfidfVectorizer can give this insight.

Your get_feature_names trick is very smart :) But it could fail if some step can't transform data in a format feature_names uses. What about adding 'previous_feature_names' argument to get_feature_names() functions, and implementing get_feature_names for SelectKBest, SelectPercentile, etc.?

ogrisel · 2013-09-19T17:40:35Z

I agree that having to trace feature provenance manually can be a real pain in practice.

Maybe we could think about using record arrays as an alternative to regular numpy arrays in some cases and subclasses of scipy.sparse matrices that would have string metadata to store the column names.

Feature selectors and other transformers that preserve the feature meaning (like scalers) could take care about outputing transform datastructures that would preserve this information when available on the input.

kmike · 2013-09-19T18:09:49Z

I didn't have a chance to use record arrays yet, but won't using them incur overhead even if feature names are not interesting to the caller? Passing previous feature_names to get_feature_names functions doesn't have this problem.

ogrisel · 2013-09-19T18:14:57Z

I didn't have a chance to use record arrays yet, but won't using them incur overhead even if feature names are not interesting to the caller? Passing previous feature_names to get_feature_names functions doesn't have this problem.

I have not tried myself either. My plan is not to make the use of record arrays mandatory but just to make sure that the info is preserved when available.

I think we need experimenting with various options to better know what are the practical tradeoffs.

…et_feature_names

jnothman · 2013-09-22T04:41:00Z

Having played a bit with recarrays, I think their support across sklearn as a data format will be an enormous change. For example, numpy does not (and cannot) treat the fields as an axis, so you can't perform vectorised operations without changing the view dtype:

>>> a = np.array([(0,1)], dtype=[('a', 'f'), ('b', 'f')])
>>> a.mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot perform reduce with flexible type
>>> a.view(dtype='f').mean()
0.5

Yes, it would be very nice to have a way to pass around named columns, but without a wrapper like pandas provides, support will not come easily.

@kmike wrote:

Just faced this issue again. I still think that feature names could be very helpful. For example, if LinearSVC or LogisticRegression is used for text categorization, it is convenient to look at features with coefficients with largest absolute values to see what classifier learned and why is it making errors - feature names that come from CountVectorizer/DictVectorizer/TfidfVectorizer can give this insight.
Your get_feature_names trick is very smart :) But it could fail if some step can't transform data in a format feature_names uses. What about adding 'previous_feature_names' argument to get_feature_names() functions, and implementing get_feature_names for SelectKBest, SelectPercentile, etc.?

Note that if the LinearSVC or LogisticRegression at the end of the pipeline uses l1 penalty, my get_feature_names implementation will return the names of features with non-zero coefficients, due to the _LearntSelectorMixin. It's a little annoying that you are not assured their correspondence with the actual coefficients; for that you really want to get the names of the features that are input to the last step, not output.

ogrisel · 2013-09-22T13:27:14Z

Indeed so recarrays are no solution either... I wish numpy arrays were not a builtin class and could allow for plugin arbitrary custom metadata that we could then update manually when suitable in sklearn...

jnothman · 2014-09-11T13:50:07Z

Perhaps the solution here is that get_feature_names() should be understood as "transform feature names" and should take as input an array/list of feature names as input. Such an API is still lacking relative to what @ogrisel wishes in this discussion; for example, a "select features by name" transformer still needs to be a meta-estimator. But at least semantics of Pipeline.get_feature_names() would be straightforward.

vene · 2015-06-06T12:52:53Z

But taking "input feature names" as a parameter will be awkward to the user who just has e.g. a vectorizer. (The call would be vect.get_feature_names(["foo"].)

Maybe it should be a new API point transform_feature_names and estimators for which it makes sense (feature extractors, even PCA: estimators that create features) could implement a feature_names_ property.

adrinjalali · 2019-08-07T12:19:11Z

Closing as the other solutions will fix this.

Better support for feature transformation pipelines: added Pipeline.g…

9a317ed

…et_feature_names

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jnothman changed the title ~~Better support for feature transformation pipelines~~ get_feature_names support for feature transformation pipelines Nov 15, 2014

jnothman changed the title ~~get_feature_names support for feature transformation pipelines~~ get_feature_names support for pipelines Nov 15, 2014

jnothman mentioned this pull request Jun 6, 2015

[WIP] Add feature_extraction.ColumnTransformer #3886

Closed

8 tasks

jnothman mentioned this pull request Aug 27, 2015

use get_feature_names from last step in pipeline if it is available #5172

Closed

dwyatte mentioned this pull request Jan 15, 2016

Pandas in, Pandas out? #5523

Closed

This was referenced Feb 22, 2016

Pipeline object does not have a get_feature_names method - intentional? #6421

Closed

RFC generalised Pipeline.get_feature_names #6424

Closed

kmike mentioned this pull request Oct 9, 2016

Scikit-learn Pipeline support TeamHG-Memex/eli5#15

Open

amueller added the Superseded PR has been replace by a newer PR label Aug 5, 2019

adrinjalali closed this Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_feature_names support for pipelines #2007

get_feature_names support for pipelines #2007

kmike commented May 27, 2013

jnothman commented May 27, 2013

kmike commented Sep 19, 2013

ogrisel commented Sep 19, 2013

kmike commented Sep 19, 2013

ogrisel commented Sep 19, 2013

jnothman commented Sep 22, 2013

ogrisel commented Sep 22, 2013

jnothman commented Sep 11, 2014

vene commented Jun 6, 2015

adrinjalali commented Aug 7, 2019

get_feature_names support for pipelines #2007

get_feature_names support for pipelines #2007

Conversation

kmike commented May 27, 2013

jnothman commented May 27, 2013

kmike commented Sep 19, 2013

ogrisel commented Sep 19, 2013

kmike commented Sep 19, 2013

ogrisel commented Sep 19, 2013

jnothman commented Sep 22, 2013

ogrisel commented Sep 22, 2013

jnothman commented Sep 11, 2014

vene commented Jun 6, 2015

adrinjalali commented Aug 7, 2019