-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
get_feature_names support for pipelines #2007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Interesting... I am not sure I'm happy with this meaning of def get_feature_names(self):
"""Assuming a single transformer in the `Pipeline` has `get_feature_names`,
call it and transform its result through the remainder of the pipeline."""
names = None
for name, step in transformers:
if hasattr(step, 'get_feature_names'):
if names is not None:
raise ValueError('Multiple steps with get_feature_names')
names = step.get_feature_names()
elif names is not None:
names = step.transform(names)
if names is None:
raise ValueError('No step with get_feature_names')
return names
|
Just faced this issue again. I still think that feature names could be very helpful. For example, if LinearSVC or LogisticRegression is used for text categorization, it is convenient to look at features with coefficients with largest absolute values to see what classifier learned and why is it making errors - feature names that come from CountVectorizer/DictVectorizer/TfidfVectorizer can give this insight. Your get_feature_names trick is very smart :) But it could fail if some step can't transform data in a format feature_names uses. What about adding 'previous_feature_names' argument to get_feature_names() functions, and implementing get_feature_names for SelectKBest, SelectPercentile, etc.? |
I agree that having to trace feature provenance manually can be a real pain in practice. Maybe we could think about using record arrays as an alternative to regular numpy arrays in some cases and subclasses of scipy.sparse matrices that would have string metadata to store the column names. Feature selectors and other transformers that preserve the feature meaning (like scalers) could take care about outputing transform datastructures that would preserve this information when available on the input. |
I didn't have a chance to use record arrays yet, but won't using them incur overhead even if feature names are not interesting to the caller? Passing previous feature_names to get_feature_names functions doesn't have this problem. |
I have not tried myself either. My plan is not to make the use of record arrays mandatory but just to make sure that the info is preserved when available. I think we need experimenting with various options to better know what are the practical tradeoffs. |
Having played a bit with recarrays, I think their support across sklearn as a data format will be an enormous change. For example, numpy does not (and cannot) treat the fields as an axis, so you can't perform vectorised operations without changing the view dtype: >>> a = np.array([(0,1)], dtype=[('a', 'f'), ('b', 'f')])
>>> a.mean()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot perform reduce with flexible type
>>> a.view(dtype='f').mean()
0.5 Yes, it would be very nice to have a way to pass around named columns, but without a wrapper like pandas provides, support will not come easily. @kmike wrote:
Note that if the |
Indeed so recarrays are no solution either... I wish numpy arrays were not a builtin class and could allow for plugin arbitrary custom metadata that we could then update manually when suitable in sklearn... |
Perhaps the solution here is that |
But taking "input feature names" as a parameter will be awkward to the user who just has e.g. a vectorizer. (The call would be Maybe it should be a new API point |
Closing as the other solutions will fix this. |
Hi,
Pipeline.get_feature_names() method is added in this pull request. This fixes FeatureUnion.get_feature_names() when one of transformers is a Pipeline.
I tried to provide an example in tests. I'm not entirely sure there is no a better way to write the code in example - please double-check it.