-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
Description
Describe the workflow you want to enable
The current version of Pipeline.get_feature_names_out()
iterates through its transformers but does not pass the output of the previous transformer's get_feature_names_out()
to the input of the next as input_features parameter.
In this case, if the pipeline contains in its middle a transformer, which generates features (like TfidfVectorizer
) and then followed by some transformer, which does not affect the scope of the features (inherited from base._OneToOneFeatureMixin
) - the names of generated features are lost.
Check the example:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import _OneToOneFeatureMixin, BaseEstimator, TransformerMixin
from datetime import datetime as dt
import sklearn
class ShowInfoTransformer(BaseEstimator, TransformerMixin, _OneToOneFeatureMixin):
def __init__(self, message):
self.message = message
self.n_features_in_ = 1
def fit(self, x, y=None):
print(self.message, dt.now())
self._check_n_features(x, reset=True)
if hasattr(x, 'shape'):
print(f'... shape X: {x.shape}')
else:
print(f'... len X: {len(x)}')
return self
@staticmethod
def transform(x, y=None):
return x
sklearn.show_versions()
texts = ["pipeline feature test", "transformer works well"]
p = make_pipeline(ShowInfoTransformer('Starts 1'), TfidfVectorizer())
p.fit(texts)
print("Names are ok", p.get_feature_names_out())
p1 = make_pipeline(ShowInfoTransformer('Starts 2'), TfidfVectorizer(), ShowInfoTransformer('Finish 2'))
p1.fit(texts)
print("Names are not ok", p1.get_feature_names_out())
Output:
System:
python: 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
executable: C:\Work\Projects\Experimental\PDF\venv\Scripts\python.exe
machine: Windows-10-10.0.19041-SP0
Python dependencies:
pip: 21.2.4
setuptools: 58.2.0
sklearn: 1.0
numpy: 1.21.2
scipy: 1.7.1
Cython: None
pandas: 1.3.3
matplotlib: None
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
Starts 1 2021-10-16 15:14:32.658655
... len X: 2
Names are ok ['feature' 'pipeline' 'test' 'transformer' 'well' 'works']
Starts 2 2021-10-16 15:14:32.659606
... len X: 2
Finish 2 2021-10-16 15:14:32.661607
... shape X: (2, 6)
Names are not ok ['x0' 'x1' 'x2' 'x3' 'x4' 'x5']
Describe your proposed solution
With the minor modification of Pipeline.get_feature_names_out()
the issue may be resolved:
Before:
def get_feature_names_out(self, input_features=None):
"""Get output feature names for transformation.
Transform input features using the pipeline.
Parameters
----------
input_features : array-like of str or None, default=None
Input features.
Returns
-------
feature_names_out : ndarray of str objects
Transformed feature names.
"""
for _, name, transform in self._iter():
if not hasattr(transform, "get_feature_names_out"):
raise AttributeError(
"Estimator {} does not provide get_feature_names_out. "
"Did you mean to call pipeline[:-1].get_feature_names_out"
"()?".format(name)
)
feature_names = transform.get_feature_names_out(input_features)
return feature_names
After:
def get_feature_names_out(self, input_features=None):
"""Get output feature names for transformation.
Transform input features using the pipeline.
Parameters
----------
input_features : array-like of str or None, default=None
Input features.
Returns
-------
feature_names_out : ndarray of str objects
Transformed feature names.
"""
feature_names = input_features
for _, name, transform in self._iter():
if not hasattr(transform, "get_feature_names_out"):
raise AttributeError(
"Estimator {} does not provide get_feature_names_out. "
"Did you mean to call pipeline[:-1].get_feature_names_out"
"()?".format(name)
)
feature_names = transform.get_feature_names_out(feature_names)
return feature_names
Describe alternatives you've considered, if relevant
No response
Additional context
No response