Skip to content

Pipeline.get_feature_names_out() to push feature names from previous to next transformer #21349

@serhit

Description

@serhit

Describe the workflow you want to enable

The current version of Pipeline.get_feature_names_out() iterates through its transformers but does not pass the output of the previous transformer's get_feature_names_out() to the input of the next as input_features parameter.

In this case, if the pipeline contains in its middle a transformer, which generates features (like TfidfVectorizer) and then followed by some transformer, which does not affect the scope of the features (inherited from base._OneToOneFeatureMixin) - the names of generated features are lost.

Check the example:

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import _OneToOneFeatureMixin, BaseEstimator, TransformerMixin
from datetime import datetime as dt
import sklearn


class ShowInfoTransformer(BaseEstimator, TransformerMixin, _OneToOneFeatureMixin):
    def __init__(self, message):
        self.message = message
        self.n_features_in_ = 1

    def fit(self, x, y=None):
        print(self.message, dt.now())
        self._check_n_features(x, reset=True)
        if hasattr(x, 'shape'):
            print(f'... shape X: {x.shape}')
        else:
            print(f'... len X: {len(x)}')

        return self

    @staticmethod
    def transform(x, y=None):
        return x


sklearn.show_versions()

texts = ["pipeline feature test", "transformer works well"]

p = make_pipeline(ShowInfoTransformer('Starts 1'), TfidfVectorizer())
p.fit(texts)
print("Names are ok", p.get_feature_names_out())

p1 = make_pipeline(ShowInfoTransformer('Starts 2'), TfidfVectorizer(), ShowInfoTransformer('Finish 2'))
p1.fit(texts)
print("Names are not ok", p1.get_feature_names_out())

Output:

System:
    python: 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
executable: C:\Work\Projects\Experimental\PDF\venv\Scripts\python.exe
   machine: Windows-10-10.0.19041-SP0

Python dependencies:
          pip: 21.2.4
   setuptools: 58.2.0
      sklearn: 1.0
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: None
       pandas: 1.3.3
   matplotlib: None
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True
Starts 1 2021-10-16 15:14:32.658655
... len X: 2
Names are ok ['feature' 'pipeline' 'test' 'transformer' 'well' 'works']
Starts 2 2021-10-16 15:14:32.659606
... len X: 2
Finish 2 2021-10-16 15:14:32.661607
... shape X: (2, 6)
Names are not ok ['x0' 'x1' 'x2' 'x3' 'x4' 'x5']

Describe your proposed solution

With the minor modification of Pipeline.get_feature_names_out() the issue may be resolved:

Before:

    def get_feature_names_out(self, input_features=None):
        """Get output feature names for transformation.

        Transform input features using the pipeline.

        Parameters
        ----------
        input_features : array-like of str or None, default=None
            Input features.

        Returns
        -------
        feature_names_out : ndarray of str objects
            Transformed feature names.
        """
        for _, name, transform in self._iter():
            if not hasattr(transform, "get_feature_names_out"):
                raise AttributeError(
                    "Estimator {} does not provide get_feature_names_out. "
                    "Did you mean to call pipeline[:-1].get_feature_names_out"
                    "()?".format(name)
                )
            feature_names = transform.get_feature_names_out(input_features)
        return feature_names

After:

    def get_feature_names_out(self, input_features=None):
        """Get output feature names for transformation.

        Transform input features using the pipeline.

        Parameters
        ----------
        input_features : array-like of str or None, default=None
            Input features.

        Returns
        -------
        feature_names_out : ndarray of str objects
            Transformed feature names.
        """
        feature_names = input_features
        for _, name, transform in self._iter():
            if not hasattr(transform, "get_feature_names_out"):
                raise AttributeError(
                    "Estimator {} does not provide get_feature_names_out. "
                    "Did you mean to call pipeline[:-1].get_feature_names_out"
                    "()?".format(name)
                )
            feature_names = transform.get_feature_names_out(feature_names)
        return feature_names

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions