Skip to content

ENH Adds feature_names_out to preprocessing module #21079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Sep 17, 2021

Reference Issues/PRs

Continues #18444

What does this implement/fix? Explain your changes.

This PR adds feature names out for the preprocessing module.

Any other comments?

Feels like Normalizer, OrdinalEncoder, and Binarizer could be in 1.0, but it's most likely too late now.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mayer79
Copy link
Contributor

mayer79 commented Oct 19, 2021

@thomasjpfan : Thanks for the work - this will be so useful in practice! Does this also cover the FunctionTransformer?

@thomasjpfan
Copy link
Member Author

FunctionTransformer will need its own PR, because FunctionTransformer's API is a bit more flexible. For the normal case, FunctionTransformer(np.log), the features are one-to-one. On the other hand, the API can also be used to output multiple features for every input feature:

import numpy as np
from sklearn.preprocessing import FunctionTransformer

def two_columns(X):
    return np.concatenate([X, 2*X], axis=1)

transformer = FunctionTransformer(two_columns)
X = np.array([[0, 1], [2, 3]])

transformer.transform(X)
# array([[0, 1, 0, 2],
#        [2, 3, 4, 6]])

I have some thoughts on what API to use for this and it will be in a follow up PR.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, otherwise LGTM

@@ -2268,6 +2269,26 @@ def transform(self, K, copy=True):

return K

def get_feature_names_out(self, input_features=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the contents of this method copy/pasted across for every transformer which uses the f"{class_name}{i}" for i in range(self.n_features_in_)] pattern? Shouldn't we move it to a _ClassNameFeatureNameMixin kinda thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it now, and iteratively improve it, rather than a big PR at the end. But also check #21334, which goes to this direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For _ClassNameFeatureNameMixin to work in general, it needs a way to get the "number of feature names out" form the actual class. KernelCenterer is a special case where it turns out that n_features_in_ == n_features_out_ and the names are prefixed.

In general, #21334 the feature_names_out_ are different than the feature names going in.

If we want to work toward a mixin, we can wait and work out a solution in #21334 and apply it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21334 is now merged so this PR can be updated accordingly.

@@ -2268,6 +2269,26 @@ def transform(self, K, copy=True):

return K

def get_feature_names_out(self, input_features=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out

@ageron
Copy link
Contributor

ageron commented Oct 25, 2021

FunctionTransformer will need its own PR, because FunctionTransformer's API is a bit more flexible.
[...]
I have some thoughts on what API to use for this and it will be in a follow up PR.

Hi @thomasjpfan , I'd love to know your plan for this. For example, suppose you have a DataFrame with features A, B, C, D, and you'd like to create a simple pipeline that runs a SimpleImputer on all columns and adds two new features equal to A/B and C/D. It doesn't sound too hard in principle, but I can't find a simple way to do it using pipelines and column transformers, and getting nice feature names out. Here's the least horrible I found:

def compute_ratio(X):
    X = getattr(X, "values", X)
    return X[:, [0]] / X[:, [1]]
    
def feature_ratio_transformer(ratio_name):
    return make_pipeline(SimpleImputer(),
                         FunctionTransformer(compute_ratio,
                                             feature_names_out=[ratio_name]))

preprocessing = make_column_transformer(
    ("passthrough", ["A", "B", "C", "D"]),
    (feature_ratio_transformer("A/B ratio"), ["A", "B"]),
    (feature_ratio_transformer("C/D ratio"), ["C", "D"]),
)

output = preprocessing.fit_transform(pd.DataFrame({
    "A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9], "D": [10, 11, 12]}))
>>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
       'passthrough__D', 'pipeline-1__A/B ratio', 'pipeline-2__C/D ratio'],
      dtype=object)

It feels simpler to just write a custom transformer that does everything, but that would go against the core principle of Scikit-Learn of keeping things composable.

Perhaps in this example it would be simpler if the ColumnTransformer let us specify the feature names out:

preprocessing = make_column_transformer(
    ("passthrough", ["A", "B", "C", "D"]),
    (feature_ratio_transformer(), ["A", "B"], ["A/B ratio"]),
    (feature_ratio_transformer(), ["C", "D"], ["C/D ratio"]),
)

Wdyt?

@glemaitre
Copy link
Member

Since FunctionTransformer is indeed different, we should probably amend (maybe first accept it :)) the SLEP007 to specify exactly what do we intend as an implementation for this case.

@ageron
Copy link
Contributor

ageron commented Oct 25, 2021

Adding a feature_names_out hyperparameter to FunctionTransformer makes the following pipeline possible. It doesn't look too bad:

import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer

def ratio_transformer(ratio_name):
    def compute_ratio(X):
        X = getattr(X, "values", X)
        return X[:, [0]] / X[:, [1]]

    return FunctionTransformer(compute_ratio,
                               feature_names_out=[ratio_name])

preprocessing = make_column_transformer(
    ("passthrough", ["A", "B", "C", "D"]),
    (ratio_transformer("A/B ratio"), ["A", "B"]),
    (ratio_transformer("C/D ratio"), ["C", "D"]),
)

df = pd.DataFrame({"A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9],
                   "D": [10, 11, 12]})

output = preprocessing.fit_transform(df)
>>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
       'passthrough__D', 'functiontransformer-1__A/B ratio',
       'functiontransformer-2__C/D ratio'], dtype=object)

FYI, I currently use the following monkey-patching function to add the feature_names_out hyperparameter to the constructor of FunctionTransformer, and to use it in get_feature_names_out(). This function also patches SimpleImputer to add get_feature_names_out() and Pipeline.get_feature_names_out() to let the feature names propagate through the pipeline.

def monkey_patch_get_signature_names_out():
    """Monkey patch some classes which did not handle get_feature_names_out()
       correctly in 1.0.0."""
    from inspect import Signature, signature, Parameter
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import make_pipeline, Pipeline
    from sklearn.preprocessing import FunctionTransformer, StandardScaler

    default_get_feature_names_out = StandardScaler.get_feature_names_out

    if not hasattr(SimpleImputer, "get_feature_names_out"):
      print("Monkey-patching SimpleImputer.get_feature_names_out()")
      SimpleImputer.get_feature_names_out = default_get_feature_names_out

    if not hasattr(FunctionTransformer, "get_feature_names_out"):
        print("Monkey-patching FunctionTransformer.get_feature_names_out()")
        orig_init = FunctionTransformer.__init__
        orig_sig = signature(orig_init)

        def __init__(*args, feature_names_out=None, **kwargs):
            orig_sig.bind(*args, **kwargs)
            orig_init(*args, **kwargs)
            args[0].feature_names_out = feature_names_out

        __init__.__signature__ = Signature(
            list(signature(orig_init).parameters.values()) + [
                Parameter("feature_names_out", Parameter.KEYWORD_ONLY)])

        def get_feature_names_out(self, names=None):
            if self.feature_names_out is None:
                return default_get_feature_names_out(self, names)
            elif callable(self.feature_names_out):
                return self.feature_names_out(names)
            else:
                return self.feature_names_out

        FunctionTransformer.__init__ = __init__
        FunctionTransformer.get_feature_names_out = get_feature_names_out

    p = make_pipeline(SimpleImputer(), SimpleImputer())
    p.fit_transform(pd.DataFrame({"A": [1., 2.], "B": [3., 4.]}))
    if list(p.get_feature_names_out()) == ["x0", "x1"]:
        print("Monkey-patching Pipeline.get_feature_names_out()")
        def get_feature_names_out(self, names=None):
            names = default_get_feature_names_out(self, names)
            for transformer in self:
                names = transformer.get_feature_names_out(names)
            return names

        Pipeline.get_feature_names_out = get_feature_names_out

monkey_patch_get_signature_names_out()

@ogrisel
Copy link
Member

ogrisel commented Nov 5, 2021

@ageron would you be interested in opening a PR for the case of FunctionTransformer?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still +1 for merging this, with or without the integration of #21334 for KernelCenterer.

@adrinjalali
Copy link
Member

I'd be happier with using the Mixin, for us to have a more coherent solution across the codebase.

@ageron
Copy link
Contributor

ageron commented Nov 6, 2021

@ogrisel , sure I'll give it a shot.

@ogrisel
Copy link
Member

ogrisel commented Dec 6, 2021

I'd be happier with using the Mixin, for us to have a more coherent solution across the codebase.

@adrinjalali I made the requested change in 2cd55e9.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, happy for this to be merged once conflict is resolved.

@lesteve
Copy link
Member

lesteve commented Feb 7, 2022

Merging this one since CI is green and there were already two approvals

@lesteve lesteve merged commit d7feac0 into scikit-learn:main Feb 7, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 9, 2022
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: 赵丰 (Zhao Feng) <616545598@qq.com>
Co-authored-by: Niket Jain <51831161+nikJ13@users.noreply.github.com>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants