-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds feature_names_out to preprocessing module #21079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds feature_names_out to preprocessing module #21079
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@thomasjpfan : Thanks for the work - this will be so useful in practice! Does this also cover the |
import numpy as np
from sklearn.preprocessing import FunctionTransformer
def two_columns(X):
return np.concatenate([X, 2*X], axis=1)
transformer = FunctionTransformer(two_columns)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)
# array([[0, 1, 0, 2],
# [2, 3, 4, 6]]) I have some thoughts on what API to use for this and it will be in a follow up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, otherwise LGTM
sklearn/preprocessing/_data.py
Outdated
@@ -2268,6 +2269,26 @@ def transform(self, K, copy=True): | |||
|
|||
return K | |||
|
|||
def get_feature_names_out(self, input_features=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the contents of this method copy/pasted across for every transformer which uses the f"{class_name}{i}" for i in range(self.n_features_in_)]
pattern? Shouldn't we move it to a _ClassNameFeatureNameMixin
kinda thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do it now, and iteratively improve it, rather than a big PR at the end. But also check #21334, which goes to this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For _ClassNameFeatureNameMixin
to work in general, it needs a way to get the "number of feature names out" form the actual class. KernelCenterer
is a special case where it turns out that n_features_in_ == n_features_out_
and the names are prefixed.
In general, #21334 the feature_names_out_
are different than the feature names going in.
If we want to work toward a mixin, we can wait and work out a solution in #21334 and apply it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#21334 is now merged so this PR can be updated accordingly.
sklearn/preprocessing/_data.py
Outdated
@@ -2268,6 +2269,26 @@ def transform(self, K, copy=True): | |||
|
|||
return K | |||
|
|||
def get_feature_names_out(self, input_features=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do it now or make a final refactoring once that we have all the different ways to generate names out
Hi @thomasjpfan , I'd love to know your plan for this. For example, suppose you have a DataFrame with features A, B, C, D, and you'd like to create a simple pipeline that runs a def compute_ratio(X):
X = getattr(X, "values", X)
return X[:, [0]] / X[:, [1]]
def feature_ratio_transformer(ratio_name):
return make_pipeline(SimpleImputer(),
FunctionTransformer(compute_ratio,
feature_names_out=[ratio_name]))
preprocessing = make_column_transformer(
("passthrough", ["A", "B", "C", "D"]),
(feature_ratio_transformer("A/B ratio"), ["A", "B"]),
(feature_ratio_transformer("C/D ratio"), ["C", "D"]),
)
output = preprocessing.fit_transform(pd.DataFrame({
"A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9], "D": [10, 11, 12]})) >>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
'passthrough__D', 'pipeline-1__A/B ratio', 'pipeline-2__C/D ratio'],
dtype=object) It feels simpler to just write a custom transformer that does everything, but that would go against the core principle of Scikit-Learn of keeping things composable. Perhaps in this example it would be simpler if the preprocessing = make_column_transformer(
("passthrough", ["A", "B", "C", "D"]),
(feature_ratio_transformer(), ["A", "B"], ["A/B ratio"]),
(feature_ratio_transformer(), ["C", "D"], ["C/D ratio"]),
) Wdyt? |
Since |
Adding a import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
def ratio_transformer(ratio_name):
def compute_ratio(X):
X = getattr(X, "values", X)
return X[:, [0]] / X[:, [1]]
return FunctionTransformer(compute_ratio,
feature_names_out=[ratio_name])
preprocessing = make_column_transformer(
("passthrough", ["A", "B", "C", "D"]),
(ratio_transformer("A/B ratio"), ["A", "B"]),
(ratio_transformer("C/D ratio"), ["C", "D"]),
)
df = pd.DataFrame({"A":[1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9],
"D": [10, 11, 12]})
output = preprocessing.fit_transform(df) >>> preprocessing.get_feature_names_out()
array(['passthrough__A', 'passthrough__B', 'passthrough__C',
'passthrough__D', 'functiontransformer-1__A/B ratio',
'functiontransformer-2__C/D ratio'], dtype=object) FYI, I currently use the following monkey-patching function to add the def monkey_patch_get_signature_names_out():
"""Monkey patch some classes which did not handle get_feature_names_out()
correctly in 1.0.0."""
from inspect import Signature, signature, Parameter
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
default_get_feature_names_out = StandardScaler.get_feature_names_out
if not hasattr(SimpleImputer, "get_feature_names_out"):
print("Monkey-patching SimpleImputer.get_feature_names_out()")
SimpleImputer.get_feature_names_out = default_get_feature_names_out
if not hasattr(FunctionTransformer, "get_feature_names_out"):
print("Monkey-patching FunctionTransformer.get_feature_names_out()")
orig_init = FunctionTransformer.__init__
orig_sig = signature(orig_init)
def __init__(*args, feature_names_out=None, **kwargs):
orig_sig.bind(*args, **kwargs)
orig_init(*args, **kwargs)
args[0].feature_names_out = feature_names_out
__init__.__signature__ = Signature(
list(signature(orig_init).parameters.values()) + [
Parameter("feature_names_out", Parameter.KEYWORD_ONLY)])
def get_feature_names_out(self, names=None):
if self.feature_names_out is None:
return default_get_feature_names_out(self, names)
elif callable(self.feature_names_out):
return self.feature_names_out(names)
else:
return self.feature_names_out
FunctionTransformer.__init__ = __init__
FunctionTransformer.get_feature_names_out = get_feature_names_out
p = make_pipeline(SimpleImputer(), SimpleImputer())
p.fit_transform(pd.DataFrame({"A": [1., 2.], "B": [3., 4.]}))
if list(p.get_feature_names_out()) == ["x0", "x1"]:
print("Monkey-patching Pipeline.get_feature_names_out()")
def get_feature_names_out(self, names=None):
names = default_get_feature_names_out(self, names)
for transformer in self:
names = transformer.get_feature_names_out(names)
return names
Pipeline.get_feature_names_out = get_feature_names_out
monkey_patch_get_signature_names_out() |
@ageron would you be interested in opening a PR for the case of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still +1 for merging this, with or without the integration of #21334 for KernelCenterer
.
I'd be happier with using the Mixin, for us to have a more coherent solution across the codebase. |
@ogrisel , sure I'll give it a shot. |
@adrinjalali I made the requested change in 2cd55e9. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, happy for this to be merged once conflict is resolved.
Merging this one since CI is green and there were already two approvals |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: 赵丰 (Zhao Feng) <616545598@qq.com> Co-authored-by: Niket Jain <51831161+nikJ13@users.noreply.github.com> Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Reference Issues/PRs
Continues #18444
What does this implement/fix? Explain your changes.
This PR adds feature names out for the preprocessing module.
Any other comments?
Feels like
Normalizer
,OrdinalEncoder
, andBinarizer
could be in 1.0, but it's most likely too late now.