-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds array_out="pandas" to transformers in preprocessing module #20100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should get_feature_names
or get_feature_names_out
be deprecated and become private?
Overall it looks quite good to me.
X : array-like | ||
The input samples. | ||
reset : bool, default=True | ||
If True, the `n_feature_names_` attribute is set to the feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If True, the `n_feature_names_` attribute is set to the feature | |
If True, the `feature_names_in_` attribute is set to the feature |
returned. If "default", an array-like without feature names is | ||
returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably worth explaining why this is not a numpy array, and sometimes a sparse array.
@@ -706,10 +740,20 @@ def fit_transform(self, X, y=None, **fit_params): | |||
# method is possible for a given clustering algorithm | |||
if y is None: | |||
# fit method of arity 1 (unsupervised transformation) | |||
return self.fit(X, **fit_params).transform(X) | |||
fitted = self.fit(X, **fit_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is fitted
ever not self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that also confused me. I think it's always self, right? Or at least should be.
@@ -269,7 +269,8 @@ def test_fit_docstring_attributes(name, Estimator): | |||
est.fit(X, y) | |||
|
|||
skipped_attributes = {'x_scores_', # For PLS, TODO remove in 1.1 | |||
'y_scores_'} # For PLS, TODO remove in 1.1 | |||
'y_scores_', # For PLS, TODO remove in 1.1 | |||
'feature_names_in_'} # Ignore for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we gonna fix this in this PR? We don't have to, but if we don't should create a separate issue for a sprint or something.
return get_feature_names_out_callable | ||
|
||
|
||
def _make_array_out(X_out, index, get_feature_names_out, *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _make_array_out(X_out, index, get_feature_names_out, *, | |
def _make_array_out(X_out, *, index, get_feature_names_out, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, wouldn't it make more sense for this function to accept the feature_names_out
instead of get_feature names_out
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not, because of the way it's used in the decorator?
|
||
def _make_array_out(X_out, index, get_feature_names_out, *, | ||
array_out="default"): | ||
"""Construct array container based on global configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Construct array container based on global configuration. | |
"""Construct array container based on the value of `array_out`. |
I think?
returned. If "default", an array-like without feature names is | ||
returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returned. If "default", an array-like without feature names is | |
returned. | |
returned. If "default", `X_out` is returned unmodified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or should we actually make sure the output is not a dataframe?
if array_out == "default": | ||
return X_out | ||
|
||
estimator, X_orig = args[0], args[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
silly question, what would these values be if the user calls est.transform(array_out='dataframe', X=X)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple unpacking error? args
is only est in that case, right?
From the dev meeting: We discussed how we want to see how the API of this PR compares to using a I'll come up this other implementation within a week and write up how the API compares to this PR when used with meta-estimators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was vaguely remembering having an implementation with a constructor parameter, but I think it was actually a constructor parameter that gets the input feature names, not whether to produce pandas dataframes lol... there have been sooo many iterations of this now. This looks good though.
@@ -438,6 +438,10 @@ Changelog | |||
- |Feature| :class:`preprocessing.OrdinalEncoder` supports passing through | |||
missing values by default. :pr:`19069` by `Thomas Fan`_. | |||
|
|||
- |Feature| Transformers in the :mod:`sklearn.preprocessing` have a `array_out` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is array_out
a good name? that seems to imply... arrays. output_format
? or output
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with either output
or output_format
@@ -706,10 +740,20 @@ def fit_transform(self, X, y=None, **fit_params): | |||
# method is possible for a given clustering algorithm | |||
if y is None: | |||
# fit method of arity 1 (unsupervised transformation) | |||
return self.fit(X, **fit_params).transform(X) | |||
fitted = self.fit(X, **fit_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that also confused me. I think it's always self, right? Or at least should be.
return fitted.transform(X) | ||
|
||
# array_out != "default" | ||
transform_params = inspect.signature(fitted.transform).parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just pass it, or is the error message not good in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean "can we just pass it"? If we don't pass it when it doesn't exist, it's a silent bug, and I think just passing it gives an error message which can be confusing to many people.
if array_out == "default": | ||
return X_out | ||
|
||
estimator, X_orig = args[0], args[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple unpacking error? args
is only est in that case, right?
There is still a use case for
poly = PolynomialFeatures().fit(X_df)
# Uses `feature_names_in_` by default as input
poly.get_feature_names()
# without `get_feature_names` a user would need to pass in some dummy data
# to get the feature names:
poly.transform(X_df).columns
# Another API would be to store `feature_names_out_`, which can be a property
# so we do not need to store another array of strings:
poly.feature_names_out_ I prefer a |
I don't have a strong opinion between |
This is superseded by #23734, right? Can we close then? |
Yes this has been superseded by #23734 and we can close. |
Reference Issues/PRs
Toward #5523
Alternative to #16772
What does this implement/fix? Explain your changes.
Adds
array_out="pandas"
to the transformers in preprocessing module. Here are the decision decisions I made:transform
uses the index of the input.feature_names_in_
is a sequence of names and does not need to be a numpy array.pd.DataFrame(X)
uses integer column names andDictVectorizer.get_feature_names
can output numerical feature names.Any other comments?
Binarizer
now have state because ofn_features_in_
and nowfeature_names_in_
. ShouldBinarizer.transform(df_test, array_out="pandas")
use the column names ofdf_test
ifBinarizer.fit
is not called?