-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Introduces set_output API for pandas output #23734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Introduces set_output API for pandas output #23734
Conversation
sklearn/utils/set_output.py
Outdated
output.columns = columns | ||
if index is not None: | ||
output.index = index | ||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if-block ever used in our current scikit-learn transformers? Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin
?
One possible use-case I would see would be to enforce that the output dataframe of a FunctionTransformer
that naturally generates a dataframe has consistent column names with what is returned by its get_feature_names_out
method. However FunctionTransformer
does not inherit from the SetOutputMixin
class at the moment so it does not apply.
But I am wondering, for such transformers that naturally outputs dataframe, should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out
I'll prefer get_feature_names_out
to already be consistent when it gets to wrapping the container. In that case, we can have SetOutputMixin
be a no-op if the return of transform
is already a DataFrame.
Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?
This if-block
is for compat, but it's also to make sense of the API. If the output
is already a dataframe and columns
is passed in, then I thought a reasonable thing to do is to set the columns
as well. Given the point, above I am okay with doing a no-op if the container is already a DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to make it a noop when the original_data
is a pandas dataframe. This way SetOutputMixin
does not do anything if estimator.transform
already returns a DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still a lingering issue with FunctionTransformer
.
This PR has been updated so that TranformerMixin
inherits the SetOutputMixin
. This means FunctionTransformer
will get the set_output
API if feature_names_out
is defined. To be concrete, here is the current behavior for FunctionTransformer
:
-
If
feature_names_out
is defined, thenset_output
is available. The output columns will always be consistent withget_feature_names_out
no matter iffunc
returns a dataframe or ndarray. -
If
feature_names_out
is not defined thenset_output
is not available.func
is allow to return a dataframe or a ndarray. Not havingset_output
is inconvenient, specifically whenfunc
returns a dataframe. For example, if theFunctionTransformer
withoutset_output
is in complex pipeline, thenpipeline.set_output(transform="pandas")
would fail to configureFunctionTransformer
.
A workaround for 2 is to have a custom set_output
raise a warning when feature_names_out
is not defined. The warning states that func
must return a dataframe if set_output(transform="pandas")
. Technically, we can "learn" the feature names out by running func
in fit
, but that introduces more computation compared to main
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas").
Sounds reasonable. We can also raise a TypeError
at transform time if set_ouput(transform="pandas")
was previously called and that func
does not naturally return a pandas dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the SetOutputMixin
into the TransformerMixin
so we don't have to add it everywhere manually and to keep the inheritance list more reasonable.
That should work in theory, though things that output sparse will error on transform
if someone did set_output
on them. It might be nicer to error earlier, but we can still do that after merging the initial thing.
Ok so the problem with inheriting from If we add Let the bikeshedding begin! |
I see two possible paths: Support it directly in scikit-learn or an API to configure "arrow".
If they inherit from
If it's a global
The default setting will keep the behavior on |
What if they do not inherit from |
@thomasjpfan The only open point for me is the name of the default option of Once that is settled, I will give my review approval. |
For now, I prefer not to require the API from third party estimators. Our meta-estimators such as Adopting the
|
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit split regarding the choice between None
, "default"
(why not "native"
). I find that None
does not specify the output type but this is also the case for "default"
or "native"
. It means that the output type should be provided somewhere else (I assume the documentation).
I am therefore fine with any option.
Regarding the other changes. LGTM.
Looking through the code, we can not use I think the best option is to find a string that best describes the behavior. I think @lorentzenchr WDYT? |
Yes |
I updated this PR to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@thomasjpfan A great thank you for all your (years long) efforts to make this happen!
This reverts commit 5313958.
I merge. In case that #78 concludes to still change the default value, we can do that in a new PR. |
Yaaaaay!! |
* Introduces set_output API for all transformers * TransformerMixin inherits from _SetOutputMixin * Adds tests * Adds whatsnew * Adds example on using set_output API * Adds developer docs for set_output
Reference Issues/PRs
Closes #23001
Implements SLEP018: scikit-learn/enhancement_proposals#68
What does this implement/fix? Explain your changes.
This PR introduces:
set_output
forPipeline
and transformers inpreprocessing
.set_output
, which will be used for follow up PRs that addsset_output
to all other transformers.OutputTypeMixin
, where most transformers only need to subclass it to get theset_output
API.set_config(output_transform="pandas")
to set the output globally.