Skip to content

ENH Introduces set_output API for pandas output #23734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
Oct 12, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jun 23, 2022

Reference Issues/PRs

Closes #23001
Implements SLEP018: scikit-learn/enhancement_proposals#68

What does this implement/fix? Explain your changes.

This PR introduces:

  1. set_output for Pipeline and transformers in preprocessing.
  2. Common test to check the behavior of set_output, which will be used for follow up PRs that adds set_output to all other transformers.
  3. OutputTypeMixin, where most transformers only need to subclass it to get the set_output API.
  4. Global configuration option set_config(output_transform="pandas") to set the output globally.

output.columns = columns
if index is not None:
output.index = index
return output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if-block ever used in our current scikit-learn transformers? Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?

One possible use-case I would see would be to enforce that the output dataframe of a FunctionTransformer that naturally generates a dataframe has consistent column names with what is returned by its get_feature_names_out method. However FunctionTransformer does not inherit from the SetOutputMixin class at the moment so it does not apply.

But I am wondering, for such transformers that naturally outputs dataframe, should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be the responsibility of the implementer of that transformer to ensure consistency with get_feature_names_out

I'll prefer get_feature_names_out to already be consistent when it gets to wrapping the container. In that case, we can have SetOutputMixin be a no-op if the return of transform is already a DataFrame.

Or is it just to add some compat with third-party estimators that output dataframes by default in case they plan to inherit from the SetOutputMixin?

This if-block is for compat, but it's also to make sense of the API. If the output is already a dataframe and columns is passed in, then I thought a reasonable thing to do is to set the columns as well. Given the point, above I am okay with doing a no-op if the container is already a DataFrame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to make it a noop when the original_data is a pandas dataframe. This way SetOutputMixin does not do anything if estimator.transform already returns a DataFrame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still a lingering issue with FunctionTransformer.

This PR has been updated so that TranformerMixin inherits the SetOutputMixin. This means FunctionTransformer will get the set_output API if feature_names_out is defined. To be concrete, here is the current behavior for FunctionTransformer:

  1. If feature_names_out is defined, then set_output is available. The output columns will always be consistent with get_feature_names_out no matter if func returns a dataframe or ndarray.

  2. If feature_names_out is not defined then set_output is not available. func is allow to return a dataframe or a ndarray. Not having set_output is inconvenient, specifically when func returns a dataframe. For example, if the FunctionTransformer without set_output is in complex pipeline, then pipeline.set_output(transform="pandas") would fail to configure FunctionTransformer.

A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas"). Technically, we can "learn" the feature names out by running func in fit, but that introduces more computation compared to main.

Copy link
Member

@ogrisel ogrisel Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround for 2 is to have a custom set_output raise a warning when feature_names_out is not defined. The warning states that func must return a dataframe if set_output(transform="pandas").

Sounds reasonable. We can also raise a TypeError at transform time if set_ouput(transform="pandas") was previously called and that func does not naturally return a pandas dataframe.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the SetOutputMixin into the TransformerMixin so we don't have to add it everywhere manually and to keep the inheritance list more reasonable.
That should work in theory, though things that output sparse will error on transform if someone did set_output on them. It might be nicer to error earlier, but we can still do that after merging the initial thing.

@amueller
Copy link
Member

amueller commented Jul 17, 2022

Ok so the problem with inheriting from TransformerMixin is that it's hard to overwrite the behavior done in the mixin.
With the implementation in this PR, transformers can decide not to inherit from SetOutputMixin and implement transform and fit_transform themselves to support set_output.

If we add SetOutputMixin to TransformerMixin everybody inheriting from TransformerMixin will automatically get SetOutputMixin but it's harder to customize.
There would be two ways to allow customization, either via making _wrap_method_output a method that someone could overwrite, or by adding a **kwargs to __init_subclass__ which would allow someone to disable the automatic wrapping behavior.

Let the bikeshedding begin!

@thomasjpfan
Copy link
Member Author

How difficult/easy would it be, to extend the set_output options to, e.g., "arrow".

I see two possible paths: Support it directly in scikit-learn or an API to configure "arrow".

  1. Add another function _wrap_in_arrow_container and dispatch to it in _wrap_data_with_container when set_output(transform="arrow").
  2. Allow set_output(transform=callable), where a users defines a callable that constructs an arrow container. This will likely have the same signature of _wrap_in_pandas_container: (data, columns, index).

What happens with 3rd party transformers in a pipeline if global set_config(transform_output="pandas") has been set?

If they inherit from TransformerMixin, then they will auto opt-into set_output and transform will output DataFrames. They can opt-out by setting auto_wrap_output_keys=None.

If set_config(transform_output="pandas") has been set, and a model is saved via pickle and later loaded in a different python process, will it ouput pandas?

If it's a global set_config, then the global option needs to be set again to output pandas. If it's the local option, i.e. transform.set_output, then transform outputs pandas.

In the default setting, what happens with other Array API compliant data containers than numpy, e.g. cupy, see #22352?

The default setting will keep the behavior on main. The transformer decides what it wants to output, ndarray, cupy, dataframes, etc.

@lorentzenchr
Copy link
Member

What happens with 3rd party transformers in a pipeline if global set_config(transform_output="pandas") has been set?

If they inherit from TransformerMixin, then they will auto opt-into set_output and transform will output DataFrames. They can opt-out by setting auto_wrap_output_keys=None.

What if they do not inherit from TransformerMixin? Or stated otherwise: Are we clear enough on the expected API for a 3rd party transformer (that - for some reasons - does not want to depend on scikit-learn)?

@lorentzenchr
Copy link
Member

@thomasjpfan The only open point for me is the name of the default option of transform_output, see #23734 (comment). Your suggestion to use None instead of "default" is good for me. As this would be a deviation from the SLEP, other opinions are highly appreciated.

Once that is settled, I will give my review approval.

@thomasjpfan
Copy link
Member Author

What if they do not inherit from TransformerMixin? Or stated otherwise: Are we clear enough on the expected API for a 3rd party transformer (that - for some reasons - does not want to depend on scikit-learn)?

For now, I prefer not to require the API from third party estimators. Our meta-estimators such as ColumnTransformer will complain when transformers set_output. As for the global option, I do not think we can require third parties to respect it.

Adopting the set_output API fully, requires at least a soft dependency on scikit-learn and pandas:

  • scikit-learn: To follow scikit-learn's global config, one needs to get the global config from scikit-learn.
  • pandas: Outputting DataFrames requires it.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit split regarding the choice between None, "default" (why not "native"). I find that None does not specify the output type but this is also the case for "default" or "native". It means that the output type should be provided somewhere else (I assume the documentation).

I am therefore fine with any option.

Regarding the other changes. LGTM.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Oct 11, 2022

Looking through the code, we can not use None instead of "default". Currently, the signature is: set_output(self, transform=None), where None is a sentinel to mean do "not configure transform". This is required if we have set_output(predict="pandas"), which configures the container for predict but leaves transform alone. Secondly, est.set_output() with no input configures nothing. I can use another sentinel object, but that complicates the set_output API.

I think the best option is to find a string that best describes the behavior. I think "native" is okay. Another option is "any"

@lorentzenchr WDYT?

@lorentzenchr
Copy link
Member

Yes "native" is fine, even good! I'm happy with any other default value than "default", which I would consider bad design because it does not tell anything about the actual behaviour.

@thomasjpfan
Copy link
Member Author

I updated this PR to use "native". I also opened a PR to update SLEP: scikit-learn/enhancement_proposals#78

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@thomasjpfan A great thank you for all your (years long) efforts to make this happen!

@lorentzenchr
Copy link
Member

I merge. In case that #78 concludes to still change the default value, we can do that in a new PR.

@lorentzenchr lorentzenchr merged commit 2a6703d into scikit-learn:main Oct 12, 2022
@amueller
Copy link
Member

Yaaaaay!!

avm19 pushed a commit to avm19/scikit-learn that referenced this pull request Oct 18, 2022
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2022
* Introduces set_output API for all transformers

* TransformerMixin inherits from _SetOutputMixin

* Adds tests

* Adds whatsnew

* Adds example on using set_output API

* Adds developer docs for set_output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas Output Proposal Outline
5 participants