Skip to content

ENH Introduces set_output API for pandas output #23734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
Oct 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
e1ea0a9
ENH Introduces set_output API
thomasjpfan Jun 22, 2022
07078a1
CLN Reduces API surface
thomasjpfan Jun 22, 2022
1faf347
ENH Expand test for failing case
thomasjpfan Jun 22, 2022
9f9680a
CLN Rename mixin
thomasjpfan Jun 22, 2022
a6a4b59
ENH Add full support for get_output in preprocessing
thomasjpfan Jun 22, 2022
4ae72c5
DOC Adds comment in clone
thomasjpfan Jun 23, 2022
beca084
DOC Adds whats new number
thomasjpfan Jun 23, 2022
021d36c
CLN Less diff
thomasjpfan Jun 23, 2022
de0db34
ENH Use keyword only arguments for public API
thomasjpfan Jun 27, 2022
63c4204
CLN Address comments
thomasjpfan Jul 4, 2022
ee4cdff
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Jul 4, 2022
9d318b1
FIX Fixes typo
thomasjpfan Jul 4, 2022
609f4f0
CLN Use dictionary instead
thomasjpfan Jul 5, 2022
64c761a
CLN Better error message
thomasjpfan Jul 5, 2022
471e2d5
TST Adds more code coverage
thomasjpfan Jul 5, 2022
63c2011
CLN Simplifies implementation
thomasjpfan Jul 9, 2022
89a854e
CLN Remove unneeded parameter
thomasjpfan Jul 9, 2022
20fed9e
CLN Simplify validation
thomasjpfan Jul 9, 2022
91e2448
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Jul 17, 2022
d63f059
CLN Rename output_transform transform_output
thomasjpfan Jul 17, 2022
32d9252
CLN Fix name
thomasjpfan Jul 17, 2022
126a9aa
DOC Update whats new
thomasjpfan Jul 17, 2022
0d02e50
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Aug 17, 2022
fb0abaa
ENH More flexible pandas output
thomasjpfan Aug 18, 2022
1c5c2ef
ENH Fixes set_output in other transformers
thomasjpfan Aug 18, 2022
e4a663f
TST Adds failing test for column transformer
thomasjpfan Aug 18, 2022
c8667b9
ENH Give column transformer desired behavior
thomasjpfan Aug 19, 2022
390e257
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Aug 19, 2022
19b6032
TST Fixes future warning
thomasjpfan Aug 19, 2022
0d2610a
CLN Refactor wrapper to take original input
thomasjpfan Aug 20, 2022
531c9c7
CLN Clean up column transformer
thomasjpfan Aug 20, 2022
321ede0
CLN Improve function and parameter names
thomasjpfan Aug 20, 2022
865edf5
ENH Adds support for cross decomposition
thomasjpfan Aug 20, 2022
110e50d
CLN Smaller diff by removing unneeded feature
thomasjpfan Aug 20, 2022
1c658ed
TST Adds tests for no get_feature_names_out
thomasjpfan Aug 20, 2022
5ae531f
CLN Make get_output_config more flexible
thomasjpfan Aug 20, 2022
4c7fefa
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Aug 26, 2022
4f8c2ac
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Sep 9, 2022
50fd9c1
DOC Update whats new
thomasjpfan Sep 9, 2022
0f63fa2
CLN Reduce diff
thomasjpfan Sep 9, 2022
c9fc072
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Sep 12, 2022
09d2359
WIP Move avaliable_if it its own file
thomasjpfan Sep 14, 2022
c59d800
CLN Be more strict about columns existing
thomasjpfan Sep 14, 2022
128ee66
CLN Address comments
thomasjpfan Sep 20, 2022
3477d51
CLN Address more comments
thomasjpfan Sep 20, 2022
f94870e
REV Revert unneeded change
thomasjpfan Sep 21, 2022
9cbb47c
TST Ust assert_from_equal
thomasjpfan Sep 21, 2022
cf0c916
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Sep 21, 2022
94c4ff5
TST Adds OneHotEncoder(sparse_output=True) to common test
thomasjpfan Sep 21, 2022
4e56880
CLN Better column transformer logic
thomasjpfan Sep 21, 2022
980caf3
CLN More strict language in error message
thomasjpfan Sep 21, 2022
9888bdd
DOC Remove docstring for unused parameter
thomasjpfan Sep 21, 2022
2b238aa
TST More tests
thomasjpfan Sep 21, 2022
2db0dd4
DOC Better docstring
thomasjpfan Sep 22, 2022
77511b5
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Sep 22, 2022
903ad04
TST Fix failing tests
thomasjpfan Sep 22, 2022
4d7f594
DOC Adds example on using set_output API
thomasjpfan Sep 24, 2022
26853ab
DOC Add developer docs for set_output
thomasjpfan Sep 25, 2022
7f13efb
DOC Adds more developer docs
thomasjpfan Sep 26, 2022
f64b2f5
FIX Fixes bug with set_output and transform=None
thomasjpfan Sep 26, 2022
96ae074
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Sep 26, 2022
99f9497
ENH Moves set_output mixin into BaseEstimator
thomasjpfan Sep 26, 2022
2fc486d
API Make more API private
thomasjpfan Sep 26, 2022
fe87f71
DOC Remove unneeded link
thomasjpfan Sep 26, 2022
cca5548
API Move back to TransformerMixin
thomasjpfan Sep 26, 2022
54964dd
ENH FunctionTransformer.set_output warns about func when transform=pa…
thomasjpfan Sep 27, 2022
3f56922
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Sep 27, 2022
88e17ff
Apply suggestions from code review
thomasjpfan Oct 5, 2022
072b1a3
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Oct 5, 2022
78f4a8b
CLN Smaller diff compared to main
thomasjpfan Oct 5, 2022
598a94f
STY Flake8 error
thomasjpfan Oct 5, 2022
08e01d0
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Oct 6, 2022
0009dc3
FIX Merge conflict
thomasjpfan Oct 6, 2022
244c002
CLN Address more comments
thomasjpfan Oct 6, 2022
4b1f1e5
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Oct 6, 2022
c33a307
FIX Fixes doc build
thomasjpfan Oct 6, 2022
51fc045
FIX Fixes test for new message
thomasjpfan Oct 6, 2022
fad1fa5
CLN Address comments
thomasjpfan Oct 7, 2022
c8bb076
DOC Adds comments about data_to_wrap
thomasjpfan Oct 7, 2022
d25ba8b
DOC Adds behavior with multiple array output
thomasjpfan Oct 7, 2022
01771ae
STY Slight formatting
thomasjpfan Oct 7, 2022
1421e8a
TST Adds testing for fit_transform and fit.transform
thomasjpfan Oct 7, 2022
24c3fc1
Update doc/developers/develop.rst
thomasjpfan Oct 10, 2022
b87ad84
Merge remote-tracking branch 'upstream/main' into pandas_out_prototyp…
thomasjpfan Oct 11, 2022
5313958
ENH Rename 'default' to 'native'
thomasjpfan Oct 11, 2022
48add35
Revert "ENH Rename 'default' to 'native'"
thomasjpfan Oct 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions doc/developers/develop.rst
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,35 @@ instantiated with an instance of ``LogisticRegression`` (or
of these two models is somewhat idiosyncratic but both should provide robust
closed-form solutions.

.. _developer_api_set_output:

Developer API for `set_output`
==============================

With
`SLEP018 <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html>`__,
scikit-learn introduces the `set_output` API for configuring transformers to
output pandas DataFrames. The `set_output` API is automatically defined if the
transformer defines :term:`get_feature_names_out` and subclasses
:class:`base.TransformerMixin`. :term:`get_feature_names_out` is used to get the
column names of pandas output. You can opt-out of the `set_output` API by
setting `auto_wrap_output_keys=None` when defining a custom subclass::

class MyTransformer(TransformerMixin, BaseEstimator, auto_wrap_output_keys=None):

def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X
def get_feature_names_out(self, input_features=None):
...

For transformers that return multiple arrays in `transform`, auto wrapping will
only wrap the first array and not alter the other arrays.

See :ref:`sphx_glr_auto_examples_miscellaneous_plot_set_output.py`
for an example on how to use the API.

.. _coding-guidelines:

Coding guidelines
Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,13 @@ random sampling procedures.
Changes impacting all modules
-----------------------------

- |MajorFeature| The `set_output` API has been adopted by all transformers.
Meta-estimators that contain transformers such as :class:`pipeline.Pipeline`
or :class:`compose.ColumnTransformer` also define a `set_output`.
For details, see
`SLEP018 <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html>`__.
:pr:`23734` by `Thomas Fan`_.

- |Enhancement| Finiteness checks (detection of NaN and infinite values) in all
estimators are now significantly more efficient for float32 data by leveraging
NumPy's SIMD optimized primitives.
Expand Down
111 changes: 111 additions & 0 deletions examples/miscellaneous/plot_set_output.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
================================
Introducing the `set_output` API
================================

.. currentmodule:: sklearn

This example will demonstrate the `set_output` API to configure transformers to
output pandas DataFrames. `set_output` can be configured per estimator by calling
the `set_output` method or globally by setting `set_config(transform_output="pandas")`.
For details, see
`SLEP018 <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html>`__.
""" # noqa

# %%
# First, we load the iris dataset as a DataFrame to demonstrate the `set_output` API.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train.head()

# %%
# To configure an estimator such as :class:`preprocessing.StandardScalar` to return
# DataFrames, call `set_output`. This feature requires pandas to be installed.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()

# %%
# `set_output` can be called after `fit` to configure `transform` after the fact.
scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f"Default output type: {type(X_test_np).__name__}")

scaler2.set_output(transform="pandas")
X_test_df = scaler2.transform(X_test)
print(f"Configured pandas output type: {type(X_test_df).__name__}")

# %%
# In a :class:`pipeline.Pipeline`, `set_output` configures all steps to output
# DataFrames.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile

clf = make_pipeline(
StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()
)
clf.set_output(transform="pandas")
clf.fit(X_train, y_train)

# %%
# Each transformer in the pipeline is configured to return DataFrames. This
# means that the final logistic regression step contain the feature names.
clf[-1].feature_names_in_

# %%
# Next we load the titanic dataset to demonstrate `set_output` with
# :class:`compose.ColumnTransformer` and heterogenous data.
from sklearn.datasets import fetch_openml

X, y = fetch_openml(
"titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

# %%
# The `set_output` API can be configured globally by using :func:`set_config` and
# setting the `transform_output` to `"pandas"`.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn import set_config

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
ct = ColumnTransformer(
(
("numerical", num_pipe, ["age", "fare"]),
(
"categorical",
OneHotEncoder(
sparse_output=False, drop="if_binary", handle_unknown="ignore"
),
["embarked", "sex", "pclass"],
),
),
verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# %%
# With the global configuration, all transformers output DataFrames. This allows us to
# easily plot the logistic regression coefficients with the corresponding feature names.
import pandas as pd

log_reg = clf[-1]
coef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)
_ = coef.sort_values().plot.barh()
16 changes: 16 additions & 0 deletions sklearn/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
),
"enable_cython_pairwise_dist": True,
"array_api_dispatch": False,
"transform_output": "default",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a more speaking name than "default" as default value? I know, it's written in SLEP 018, but something like "numpy" or "array-like" would better describe this option, IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option use to be "numpy", but it was inconsistent with transformers that can output sparse data. I was also thinking about third party transformers that already output dataframes where a "numpy" default would be strange.

I can get behind "array-like". The only concern I have is how sparse data is a weird "array-like", because asarray on a sparse matrix returns an object dtype:

from scipy.sparse import csr_matrix
import numpy as np

mat = csr_matrix([[1, 2, 0]])
print(np.asarray(mat).dtype)
# object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at our glossary entry for "array-like", we exclude "sparse matrix" from "array-like". In that case, "array-like" would not be a good default, because it does not cover sparse matrices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a proposal. Naming the default "default" just seems wrong to me. What if we change it in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics for "default" is "the transformer does anything it wants". Here are some options:

  1. None
  2. "undefined"
  3. "unmodified"
  4. "unchanged"

I am in favor of None. I think it's the most pythonic way to say "use the default".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with None. What do others think?

}
_threadlocal = threading.local()

Expand Down Expand Up @@ -52,6 +53,7 @@ def set_config(
pairwise_dist_chunk_size=None,
enable_cython_pairwise_dist=None,
array_api_dispatch=None,
transform_output=None,
):
"""Set global scikit-learn configuration

Expand Down Expand Up @@ -120,6 +122,11 @@ def set_config(

.. versionadded:: 1.2

transform_output : str, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a more speaking name than "default" as default value? I know, it's written in SLEP 018, but something like "numpy" or "array-like" would better describe this option, IMHO.

Configure the output container for transform.

.. versionadded:: 1.2

See Also
--------
config_context : Context manager for global scikit-learn configuration.
Expand All @@ -141,6 +148,8 @@ def set_config(
local_config["enable_cython_pairwise_dist"] = enable_cython_pairwise_dist
if array_api_dispatch is not None:
local_config["array_api_dispatch"] = array_api_dispatch
if transform_output is not None:
local_config["transform_output"] = transform_output


@contextmanager
Expand All @@ -153,6 +162,7 @@ def config_context(
pairwise_dist_chunk_size=None,
enable_cython_pairwise_dist=None,
array_api_dispatch=None,
transform_output=None,
):
"""Context manager for global scikit-learn configuration.

Expand Down Expand Up @@ -220,6 +230,11 @@ def config_context(

.. versionadded:: 1.2

transform_output : str, default=None
Configure the output container for transform.

.. versionadded:: 1.2

Yields
------
None.
Expand Down Expand Up @@ -256,6 +271,7 @@ def config_context(
pairwise_dist_chunk_size=pairwise_dist_chunk_size,
enable_cython_pairwise_dist=enable_cython_pairwise_dist,
array_api_dispatch=array_api_dispatch,
transform_output=transform_output,
)

try:
Expand Down
17 changes: 15 additions & 2 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from . import __version__
from ._config import get_config
from .utils import _IS_32BIT
from .utils._set_output import _SetOutputMixin
from .utils._tags import (
_DEFAULT_TAGS,
)
Expand Down Expand Up @@ -98,6 +99,13 @@ def clone(estimator, *, safe=True):
"Cannot clone object %s, as the constructor "
"either does not set or modifies parameter %s" % (estimator, name)
)

# _sklearn_output_config is used by `set_output` to configure the output
# container of an estimator.
if hasattr(estimator, "_sklearn_output_config"):
new_object._sklearn_output_config = copy.deepcopy(
estimator._sklearn_output_config
)
return new_object


Expand Down Expand Up @@ -798,8 +806,13 @@ def get_submatrix(self, i, data):
return data[row_ind[:, np.newaxis], col_ind]


class TransformerMixin:
"""Mixin class for all transformers in scikit-learn."""
class TransformerMixin(_SetOutputMixin):
"""Mixin class for all transformers in scikit-learn.

If :term:`get_feature_names_out` is defined and `auto_wrap_output` is True,
then `BaseEstimator` will automatically wrap `transform` and `fit_transform` to
follow the `set_output` API. See the :ref:`developer_api_set_output` for details.
"""

def fit_transform(self, X, y=None, **fit_params):
"""
Expand Down
63 changes: 60 additions & 3 deletions sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
from ..utils import Bunch
from ..utils import _safe_indexing
from ..utils import _get_column_indices
from ..utils._set_output import _get_output_config, _safe_set_output
from ..utils import check_pandas_support
from ..utils.metaestimators import _BaseComposition
from ..utils.validation import check_array, check_is_fitted, _check_feature_names_in
from ..utils.fixes import delayed
Expand Down Expand Up @@ -252,6 +254,35 @@ def _transformers(self, value):
except (TypeError, ValueError):
self.transformers = value

def set_output(self, transform=None):
"""Set the output container when `"transform`" and `"fit_transform"` are called.

Calling `set_output` will set the output of all estimators in `transformers`
and `transformers_`.

Parameters
----------
transform : {"default", "pandas"}, default=None
Configure output of `transform` and `fit_transform`.

Returns
-------
self : estimator instance
Estimator instance.
"""
super().set_output(transform=transform)
transformers = (
trans
for _, trans, _ in chain(
self.transformers, getattr(self, "transformers_", [])
)
if trans not in {"passthrough", "drop"}
)
for trans in transformers:
_safe_set_output(trans, transform=transform)

return self

def get_params(self, deep=True):
"""Get parameters for this estimator.

Expand Down Expand Up @@ -302,7 +333,19 @@ def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):

"""
if fitted:
transformers = self.transformers_
if replace_strings:
# Replace "passthrough" with the fitted version in
# _name_to_fitted_passthrough
def replace_passthrough(name, trans, columns):
if name not in self._name_to_fitted_passthrough:
return name, trans, columns
return name, self._name_to_fitted_passthrough[name], columns

transformers = [
replace_passthrough(*trans) for trans in self.transformers_
]
else:
transformers = self.transformers_
else:
# interleave the validated column specifiers
transformers = [
Expand All @@ -314,12 +357,17 @@ def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):
transformers = chain(transformers, [self._remainder])
get_weight = (self.transformer_weights or {}).get

output_config = _get_output_config("transform", self)
for name, trans, columns in transformers:
if replace_strings:
# replace 'passthrough' with identity transformer and
# skip in case of 'drop'
if trans == "passthrough":
trans = FunctionTransformer(accept_sparse=True, check_inverse=False)
trans = FunctionTransformer(
accept_sparse=True,
check_inverse=False,
feature_names_out="one-to-one",
).set_output(transform=output_config["dense"])
elif trans == "drop":
continue
elif _is_empty_column_selection(columns):
Expand Down Expand Up @@ -505,15 +553,20 @@ def _update_fitted_transformers(self, transformers):
# transformers are fitted; excludes 'drop' cases
fitted_transformers = iter(transformers)
transformers_ = []
self._name_to_fitted_passthrough = {}

for name, old, column, _ in self._iter():
if old == "drop":
trans = "drop"
elif old == "passthrough":
# FunctionTransformer is present in list of transformers,
# so get next transformer, but save original string
next(fitted_transformers)
func_transformer = next(fitted_transformers)
trans = "passthrough"

# The fitted FunctionTransformer is saved in another attribute,
# so it can be used during transform for set_output.
self._name_to_fitted_passthrough[name] = func_transformer
elif _is_empty_column_selection(column):
trans = old
else:
Expand Down Expand Up @@ -765,6 +818,10 @@ def _hstack(self, Xs):
return sparse.hstack(converted_Xs).tocsr()
else:
Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
config = _get_output_config("transform", self)
if config["dense"] == "pandas" and all(hasattr(X, "iloc") for X in Xs):
pd = check_pandas_support("transform")
return pd.concat(Xs, axis=1)
return np.hstack(Xs)

def _sk_visual_block_(self):
Expand Down
Loading