Skip to content

Commit e91011b

Browse files
authored
DOC update FAQ regarding dataframe I/O and support for categorical variable (#29957)
1 parent 019e953 commit e91011b

File tree

1 file changed

+27
-15
lines changed

1 file changed

+27
-15
lines changed

doc/faq.rst

+27-15
Original file line numberDiff line numberDiff line change
@@ -181,21 +181,33 @@ discussed in :ref:`preprocessing_categorical_features`.
181181
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an
182182
example of working with heterogeneous (e.g. categorical and numeric) data.
183183

184-
Why does scikit-learn not directly work with, for example, :class:`pandas.DataFrame`?
185-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
186-
187-
The homogeneous NumPy and SciPy data objects currently expected are most
188-
efficient to process for most operations. Extensive work would also be needed
189-
to support Pandas categorical types. Restricting input to homogeneous
190-
types therefore reduces maintenance cost and encourages usage of efficient
191-
data structures.
192-
193-
Note however that :class:`~sklearn.compose.ColumnTransformer` makes it
194-
convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
195-
dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
196-
Therefore :class:`~sklearn.compose.ColumnTransformer` are often used in the first
197-
step of scikit-learn pipelines when dealing
198-
with heterogeneous dataframes (see :ref:`pipeline` for more details).
184+
Note that recently, :class:`~sklearn.ensemble.HistGradientBoostingClassifier` and
185+
:class:`~sklearn.ensemble.HistGradientBoostingRegressor` gained native support for
186+
categorical features through the option `categorical_features="from_dtype"`. This
187+
option relies on inferring which columns of the data are categorical based on the
188+
:class:`pandas.CategoricalDtype` and :class:`polars.datatypes.Categorical` dtypes.
189+
190+
Does scikit-learn work natively with various types of dataframes?
191+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
192+
193+
Scikit-learn has limited support for :class:`pandas.DataFrame` and
194+
:class:`polars.DataFrame`. Scikit-learn estimators can accept both these dataframe types
195+
as input, and scikit-learn transformers can output dataframes using the `set_output`
196+
API. For more details, refer to
197+
:ref:`sphx_glr_auto_examples_miscellaneous_plot_set_output.py`.
198+
199+
However, the internal computations in scikit-learn estimators rely on numerical
200+
operations that are more efficiently performed on homogeneous data structures such as
201+
NumPy arrays or SciPy sparse matrices. As a result, most scikit-learn estimators will
202+
internally convert dataframe inputs into these homogeneous data structures. Similarly,
203+
dataframe outputs are generated from these homogeneous data structures.
204+
205+
Also note that :class:`~sklearn.compose.ColumnTransformer` makes it convenient to handle
206+
heterogeneous pandas dataframes by mapping homogeneous subsets of dataframe columns
207+
selected by name or dtype to dedicated scikit-learn transformers. Therefore
208+
:class:`~sklearn.compose.ColumnTransformer` are often used in the first step of
209+
scikit-learn pipelines when dealing with heterogeneous dataframes (see :ref:`pipeline`
210+
for more details).
199211

200212
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`
201213
for an example of working with heterogeneous (e.g. categorical and numeric) data.

0 commit comments

Comments
 (0)