@@ -181,21 +181,33 @@ discussed in :ref:`preprocessing_categorical_features`.
181
181
See also :ref: `sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py ` for an
182
182
example of working with heterogeneous (e.g. categorical and numeric) data.
183
183
184
- Why does scikit-learn not directly work with, for example, :class: `pandas.DataFrame `?
185
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
186
-
187
- The homogeneous NumPy and SciPy data objects currently expected are most
188
- efficient to process for most operations. Extensive work would also be needed
189
- to support Pandas categorical types. Restricting input to homogeneous
190
- types therefore reduces maintenance cost and encourages usage of efficient
191
- data structures.
192
-
193
- Note however that :class: `~sklearn.compose.ColumnTransformer ` makes it
194
- convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
195
- dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
196
- Therefore :class: `~sklearn.compose.ColumnTransformer ` are often used in the first
197
- step of scikit-learn pipelines when dealing
198
- with heterogeneous dataframes (see :ref: `pipeline ` for more details).
184
+ Note that recently, :class: `~sklearn.ensemble.HistGradientBoostingClassifier ` and
185
+ :class: `~sklearn.ensemble.HistGradientBoostingRegressor ` gained native support for
186
+ categorical features through the option `categorical_features="from_dtype" `. This
187
+ option relies on inferring which columns of the data are categorical based on the
188
+ :class: `pandas.CategoricalDtype ` and :class: `polars.datatypes.Categorical ` dtypes.
189
+
190
+ Does scikit-learn work natively with various types of dataframes?
191
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
192
+
193
+ Scikit-learn has limited support for :class: `pandas.DataFrame ` and
194
+ :class: `polars.DataFrame `. Scikit-learn estimators can accept both these dataframe types
195
+ as input, and scikit-learn transformers can output dataframes using the `set_output `
196
+ API. For more details, refer to
197
+ :ref: `sphx_glr_auto_examples_miscellaneous_plot_set_output.py `.
198
+
199
+ However, the internal computations in scikit-learn estimators rely on numerical
200
+ operations that are more efficiently performed on homogeneous data structures such as
201
+ NumPy arrays or SciPy sparse matrices. As a result, most scikit-learn estimators will
202
+ internally convert dataframe inputs into these homogeneous data structures. Similarly,
203
+ dataframe outputs are generated from these homogeneous data structures.
204
+
205
+ Also note that :class: `~sklearn.compose.ColumnTransformer ` makes it convenient to handle
206
+ heterogeneous pandas dataframes by mapping homogeneous subsets of dataframe columns
207
+ selected by name or dtype to dedicated scikit-learn transformers. Therefore
208
+ :class: `~sklearn.compose.ColumnTransformer ` are often used in the first step of
209
+ scikit-learn pipelines when dealing with heterogeneous dataframes (see :ref: `pipeline `
210
+ for more details).
199
211
200
212
See also :ref: `sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py `
201
213
for an example of working with heterogeneous (e.g. categorical and numeric) data.
0 commit comments