Skip to content

Commit b824c72

Browse files
ArturoAmorQArturoAmorQlorentzenchr
authored
DOC Improve wording in Categorical Feature support example (#31864)
Co-authored-by: ArturoAmorQ <arturo.amor-quiroz@polytechnique.edu> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
1 parent adb1ae7 commit b824c72

File tree

1 file changed

+50
-41
lines changed

1 file changed

+50
-41
lines changed

examples/ensemble/plot_gradient_boosting_categorical.py

Lines changed: 50 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,29 @@
55
66
.. currentmodule:: sklearn
77
8-
In this example, we will compare the training times and prediction
9-
performances of :class:`~ensemble.HistGradientBoostingRegressor` with
10-
different encoding strategies for categorical features. In
11-
particular, we will evaluate:
8+
In this example, we compare the training times and prediction performances of
9+
:class:`~ensemble.HistGradientBoostingRegressor` with different encoding
10+
strategies for categorical features. In particular, we evaluate:
1211
1312
- "Dropped": dropping the categorical features;
1413
- "One Hot": using a :class:`~preprocessing.OneHotEncoder`;
1514
- "Ordinal": using an :class:`~preprocessing.OrdinalEncoder` and treat
1615
categories as ordered, equidistant quantities;
17-
- "Native": using an :class:`~preprocessing.OrdinalEncoder` and rely on the
18-
:ref:`native category support <categorical_support_gbdt>` of the
16+
- "Native": relying on the :ref:`native category support
17+
<categorical_support_gbdt>` of the
1918
:class:`~ensemble.HistGradientBoostingRegressor` estimator.
2019
21-
We will work with the Ames Iowa Housing dataset which consists of numerical
22-
and categorical features, where the houses' sales prices is the target.
20+
For such purpose we use the Ames Iowa Housing dataset, which consists of
21+
numerical and categorical features, where the target is the house sale price.
2322
2423
See :ref:`sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py` for an
2524
example showcasing some other features of
2625
:class:`~ensemble.HistGradientBoostingRegressor`.
2726
27+
See :ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder.py` for a
28+
comparison of encoding strategies in the presence of high cardinality
29+
categorical features.
30+
2831
"""
2932

3033
# Authors: The scikit-learn developers
@@ -97,8 +100,8 @@
97100
# %%
98101
# Gradient boosting estimator with one-hot encoding
99102
# -------------------------------------------------
100-
# Next, we create a pipeline that will one-hot encode the categorical features
101-
# and let the rest of the numerical data to passthrough:
103+
# Next, we create a pipeline to one-hot encode the categorical features,
104+
# while letting the remaining features `"passthrough"` unchanged:
102105

103106
from sklearn.preprocessing import OneHotEncoder
104107

@@ -118,9 +121,9 @@
118121
# %%
119122
# Gradient boosting estimator with ordinal encoding
120123
# -------------------------------------------------
121-
# Next, we create a pipeline that will treat categorical features as if they
122-
# were ordered quantities, i.e. the categories will be encoded as 0, 1, 2,
123-
# etc., and treated as continuous features.
124+
# Next, we create a pipeline that treats categorical features as ordered
125+
# quantities, i.e. the categories are encoded as 0, 1, 2, etc., and treated as
126+
# continuous features.
124127

125128
import numpy as np
126129

@@ -132,10 +135,6 @@
132135
make_column_selector(dtype_include="category"),
133136
),
134137
remainder="passthrough",
135-
# Use short feature names to make it easier to specify the categorical
136-
# variables in the HistGradientBoostingRegressor in the next step
137-
# of the pipeline.
138-
verbose_feature_names_out=False,
139138
)
140139

141140
hist_ordinal = make_pipeline(
@@ -147,14 +146,23 @@
147146
# Gradient boosting estimator with native categorical support
148147
# -----------------------------------------------------------
149148
# We now create a :class:`~ensemble.HistGradientBoostingRegressor` estimator
150-
# that will natively handle categorical features. This estimator will not treat
151-
# categorical features as ordered quantities. We set
152-
# `categorical_features="from_dtype"` such that features with categorical dtype
153-
# are considered categorical features.
149+
# that can natively handle categorical features without explicit encoding. Such
150+
# functionality can be enabled by setting `categorical_features="from_dtype"`,
151+
# which automatically detects features with categorical dtypes, or more explicitly
152+
# by `categorical_features=categorical_columns_subset`.
153+
#
154+
# Unlike previous encoding approaches, the estimator natively deals with the
155+
# categorical features. At each split, it partitions the categories of such a
156+
# feature into disjoint sets using a heuristic that sorts them by their effect
157+
# on the target variable, see `Split finding with categorical features
158+
# <https://scikit-learn.org/stable/modules/ensemble.html#split-finding-with-categorical-features>`_
159+
# for details.
154160
#
155-
# The main difference between this estimator and the previous one is that in
156-
# this one, we let the :class:`~ensemble.HistGradientBoostingRegressor` detect
157-
# which features are categorical from the DataFrame columns' dtypes.
161+
# While ordinal encoding may work well for low-cardinality features even if
162+
# categories have no natural order, reaching meaningful splits requires deeper
163+
# trees as the cardinality increases. The native categorical support avoids this
164+
# by directly working with unordered categories. The advantage over one-hot
165+
# encoding is the omitted preprocessing and faster fit and predict time.
158166

159167
hist_native = HistGradientBoostingRegressor(
160168
random_state=42, categorical_features="from_dtype"
@@ -167,7 +175,7 @@
167175
# Here we use :term:`cross validation` to compare the models performance in
168176
# terms of :func:`~metrics.mean_absolute_percentage_error` and fit times. In the
169177
# upcoming plots, error bars represent 1 standard deviation as computed across
170-
# folds.
178+
# cross-validation splits.
171179

172180
from sklearn.model_selection import cross_validate
173181

@@ -258,18 +266,18 @@ def plot_performance_tradeoff(results, title):
258266
# down-left corner, as indicated by the arrow. Those models would indeed
259267
# correspond to faster fitting and lower error.
260268
#
261-
# We see that the model with one-hot-encoded data is by far the slowest. This
262-
# is to be expected, since one-hot-encoding creates one additional feature per
263-
# category value (for each categorical feature), and thus more split points
264-
# need to be considered during fitting. In theory, we expect the native
265-
# handling of categorical features to be slightly slower than treating
266-
# categories as ordered quantities ('Ordinal'), since native handling requires
267-
# :ref:`sorting categories <categorical_support_gbdt>`. Fitting times should
268-
# however be close when the number of categories is small, and this may not
269-
# always be reflected in practice.
269+
# The model using one-hot encoded data is the slowest. This is to be expected,
270+
# as one-hot encoding creates an additional feature for each category value of
271+
# every categorical feature, greatly increasing the number of split candidates
272+
# during training. In theory, we expect the native handling of categorical
273+
# features to be slightly slower than treating categories as ordered quantities
274+
# ('Ordinal'), since native handling requires :ref:`sorting categories
275+
# <categorical_support_gbdt>`. Fitting times should however be close when the
276+
# number of categories is small, and this may not always be reflected in
277+
# practice.
270278
#
271-
# In terms of prediction performance, dropping the categorical features leads
272-
# to poorer performance. The three models that use categorical features have
279+
# In terms of prediction performance, dropping the categorical features leads to
280+
# the worst performance. The three models that use categorical features have
273281
# comparable error rates, with a slight edge for the native handling.
274282

275283
# %%
@@ -322,8 +330,9 @@ def plot_performance_tradeoff(results, title):
322330
)
323331

324332
# %%
325-
# The results for these under-fitting models confirm our previous intuition:
326-
# the native category handling strategy performs the best when the splitting
327-
# budget is constrained. The two other strategies (one-hot encoding and
328-
# treating categories as ordinal values) lead to error values comparable
329-
# to the baseline model that just dropped the categorical features altogether.
333+
# The results for these underfitting models confirm our previous intuition: the
334+
# native category handling strategy performs the best when the splitting budget
335+
# is constrained. The two explicit encoding strategies (one-hot and ordinal
336+
# encoding) lead to slightly larger errors than the estimator's native handling,
337+
# but still perform better than the baseline model that just dropped the
338+
# categorical features altogether.

0 commit comments

Comments
 (0)