|
5 | 5 |
|
6 | 6 | .. currentmodule:: sklearn
|
7 | 7 |
|
8 |
| -In this example, we will compare the training times and prediction |
9 |
| -performances of :class:`~ensemble.HistGradientBoostingRegressor` with |
10 |
| -different encoding strategies for categorical features. In |
11 |
| -particular, we will evaluate: |
| 8 | +In this example, we compare the training times and prediction performances of |
| 9 | +:class:`~ensemble.HistGradientBoostingRegressor` with different encoding |
| 10 | +strategies for categorical features. In particular, we evaluate: |
12 | 11 |
|
13 | 12 | - "Dropped": dropping the categorical features;
|
14 | 13 | - "One Hot": using a :class:`~preprocessing.OneHotEncoder`;
|
15 | 14 | - "Ordinal": using an :class:`~preprocessing.OrdinalEncoder` and treat
|
16 | 15 | categories as ordered, equidistant quantities;
|
17 |
| -- "Native": using an :class:`~preprocessing.OrdinalEncoder` and rely on the |
18 |
| - :ref:`native category support <categorical_support_gbdt>` of the |
| 16 | +- "Native": relying on the :ref:`native category support |
| 17 | + <categorical_support_gbdt>` of the |
19 | 18 | :class:`~ensemble.HistGradientBoostingRegressor` estimator.
|
20 | 19 |
|
21 |
| -We will work with the Ames Iowa Housing dataset which consists of numerical |
22 |
| -and categorical features, where the houses' sales prices is the target. |
| 20 | +For such purpose we use the Ames Iowa Housing dataset, which consists of |
| 21 | +numerical and categorical features, where the target is the house sale price. |
23 | 22 |
|
24 | 23 | See :ref:`sphx_glr_auto_examples_ensemble_plot_hgbt_regression.py` for an
|
25 | 24 | example showcasing some other features of
|
26 | 25 | :class:`~ensemble.HistGradientBoostingRegressor`.
|
27 | 26 |
|
| 27 | +See :ref:`sphx_glr_auto_examples_preprocessing_plot_target_encoder.py` for a |
| 28 | +comparison of encoding strategies in the presence of high cardinality |
| 29 | +categorical features. |
| 30 | +
|
28 | 31 | """
|
29 | 32 |
|
30 | 33 | # Authors: The scikit-learn developers
|
|
97 | 100 | # %%
|
98 | 101 | # Gradient boosting estimator with one-hot encoding
|
99 | 102 | # -------------------------------------------------
|
100 |
| -# Next, we create a pipeline that will one-hot encode the categorical features |
101 |
| -# and let the rest of the numerical data to passthrough: |
| 103 | +# Next, we create a pipeline to one-hot encode the categorical features, |
| 104 | +# while letting the remaining features `"passthrough"` unchanged: |
102 | 105 |
|
103 | 106 | from sklearn.preprocessing import OneHotEncoder
|
104 | 107 |
|
|
118 | 121 | # %%
|
119 | 122 | # Gradient boosting estimator with ordinal encoding
|
120 | 123 | # -------------------------------------------------
|
121 |
| -# Next, we create a pipeline that will treat categorical features as if they |
122 |
| -# were ordered quantities, i.e. the categories will be encoded as 0, 1, 2, |
123 |
| -# etc., and treated as continuous features. |
| 124 | +# Next, we create a pipeline that treats categorical features as ordered |
| 125 | +# quantities, i.e. the categories are encoded as 0, 1, 2, etc., and treated as |
| 126 | +# continuous features. |
124 | 127 |
|
125 | 128 | import numpy as np
|
126 | 129 |
|
|
132 | 135 | make_column_selector(dtype_include="category"),
|
133 | 136 | ),
|
134 | 137 | remainder="passthrough",
|
135 |
| - # Use short feature names to make it easier to specify the categorical |
136 |
| - # variables in the HistGradientBoostingRegressor in the next step |
137 |
| - # of the pipeline. |
138 |
| - verbose_feature_names_out=False, |
139 | 138 | )
|
140 | 139 |
|
141 | 140 | hist_ordinal = make_pipeline(
|
|
147 | 146 | # Gradient boosting estimator with native categorical support
|
148 | 147 | # -----------------------------------------------------------
|
149 | 148 | # We now create a :class:`~ensemble.HistGradientBoostingRegressor` estimator
|
150 |
| -# that will natively handle categorical features. This estimator will not treat |
151 |
| -# categorical features as ordered quantities. We set |
152 |
| -# `categorical_features="from_dtype"` such that features with categorical dtype |
153 |
| -# are considered categorical features. |
| 149 | +# that can natively handle categorical features without explicit encoding. Such |
| 150 | +# functionality can be enabled by setting `categorical_features="from_dtype"`, |
| 151 | +# which automatically detects features with categorical dtypes, or more explicitly |
| 152 | +# by `categorical_features=categorical_columns_subset`. |
| 153 | +# |
| 154 | +# Unlike previous encoding approaches, the estimator natively deals with the |
| 155 | +# categorical features. At each split, it partitions the categories of such a |
| 156 | +# feature into disjoint sets using a heuristic that sorts them by their effect |
| 157 | +# on the target variable, see `Split finding with categorical features |
| 158 | +# <https://scikit-learn.org/stable/modules/ensemble.html#split-finding-with-categorical-features>`_ |
| 159 | +# for details. |
154 | 160 | #
|
155 |
| -# The main difference between this estimator and the previous one is that in |
156 |
| -# this one, we let the :class:`~ensemble.HistGradientBoostingRegressor` detect |
157 |
| -# which features are categorical from the DataFrame columns' dtypes. |
| 161 | +# While ordinal encoding may work well for low-cardinality features even if |
| 162 | +# categories have no natural order, reaching meaningful splits requires deeper |
| 163 | +# trees as the cardinality increases. The native categorical support avoids this |
| 164 | +# by directly working with unordered categories. The advantage over one-hot |
| 165 | +# encoding is the omitted preprocessing and faster fit and predict time. |
158 | 166 |
|
159 | 167 | hist_native = HistGradientBoostingRegressor(
|
160 | 168 | random_state=42, categorical_features="from_dtype"
|
|
167 | 175 | # Here we use :term:`cross validation` to compare the models performance in
|
168 | 176 | # terms of :func:`~metrics.mean_absolute_percentage_error` and fit times. In the
|
169 | 177 | # upcoming plots, error bars represent 1 standard deviation as computed across
|
170 |
| -# folds. |
| 178 | +# cross-validation splits. |
171 | 179 |
|
172 | 180 | from sklearn.model_selection import cross_validate
|
173 | 181 |
|
@@ -258,18 +266,18 @@ def plot_performance_tradeoff(results, title):
|
258 | 266 | # down-left corner, as indicated by the arrow. Those models would indeed
|
259 | 267 | # correspond to faster fitting and lower error.
|
260 | 268 | #
|
261 |
| -# We see that the model with one-hot-encoded data is by far the slowest. This |
262 |
| -# is to be expected, since one-hot-encoding creates one additional feature per |
263 |
| -# category value (for each categorical feature), and thus more split points |
264 |
| -# need to be considered during fitting. In theory, we expect the native |
265 |
| -# handling of categorical features to be slightly slower than treating |
266 |
| -# categories as ordered quantities ('Ordinal'), since native handling requires |
267 |
| -# :ref:`sorting categories <categorical_support_gbdt>`. Fitting times should |
268 |
| -# however be close when the number of categories is small, and this may not |
269 |
| -# always be reflected in practice. |
| 269 | +# The model using one-hot encoded data is the slowest. This is to be expected, |
| 270 | +# as one-hot encoding creates an additional feature for each category value of |
| 271 | +# every categorical feature, greatly increasing the number of split candidates |
| 272 | +# during training. In theory, we expect the native handling of categorical |
| 273 | +# features to be slightly slower than treating categories as ordered quantities |
| 274 | +# ('Ordinal'), since native handling requires :ref:`sorting categories |
| 275 | +# <categorical_support_gbdt>`. Fitting times should however be close when the |
| 276 | +# number of categories is small, and this may not always be reflected in |
| 277 | +# practice. |
270 | 278 | #
|
271 |
| -# In terms of prediction performance, dropping the categorical features leads |
272 |
| -# to poorer performance. The three models that use categorical features have |
| 279 | +# In terms of prediction performance, dropping the categorical features leads to |
| 280 | +# the worst performance. The three models that use categorical features have |
273 | 281 | # comparable error rates, with a slight edge for the native handling.
|
274 | 282 |
|
275 | 283 | # %%
|
@@ -322,8 +330,9 @@ def plot_performance_tradeoff(results, title):
|
322 | 330 | )
|
323 | 331 |
|
324 | 332 | # %%
|
325 |
| -# The results for these under-fitting models confirm our previous intuition: |
326 |
| -# the native category handling strategy performs the best when the splitting |
327 |
| -# budget is constrained. The two other strategies (one-hot encoding and |
328 |
| -# treating categories as ordinal values) lead to error values comparable |
329 |
| -# to the baseline model that just dropped the categorical features altogether. |
| 333 | +# The results for these underfitting models confirm our previous intuition: the |
| 334 | +# native category handling strategy performs the best when the splitting budget |
| 335 | +# is constrained. The two explicit encoding strategies (one-hot and ordinal |
| 336 | +# encoding) lead to slightly larger errors than the estimator's native handling, |
| 337 | +# but still perform better than the baseline model that just dropped the |
| 338 | +# categorical features altogether. |
0 commit comments