-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
39b3aaa
Automatically handle categorical columns of pandas dataframes in HGBRT
ogrisel 753c101
Simplify examples by using the enhanced GBRT models
ogrisel beb397f
Changelog
ogrisel 4aec8f2
Update examples/ensemble/plot_gradient_boosting_categorical.py
ogrisel b8e0f1f
Merge remote-tracking branch 'upstream/main' into pr/24907
thomasjpfan 098895e
STY Fixes spelling
thomasjpfan 00f6a18
ENH Check features with pandas categoricals are consistent with categ…
thomasjpfan 9ea3b80
more specific comment
ogrisel b87b3b3
Merge branch 'main' into pandas-categorical-gbrt
ogrisel 2c9f172
docstring improvements + treat all negatives as missing as stated in …
ogrisel f41fa86
CLN Only make one dataframe copy
thomasjpfan 5fa30ec
Merge branch 'main' into pandas-categorical-gbrt
ogrisel 75b1afe
Update user guide and linked example to match
ogrisel 787501d
Always ordinal encoded category columns while always respecting the c…
ogrisel 645154c
Improve comment on auto mode in plot_cyclical_feature_engineering.py
ogrisel 05535f7
Update doc/modules/ensemble.rst
ogrisel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,8 @@ | |
X, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True, parser="pandas") | ||
|
||
# Select only a subset of features of X to make the example faster to run | ||
categorical_columns_subset = [ | ||
column_subset = [ | ||
# Categorical features: | ||
"BldgType", | ||
"GarageFinish", | ||
"LotConfig", | ||
|
@@ -44,9 +45,7 @@ | |
"ExterCond", | ||
"ExterQual", | ||
"PoolQC", | ||
] | ||
|
||
numerical_columns_subset = [ | ||
# Numerical features: | ||
"3SsnPorch", | ||
"Fireplaces", | ||
"BsmtHalfBath", | ||
|
@@ -59,8 +58,12 @@ | |
"ScreenPorch", | ||
] | ||
|
||
X = X[categorical_columns_subset + numerical_columns_subset] | ||
X[categorical_columns_subset] = X[categorical_columns_subset].astype("category") | ||
# Comment the line below to run the example on the full dataset: | ||
X = X[column_subset] | ||
|
||
# Explicitly type the categorical columns as such. | ||
object_dtyped_columns = X.select_dtypes(include=["object"]).columns | ||
X[object_dtyped_columns] = X[object_dtyped_columns].astype("category") | ||
|
||
categorical_columns = X.select_dtypes(include="category").columns | ||
n_categorical_features = len(categorical_columns) | ||
|
@@ -123,10 +126,6 @@ | |
make_column_selector(dtype_include="category"), | ||
), | ||
remainder="passthrough", | ||
# Use short feature names to make it easier to specify the categorical | ||
# variables in the HistGradientBoostingRegressor in the next step | ||
# of the pipeline. | ||
verbose_feature_names_out=False, | ||
) | ||
|
||
hist_ordinal = make_pipeline( | ||
|
@@ -140,24 +139,35 @@ | |
# that will natively handle categorical features. This estimator will not treat | ||
# categorical features as ordered quantities. | ||
# | ||
# Since the :class:`~ensemble.HistGradientBoostingRegressor` requires category | ||
# values to be encoded in `[0, n_unique_categories - 1]`, we still rely on an | ||
# :class:`~preprocessing.OrdinalEncoder` to pre-process the data. | ||
# | ||
# The main difference between this pipeline and the previous one is that in | ||
# this one, we let the :class:`~ensemble.HistGradientBoostingRegressor` know | ||
# which features are categorical. | ||
# To benefit from this, one option is to encode the categorical features using the | ||
# pandas categorical dtype which we already did at the beginning of this | ||
# example with the call to `.astype("category")`. | ||
hist_native = HistGradientBoostingRegressor(random_state=42) | ||
|
||
# The ordinal encoder will first output the categorical features, and then the | ||
# continuous (passed-through) features | ||
# %% | ||
# Note that this is equivalent to using the ordinal encoder that output pandas | ||
# dataframe with unchanged column names and then passing the name of the | ||
# categorical features to the ``categorical_features`` constructor parameter of | ||
# :class:`~ensemble.HistGradientBoostingRegressor`: | ||
|
||
ordinal_encoder = make_column_transformer( | ||
( | ||
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan), | ||
categorical_columns, | ||
), | ||
remainder="passthrough", | ||
# Use short feature names to make it easier to specify the categorical | ||
# variables in the HistGradientBoostingRegressor in the next step | ||
# of the pipeline. | ||
verbose_feature_names_out=False, | ||
).set_output(transform="pandas") | ||
|
||
hist_native = make_pipeline( | ||
hist_native2 = make_pipeline( | ||
ordinal_encoder, | ||
HistGradientBoostingRegressor( | ||
random_state=42, | ||
categorical_features=categorical_columns, | ||
categorical_features=categorical_columns, random_state=42 | ||
), | ||
).set_output(transform="pandas") | ||
) | ||
|
||
# %% | ||
# Model comparison | ||
|
@@ -254,11 +264,12 @@ def plot_results(figure_title): | |
# we artificially limit the total number of splits by both limiting the number | ||
# of trees and the depth of each tree. | ||
|
||
for pipe in (hist_dropped, hist_one_hot, hist_ordinal, hist_native): | ||
for pipe in (hist_dropped, hist_one_hot, hist_ordinal): | ||
pipe.set_params( | ||
histgradientboostingregressor__max_depth=3, | ||
histgradientboostingregressor__max_iter=15, | ||
) | ||
hist_native.set_params(max_depth=3, max_iter=15) | ||
|
||
dropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring) | ||
one_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring) | ||
|
@@ -272,6 +283,5 @@ def plot_results(figure_title): | |
# %% | ||
# The results for these under-fitting models confirm our previous intuition: | ||
# the native category handling strategy performs the best when the splitting | ||
# budget is constrained. The two other strategies (one-hot encoding and | ||
# treating categories as ordinal values) lead to error values comparable | ||
# to the baseline model that just dropped the categorical features altogether. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This remark was no longer true since the example was made to run on a subset of the columns to make it run faster. Compare the last plot of: and: So I fixed this as part of the PR (along with a simplification of the column indexing code. |
||
# budget is constrained. Note that this effect is even more pronounced when | ||
# we include all the features from the original dataset. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because of: https://github.com/scikit-learn/scikit-learn/pull/24907/files#r1020949896