-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -272,6 +266,5 @@ def plot_results(figure_title): | |||
# %% | |||
# The results for these under-fitting models confirm our previous intuition: | |||
# the native category handling strategy performs the best when the splitting | |||
# budget is constrained. The two other strategies (one-hot encoding and | |||
# treating categories as ordinal values) lead to error values comparable | |||
# to the baseline model that just dropped the categorical features altogether. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This remark was no longer true since the example was made to run on a subset of the columns to make it run faster. Compare the last plot of:
and:
So I fixed this as part of the PR (along with a simplification of the column indexing code.
|
||
# Explicitly type the categorical columns as such. | ||
object_dtyped_columns = X.select_dtypes(include=["object"]).columns | ||
X[object_dtyped_columns] = X[object_dtyped_columns].astype("category") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because of: https://github.com/scikit-learn/scikit-learn/pull/24907/files#r1020949896
Note: there is no backward compat to handle here because calling the fit method of an HGBRT model directly on non-numerical dataframe would have raised a |
@ogrisel Thanks for working on this issue. At the moment, I‘m short on review time. One thought that pops up (see discussion in issue): This PR makes the ColumnTransformer unnecessary for many use cases. Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded. (Is this now possible with pandas?) What are the pros and cons of each approach? |
@@ -1240,7 +1286,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting): | |||
or shape (n_categorical_features,), default=None | |||
Indicates the categorical features. | |||
|
|||
- None : no feature will be considered categorical. | |||
- None : no feature will be considered categorical unless explicitly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should keep None
as is and have an option "auto"
that could become the default in 2 versions. None
could be a bit less explicit than a str
option (I proposed "auto" but it could be something more linked to the "categorical" dtype from pandas).
Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could as well avoid the delay of 2 versions since I don't think that we have a corner case where our user can be surprised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has I said here (#24907 (comment)), since this case would previously raise an exception, I don't really see the point of trying to enforce backward compat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is indeed an expected feature from people using LightGBM
and XGBoost
. I will review this PR tomorrow.
It might imply the
Users from Not a real con, but we had something similar with |
This is possible in the sense that you can type the categorical columns explicitly in pandas with
I don't see the point of introducing an OE in a pipeline that would convert a categorical column into the same categorical columns. In any case, the HGBDT models drops the meaning of the categories. The only problem I see now, is that in |
# dtype to encode the categories. | ||
# XXX: using integers would be more efficient but would | ||
# make the code more complex. | ||
codes = X[col].values.codes.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my thoughts on using panda's codes
:
-
Mismatched Categories in terms of order: I think it is important to check that the categorical dtype are the same during
fit
andtransform
, otherwise thecodes
encoding may represent different categories. For example, if the fit time categorical dtype is['dog', 'cat', 'turtle']
and predict time is['cat', 'turtle', 'dog']
, then codes[0, 1, 2]
will mean different things. -
Unknown Categories: In the following, "bear" is unknown:
X_train["pet"].cat.categories
# ["cat", "dog", "snake", ...]
X_test["pet"].cat.categories
# ["cat", "bear", "dog", "snake", ...]
During train time "dog"
is 1, but during predict time "bear"
is 1. We can do some preprocessing here to map the unknown categories to missing. Although, the simplest solution is to check the categorical dtypes are the same during fit
and predict
.
- Infrequent Categories: In the following, "bear" is infrequent and happened to only be in the train set:
X_train["pet"].cat.categories
# ["cat", "bear", "dog", "snake", ...]
X_test["pet"].cat.categories
# ["cat", "dog", "snake", ...]
We can handle this with some preprocessing to correctly remap the codes during test
to the ones seen in train
. Again, the simplest solution is to check the categorical dtype are the same during fit
and predict
TLDR: I think checking that the categories match during fit
and predict
is the simplest way to work around all three situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: I think checking that the categories match during fit and predict is the simplest way to work around all three situations.
I agree. I will update this PR later this week. Or feel free to push to this PR directly if you prefer as I don't plan to work on this today nor tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am on vacation this week, so I am trying my best not to write code. I will be available next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the consistency check for pandas categorical features here: 00f6a18
(#24907)
Maybe we could also raise a |
I agree.
I agree with not doing any encoding based on rare categories in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @ogrisel !
I won't have time for a proper review (sorry), so I just looked at the docs.
# Note that this is equivalent to using the ordinal encoder and then passing | ||
# the name of the categorical features to the ``categorical_features`` | ||
# constructor parameter of :class:`~ensemble.HistGradientBoostingRegressor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some users may not want or be able to rely on pandas, so I wonder if we should still leave an example with the OridnalEncoder
somewhere? Perhaps just as a commented-out snippet here?
Also, in the User Guide we currently make a reference to this example, and it probably needs to be updated:
The cardinality of each categorical feature should be less than the max_bins parameter, and each categorical feature is expected to be encoded in [0, max_bins - 1]. To that end, it might be useful to pre-process the data with an OrdinalEncoder as done in Categorical Feature Support in Gradient Boosting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 75b1afe to address this point.
Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
…ories seen in fit
The flake8 issue is unrelated and fixed in #25017. |
We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3 |
Left field question: why not use |
This PR is trying to use the encoding from Overall, I am okay with using a |
Thanks for explaining. I didn't realise that reusing the categories from the dataframe would save a lot of runtime. From a quick look at the diff in #15396 it seems like it is a "non trivial amount of code", which makes me think investing effort to get that PR merged and then use the |
I sneakily pushed 787501d while not testing for this slight behavioral change. It's on my todo list to add proper testing for edge cases. |
Co-authored-by: Tim Head <betatim@gmail.com>
As much as I like this PR's feature as fast as possible in the next release, this sound like a reasonable plan to me. |
Closing in favor of #26411 which is much cleaner. |
More UX improvement related to categorical features in HGBDT following #24889.
Related to:
This PR makes HGBDT models automatically detect categorical columns of an input pandas dataframe and automatically ordinal-encode those columns internally.
See the tests and examples to see the kind of code simplification this leads to.
Note: the
_ordinal_encode_df
could logically be collapsed into a_validate_data
method that overrides the one inherited fromBaseEstimator
because it always needs to be called first. However, at fit time it also returns an additionalcategorical_features
array so it does not really match the API of_validate_data
.Alternatively I could introduce
_validate_hist_gradient_boosting_data
that internally calls bothself._ordinal_encode_df
andself._validate_data
and returncategorical_features
array when needed. Let me know what you prefer.Final note: it might probably be possible to adapt this code to make it work for any dataframe object that implements the
__dataframe__
protocol instead of just pandas but I think we should tackle the support of__dataframe__
objects in a dedicated PR instead.I milestoned this as 1.2 but if you think this is rushing things too much we can remilestone to 1.3.