-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds native pandas categorical support to gradient boosting #26411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds native pandas categorical support to gradient boosting #26411
Conversation
Could you please update the existing examples to show the benefits of using this PR? I think the following 3 examples would benefit from this treatment:
There is also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @thomasjpfan. I think this PR should improve significantly the usability of Hist Gradient Boosting models when dealing with mixed categorical and numerical features which is a very common use case in practice.
Here are a few discussion points below:
I updated the PR to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass. In retrospect, let's be conservative about the default behavior and issue a warning when the input dataframe has categorical columns and the categorical_features
has been left to its default value of categorical_features="warn"
.
Other than that, here is a bit more feedback to improve the test.
I think the examples would have benefited to be updated as part of this PR to make sure we get the UX right from the users' point of view. See: #26411 (comment)
Other than that, LGTM.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
@lorentzenchr any opinion on this PR in general and the discussion on the default behavior in particular? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the last batch of updates. LGTM besides the following sphinx-gallery formatting problem:
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking very good.
@thomasjpfan Could you add a test with pd.Categorical(["a", "b"], categories=["a", "b"]
and pd.Categorical(["a", "b"], categories=["b", "a"]
?
Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder
and then use it here? Will it be possible to replace the code here, once we have that?
@@ -1267,6 +1394,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting): | |||
features. | |||
- str array-like: names of categorical features (assuming the training | |||
data has feature names). | |||
- `"by_dtype"`: Pandas categorical dtypes are considered categorical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "infer_by_dtype"
? It is a bit longer but a bit more explicit.
Just for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. Or maybe "infer_from_dtype"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe "from_dtype"
. Not strong opinion in retrospect.
"infer" might be a bit misleading if the estimator just selects columns that are explicitly category dtyped and more ambiguous dtypes such as str-based or object-based are not considered as discussed below (in #26411 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 673c76d
(#26411), I updated the keyword argument to "from_dtype"
.
@@ -1267,6 +1394,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting): | |||
features. | |||
- str array-like: names of categorical features (assuming the training | |||
data has feature names). | |||
- `"by_dtype"`: Pandas categorical dtypes are considered categorical. | |||
The input must be a pandas DataFrame to use this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess and hope that we can extend this in the future and also treat string columns of any dataframe the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String columns might contain free text (e.g. a title, a description, people full names,...) which are not really categorical features directly. They would require some kind of dedicated text preprocessing involving string tokenization followed by some form of sparse multi-hot encoding, dense embedding or topic modelling (e.g. NMF/LDA). I wouldn't consider such columns directly as categorical columns.
I have the feeling that it's better to error on those and let the user either type those columns explicitly as categorical or do some appropriate preprocessing to find a suitable numerical representation.
I would skip the deprecation and warning as it is annoying. It is, however, scikit-learn's promise and de-facto standard for a long time now. So +1 for directly setting the new default. But I accept to be overruled by other opinions. |
It would be annoying only for people who fit with a raw dataframe with at least one column with a category dtype. People fitting GBDT models on preprocessed categorical features and/or numerical only data would not see any warning. |
Could we replace this PR’s code backwards compatible with #15396 merged? I‘m asking for the 1.3 release. For 1.4, OrdinalEncoder should come first. |
#15396 adds a Although, there will be an issue with |
I just pushed a commit by merging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have a couple of nitpicks: I would add a new test to be sure that we always sort numerical categories and to make it explicit via a unit test. Otherwise LGTM.
Parameters | ||
---------- | ||
X : {array-like, pandas DataFrame} of shape (n_samples, n_features) | ||
Input data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input data | |
Input data. |
@@ -207,10 +301,42 @@ def _check_categories(self, X): | |||
- None if the feature is not categorical | |||
None if no feature is categorical. | |||
""" | |||
if self.categorical_features is None: | |||
X_is_dataframe = hasattr(X, "dtypes") # sufficient here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should just use _is_pandas_df
just for consistency even if this is more involved.
if X_is_dataframe and (X.dtypes == "category").any(): | ||
warnings.warn( | ||
( | ||
"The categorical_features parameter will change to 'by_dtype'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The categorical_features parameter will change to 'by_dtype'" | |
"The categorical_features parameter will change to 'from_dtype'" |
warnings.warn( | ||
( | ||
"The categorical_features parameter will change to 'by_dtype'" | ||
" in v1.6. The 'by_dtype' option automatically treats" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" in v1.6. The 'by_dtype' option automatically treats" | |
" in v1.6. The 'from_dtype' option automatically treats" |
@@ -230,17 +356,18 @@ def _check_categories(self, X): | |||
) | |||
|
|||
n_features = X.shape[1] | |||
feature_names_in_ = getattr(X, "columns", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be worth to add a small comment why self.feature_names_in_
is not yet defined at this stage.
hist_np = Hist(categorical_features=[False, True], **hist_kwargs) | ||
hist_np.fit(X_train, y_train) | ||
|
||
hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs) | |
hist_pd = HistGradientBoosting(categorical_features="from_dtype", **hist_kwargs) |
|
||
|
||
@pytest.mark.parametrize( | ||
"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor] | |
"HistGradientBoosting", [HistGradientBoostingClassifier, HistGradientBoostingRegressor] |
@pytest.mark.parametrize( | ||
"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor] | ||
) | ||
def test_pandas_categorical_errors(Hist): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_pandas_categorical_errors(Hist): | |
def test_pandas_categorical_errors(HistGradientBoosting): |
pd = pytest.importorskip("pandas") | ||
|
||
msg = "Categorical feature 'f_cat' is expected to have a cardinality <= 16" | ||
hist = Hist(categorical_features="from_dtype", max_bins=16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hist = Hist(categorical_features="from_dtype", max_bins=16) | |
hist = HistGradientBoosting(categorical_features="from_dtype", max_bins=16) |
# OrdinalEncoder requires categories backed by numerical values | ||
# to be sorted | ||
if categories.dtype.kind not in "OUS": | ||
categories = np.sort(categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it could be possible to get a specific test for this. I assume that we test it indirectly because codecov does not complain that this line is not covered but it could be cool to have an explicit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 5fea2dd I updated the test to directly check the categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have a couple of nitpicks: I would add a new test to be sure that we always sort numerical categories and to make it explicit via a unit test. Otherwise LGTM.
Uhm, the test is unrelated to this PR. I will have a look (since I merged that one). |
The issue here is solved in #27802 |
LGTM. Merging this PR then. Thanks @thomasjpfan. |
This is amazing, thank you @thomasjpfan |
Reference Issues/PRs
Closes #24907
What does this implement/fix? Explain your changes.
This PR adds
categorical_features="pandas"
which infers the categorical features from the dtype. Unlike #26268, the cardinality for each category is still restricted above bymax_bins
.Any other comments?
Given the mixed reaction to #26268, I opened this PR because it is less magic. Essentially, this PR is running
OrdinalEncoder
on the categorical features.