Skip to content

ENH Adds native pandas categorical support to gradient boosting #26411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Nov 17, 2023

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Closes #24907

What does this implement/fix? Explain your changes.

This PR adds categorical_features="pandas" which infers the categorical features from the dtype. Unlike #26268, the cardinality for each category is still restricted above by max_bins.

Any other comments?

Given the mixed reaction to #26268, I opened this PR because it is less magic. Essentially, this PR is running OrdinalEncoder on the categorical features.

@ogrisel
Copy link
Member

ogrisel commented May 24, 2023

Could you please update the existing examples to show the benefits of using this PR? I think the following 3 examples would benefit from this treatment:

  • examples/applications/plot_cyclical_feature_engineering.py
  • examples/ensemble/plot_gradient_boosting_categorical.py
  • examples/inspection/plot_partial_dependence.py

There is also examples/preprocessing/plot_target_encoder.py that could probably be reworked a bit to use this new feature but I am not sure this would simplify it so maybe we can leave it as it is.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @thomasjpfan. I think this PR should improve significantly the usability of Hist Gradient Boosting models when dealing with mixed categorical and numerical features which is a very common use case in practice.

Here are a few discussion points below:

@thomasjpfan
Copy link
Member Author

I updated the PR to use categorical_features="by_dtype" and making it the default. I do have concerns over changing the default as noted in #26411 (comment), but I think it is worth it.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass. In retrospect, let's be conservative about the default behavior and issue a warning when the input dataframe has categorical columns and the categorical_features has been left to its default value of categorical_features="warn".

Other than that, here is a bit more feedback to improve the test.

I think the examples would have benefited to be updated as part of this PR to make sure we get the UX right from the users' point of view. See: #26411 (comment)

Other than that, LGTM.

@ogrisel
Copy link
Member

ogrisel commented May 30, 2023

@lorentzenchr any opinion on this PR in general and the discussion on the default behavior in particular?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last batch of updates. LGTM besides the following sphinx-gallery formatting problem:

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@lorentzenchr lorentzenchr self-requested a review June 1, 2023 12:55
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking very good.

@thomasjpfan Could you add a test with pd.Categorical(["a", "b"], categories=["a", "b"] and pd.Categorical(["a", "b"], categories=["b", "a"]?

Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder and then use it here? Will it be possible to replace the code here, once we have that?

@@ -1267,6 +1394,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
features.
- str array-like: names of categorical features (assuming the training
data has feature names).
- `"by_dtype"`: Pandas categorical dtypes are considered categorical.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "infer_by_dtype"? It is a bit longer but a bit more explicit.
Just for discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. Or maybe "infer_from_dtype".

Copy link
Member

@ogrisel ogrisel Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe "from_dtype". Not strong opinion in retrospect.

"infer" might be a bit misleading if the estimator just selects columns that are explicitly category dtyped and more ambiguous dtypes such as str-based or object-based are not considered as discussed below (in #26411 (comment)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 673c76d (#26411), I updated the keyword argument to "from_dtype".

@@ -1267,6 +1394,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
features.
- str array-like: names of categorical features (assuming the training
data has feature names).
- `"by_dtype"`: Pandas categorical dtypes are considered categorical.
The input must be a pandas DataFrame to use this feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess and hope that we can extend this in the future and also treat string columns of any dataframe the same way.

Copy link
Member

@ogrisel ogrisel Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String columns might contain free text (e.g. a title, a description, people full names,...) which are not really categorical features directly. They would require some kind of dedicated text preprocessing involving string tokenization followed by some form of sparse multi-hot encoding, dense embedding or topic modelling (e.g. NMF/LDA). I wouldn't consider such columns directly as categorical columns.

I have the feeling that it's better to error on those and let the user either type those columns explicitly as categorical or do some appropriate preprocessing to find a suitable numerical representation.

@lorentzenchr
Copy link
Member

any opinion on ... the discussion on the default behavior in particular?

I would skip the deprecation and warning as it is annoying. It is, however, scikit-learn's promise and de-facto standard for a long time now.

So +1 for directly setting the new default. But I accept to be overruled by other opinions.

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2023

I would skip the deprecation and warning as it is annoying. It is, however, scikit-learn's promise and de-facto standard for a long time now.

It would be annoying only for people who fit with a raw dataframe with at least one column with a category dtype.

People fitting GBDT models on preprocessed categorical features and/or numerical only data would not see any warning.

@thomasjpfan
Copy link
Member Author

Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder and then use it here? Will it be possible to replace the code here, once we have that?

Yes, if #15396 gets adopted then we can reduce the amount of code here.

@lorentzenchr
Copy link
Member

Then a question? Why not first enable pd.Categorical (#14953, #15396) to OrdinalEncoder and then use it here? Will it be possible to replace the code here, once we have that?

Yes, if #15396 gets adopted then we can reduce the amount of code here.

Could we replace this PR’s code backwards compatible with #15396 merged? I‘m asking for the 1.3 release. For 1.4, OrdinalEncoder should come first.

@thomasjpfan
Copy link
Member Author

Could we replace this PR’s code backwards compatible with #15396 merged?

#15396 adds a categories="from_dtype" option which infers the categories from the dtype. By itself, #15396 will not break this PR. This PR precomputes the categories and passes it to OrdinalEncoder(categories=...), which does not go through any of the new code from #15396.

Although, there will be an issue with OrdinalEncoder + categories as float ndarrays. Currently, gradient boosting is considering negative floating values as missing values. OrdinalEncoder would encode them as a positive integer, which ends up to be a behavior change. If we want to keep backward compatibility, we'll still need to precompute the categories and pass it into OrdinalEncoder as done in this PR.

@github-actions
Copy link

github-actions bot commented Jun 22, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5fea2dd. Link to the linter CI: here

@glemaitre glemaitre self-requested a review November 15, 2023 14:56
@glemaitre
Copy link
Member

I just pushed a commit by merging main into this branch. I will give it a review.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have a couple of nitpicks: I would add a new test to be sure that we always sort numerical categories and to make it explicit via a unit test. Otherwise LGTM.

Parameters
----------
X : {array-like, pandas DataFrame} of shape (n_samples, n_features)
Input data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Input data
Input data.

@@ -207,10 +301,42 @@ def _check_categories(self, X):
- None if the feature is not categorical
None if no feature is categorical.
"""
if self.categorical_features is None:
X_is_dataframe = hasattr(X, "dtypes") # sufficient here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should just use _is_pandas_df just for consistency even if this is more involved.

if X_is_dataframe and (X.dtypes == "category").any():
warnings.warn(
(
"The categorical_features parameter will change to 'by_dtype'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The categorical_features parameter will change to 'by_dtype'"
"The categorical_features parameter will change to 'from_dtype'"

warnings.warn(
(
"The categorical_features parameter will change to 'by_dtype'"
" in v1.6. The 'by_dtype' option automatically treats"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" in v1.6. The 'by_dtype' option automatically treats"
" in v1.6. The 'from_dtype' option automatically treats"

@@ -230,17 +356,18 @@ def _check_categories(self, X):
)

n_features = X.shape[1]
feature_names_in_ = getattr(X, "columns", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be worth to add a small comment why self.feature_names_in_ is not yet defined at this stage.

hist_np = Hist(categorical_features=[False, True], **hist_kwargs)
hist_np.fit(X_train, y_train)

hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hist_pd = Hist(categorical_features="from_dtype", **hist_kwargs)
hist_pd = HistGradientBoosting(categorical_features="from_dtype", **hist_kwargs)



@pytest.mark.parametrize(
"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]
"HistGradientBoosting", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]

@pytest.mark.parametrize(
"Hist", [HistGradientBoostingClassifier, HistGradientBoostingRegressor]
)
def test_pandas_categorical_errors(Hist):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_pandas_categorical_errors(Hist):
def test_pandas_categorical_errors(HistGradientBoosting):

pd = pytest.importorskip("pandas")

msg = "Categorical feature 'f_cat' is expected to have a cardinality <= 16"
hist = Hist(categorical_features="from_dtype", max_bins=16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hist = Hist(categorical_features="from_dtype", max_bins=16)
hist = HistGradientBoosting(categorical_features="from_dtype", max_bins=16)

# OrdinalEncoder requires categories backed by numerical values
# to be sorted
if categories.dtype.kind not in "OUS":
categories = np.sort(categories)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it could be possible to get a specific test for this. I assume that we test it indirectly because codecov does not complain that this line is not covered but it could be cool to have an explicit test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 5fea2dd I updated the test to directly check the categories.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have a couple of nitpicks: I would add a new test to be sure that we always sort numerical categories and to make it explicit via a unit test. Otherwise LGTM.

@glemaitre glemaitre added this to the 1.4 milestone Nov 15, 2023
@glemaitre
Copy link
Member

Uhm, the test is unrelated to this PR. I will have a look (since I merged that one).

@glemaitre
Copy link
Member

The issue here is solved in #27802

@glemaitre
Copy link
Member

LGTM. Merging this PR then. Thanks @thomasjpfan.

@amueller
Copy link
Member

This is amazing, thank you @thomasjpfan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants