Skip to content

ENH Automatic handling of categorical columns in Hist Gradient Boosting models #24907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 13, 2022

More UX improvement related to categorical features in HGBDT following #24889.

Related to:

This PR makes HGBDT models automatically detect categorical columns of an input pandas dataframe and automatically ordinal-encode those columns internally.

See the tests and examples to see the kind of code simplification this leads to.

Note: the _ordinal_encode_df could logically be collapsed into a _validate_data method that overrides the one inherited from BaseEstimator because it always needs to be called first. However, at fit time it also returns an additional categorical_features array so it does not really match the API of _validate_data.

Alternatively I could introduce _validate_hist_gradient_boosting_data that internally calls both self._ordinal_encode_df and self._validate_data and return categorical_features array when needed. Let me know what you prefer.

Final note: it might probably be possible to adapt this code to make it work for any dataframe object that implements the __dataframe__ protocol instead of just pandas but I think we should tackle the support of __dataframe__ objects in a dedicated PR instead.

I milestoned this as 1.2 but if you think this is rushing things too much we can remilestone to 1.3.

@@ -272,6 +266,5 @@ def plot_results(figure_title):
# %%
# The results for these under-fitting models confirm our previous intuition:
# the native category handling strategy performs the best when the splitting
# budget is constrained. The two other strategies (one-hot encoding and
# treating categories as ordinal values) lead to error values comparable
# to the baseline model that just dropped the categorical features altogether.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remark was no longer true since the example was made to run on a subset of the columns to make it run faster. Compare the last plot of:

and:

So I fixed this as part of the PR (along with a simplification of the column indexing code.


# Explicitly type the categorical columns as such.
object_dtyped_columns = X.select_dtypes(include=["object"]).columns
X[object_dtyped_columns] = X[object_dtyped_columns].astype("category")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 13, 2022

Note: there is no backward compat to handle here because calling the fit method of an HGBRT model directly on non-numerical dataframe would have raised a ValueError in scikit-learn 1.1.

@lorentzenchr
Copy link
Member

@ogrisel Thanks for working on this issue. At the moment, I‘m short on review time.

One thought that pops up (see discussion in issue): This PR makes the ColumnTransformer unnecessary for many use cases. Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded. (Is this now possible with pandas?) What are the pros and cons of each approach?

@@ -1240,7 +1286,8 @@ class HistGradientBoostingRegressor(RegressorMixin, BaseHistGradientBoosting):
or shape (n_categorical_features,), default=None
Indicates the categorical features.

- None : no feature will be considered categorical.
- None : no feature will be considered categorical unless explicitly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should keep None as is and have an option "auto" that could become the default in 2 versions. None could be a bit less explicit than a str option (I proposed "auto" but it could be something more linked to the "categorical" dtype from pandas).

Just a thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could as well avoid the delay of 2 versions since I don't think that we have a corner case where our user can be surprised.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has I said here (#24907 (comment)), since this case would previously raise an exception, I don't really see the point of trying to enforce backward compat.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is indeed an expected feature from people using LightGBM and XGBoost. I will review this PR tomorrow.

@glemaitre
Copy link
Member

Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded.

It might imply the Pipeline to not "step" agnostic anymore. I would think that it is more appropriate to have a meta estimator such as SuperVectorizer that could be used in the pipeline.

What are the pros and cons of each approach?

Users from xgboost and lightgbm would expect to not have to use a pipeline and ColumnTransformer. It might be more intuitive to automatically encode in the GBDT.

Not a real con, but we had something similar with normalize for linear models, and we made the inverse move. We should be certain that encoding the categories do not have any side-effect on other parameters of the model (e.g. order of categories - I assume that it does not matter).

@ogrisel
Copy link
Member Author

ogrisel commented Nov 14, 2022

Another possibility would be to automatically apply OE in a pipeline and pass on the info which features were OE encoded. (Is this now possible with pandas?)

This is possible in the sense that you can type the categorical columns explicitly in pandas with .astype("category") that uses the categorical dtype. But then there is not need to use OE step in a column/transformer pipeline before because:

  • the downstream estimator still needs specific code to detect which columns have a categorical dtype and which have regular numerical dtype;
  • a categorical dtype in pandas is already already ordinally encoded (see the .codes attribute I look up in this PR).

What are the pros and cons of each approach?

I don't see the point of introducing an OE in a pipeline that would convert a categorical column into the same categorical columns. In any case, the HGBDT models drops the meaning of the categories.

The only problem I see now, is that in fit I do not record the categories to check that they are the same at predict time. That could indeed be useful to raise a meaningful error message in case X_test as a categorical column with the same name as the one in X_train but with different categories.

# dtype to encode the categories.
# XXX: using integers would be more efficient but would
# make the code more complex.
codes = X[col].values.codes.astype(np.float64)
Copy link
Member

@thomasjpfan thomasjpfan Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my thoughts on using panda's codes:

  1. Mismatched Categories in terms of order: I think it is important to check that the categorical dtype are the same during fit and transform, otherwise the codes encoding may represent different categories. For example, if the fit time categorical dtype is ['dog', 'cat', 'turtle'] and predict time is ['cat', 'turtle', 'dog'], then codes [0, 1, 2] will mean different things.

  2. Unknown Categories: In the following, "bear" is unknown:

X_train["pet"].cat.categories
# ["cat", "dog", "snake", ...]

X_test["pet"].cat.categories
# ["cat", "bear", "dog", "snake", ...]

During train time "dog" is 1, but during predict time "bear" is 1. We can do some preprocessing here to map the unknown categories to missing. Although, the simplest solution is to check the categorical dtypes are the same during fit and predict.

  1. Infrequent Categories: In the following, "bear" is infrequent and happened to only be in the train set:
X_train["pet"].cat.categories
# ["cat", "bear", "dog", "snake", ...]

X_test["pet"].cat.categories
# ["cat", "dog", "snake", ...]

We can handle this with some preprocessing to correctly remap the codes during test to the ones seen in train. Again, the simplest solution is to check the categorical dtype are the same during fit and predict

TLDR: I think checking that the categories match during fit and predict is the simplest way to work around all three situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: I think checking that the categories match during fit and predict is the simplest way to work around all three situations.

I agree. I will update this PR later this week. Or feel free to push to this PR directly if you prefer as I don't plan to work on this today nor tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on vacation this week, so I am trying my best not to write code. I will be available next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the consistency check for pandas categorical features here: 00f6a18 (#24907)

@ogrisel
Copy link
Member Author

ogrisel commented Nov 15, 2022

Maybe we could also raise a ValueError if there are more than 255 categories. In this case we could point the user to explicitly use OrdinalEncoder and collapsed the rarest categories together for instance. Ideally we would also point the user to use a target/impact encoder to deal with this case but we need to finalize it first... Otherwise we could collapse the rarest categories together as part of this PR but this is extra work and maybe it's too magical, better let the user deal with this case explicitly.

@thomasjpfan
Copy link
Member

Maybe we could also raise a ValueError if there are more than 255 categories.

I agree.

Otherwise we could collapse the rarest categories together as part of this PR but this is extra work and maybe it's too magical, better let the user deal with this case explicitly.

I agree with not doing any encoding based on rare categories in HistGradientBoosting* itself. If we want to support this use case in scikit-learn, then I would add infrequent categorical support to OridinalEncoder. (OridinalEncoder would also need to support pandas categoricals output to work with this PR)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ogrisel !

I won't have time for a proper review (sorry), so I just looked at the docs.

Comment on lines 150 to 152
# Note that this is equivalent to using the ordinal encoder and then passing
# the name of the categorical features to the ``categorical_features``
# constructor parameter of :class:`~ensemble.HistGradientBoostingRegressor`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some users may not want or be able to rely on pandas, so I wonder if we should still leave an example with the OridnalEncoder somewhere? Perhaps just as a commented-out snippet here?

Also, in the User Guide we currently make a reference to this example, and it probably needs to be updated:

The cardinality of each categorical feature should be less than the max_bins parameter, and each categorical feature is expected to be encoded in [0, max_bins - 1]. To that end, it might be useful to pre-process the data with an OrdinalEncoder as done in Categorical Feature Support in Gradient Boosting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 75b1afe to address this point.

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>
@thomasjpfan
Copy link
Member

The flake8 issue is unrelated and fixed in #25017.

@jeremiedbb
Copy link
Member

We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3

@jeremiedbb jeremiedbb modified the milestones: 1.2, 1.3 Nov 24, 2022
@betatim
Copy link
Member

betatim commented Nov 28, 2022

Left field question: why not use OrdinalEncoder here instead of implementing your own and having to handle all the cases mentioned in #24907 (comment) ?

@thomasjpfan
Copy link
Member

why not use OrdinalEncoder here instead of implementing your own and having to handle all the cases mentioned in #24907 (comment) ?

This PR is trying to use the encoding from cat.codes directly. If we use OridinalEnocder it will learn it's own encoding, which is slower. (I have a PR that adds using cat.codes to OrdinalEncoder: #15396)

Overall, I am okay with using a OrdinalEncoder and do not have a strong opinion on it.

@betatim
Copy link
Member

betatim commented Nov 30, 2022

Thanks for explaining. I didn't realise that reusing the categories from the dataframe would save a lot of runtime.

From a quick look at the diff in #15396 it seems like it is a "non trivial amount of code", which makes me think investing effort to get that PR merged and then use the OrdinalEncoder here would be a good investment. Kind of a "two birds with one stone" move (hist boosting gets it and all the other users of encoders).

@ogrisel
Copy link
Member Author

ogrisel commented Dec 1, 2022

I sneakily pushed 787501d while not testing for this slight behavioral change. It's on my todo list to add proper testing for edge cases.

Co-authored-by: Tim Head <betatim@gmail.com>
@lorentzenchr
Copy link
Member

From a quick look at the diff in #15396 it seems like it is a "non trivial amount of code", which makes me think investing effort to get that PR merged and then use the OrdinalEncoder here would be a good investment. Kind of a "two birds with one stone" move (hist boosting gets it and all the other users of encoders).

As much as I like this PR's feature as fast as possible in the next release, this sound like a reasonable plan to me.

@ogrisel
Copy link
Member Author

ogrisel commented May 24, 2023

Closing in favor of #26411 which is much cleaner.

@ogrisel ogrisel closed this May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants