Transformer for nominal categories, with the goal of improving category support in decision trees #24967

betatim · 2022-11-17T10:15:28Z

I'd like to find out how keen people would be for adding a transformer like this.

Describe the workflow you want to enable

Improved support for nominal categories in tree based models. Nominal categories are ones with no order to them, for example colour or country (c.f. ordinal categories that are ordered).

Describe your proposed solution

In #12866 @amueller linked to Splitting on categorical predictors in random forests. This proposes to replace categorical values (red, green, blue, ..) by cleverly computed numerical values (blue -> 0.643, green ->0.123, ...) that allow you to achieve similar/equal performance by using < as splitting decision in a tree node compared to performing an exhaustive search of all possible categorical splits.

I would implement this as a transformer that you use together with a random forest or other tree based model in a pipeline.

Describe alternatives you've considered, if relevant

There have been several PRs attempting to add native category support to decision trees. #12866 is the latest one. My impression is that it would be cool to have this in trees but that it is not easy to do, several people have tried but no PR has landed yet. The "Breimann trick" that is used is also limited in the number of categorical values it supports, where this new idea seems to support unlimited categorical values and multi class classification.

Additional context

An earlier paper that is cited in "Splitting categorical predictors...": https://link.springer.com/article/10.1023/A:1009869804967

An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split (which is "the best possible" but too expensive to compute to be used in practice). The method proposed by the paper seems simple (famous last words), and they make claims about the performance but it would be nice to know how close this gets us. Trust, but verify ;)

The text was updated successfully, but these errors were encountered:

betatim · 2022-11-17T13:30:11Z

Some notes on how to compute the various quantities. Mostly so I don't lose them. Based on the example given in "Splitting categorical predictors ..".

# contingency matrix, given in the example, would have to construct it
labels = ["PS", "NP", "PR"]  # unused, just for completeness
categories = ["G","B","P","DO","LO","Y","R"]

C = np.array([[20, 3, 1], [19,4,2], [25,4,7], [64,4,20], [16,1,5], [100,2,3], [200,3,4]])
P = (C / C.sum(axis=1).reshape(-1,1))
pbar = C.sum(axis=0) / C.sum()

sigma = np.zeros((3,3))
for a in range(P.shape[0]):
  sigma +=  (C.sum(axis=1)[a] * np.outer((P[a, :] - pbar), (P[a,:] - pbar).T))

w, v = np.linalg.eig(sigma)
v0 = v[:, np.argmax(w)]

for a, category in enumerate(categories):
    print(category, a, v0 @ P[a, :])

Each categorical value of a feature should be replaced by v0 @P[a, :], where a is the index that corresponds to the value of the category.

glemaitre · 2022-11-20T10:38:07Z

An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split

I assume that we should at least benchmark as in:

https://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_categorical.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-categorical-py

I am not against having this encoder until we can show that it works in practice and that it kind of scales. However, I think that we should tend to have a similar user experience for tree-based models (DT and RF, even exact GBDT) as the HGBDT, meaning going towards the implementation proposed in #24907, where the magic happens within the trees.

betatim · 2022-11-21T13:02:23Z

From a quick read of #18894 which seems to be the "source" of #24907 the proposal there is to use a category encoder inside hist gradient boosting. This means users won't have to create a pipeline anymore to do the encoding themselves, the estimator takes care of it (kinda like a "built in" pipeline). Having an additional transformer like the one proposed here that can deal with nominal (no order) categories seems like a good thing.

mohitthakur13 · 2022-11-22T13:14:08Z

Hey @betatim - is there a possibility I could help contribute on this? Would love to help and learn. Thank you.

betatim · 2022-11-22T15:49:57Z

My current plan is to write a bit of code and open a PR tomorrow. I'm not sure yet how we could best collaborate on the PR (maybe Prs to the PR?). Getting another pair of eyes on it is good though because reviewer time is often the bottleneck. So maybe that is a way to collaborate.

mohitthakur13 · 2022-11-22T16:03:13Z

Totally up for it @betatim, PR to a PR or with anything that could help you save time. Looking forward to it.

thomasjpfan · 2022-11-22T16:04:06Z

For linking purposes, #17323 introduces "impact encoder" or "target encoder" that converts categories into numerical values. I plan on revisiting it soon and likely bring it closer to cuml's TargetEncoder.

As for this issue, I suspect there are not enough citations to adopt the proposed feature.

betatim · 2022-11-23T09:37:52Z

Do you know how many citations the target encoding in cuml (or really any of the many variations of target encoding) have? The cuml docs link to https://maxhalford.github.io/blog/target-encoding/ which, after a quick browse, doesn't link to a paper.

betatim · 2022-11-23T15:14:22Z

I think one problem that the approach from Splitting on categorical predictors in random forests has is that it has "target leakage". The values of y are used during the construction of the features. This means the features contain information about what the value of y is. Same as all the other target encoders. So we need a way to solve that, some form of splitting the data like the cuml TargetEncoder does?

thomasjpfan · 2022-11-25T21:20:10Z

So we need a way to solve that, some form of splitting the data like the cuml TargetEncoder does?

Yea, the implementations of TargetEncoder that do not leak target data are the ones that splits the data multiple times during training. In the past, I ran benchmarks that shows that data splitting is required for TargetEncoder to work. There is also this benchmark paper that concludes the same thing.

For reference, here are some repos containing benchmarks from different sources:

https://github.com/DenisVorotyntsev/CategoricalEncodingBenchmark
https://github.com/jorisvandenbossche/target-encoder-benchmarks
https://github.com/thomasjpfan/sk_encoder_cv (I experimented with some target encoding schemes trying to keep fit.transform == fit_transform. None of these experimental encoding schemes fit in scikit-learn, because we do not to invent anything here).

With the data splitting, we will unfortunately end up with fit(X, y).transform(X) != fit_transform(X, y). During fit, the training data is split such that a subset of the data is used to encode the categories in the other subset. At the end of fit, all the training data is used to encode the categories, which is used during prediction time.

TLDR: If we include a target encoder in scikit-learn, it likely will be the one that splits the data in fit where fit.transform != fit_transform. This breaks our API convention, but it is required for a target encoder to work. In our glossary entry for fit_transform, we do allow for this inconsistency.

betatim · 2022-11-28T10:55:34Z

Another big library that contains many different categorical encoders https://contrib.scikit-learn.org/category_encoders/

betatim added New Feature Needs Triage Issue requires triage labels Nov 17, 2022

glemaitre removed the Needs Triage Issue requires triage label Nov 20, 2022

ogrisel mentioned this issue Mar 6, 2023

ENH Adds TargetEncoder #25334

Merged

amueller mentioned this issue Jan 22, 2024

Handle pd.Categorical in encoders #14953

Open

glemaitre added this to Support for categorical variable May 17, 2024

glemaitre moved this to Discussion in Support for categorical variable May 17, 2024

adam2392 mentioned this issue Jul 8, 2024

FEA Categorical split support for DecisionTree*, ExtraTree*, RandomForest* and `ExtraTrees* #29437

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

betatim commented Nov 17, 2022 •

edited

Loading

betatim commented Nov 17, 2022

glemaitre commented Nov 20, 2022

betatim commented Nov 21, 2022

mohitthakur13 commented Nov 22, 2022

betatim commented Nov 22, 2022

mohitthakur13 commented Nov 22, 2022

thomasjpfan commented Nov 22, 2022

betatim commented Nov 23, 2022

betatim commented Nov 23, 2022

thomasjpfan commented Nov 25, 2022

betatim commented Nov 28, 2022

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

Comments

betatim commented Nov 17, 2022 • edited Loading

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

betatim commented Nov 17, 2022

glemaitre commented Nov 20, 2022

betatim commented Nov 21, 2022

mohitthakur13 commented Nov 22, 2022

betatim commented Nov 22, 2022

mohitthakur13 commented Nov 22, 2022

thomasjpfan commented Nov 22, 2022

betatim commented Nov 23, 2022

betatim commented Nov 23, 2022

thomasjpfan commented Nov 25, 2022

betatim commented Nov 28, 2022

betatim commented Nov 17, 2022 •

edited

Loading