Skip to content

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
betatim opened this issue Nov 17, 2022 · 11 comments

Comments

@betatim
Copy link
Member

betatim commented Nov 17, 2022

I'd like to find out how keen people would be for adding a transformer like this.

Describe the workflow you want to enable

Improved support for nominal categories in tree based models. Nominal categories are ones with no order to them, for example colour or country (c.f. ordinal categories that are ordered).

Describe your proposed solution

In #12866 @amueller linked to Splitting on categorical predictors in random forests. This proposes to replace categorical values (red, green, blue, ..) by cleverly computed numerical values (blue -> 0.643, green ->0.123, ...) that allow you to achieve similar/equal performance by using < as splitting decision in a tree node compared to performing an exhaustive search of all possible categorical splits.

I would implement this as a transformer that you use together with a random forest or other tree based model in a pipeline.

Describe alternatives you've considered, if relevant

There have been several PRs attempting to add native category support to decision trees. #12866 is the latest one. My impression is that it would be cool to have this in trees but that it is not easy to do, several people have tried but no PR has landed yet. The "Breimann trick" that is used is also limited in the number of categorical values it supports, where this new idea seems to support unlimited categorical values and multi class classification.

Additional context

An earlier paper that is cited in "Splitting categorical predictors...": https://link.springer.com/article/10.1023/A:1009869804967

An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split (which is "the best possible" but too expensive to compute to be used in practice). The method proposed by the paper seems simple (famous last words), and they make claims about the performance but it would be nice to know how close this gets us. Trust, but verify ;)

@betatim betatim added New Feature Needs Triage Issue requires triage labels Nov 17, 2022
@betatim
Copy link
Member Author

betatim commented Nov 17, 2022

Some notes on how to compute the various quantities. Mostly so I don't lose them. Based on the example given in "Splitting categorical predictors ..".

# contingency matrix, given in the example, would have to construct it
labels = ["PS", "NP", "PR"]  # unused, just for completeness
categories = ["G","B","P","DO","LO","Y","R"]

C = np.array([[20, 3, 1], [19,4,2], [25,4,7], [64,4,20], [16,1,5], [100,2,3], [200,3,4]])
P = (C / C.sum(axis=1).reshape(-1,1))
pbar = C.sum(axis=0) / C.sum()

sigma = np.zeros((3,3))
for a in range(P.shape[0]):
  sigma +=  (C.sum(axis=1)[a] * np.outer((P[a, :] - pbar), (P[a,:] - pbar).T))

w, v = np.linalg.eig(sigma)
v0 = v[:, np.argmax(w)]

for a, category in enumerate(categories):
    print(category, a, v0 @ P[a, :])

Each categorical value of a feature should be replaced by v0 @P[a, :], where a is the index that corresponds to the value of the category.

@glemaitre
Copy link
Member

An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split

I assume that we should at least benchmark as in:

https://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_categorical.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-categorical-py

I am not against having this encoder until we can show that it works in practice and that it kind of scales. However, I think that we should tend to have a similar user experience for tree-based models (DT and RF, even exact GBDT) as the HGBDT, meaning going towards the implementation proposed in #24907, where the magic happens within the trees.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Nov 20, 2022
@betatim
Copy link
Member Author

betatim commented Nov 21, 2022

From a quick read of #18894 which seems to be the "source" of #24907 the proposal there is to use a category encoder inside hist gradient boosting. This means users won't have to create a pipeline anymore to do the encoding themselves, the estimator takes care of it (kinda like a "built in" pipeline). Having an additional transformer like the one proposed here that can deal with nominal (no order) categories seems like a good thing.

@mohitthakur13
Copy link
Contributor

Hey @betatim - is there a possibility I could help contribute on this? Would love to help and learn. Thank you.

@betatim
Copy link
Member Author

betatim commented Nov 22, 2022

My current plan is to write a bit of code and open a PR tomorrow. I'm not sure yet how we could best collaborate on the PR (maybe Prs to the PR?). Getting another pair of eyes on it is good though because reviewer time is often the bottleneck. So maybe that is a way to collaborate.

@mohitthakur13
Copy link
Contributor

Totally up for it @betatim, PR to a PR or with anything that could help you save time. Looking forward to it.

@thomasjpfan
Copy link
Member

For linking purposes, #17323 introduces "impact encoder" or "target encoder" that converts categories into numerical values. I plan on revisiting it soon and likely bring it closer to cuml's TargetEncoder.

As for this issue, I suspect there are not enough citations to adopt the proposed feature.

@betatim
Copy link
Member Author

betatim commented Nov 23, 2022

Do you know how many citations the target encoding in cuml (or really any of the many variations of target encoding) have? The cuml docs link to https://maxhalford.github.io/blog/target-encoding/ which, after a quick browse, doesn't link to a paper.

@betatim
Copy link
Member Author

betatim commented Nov 23, 2022

I think one problem that the approach from Splitting on categorical predictors in random forests has is that it has "target leakage". The values of y are used during the construction of the features. This means the features contain information about what the value of y is. Same as all the other target encoders. So we need a way to solve that, some form of splitting the data like the cuml TargetEncoder does?

@thomasjpfan
Copy link
Member

So we need a way to solve that, some form of splitting the data like the cuml TargetEncoder does?

Yea, the implementations of TargetEncoder that do not leak target data are the ones that splits the data multiple times during training. In the past, I ran benchmarks that shows that data splitting is required for TargetEncoder to work. There is also this benchmark paper that concludes the same thing.

For reference, here are some repos containing benchmarks from different sources:

With the data splitting, we will unfortunately end up with fit(X, y).transform(X) != fit_transform(X, y). During fit, the training data is split such that a subset of the data is used to encode the categories in the other subset. At the end of fit, all the training data is used to encode the categories, which is used during prediction time.

TLDR: If we include a target encoder in scikit-learn, it likely will be the one that splits the data in fit where fit.transform != fit_transform. This breaks our API convention, but it is required for a target encoder to work. In our glossary entry for fit_transform, we do allow for this inconsistency.

@betatim
Copy link
Member Author

betatim commented Nov 28, 2022

Another big library that contains many different categorical encoders https://contrib.scikit-learn.org/category_encoders/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants