-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Transformer for nominal categories, with the goal of improving category support in decision trees #24967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some notes on how to compute the various quantities. Mostly so I don't lose them. Based on the example given in "Splitting categorical predictors ..". # contingency matrix, given in the example, would have to construct it
labels = ["PS", "NP", "PR"] # unused, just for completeness
categories = ["G","B","P","DO","LO","Y","R"]
C = np.array([[20, 3, 1], [19,4,2], [25,4,7], [64,4,20], [16,1,5], [100,2,3], [200,3,4]])
P = (C / C.sum(axis=1).reshape(-1,1))
pbar = C.sum(axis=0) / C.sum()
sigma = np.zeros((3,3))
for a in range(P.shape[0]):
sigma += (C.sum(axis=1)[a] * np.outer((P[a, :] - pbar), (P[a,:] - pbar).T))
w, v = np.linalg.eig(sigma)
v0 = v[:, np.argmax(w)]
for a, category in enumerate(categories):
print(category, a, v0 @ P[a, :]) Each categorical value of a feature should be replaced by |
I assume that we should at least benchmark as in: I am not against having this encoder until we can show that it works in practice and that it kind of scales. However, I think that we should tend to have a similar user experience for tree-based models (DT and RF, even exact GBDT) as the HGBDT, meaning going towards the implementation proposed in #24907, where the magic happens within the trees. |
From a quick read of #18894 which seems to be the "source" of #24907 the proposal there is to use a category encoder inside hist gradient boosting. This means users won't have to create a pipeline anymore to do the encoding themselves, the estimator takes care of it (kinda like a "built in" pipeline). Having an additional transformer like the one proposed here that can deal with nominal (no order) categories seems like a good thing. |
Hey @betatim - is there a possibility I could help contribute on this? Would love to help and learn. Thank you. |
My current plan is to write a bit of code and open a PR tomorrow. I'm not sure yet how we could best collaborate on the PR (maybe Prs to the PR?). Getting another pair of eyes on it is good though because reviewer time is often the bottleneck. So maybe that is a way to collaborate. |
Totally up for it @betatim, PR to a PR or with anything that could help you save time. Looking forward to it. |
For linking purposes, #17323 introduces "impact encoder" or "target encoder" that converts categories into numerical values. I plan on revisiting it soon and likely bring it closer to cuml's As for this issue, I suspect there are not enough citations to adopt the proposed feature. |
Do you know how many citations the target encoding in cuml (or really any of the many variations of target encoding) have? The cuml docs link to https://maxhalford.github.io/blog/target-encoding/ which, after a quick browse, doesn't link to a paper. |
I think one problem that the approach from Splitting on categorical predictors in random forests has is that it has "target leakage". The values of |
Yea, the implementations of For reference, here are some repos containing benchmarks from different sources:
With the data splitting, we will unfortunately end up with TLDR: If we include a target encoder in scikit-learn, it likely will be the one that splits the data in |
Another big library that contains many different categorical encoders https://contrib.scikit-learn.org/category_encoders/ |
I'd like to find out how keen people would be for adding a transformer like this.
Describe the workflow you want to enable
Improved support for nominal categories in tree based models. Nominal categories are ones with no order to them, for example colour or country (c.f. ordinal categories that are ordered).
Describe your proposed solution
In #12866 @amueller linked to Splitting on categorical predictors in random forests. This proposes to replace categorical values (red, green, blue, ..) by cleverly computed numerical values (blue -> 0.643, green ->0.123, ...) that allow you to achieve similar/equal performance by using
<
as splitting decision in a tree node compared to performing an exhaustive search of all possible categorical splits.I would implement this as a transformer that you use together with a random forest or other tree based model in a pipeline.
Describe alternatives you've considered, if relevant
There have been several PRs attempting to add native category support to decision trees. #12866 is the latest one. My impression is that it would be cool to have this in trees but that it is not easy to do, several people have tried but no PR has landed yet. The "Breimann trick" that is used is also limited in the number of categorical values it supports, where this new idea seems to support unlimited categorical values and multi class classification.
Additional context
An earlier paper that is cited in "Splitting categorical predictors...": https://link.springer.com/article/10.1023/A:1009869804967
An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split (which is "the best possible" but too expensive to compute to be used in practice). The method proposed by the paper seems simple (famous last words), and they make claims about the performance but it would be nice to know how close this gets us. Trust, but verify ;)
The text was updated successfully, but these errors were encountered: