Description
I'd like to find out how keen people would be for adding a transformer like this.
Describe the workflow you want to enable
Improved support for nominal categories in tree based models. Nominal categories are ones with no order to them, for example colour or country (c.f. ordinal categories that are ordered).
Describe your proposed solution
In #12866 @amueller linked to Splitting on categorical predictors in random forests. This proposes to replace categorical values (red, green, blue, ..) by cleverly computed numerical values (blue -> 0.643, green ->0.123, ...) that allow you to achieve similar/equal performance by using <
as splitting decision in a tree node compared to performing an exhaustive search of all possible categorical splits.
I would implement this as a transformer that you use together with a random forest or other tree based model in a pipeline.
Describe alternatives you've considered, if relevant
There have been several PRs attempting to add native category support to decision trees. #12866 is the latest one. My impression is that it would be cool to have this in trees but that it is not easy to do, several people have tried but no PR has landed yet. The "Breimann trick" that is used is also limited in the number of categorical values it supports, where this new idea seems to support unlimited categorical values and multi class classification.
Additional context
An earlier paper that is cited in "Splitting categorical predictors...": https://link.springer.com/article/10.1023/A:1009869804967
An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split (which is "the best possible" but too expensive to compute to be used in practice). The method proposed by the paper seems simple (famous last words), and they make claims about the performance but it would be nice to know how close this gets us. Trust, but verify ;)
Metadata
Metadata
Assignees
Type
Projects
Status