Skip to content

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

Open
@betatim

Description

@betatim

I'd like to find out how keen people would be for adding a transformer like this.

Describe the workflow you want to enable

Improved support for nominal categories in tree based models. Nominal categories are ones with no order to them, for example colour or country (c.f. ordinal categories that are ordered).

Describe your proposed solution

In #12866 @amueller linked to Splitting on categorical predictors in random forests. This proposes to replace categorical values (red, green, blue, ..) by cleverly computed numerical values (blue -> 0.643, green ->0.123, ...) that allow you to achieve similar/equal performance by using < as splitting decision in a tree node compared to performing an exhaustive search of all possible categorical splits.

I would implement this as a transformer that you use together with a random forest or other tree based model in a pipeline.

Describe alternatives you've considered, if relevant

There have been several PRs attempting to add native category support to decision trees. #12866 is the latest one. My impression is that it would be cool to have this in trees but that it is not easy to do, several people have tried but no PR has landed yet. The "Breimann trick" that is used is also limited in the number of categorical values it supports, where this new idea seems to support unlimited categorical values and multi class classification.

Additional context

An earlier paper that is cited in "Splitting categorical predictors...": https://link.springer.com/article/10.1023/A:1009869804967

An open question for me is how to measure how good this kind of transformation is compared to the exhaustive split (which is "the best possible" but too expensive to compute to be used in practice). The method proposed by the paper seems simple (famous last words), and they make claims about the performance but it would be nice to know how close this gets us. Trust, but verify ;)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Discussion

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions