Skip to content

Tree Optimal split - from LightGBM #9960

Closed
@rspadim

Description

@rspadim

implement split logic, like LightGBM
Optimal split: https://media.readthedocs.org/pdf/lightgbm/latest/lightgbm.pdf

lightgbm commits: microsoft/LightGBM#762


Optimal Split for Categorical Features

We often convert the categorical features into one-hot coding. However, it is not a good solution in tree learner. The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.

Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are 2^(k-1) - 1 possible partitions. But there is a efficient solution for regression tree[7]. It needs about k * log(k) to find the optimal partition.

The basic idea is reordering the categories according to the relevance of training target. More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram.


[7] Walter D. Fisher. "On Grouping for Maximum Homogeneity." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.

http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions