Description
implement split logic, like LightGBM
Optimal split: https://media.readthedocs.org/pdf/lightgbm/latest/lightgbm.pdf
lightgbm commits: microsoft/LightGBM#762
Optimal Split for Categorical Features
We often convert the categorical features into one-hot coding. However, it is not a good solution in tree learner. The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.
Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are 2^(k-1) - 1 possible partitions. But there is a efficient solution for regression tree[7]. It needs about k * log(k) to find the optimal partition.
The basic idea is reordering the categories according to the relevance of training target. More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram.
[7] Walter D. Fisher. "On Grouping for Maximum Homogeneity." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479