Tree Optimal split - from LightGBM

implement split logic, like LightGBM
Optimal split: https://media.readthedocs.org/pdf/lightgbm/latest/lightgbm.pdf

lightgbm commits: https://github.com/Microsoft/LightGBM/pull/762

---

Optimal Split for Categorical Features

We often convert the categorical features into one-hot coding. However, it is not a good solution in tree learner. The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.

Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are 2^(k-1) - 1 possible partitions. But there is a efficient solution for regression tree[7]. It needs about k * log(k) to find the optimal partition.

The basic idea is reordering the categories according to the relevance of training target. More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram.

---

[7] Walter D. Fisher. "On Grouping for Maximum Homogeneity." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.

http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tree Optimal split - from LightGBM #9960

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tree Optimal split - from LightGBM #9960

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions