Feature importance for mixed data and/or high cardinality data

In this article https://link.springer.com/article/10.1186/1471-2105-8-25 concerns about using gini for feature importance is being raised. 



> We found that for the original random forest method the variable importance measures are affected by the number of categories and scale of measurement of the predictor variables, which are no direct indicators of the true importance of the variable.
> 
> However, in studies where continuous variables, such as the folding energy, are used in combination with categorical information from the neighboring nucleotides, or when categorical predictors, as in amino acid sequence data, vary in their number of categories present in the sample variable selection with random forest variable importance measures is unreliable and may even be misleading.

I'm not entirely sure but as I understand the documentation and reading through the documentation and code feature importance is calculated by using either gini or entropy (both suffering from the same issue I suppose). Is their any way circumvent this problem or actions that is supposed to be taken before feature imortance. 

In the article they propose a new splitting criteria to combat this problem. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature importance for mixed data and/or high cardinality data #16023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature importance for mixed data and/or high cardinality data #16023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions