Skip to content

Feature importance for mixed data and/or high cardinality data #16023

@hugwi

Description

@hugwi

In this article https://link.springer.com/article/10.1186/1471-2105-8-25 concerns about using gini for feature importance is being raised.

We found that for the original random forest method the variable importance measures are affected by the number of categories and scale of measurement of the predictor variables, which are no direct indicators of the true importance of the variable.

However, in studies where continuous variables, such as the folding energy, are used in combination with categorical information from the neighboring nucleotides, or when categorical predictors, as in amino acid sequence data, vary in their number of categories present in the sample variable selection with random forest variable importance measures is unreliable and may even be misleading.

I'm not entirely sure but as I understand the documentation and reading through the documentation and code feature importance is calculated by using either gini or entropy (both suffering from the same issue I suppose). Is their any way circumvent this problem or actions that is supposed to be taken before feature imortance.

In the article they propose a new splitting criteria to combat this problem.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions