-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
In this article https://link.springer.com/article/10.1186/1471-2105-8-25 concerns about using gini for feature importance is being raised.
We found that for the original random forest method the variable importance measures are affected by the number of categories and scale of measurement of the predictor variables, which are no direct indicators of the true importance of the variable.
However, in studies where continuous variables, such as the folding energy, are used in combination with categorical information from the neighboring nucleotides, or when categorical predictors, as in amino acid sequence data, vary in their number of categories present in the sample variable selection with random forest variable importance measures is unreliable and may even be misleading.
I'm not entirely sure but as I understand the documentation and reading through the documentation and code feature importance is calculated by using either gini or entropy (both suffering from the same issue I suppose). Is their any way circumvent this problem or actions that is supposed to be taken before feature imortance.
In the article they propose a new splitting criteria to combat this problem.