-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
I think it would be nice to implement Altmann's permutation importance (or PIMP) in scikit - original paper is https://academic.oup.com/bioinformatics/article/26/10/1340/193348. The idea was already proposed in #11187 (comment) but I think nothing came of it.
The method is already implemented in 2 or 3 packages for R that I know of, one being the original release by the authors and the other two being the Vita package by Janitza et al. and the other being Ranger. However, I believe it is being hampered by the very slow implementation of random forests (in my experience) that it leverages in R (both randomForest and ranger), and it could benefit and see greater use if it took advantage of scikit's RF implementation (by my experiments, using the same dataset gives a training time of ~70 seconds in scikit and ~30 minutes in ranger, likely double in randomForest).
The benefits are that it is easier/faster to implement than the conditional permutation scheme by Strobl et al. while leaving the dependence between features untouched, and that for a large number of features it would be faster to compute than standard permutation importance (altough PIMP requires retraining the model for each permutation, classical permutation importance does not).
As for the drawbacks of this method, I believe the main one would be the number of permutations needed to get reliable p-values in the end. The original paper suggests between 50-100 permutations of the response vector, however one of the implementations I found (I believe in ranger in R) suggests that >>100 are needed for accurate p-values.
Also, it is still based on Random Forests and Gini Impurity measures, so I believe there is still a bias towards high cardinality features and correlated features, but the whole distribution estimation/p-value procedure should help filter out some redundant or otherwise statistically insignificant importance scores.
On the note of speed I should also mention the paper by Janitza et. al (https://link.springer.com/article/10.1007/s11634-016-0276-4) which proposes an algorithm (NTA) that is 100 to 125 times faster than Altmann's PIMP, altough it needs a high number of features in the dataset and for many of them to have non-positive importance scores. I believe this method doesn't respect scikit's guidelines for the inclusion of a new algorithm given the number of citations, but it could still be useful to implement; furthermore, Degenhardt et al. indicate it as the best feature selection method along with Boruta in their "Evaluation of variable selection methods for random forests and omics data sets" (https://europepmc.org/article/med/29045534)