[WIP] Permutation importances for Random Forest #8027

dmwinslow · 2016-12-09T18:11:28Z

This provides an alternative method for calculating feature importances for Random Forests, by means of a permutation test. This test was described in the initial random forest paper [1], and codified in subsequent literature [2]. We based this work in part on the work done in a previous pull request #3436, but opted to start a new pull request as our implementation differs substantially, and uses somewhat different methodology in performing the actual permutation test (we could not verify the method used in #3436 anywhere in the literature).

Relevant to issue #6024, which is requesting permutation importances specifically for the RandomForestClassifier - but also makes the point that permutation importances could be applied more generally to classifiers/regressors.

References
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
.. [2] Jerome Paul and Pierre Dupont, "Inferring statistically significant features from random forests", Neurocomputing, 150, Part B:471–480, 2015.

Outstanding issues

No support for RandomForestRegressor
-- Regression would require a different accuracy metric. R^2 score is one option, but we should consider which metric to use as default, and if we want to support alternative accuracy metrics for permutation importance calculations.
No support multiple outputs
-- This is easy enough to rectify, but we will need to decide how to handle aggregate accuracy over multiple outputs. One way to do this might be a simple average over all outputs. Should consider how this aggregation might interact with other accuracy metrics, if we choose to support other options.
No support for class/sample weights
-- This implementation does not currently account for class weights or sample weights in calculating feature importances. This may be desirable, but it seems more sensible to use the same weights that were used to train the forest. Need to do a quick literature search on this, happy to accept comments if anyone else has an opinion.
Use of static methods for helper functions
-- We opted to implement a number of helper functions as static methods on the ForestClassifier class in order to make the code more readable and avoid repeated code (i.e., _get_oob_data, _predict_oob, _calc_mislabel_rate). This doesn't seem like standard practice, so any advice on the preferred method for doing this would be appreciated.
Use of _feature_importances at the tree level
-- The changes made to the BaseDecisionTree are a bit of a hack, but were necessary to ensure that the feature importances reported by the tree are self-consistent with those calculated at the Forest level (e.g., for the std bars reported in the example we added). We could also accomplish this by moving the permutation test logic into the tree itself, but the trees being aware of the out-of-bag sampling seems wrong as well. Suggestions definitely welcome here.

…ob…)

dalmia · 2016-12-10T06:41:10Z

sklearn/ensemble/forest.py

+            y_oob_pred = self._predict_oob(X_oob, tree)
+
+            feature_importances = np.zeros(n_features)
+            for feature_ind in xrange(n_features):


Python 3 renamed xrange() to range(). Please make the change.

jnothman

This looks interesting. Some initial code comments while I fall asleep...

jnothman · 2016-12-12T13:08:15Z

sklearn/ensemble/forest.py

            n_jobs=n_jobs,
            random_state=random_state,
            verbose=verbose,
            warm_start=warm_start,
            class_weight=class_weight)

+    def _set_permutation_feature_importances(self, X, y):
+        # Convert data to arrays, y to row vector
+        X = np.array(X)


X will either be an array or sparse matrix already. If sparse matrix, we should be raising a NotImplementedError, I think

jnothman · 2016-12-12T13:08:17Z

sklearn/ensemble/forest.py

+    def _set_permutation_feature_importances(self, X, y):
+        # Convert data to arrays, y to row vector
+        X = np.array(X)
+        y = np.squeeze(np.array(y))


y will already be an array.

jnothman · 2016-12-12T13:10:56Z

sklearn/ensemble/forest.py

+                y_oob_pred_perm = self._predict_oob(X_oob, tree,
+                                                    shuffle_ind=feature_ind,
+                                                    random_state=random_state)
+                n_wrong = self._calc_mislabel_rate(y_oob, y_oob_pred)


it looks like this can be taken out of the loop

jnothman · 2016-12-12T13:12:17Z

sklearn/ensemble/forest.py

+
+    @staticmethod
+    def _calc_mislabel_rate(expected, predicted):
+        bool_mislabel = np.squeeze(expected) != np.squeeze(predicted)


in what case should we expect expected.shape != predicted.shape that we should want to squeeze?

# Conflicts: # sklearn/tree/tree.py

ryanvarley · 2017-09-06T19:57:27Z

Given the recent discussion in #8898 regarding eli5 implementing permutation importances is it worth progressing with this PR?

This technique for random forests is more standard (It is available in R and was mentioned in the original random forest paper) and can use the OOB data meaning the API stays the same (no additional input is needed for this importance measure). It would be a nice addition to scikit-learn especially given the some of the issues with the existing importance measure (gini importance) which many wont see if its only available in an external package.

That said, the eli5 implementation is very nice and is much more useful as it can easily be applied to any classifier under multiple configurations and so the extra value we gain here is minimal.

Thoughts @jnothman @kmike?

jnothman · 2017-09-07T01:17:23Z

I don't think we have a good API in scikit-learn for alternative feature importances (see #9606), I think eli5 is doing some good work in this space, and I think we've got plenty to review. So I'm inclined to agree that this might be worth closing. But it's been a long time since I've looked at this method in detail.

…

On 7 September 2017 at 05:57, Ryan Varley ***@***.***> wrote: Given the recent discussion in #8898 <#8898> regarding eli5 implementing permutation importances is it worth progressing with this PR? This technique for random forests is more standard (It is available in R and was mentioned in the original random forest paper) and can use the OOB data meaning the API stays the same (no additional input is needed for this importance measure). It would be a nice addition to scikit-learn especially given the some of the issues with the existing importance measure (gini importance) which many wont see if its only available in an external package. That said, the eli5 implementation is very nice and is much more useful as it can easily be applied to any classifier under multiple configurations and so the extra value we gain here is minimal. Thoughts @jnothman <https://github.com/jnothman> @kmike <https://github.com/kmike>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8027 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62seqFyLhTCkclIgl-bf3Ggf6Ddqks5sfvkpgaJpZM4LJNZd> .

ryanvarley · 2017-09-18T11:00:37Z

In that case I suggest we close this PR and the inactive PR this was based on (#3436)

jnothman · 2017-09-18T11:02:40Z

Thanks for the reminder.

Dustin and others added 5 commits December 9, 2016 13:56

Implement permutation importances for RF classifier

6720531

Added tests to RF permutation_feature_importances (and renamed from o…

40f5db5

…ob…)

Add permutation importance and reference to docstrings

cd0fa7c

Added an example comparing feature importance methods for random forest

0434315

Refactored oob_feature_importances to permutation_feature_importances

d3669c6

dalmia suggested changes Dec 10, 2016

View reviewed changes

jnothman reviewed Dec 12, 2016

View reviewed changes

Ryan Varley added 4 commits December 13, 2016 17:52

Remove unneeded np.array wrappers and change xrange to range

4f7f99e

Removed unnecessary squeeze calls

5a34f50

Merge branch 'master' into rf_permutation_importance

e6021c6

# Conflicts: # sklearn/tree/tree.py

Fixed flake8 errors

51f0bb8

kmike mentioned this pull request Sep 1, 2017

Model-based feature importances for any classifier? #8898

Closed

jnothman closed this Sep 18, 2017

jnothman mentioned this pull request Sep 18, 2017

ENH: Added Random Forest feature importance based on out of bag data #3436

Closed

robert-robison mentioned this pull request Oct 11, 2020

ENH: OOB Permutation Importance for Random Forests #18603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Permutation importances for Random Forest #8027

[WIP] Permutation importances for Random Forest #8027

Uh oh!

dmwinslow commented Dec 9, 2016 •

edited

Loading

Uh oh!

dalmia Dec 10, 2016

Uh oh!

jnothman left a comment

Uh oh!

jnothman Dec 12, 2016

Uh oh!

jnothman Dec 12, 2016

Uh oh!

jnothman Dec 12, 2016

Uh oh!

jnothman Dec 12, 2016

Uh oh!

ryanvarley commented Sep 6, 2017

Uh oh!

jnothman commented Sep 7, 2017 via email

Uh oh!

ryanvarley commented Sep 18, 2017

Uh oh!

jnothman commented Sep 18, 2017

Uh oh!

Uh oh!

Uh oh!

[WIP] Permutation importances for Random Forest #8027

[WIP] Permutation importances for Random Forest #8027

Uh oh!

Conversation

dmwinslow commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Outstanding issues

Uh oh!

dalmia Dec 10, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 12, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 12, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 12, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 12, 2016

Choose a reason for hiding this comment

Uh oh!

ryanvarley commented Sep 6, 2017

Uh oh!

jnothman commented Sep 7, 2017 via email

Uh oh!

ryanvarley commented Sep 18, 2017

Uh oh!

jnothman commented Sep 18, 2017

Uh oh!

Uh oh!

dmwinslow commented Dec 9, 2016 •

edited

Loading