Skip to content

[WIP] Permutation importances for Random Forest #8027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

dmwinslow
Copy link

@dmwinslow dmwinslow commented Dec 9, 2016

This provides an alternative method for calculating feature importances for Random Forests, by means of a permutation test. This test was described in the initial random forest paper [1], and codified in subsequent literature [2]. We based this work in part on the work done in a previous pull request #3436, but opted to start a new pull request as our implementation differs substantially, and uses somewhat different methodology in performing the actual permutation test (we could not verify the method used in #3436 anywhere in the literature).

Relevant to issue #6024, which is requesting permutation importances specifically for the RandomForestClassifier - but also makes the point that permutation importances could be applied more generally to classifiers/regressors.

References
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
.. [2] Jerome Paul and Pierre Dupont, "Inferring statistically significant features from random forests", Neurocomputing, 150, Part B:471–480, 2015.

Outstanding issues

  • No support for RandomForestRegressor
    -- Regression would require a different accuracy metric. R^2 score is one option, but we should consider which metric to use as default, and if we want to support alternative accuracy metrics for permutation importance calculations.
  • No support multiple outputs
    -- This is easy enough to rectify, but we will need to decide how to handle aggregate accuracy over multiple outputs. One way to do this might be a simple average over all outputs. Should consider how this aggregation might interact with other accuracy metrics, if we choose to support other options.
  • No support for class/sample weights
    -- This implementation does not currently account for class weights or sample weights in calculating feature importances. This may be desirable, but it seems more sensible to use the same weights that were used to train the forest. Need to do a quick literature search on this, happy to accept comments if anyone else has an opinion.
  • Use of static methods for helper functions
    -- We opted to implement a number of helper functions as static methods on the ForestClassifier class in order to make the code more readable and avoid repeated code (i.e., _get_oob_data, _predict_oob, _calc_mislabel_rate). This doesn't seem like standard practice, so any advice on the preferred method for doing this would be appreciated.
  • Use of _feature_importances at the tree level
    -- The changes made to the BaseDecisionTree are a bit of a hack, but were necessary to ensure that the feature importances reported by the tree are self-consistent with those calculated at the Forest level (e.g., for the std bars reported in the example we added). We could also accomplish this by moving the permutation test logic into the tree itself, but the trees being aware of the out-of-bag sampling seems wrong as well. Suggestions definitely welcome here.

y_oob_pred = self._predict_oob(X_oob, tree)

feature_importances = np.zeros(n_features)
for feature_ind in xrange(n_features):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3 renamed xrange() to range(). Please make the change.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting. Some initial code comments while I fall asleep...

n_jobs=n_jobs,
random_state=random_state,
verbose=verbose,
warm_start=warm_start,
class_weight=class_weight)

def _set_permutation_feature_importances(self, X, y):
# Convert data to arrays, y to row vector
X = np.array(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X will either be an array or sparse matrix already. If sparse matrix, we should be raising a NotImplementedError, I think

def _set_permutation_feature_importances(self, X, y):
# Convert data to arrays, y to row vector
X = np.array(X)
y = np.squeeze(np.array(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y will already be an array.

y_oob_pred_perm = self._predict_oob(X_oob, tree,
shuffle_ind=feature_ind,
random_state=random_state)
n_wrong = self._calc_mislabel_rate(y_oob, y_oob_pred)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this can be taken out of the loop


@staticmethod
def _calc_mislabel_rate(expected, predicted):
bool_mislabel = np.squeeze(expected) != np.squeeze(predicted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what case should we expect expected.shape != predicted.shape that we should want to squeeze?

@ryanvarley
Copy link

Given the recent discussion in #8898 regarding eli5 implementing permutation importances is it worth progressing with this PR?

This technique for random forests is more standard (It is available in R and was mentioned in the original random forest paper) and can use the OOB data meaning the API stays the same (no additional input is needed for this importance measure). It would be a nice addition to scikit-learn especially given the some of the issues with the existing importance measure (gini importance) which many wont see if its only available in an external package.

That said, the eli5 implementation is very nice and is much more useful as it can easily be applied to any classifier under multiple configurations and so the extra value we gain here is minimal.

Thoughts @jnothman @kmike?

@jnothman
Copy link
Member

jnothman commented Sep 7, 2017 via email

@ryanvarley
Copy link

In that case I suggest we close this PR and the inactive PR this was based on (#3436)

@jnothman
Copy link
Member

Thanks for the reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants