-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] Permutation importances for Random Forest #8027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/ensemble/forest.py
Outdated
y_oob_pred = self._predict_oob(X_oob, tree) | ||
|
||
feature_importances = np.zeros(n_features) | ||
for feature_ind in xrange(n_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3 renamed xrange()
to range()
. Please make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks interesting. Some initial code comments while I fall asleep...
sklearn/ensemble/forest.py
Outdated
n_jobs=n_jobs, | ||
random_state=random_state, | ||
verbose=verbose, | ||
warm_start=warm_start, | ||
class_weight=class_weight) | ||
|
||
def _set_permutation_feature_importances(self, X, y): | ||
# Convert data to arrays, y to row vector | ||
X = np.array(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X will either be an array or sparse matrix already. If sparse matrix, we should be raising a NotImplementedError
, I think
sklearn/ensemble/forest.py
Outdated
def _set_permutation_feature_importances(self, X, y): | ||
# Convert data to arrays, y to row vector | ||
X = np.array(X) | ||
y = np.squeeze(np.array(y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y will already be an array.
sklearn/ensemble/forest.py
Outdated
y_oob_pred_perm = self._predict_oob(X_oob, tree, | ||
shuffle_ind=feature_ind, | ||
random_state=random_state) | ||
n_wrong = self._calc_mislabel_rate(y_oob, y_oob_pred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this can be taken out of the loop
sklearn/ensemble/forest.py
Outdated
|
||
@staticmethod | ||
def _calc_mislabel_rate(expected, predicted): | ||
bool_mislabel = np.squeeze(expected) != np.squeeze(predicted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what case should we expect expected.shape != predicted.shape
that we should want to squeeze?
Given the recent discussion in #8898 regarding eli5 implementing permutation importances is it worth progressing with this PR? This technique for random forests is more standard (It is available in R and was mentioned in the original random forest paper) and can use the OOB data meaning the API stays the same (no additional input is needed for this importance measure). It would be a nice addition to scikit-learn especially given the some of the issues with the existing importance measure (gini importance) which many wont see if its only available in an external package. That said, the eli5 implementation is very nice and is much more useful as it can easily be applied to any classifier under multiple configurations and so the extra value we gain here is minimal. |
I don't think we have a good API in scikit-learn for alternative feature
importances (see #9606), I think eli5 is doing some good work in this
space, and I think we've got plenty to review. So I'm inclined to agree
that this might be worth closing. But it's been a long time since I've
looked at this method in detail.
…On 7 September 2017 at 05:57, Ryan Varley ***@***.***> wrote:
Given the recent discussion in #8898
<#8898> regarding eli5
implementing permutation importances is it worth progressing with this PR?
This technique for random forests is more standard (It is available in R
and was mentioned in the original random forest paper) and can use the OOB
data meaning the API stays the same (no additional input is needed for this
importance measure). It would be a nice addition to scikit-learn especially
given the some of the issues with the existing importance measure (gini
importance) which many wont see if its only available in an external
package.
That said, the eli5 implementation is very nice and is much more useful as
it can easily be applied to any classifier under multiple configurations
and so the extra value we gain here is minimal.
Thoughts @jnothman <https://github.com/jnothman> @kmike
<https://github.com/kmike>?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8027 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62seqFyLhTCkclIgl-bf3Ggf6Ddqks5sfvkpgaJpZM4LJNZd>
.
|
In that case I suggest we close this PR and the inactive PR this was based on (#3436) |
Thanks for the reminder. |
This provides an alternative method for calculating feature importances for Random Forests, by means of a permutation test. This test was described in the initial random forest paper [1], and codified in subsequent literature [2]. We based this work in part on the work done in a previous pull request #3436, but opted to start a new pull request as our implementation differs substantially, and uses somewhat different methodology in performing the actual permutation test (we could not verify the method used in #3436 anywhere in the literature).
Relevant to issue #6024, which is requesting permutation importances specifically for the RandomForestClassifier - but also makes the point that permutation importances could be applied more generally to classifiers/regressors.
References
.. [1] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
.. [2] Jerome Paul and Pierre Dupont, "Inferring statistically significant features from random forests", Neurocomputing, 150, Part B:471–480, 2015.
Outstanding issues
-- Regression would require a different accuracy metric. R^2 score is one option, but we should consider which metric to use as default, and if we want to support alternative accuracy metrics for permutation importance calculations.
-- This is easy enough to rectify, but we will need to decide how to handle aggregate accuracy over multiple outputs. One way to do this might be a simple average over all outputs. Should consider how this aggregation might interact with other accuracy metrics, if we choose to support other options.
-- This implementation does not currently account for class weights or sample weights in calculating feature importances. This may be desirable, but it seems more sensible to use the same weights that were used to train the forest. Need to do a quick literature search on this, happy to accept comments if anyone else has an opinion.
-- We opted to implement a number of helper functions as static methods on the ForestClassifier class in order to make the code more readable and avoid repeated code (i.e.,
_get_oob_data
,_predict_oob
,_calc_mislabel_rate
). This doesn't seem like standard practice, so any advice on the preferred method for doing this would be appreciated.-- The changes made to the BaseDecisionTree are a bit of a hack, but were necessary to ensure that the feature importances reported by the tree are self-consistent with those calculated at the Forest level (e.g., for the std bars reported in the example we added). We could also accomplish this by moving the permutation test logic into the tree itself, but the trees being aware of the out-of-bag sampling seems wrong as well. Suggestions definitely welcome here.