ENH: Added Random Forest feature importance based on out of bag data #3436

sebp · 2014-07-18T18:51:29Z

This follows Breiman's proposed way of computing feature importances
on the out-of-bag samples for each tree (see section 10 in
Breiman, Random Forests, Machine Learning, 2001).

For the j-th feature, each tree outputs two predictions.
The first one is based on the original data and the second one
on data where the the j-th feature is randomly permuted.
Both predictions are compared to the true value and the difference
in performance is used as an importance measure.
In the case of regression, the R^2 score is used to asses performance
and accuracy in the case of classification.

This follows Breiman's proposed way of computing feature importances on the out-of-bag samples for each tree (see section 10 in Breiman, Random Forests, Machine Learning, 2001). For the j-th feature, each tree outputs two predictions. The first one is based on the original data and the second one on data where the the j-th feature is randomly permuted. Both predictions are compared to the true value and the difference in performance is used as an importance measure. In the case of regression, the R^2 score is used to asses performance and accuracy in the case of classification.

coveralls · 2014-07-18T19:03:50Z

Coverage increased (+0.02%) when pulling 97754df on sebp:random_forest_oob_feature_importances into 4113dfe on scikit-learn:master.

This avoids infinit and NaN feature importances

coveralls · 2014-07-18T20:02:41Z

Coverage increased (+0.02%) when pulling da1df33 on sebp:random_forest_oob_feature_importances into 4113dfe on scikit-learn:master.

ryanvarley · 2016-05-17T14:30:23Z

Was there any fundamental reason why this wasn't merged in?

jnothman · 2016-05-17T15:05:40Z

Was there any fundamental reason why this wasn't merged in?

Well, we never merge without review. But it looks like this was never reviewed, probably for no fundamental reason.

ryanvarley · 2016-05-17T16:10:37Z

Can I get a review on it then? Im happy to take over if @sebp doesn't come back. The addition made here is pretty important if you are using continuous and binary features together (as continuous feature are nearly always considered more important as they can be split on multiple times).

jnothman · 2016-05-18T00:56:53Z

I am not familiar enough with the technique myself. I assume it is quite standard. It looks straightforward enough, but are you able to please review the code and the parameter description for correctness? When I find time, I will try to review it for idiom, API, documentation, testing. If @sebp does not stand up to bring it to completion, then you will be welcome to champion the cause.

ryanvarley · 2016-05-18T08:21:48Z

Sure, thanks.

jnothman · 2016-05-18T08:49:06Z

To whoever works on this:

we now need to use _generate_unsampled_indices instead of indices_
I think we can avoid duplicating the code by using the scoring API. Given a scorer (e.g. scorer = get_scorer(oob_feature_importances) where oob_feature_importances defaults to 'accuracy' or 'r2' if True), we should be able to do score[j] += scoring(estimator, X[mask, :], y[mask, :]); ...; score[j] -= scoring(estimator, X_modified[mask, :], y[mask, :])`
we should be using sklearn.utils.safe_indexing for indexing with masks to support some versions of scipy.sparse (and to use the faster .take)
feature_importances_ has special meaning in, at least, SelectFromModel. Perhaps we should be reusing that existing name rather than setting a new attribute. But I'm not certain about this.
we might want to use multithreading, although currently X is not being copied when feature j is modified and hence is not thread-safe. Making it thread-safe may counteract benefits of parallelism.

Testing this exactly seems a challenge, but could be done if we want to...

jnothman · 2017-09-18T11:03:19Z

As per #8027 (comment) we'll give this a miss for now.

BUG: Check if standard deviation is small before dividing by it

da1df33

This avoids infinit and NaN feature importances

glouppe mentioned this pull request Jul 20, 2014

Allow to choose the out-of-bag scoring metric #3455

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added the Need Contributor label Oct 11, 2016

dmwinslow mentioned this pull request Dec 9, 2016

[WIP] Permutation importances for Random Forest #8027

Closed

5 tasks

kmike mentioned this pull request Sep 1, 2017

Model-based feature importances for any classifier? #8898

Closed

jnothman closed this Sep 18, 2017

robert-robison mentioned this pull request Oct 11, 2020

ENH: OOB Permutation Importance for Random Forests #18603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Added Random Forest feature importance based on out of bag data #3436

ENH: Added Random Forest feature importance based on out of bag data #3436

Uh oh!

sebp commented Jul 18, 2014

Uh oh!

coveralls commented Jul 18, 2014

Uh oh!

coveralls commented Jul 18, 2014

Uh oh!

ryanvarley commented May 17, 2016

Uh oh!

jnothman commented May 17, 2016

Uh oh!

ryanvarley commented May 17, 2016

Uh oh!

jnothman commented May 18, 2016

Uh oh!

ryanvarley commented May 18, 2016

Uh oh!

jnothman commented May 18, 2016

Uh oh!

jnothman commented Sep 18, 2017

Uh oh!

Uh oh!

Uh oh!

ENH: Added Random Forest feature importance based on out of bag data #3436

ENH: Added Random Forest feature importance based on out of bag data #3436

Uh oh!

Conversation

sebp commented Jul 18, 2014

Uh oh!

coveralls commented Jul 18, 2014

Uh oh!

coveralls commented Jul 18, 2014

Uh oh!

ryanvarley commented May 17, 2016

Uh oh!

jnothman commented May 17, 2016

Uh oh!

ryanvarley commented May 17, 2016

Uh oh!

jnothman commented May 18, 2016

Uh oh!

ryanvarley commented May 18, 2016

Uh oh!

jnothman commented May 18, 2016

Uh oh!

jnothman commented Sep 18, 2017

Uh oh!

Uh oh!