Skip to content

Conversation

sebp
Copy link
Contributor

@sebp sebp commented Jul 18, 2014

This follows Breiman's proposed way of computing feature importances
on the out-of-bag samples for each tree (see section 10 in
Breiman, Random Forests, Machine Learning, 2001).

For the j-th feature, each tree outputs two predictions.
The first one is based on the original data and the second one
on data where the the j-th feature is randomly permuted.
Both predictions are compared to the true value and the difference
in performance is used as an importance measure.
In the case of regression, the R^2 score is used to asses performance
and accuracy in the case of classification.

This follows Breiman's proposed way of computing feature importances
on the out-of-bag samples for each tree (see section 10 in
Breiman, Random Forests, Machine Learning, 2001).

For the j-th feature, each tree outputs two predictions.
The first one is based on the original data and the second one
on data where the the j-th feature is randomly permuted.
Both predictions are compared to the true value and the difference
in performance is used as an importance measure.
In the case of regression, the R^2 score is used to asses performance
and accuracy in the case of classification.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 97754df on sebp:random_forest_oob_feature_importances into 4113dfe on scikit-learn:master.

This avoids infinit and NaN feature importances
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling da1df33 on sebp:random_forest_oob_feature_importances into 4113dfe on scikit-learn:master.

@ryanvarley
Copy link

Was there any fundamental reason why this wasn't merged in?

@jnothman
Copy link
Member

Was there any fundamental reason why this wasn't merged in?

Well, we never merge without review. But it looks like this was never reviewed, probably for no fundamental reason.

@ryanvarley
Copy link

Can I get a review on it then? Im happy to take over if @sebp doesn't come back. The addition made here is pretty important if you are using continuous and binary features together (as continuous feature are nearly always considered more important as they can be split on multiple times).

@jnothman
Copy link
Member

I am not familiar enough with the technique myself. I assume it is quite standard. It looks straightforward enough, but are you able to please review the code and the parameter description for correctness? When I find time, I will try to review it for idiom, API, documentation, testing. If @sebp does not stand up to bring it to completion, then you will be welcome to champion the cause.

@ryanvarley
Copy link

Sure, thanks.

@jnothman
Copy link
Member

To whoever works on this:

  • we now need to use _generate_unsampled_indices instead of indices_
  • I think we can avoid duplicating the code by using the scoring API. Given a scorer (e.g. scorer = get_scorer(oob_feature_importances) where oob_feature_importances defaults to 'accuracy' or 'r2' if True), we should be able to do score[j] += scoring(estimator, X[mask, :], y[mask, :]); ...; score[j] -= scoring(estimator, X_modified[mask, :], y[mask, :])`
  • we should be using sklearn.utils.safe_indexing for indexing with masks to support some versions of scipy.sparse (and to use the faster .take)
  • feature_importances_ has special meaning in, at least, SelectFromModel. Perhaps we should be reusing that existing name rather than setting a new attribute. But I'm not certain about this.
  • we might want to use multithreading, although currently X is not being copied when feature j is modified and hence is not thread-safe. Making it thread-safe may counteract benefits of parallelism.

Testing this exactly seems a challenge, but could be done if we want to...

@jnothman
Copy link
Member

As per #8027 (comment) we'll give this a miss for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants