-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
ENH: Added Random Forest feature importance based on out of bag data #3436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added Random Forest feature importance based on out of bag data #3436
Conversation
This follows Breiman's proposed way of computing feature importances on the out-of-bag samples for each tree (see section 10 in Breiman, Random Forests, Machine Learning, 2001). For the j-th feature, each tree outputs two predictions. The first one is based on the original data and the second one on data where the the j-th feature is randomly permuted. Both predictions are compared to the true value and the difference in performance is used as an importance measure. In the case of regression, the R^2 score is used to asses performance and accuracy in the case of classification.
This avoids infinit and NaN feature importances
Was there any fundamental reason why this wasn't merged in? |
Well, we never merge without review. But it looks like this was never reviewed, probably for no fundamental reason. |
Can I get a review on it then? Im happy to take over if @sebp doesn't come back. The addition made here is pretty important if you are using continuous and binary features together (as continuous feature are nearly always considered more important as they can be split on multiple times). |
I am not familiar enough with the technique myself. I assume it is quite standard. It looks straightforward enough, but are you able to please review the code and the parameter description for correctness? When I find time, I will try to review it for idiom, API, documentation, testing. If @sebp does not stand up to bring it to completion, then you will be welcome to champion the cause. |
Sure, thanks. |
To whoever works on this:
Testing this exactly seems a challenge, but could be done if we want to... |
As per #8027 (comment) we'll give this a miss for now. |
This follows Breiman's proposed way of computing feature importances
on the out-of-bag samples for each tree (see section 10 in
Breiman, Random Forests, Machine Learning, 2001).
For the j-th feature, each tree outputs two predictions.
The first one is based on the original data and the second one
on data where the the j-th feature is randomly permuted.
Both predictions are compared to the true value and the difference
in performance is used as an importance measure.
In the case of regression, the R^2 score is used to asses performance
and accuracy in the case of classification.