-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Unbiased MDI-like feature importance measure for random forests #31271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+921
−49
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…as "extreme value" issues
…d that they coincide with feature_importances_ on inbag samples
…n different from gini/mse
Note that this PR is meant to stay in draft mode while we conduct empirical experiments to better understand the pros and cons of each method and decide whether we want to only keep one or both. |
ogrisel
reviewed
Apr 29, 2025
ogrisel
reviewed
Apr 29, 2025
ogrisel
reviewed
Apr 29, 2025
ogrisel
reviewed
Apr 29, 2025
ogrisel
reviewed
Apr 29, 2025
Closed because wrong branch name, new PR is #31279 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #20059
What does this implement/fix? Explain your changes.
This implements two methods that correct the cardinality bias of the
feature_importances_
attribute of random forest estimators by leveraging out-of-bag (oob) samples.The first method is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker. The corresponding attribute is named
ufi_feature_importances_
.The second method is derived from A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. The corresponding attribute is named
mdi_oob_feature_importances_
.The names are temporary, we are still seeking a way of favoring one method over the other (currently investigating whether one of the two reaches asymptotic behavior faster than the other).
These attributes are set by the
fit
method after training, if the parameteroob_score
is set toTrue
. In this case we send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. The two proposed method differ in the way they mix in-bag and oob samples.
This PR also includes these two new feature importance measures to the test suite, specifically in test_forest.py. Existing tests are widened to test these two measures and new tests are added to make sure they behave correctly (e.g. they coincide with values given by the code of the cited papers, they recover traditional MDI when used on in-bag samples).
Any other comments?
The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, but we would like the new methods to support the other available criteria in scikit-learn.
log_loss
support was added for classification with the ufi method by generalizing the idea of mixing in-bag and oob samples.Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.
Support for sparse matrix input and for sample weights should be added soon.
Tests on
oob_score_
currently fail, this is under investigation.This work is done in close colaboration with @ogrisel.
TODO:
oob_score_
done in d198f20
"mdi_oob"
method be naturally expanded to supportcriterion="log_loss"
as seems to be the case for the"ufi"
method?sample_weight
GradientBoostingClassifier
andGradientBoostintRegressor
when row-wise (sub)sampling is enabled at training time.