-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Unbiased MDI-like feature importance measure for random forests #31279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Unbiased MDI-like feature importance measure for random forests #31279
Conversation
…as "extreme value" issues
…d that they coincide with feature_importances_ on inbag samples
…n different from gini/mse
cd131bc
to
83c0a1a
Compare
I added |
You could maybe condition on the joblib version ? |
We opened a PR to bump to the next version anyway as it will be old enough by next release 🤷♂️ |
Can someone give me a short summary and in particular the formulae of the implemented features importances? |
The classical MDI uses the impurity of the nodes computed during training, which in binary classification is : Once these modified impurity measures are computed with the oob samples, we compute the same decrease in impurity as in traditional MDI : This can be extended to multi-class classification and in regression we use a similar idea to compute node impurity that uses in and out of bag information : I went into more details in this document, where I also compare this method to the alternative proposed in this paper. |
@GaetandeCast Thanks a lot for your summary. TLDR (summary)What is the difference to the out-of-bag (oob) score of a random forest? DetailsThe Brier score in binary classification ( If one wants to use the oob sample instead of the training sample then one simply can use the Brier score. The same logic applies to all (strictly consistent) loss functions (and to regression). From the formula of the UFI suggestion, I am unable to recover the Brier score on the oob sample. Neither can I find any derivation in their paper. I also think it is plain wrong because they mix the empirical means of 2 independent data samples, i.e., Edit: If one restricts to the oob sample that flows through that fixed node, meaning p_pred is constant, than one has |
If I understand correctly @lorentzenchr, you suggest computing the impurity (Gini/Brier/MSE/...) of the oob samples directly, and use those to measure feature importance. However this is not sufficient to solve the cardinality bias of the MDI, which is why the two proposed methods suggest mixing metrics based on both in and out of bag samples. One of the reasons the bias is still seen on purely out-of-bag samples is that an impurity measure is always positive and even a random feature will manage to create purity. Therefore even when computed on oob samples, random/uninformative features will be assigned a non-zero importance. This is actually the first approach I tested and I got feature importance measures very close to the default (biased) ones on the docs example about MDI vs permutation importance. This idea is also mentioned in the paper of the second method, MDI-oob, in section 3.2 where they also advise against it. UFI on the other hand is proven to give zero-importance to features independent of the target (Theorems 2 & 4), which is minimum for a feature importance measure. |
@GaetandeCast Yes. To be very honest but without intent to be disrepectful to any researcher, I find the literature that I have seen on bias of MDI feature importance1 partially poor (again, may be due to my ignorance). Does anybody actually define the bias they talke about? Why should in-sample (training) impurity be biased for measuring feature importance of a tree? I know that the loss is biased, but the feature importance? How is feature importance defined statistically (not empirically), I mean in terms of statistical quantities like distributions, expectations, etc? If you can't define what you are talking about, how can you talk about bias of that object. Why do we interprete the high importance of random features (or high cardinality) as bias? Why don't interprete it as a tree/random forest (RF) poorly overfitting those features? If so, the whole question would shift away from bias of feature importance to fixing the fitting algorithms of trees/RF to avoid overfitting. We would even interprete the high importance not as a sign of bias of the importance but as a valuable tool to measure overfitting. If you want to use the statistical risk/expected loss as measure for feature importance, out-of-bag is the best cheapest thing you have for random forests, otherwise you would need to use a so far unused (test) sample (similar to honest trees). And yes, all impurity measures are non-negative. They are generalized entropies and a free constant term is usually set such that the minimum is 0, hence non-negative. Example: The entropy of the squared error (=minimum of expected squared error) is the variance, entropy of log loss is Shannon entropy. Why blame the loss/entropy (impurity), why not define feature importance in a different way? I have not looked yet deeper into the 2nd paper (MDI-oob). 1 Strobl et al (2007) https://doi.org/10.1186/1471-2105-8-25 correctly identifies the root cause: variable selection bias (not MDI bias). |
I understand the concern and I agree that the UFI paper does not do a great job of defining the issue and setting a theoretical framework. This is done in a much better way in section 2 of MDI-oob, where they give non asymptotic bounds on the sum of importance of irrelevant features. The bounds are proportional to the size of the leafs (deeper trees lead to more biased FI) and the upper bound is stricter when features are categorical (matches empirical result about cardinality). The most convincing framework to define feature importance mathematically is that of Shapley values from game theory. In this framework, we want to satisfy the four axioms of TU-values, among which is the null player: a feature independent of the output should receive zero importance. This paper shows that in asymptotic regime and under quite specific conditions (Extra-tree with In a nutshell, the two methods modify the MDI to try to get the desirable properties of Shapley values, while being much cheaper to compute. |
To be more specific, MDI recovers Shapley values of the loss improvement w.r.t the null model (i.e. MSE, Brier score or log-loss depending on the choice of the splitting criterion: MSE, Gini or Shannon entropy). @GaetandeCast has already conducted some experiments to validate that this is empirically the case. I think this is an interesting property, but it's only valid under restrictive conditions: extra trees with asymptotic regime, binary valued features, max_features=1, ... For regular random forests trained on continuous features, we can sometimes empirically observe small but significant discrepancy with the estimated SAGE values. By the way, in the asymptotic regime, all three methods (train MDI, UFI and MDI-oob) converge to the same values. So the Shapley value interpretation of any MDI variant is rarely possible in general. One could argue that UFI has a faster convergence than train MDI towards SAGE values, for many datasets/model combinations, but I am not sure if this is always the case. In my opinion, UFI (and MDI-oob) are mostly interesting as a very cheap yet robust way to find out which features can be safely discarded because they do not help the model generalize, irrespective of the relative cardinality of the features. Train MDI (or naive OOB MDI) cannot achieve this because they can assign larger importance values to large cardinality features that are independent of the target than lower cardinality features with a small to medium association with the target. SAGE values in particular are computationally expensive to compute and require a test set. Permutation importance are also quite expensive to compute (but cheaper than SAGE) and also require a test set (although we could compute them on OOB samples, see #18603). MDI-based importances are much cheaper (nearly free) and do not require extra training data (when computed for bagging ensemble of trees). |
Let me rephrase: Are we trying to solve the wrong problem? |
How would you translate this into an implementable setting? One possible way would be to user recommend to always use Random Forests on |
Note that UFI values (as well as MDI-oob values) converge to (train) MDI values in the large sample limit. So all 3 are correct in that respect. The problem is that whenever you use a random forest, you almost never are in the asymptotic regime (otherwise you wouldn't need an RF and could just train a single deep tree). In the finite sample regime, the (train) MDI values do not reflect which features are important to get predictions that yield accurate predictions on held-out data observations (as illustrated in the scikit-learn example). Therefore, I think UFI is useful to very cheaply identify features that are useless for the RF model and therefore help our users build faster-to-train / faster-to-predict / more efficient predictive with similar test set performance as the original model. |
@lorentzenchr, we summarize here the main arguments that support this inclusion in scikit-learn: The issue:
Why UFI:
|
Also note: the purpose of this PR is not to remove MDI, but to add an alternative that is safer for the purpose of feature selection and explain in the documentation and examples how the two are complementary when trying to understand how the features are used by a given model. |
Reference Issues/PRs
Fixes #20059
What does this implement/fix? Explain your changes.
This PR implements a methods called UFI (Unbiased Feature Importance) that correct the cardinality bias of the
feature_importances_
attribute of random forest estimators by leveraging out-of-bag (oob) samples.It is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker.
There is an other method called MDI-oob introduced in A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. We show in this report that the two methods are extremely similar when written under the same notations, so we chose UFI because it enjoys an additional property for irrelevant features: if a feature$X_j$ is independent of the target, conditionally on all other features, then it receives zero importance in expectation with UFI. This is proven in the original paper.
The attribute
unbiased_feature_importance
is set by thefit
method after training, if the parameteroob_score
is set toTrue
, for forests trained with the MSE or Gini impurity criteria. We send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. It also converges to the MDI in the large sample limit.
This PR also includes this new feature importance measure to the test suite, specifically in test_forest.py. Existing tests are widened to test the new method and new tests are added to make sure it behaves correctly (e.g. it coincides with values given by the code of the cited paper, it recovers traditional MDI when used on in-bag samples, it converges to MDI etc.).
We also mention this new feature importance method in the permutation importance vs MDI doc example.
Any other comments?
The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, so we are limited to these for now.
Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.
This work is done in colaboration with @ogrisel and @antoinebaker.