Unbiased MDI-like feature importance measure for random forests #31279

GaetandeCast · 2025-04-30T16:01:48Z

Reference Issues/PRs

Fixes #20059

What does this implement/fix? Explain your changes.

This implements two methods that correct the cardinality bias of the feature_importances_ attribute of random forest estimators by leveraging out-of-bag (oob) samples.
The first method is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker. The corresponding attribute is named ufi_feature_importances_.
The second method is derived from A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. The corresponding attribute is named mdi_oob_feature_importances_.
The names are temporary, we are still seeking a way of favoring one method over the other (currently investigating whether one of the two reaches asymptotic behavior faster than the other).

EDIT: since the above description was written, this PR has been updated to only implement the UFI method and only expose the unbiased_feature_importances_ fitted attribute as a result.

These attributes are set by the fit method after training, if the parameter oob_score is set to True. In this case we send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.

This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. The two proposed method differ in the way they mix in-bag and oob samples.

This PR also includes these two new feature importance measures to the test suite, specifically in test_forest.py. Existing tests are widened to test these two measures and new tests are added to make sure they behave correctly (e.g. they coincide with values given by the code of the cited papers, they recover traditional MDI when used on in-bag samples).

Any other comments?

The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, but we would like the new methods to support the other available criteria in scikit-learn. log_loss support was added for classification with the ufi method by generalizing the idea of mixing in-bag and oob samples.

Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.

Support for sparse matrix input should be added soon.

This work is done in close colaboration with @ogrisel.

TODO:

Fix the tests related to oob_score_
Done in d198f20
Add support for sparse input data (scipy sparse matrix and scipy sparse array containers).
support: 8329b3b
test: 0b48af4
Add support and tests for sample_weight
Support added in f10721e. Test in 241de66
Expose the feature for GradientBoostingClassifier and GradientBoostintRegressor when row-wise (sub)sampling is enabled at training time.
Done in ce52159
Shall we expose some public method to allow the user to pass held-out data instead of just computing the importance using OOB samples identified at training time?
Separate gradient boosting from this pr
8a09b39
Update doc example on permutation vs mdi to include ufi & mdi_oob
229cc4d
Think about an API to expose feature importance confidence intervals based on tree level booststraping

Edit: We noticed a discrepancy between the formula defined by the authors of mdi_oob and what their code does. This is detailed here, in part 5. Therefore we only implement UFI for now. Furthermore we could not find an equivalent of ufi for the entropy impurity criterion so we compute ufi with gini whatever the impurity criterion in classification, and with mse for classification

…as "extreme value" issues

…lity

…d that they coincide with feature_importances_ on inbag samples

…ut Gini

…s_ on X_train

…n different from gini/mse

antoinebaker

Here a first pass ! I didn't look at the cython code, examples or tests yet, I'll try at another time :)

sklearn/ensemble/_forest.py

sklearn/tree/_classes.py

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

GaetandeCast · 2025-06-25T12:57:45Z

I added return_as="generator" for memory performance which will break the tests on the CI machines that use joblib version 1.2, until the minimum version is bumped to 1.3

antoinebaker · 2025-06-25T13:15:16Z

I added return_as="generator" for memory performance which will break the tests on the CI machines that use joblib version 1.2, until the minimum version is bumped to 1.3

You could maybe condition on the joblib version ?

GaetandeCast · 2025-06-25T13:21:23Z

You could maybe condition on the joblib version ?

We opened a PR to bump to the next version anyway as it will be old enough by next release 🤷‍♂️

lorentzenchr · 2025-07-01T19:07:15Z

Can someone give me a short summary and in particular the formulae of the implemented features importances?

GaetandeCast · 2025-07-02T08:43:18Z

Can someone give me a short summary and in particular the formulae of the implemented features importances?

The classical MDI uses the impurity of the nodes computed during training, which in binary classification is :
$H(t) = 1 - p_{t,0}^2 - (1 - p_{t,0})^2$. The authors of UFI suggest using a modified impurity measure: $H'(t) = 1 - p_{t,0} p^\prime_{t,0} - (1 - p_{t,0})(1 - p^\prime_{t,0})$ where $p_{t,0}$ and $p^\prime_{t,0}$ are the proportion of training and testing samples respectively of class 0 that go through the evaluated node $t$. Since we are doing bagging in Random Forests, these test samples are readily available in the out-of-bag (oob) samples.

Once these modified impurity measures are computed with the oob samples, we compute the same decrease in impurity as in traditional MDI : $\Delta^\prime (t) = \omega_t H^\prime (t) - \omega_l H^\prime (l) - \omega_r H^\prime (r)$ where $l$ and $r$ designate the left and right children of $t$ and $\omega_t$ is the proportion of training samples going through $t$.

This can be extended to multi-class classification and in regression we use a similar idea to compute node impurity that uses in and out of bag information : $H^\prime(t) = \frac{1}{n^\prime_t} \sum_{x^\prime_i \in R_t}(y^\prime_i - \bar{y}_t)^2$. $\bar{y}_t$ is the node value, $(x^\prime_i, y^\prime_i)_i$ are oob samples.

I went into more details in this document, where I also compare this method to the alternative proposed in this paper.

@lorentzenchr

lorentzenchr · 2025-07-02T14:16:14Z

@GaetandeCast Thanks a lot for your summary.

TLDR (summary)

What is the difference to the out-of-bag (oob) score of a random forest?

Details

The Brier score in binary classification ($y \in {0, 1}$) is a different name for the squared error: $BS(y_{obs}, p_{pred}) = (y_{obs} - p_{pred})^2$.
Each tree node minimizes this score $\sum_{i \in node} BS(y_{obs, i}, p_{pred, i})$, meaning it predicts $p_{pred} = \bar{y} = average(y \in node) = \frac{1}{n_{node}}\sum_{i \in node} y_i$ (so we predict probabilities like good statisticians).
Plugging it back into the Brier score gives the Brier score entropy, aka Gini score, $Gini = \sum_{i \in node} \bar{y}(1 - \bar{y})$.

If one wants to use the oob sample instead of the training sample then one simply can use the Brier score. The same logic applies to all (strictly consistent) loss functions (and to regression).

From the formula of the UFI suggestion, I am unable to recover the Brier score on the oob sample. Neither can I find any derivation in their paper. I also think it is plain wrong because they mix the empirical means of 2 independent data samples, i.e., $p$ and $p^\prime$. Unless I have not understood something fundamental, that, honestly, seems like nonsense to me.

Edit: If one restricts to the oob sample that flows through that fixed node, meaning p_pred is constant, than one has $\bar{BS}=\bar{y}(1-2p_{pred})+p_{pred}^2$.

GaetandeCast · 2025-07-03T09:23:48Z

If I understand correctly @lorentzenchr, you suggest computing the impurity (Gini/Brier/MSE/...) of the oob samples directly, and use those to measure feature importance.

However this is not sufficient to solve the cardinality bias of the MDI, which is why the two proposed methods suggest mixing metrics based on both in and out of bag samples. One of the reasons the bias is still seen on purely out-of-bag samples is that an impurity measure is always positive and even a random feature will manage to create purity. Therefore even when computed on oob samples, random/uninformative features will be assigned a non-zero importance.

This is actually the first approach I tested and I got feature importance measures very close to the default (biased) ones on the docs example about MDI vs permutation importance.

This idea is also mentioned in the paper of the second method, MDI-oob, in section 3.2 where they also advise against it.

UFI on the other hand is proven to give zero-importance to features independent of the target (Theorems 2 & 4), which is minimum for a feature importance measure.

lorentzenchr · 2025-07-03T11:45:34Z

If I understand correctly, you suggest computing the impurity (Gini/Brier/MSE/...) of the oob samples directly, and use those to measure feature importance.

@GaetandeCast Yes.

To be very honest but without intent to be disrepectful to any researcher, I find the literature that I have seen on bias of MDI feature importance¹ partially poor (again, may be due to my ignorance). Does anybody actually define the bias they talke about? Why should in-sample (training) impurity be biased for measuring feature importance of a tree? I know that the loss is biased, but the feature importance? How is feature importance defined statistically (not empirically), I mean in terms of statistical quantities like distributions, expectations, etc? If you can't define what you are talking about, how can you talk about bias of that object. Why do we interprete the high importance of random features (or high cardinality) as bias? Why don't interprete it as a tree/random forest (RF) poorly overfitting those features? If so, the whole question would shift away from bias of feature importance to fixing the fitting algorithms of trees/RF to avoid overfitting. We would even interprete the high importance not as a sign of bias of the importance but as a valuable tool to measure overfitting.

If you want to use the statistical risk/expected loss as measure for feature importance, out-of-bag is the best cheapest thing you have for random forests, otherwise you would need to use a so far unused (test) sample (similar to honest trees).

And yes, all impurity measures are non-negative. They are generalized entropies and a free constant term is usually set such that the minimum is 0, hence non-negative. Example: The entropy of the squared error (=minimum of expected squared error) is the variance, entropy of log loss is Shannon entropy. Why blame the loss/entropy (impurity), why not define feature importance in a different way?

I have not looked yet deeper into the 2nd paper (MDI-oob).

¹ Strobl et al (2007) https://doi.org/10.1186/1471-2105-8-25 correctly identifies the root cause: variable selection bias (not MDI bias).

GaetandeCast · 2025-07-03T12:48:06Z

I understand the concern and I agree that the UFI paper does not do a great job of defining the issue and setting a theoretical framework. This is done in a much better way in section 2 of MDI-oob, where they give non asymptotic bounds on the sum of importance of irrelevant features. The bounds are proportional to the size of the leafs (deeper trees lead to more biased FI) and the upper bound is stricter when features are categorical (matches empirical result about cardinality).

The most convincing framework to define feature importance mathematically is that of Shapley values from game theory. In this framework, we want to satisfy the four axioms of TU-values, among which is the null player: a feature independent of the output should receive zero importance. This paper shows that in asymptotic regime and under quite specific conditions (Extra-tree with max_feature=1, categorical features and output) the MDI recovers Shapley values. In this context, empirical results show that UFI converges quite sooner than MDI to these asymptotic values. I am currently investigating how well this result holds when relaxing the conditions.

In a nutshell, the two methods modify the MDI to try to get the desirable properties of Shapley values, while being much cheaper to compute.

ogrisel · 2025-07-04T13:08:04Z

This paper shows that in asymptotic regime and under quite specific conditions (Extra-tree with max_feature=1, categorical features and output) the MDI recovers Shapley values.

To be more specific, MDI recovers Shapley values of the loss improvement w.r.t the null model (i.e. MSE, Brier score or log-loss depending on the choice of the splitting criterion: MSE, Gini or Shannon entropy). @GaetandeCast has already conducted some experiments to validate that this is empirically the case. I think this is an interesting property, but it's only valid under restrictive conditions: extra trees with asymptotic regime, binary valued features, max_features=1, ... For regular random forests trained on continuous features, we can sometimes empirically observe small but significant discrepancy with the estimated SAGE values.

By the way, in the asymptotic regime, all three methods (train MDI, UFI and MDI-oob) converge to the same values. So the Shapley value interpretation of any MDI variant is rarely possible in general. One could argue that UFI has a faster convergence than train MDI towards SAGE values, for many datasets/model combinations, but I am not sure if this is always the case.

In my opinion, UFI (and MDI-oob) are mostly interesting as a very cheap yet robust way to find out which features can be safely discarded because they do not help the model generalize, irrespective of the relative cardinality of the features. Train MDI (or naive OOB MDI) cannot achieve this because they can assign larger importance values to large cardinality features that are independent of the target than lower cardinality features with a small to medium association with the target.

SAGE values in particular are computationally expensive to compute and require a test set.

Permutation importance are also quite expensive to compute (but cheaper than SAGE) and also require a test set. MDI-based importances are much cheaper (nearly free) and do not require extra training data (when computed for bagging ensemble of trees).

lorentzenchr · 2025-07-04T14:46:02Z

Let me rephrase: Are we trying to solve the wrong problem?
Decision trees have a training problem. They prefer splits of features with many split points (even when the feature has no correlation with y). Shouldn’t we rather fix the tree learning instead of the (in my point of view - correct) feature importance measure like MDI.

ogrisel · 2025-07-04T15:26:08Z

Shouldn’t we rather fix the tree learning instead of the (in my point of view - correct) feature importance measure like MDI.

How would you translate this into an implementable setting?

One possible way would be to user recommend to always use Random Forests on KBinsDiscretizer pre-preprocessed data and tune the number of bins jointly with the RF hyper-parameters. Then look at the MDI of the resulting pipeline and see if we still have possibly misleading MDI values or not.

lorentzenchr · 2025-07-04T17:15:38Z

@glouppe @mnwright @mayer79 your advice and insights would be very valuable for this issue in scikit-learn.

GaetandeCast and others added 30 commits April 14, 2025 17:43

First working implementation of UFI, does not support multi output, h…

c0e22ea

…as "extreme value" issues

Removed the normalization inherited from the old MDI to avoid instabi…

b1e9df8

…lity

added multi output support

2a694b6

removed redundant cross_impurity computations

fd0abfb

added mdi_oob

ef9f48d

redesigned ufi for better memory management

a225a42

removed a debug import

83f3880

added mdi_oob, cleaned the code

27618db

better unified the code between ufi and mdi_oob

5ad9636

fixed a call oversight

21d2e04

fixed an error in mdi_oob computations

8194d6e

changed tests on feature_importances_ to use unbiased FI too

9e16a09

add tests to check that the added methods coincide with the papers an…

8991d79

…d that they coincide with feature_importances_ on inbag samples

added support for regression (only MSE split)

a9d2983

added warning for unbiased feature importance in classification witho…

710d42c

…ut Gini

merged test_non_OOB_unbiased_feature_importances_class & _reg

ddedf27

Fixed a few mistake so that ufi-regression matches feature_importance…

1de98fc

…s_ on X_train

Extended the tests on matching the paper values to regression

c7c5d76

re added tests on oob_score for dense X. They fail

a44084d

revert a small change to a test

082206c

raise an error when calling unbiased feature importance with criterio…

b028cb9

…n different from gini/mse

adapted the tests to the previous commit

dcb3106

Added log_loss ufi

c61c8dc

fixed the oob_score_ issue, simplified the self.value accesses

d198f20

updated api and tests for ufi with 'log_loss'

f2acf5f

divide by 2 ufi 'log_loss' and improve tests

f41cf3f

fix some linting

af785d6

fixed Cython linting

ccd4f18

added inline function for clarity and comments on available criteria

ac36aaa

Merge branch 'main' into main

fda8349

GaetandeCast added 3 commits June 17, 2025 17:56

update changelog

a7821f6

Add paragraph in tree user guide

2f9afc5

refactor doc example

abe0493

GaetandeCast marked this pull request as ready for review June 18, 2025 14:00

antoinebaker reviewed Jun 18, 2025

View reviewed changes

GaetandeCast and others added 4 commits June 19, 2025 10:04

fix docstring on scoring function

6157f08

call unnormalized fi only once

7ef1da6

Apply suggestions from code review

227ec3d

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

remove unused code for oob_pred, clean aggregation of oob_pred

83c0a1a

GaetandeCast force-pushed the unbiased-feature-importance branch from cd131bc to 83c0a1a Compare June 19, 2025 09:50

GaetandeCast added 2 commits June 19, 2025 12:44

improve references in docstrings and add versionadded

e0313d3

mention the new method in the class attribute docstring

cbfc53a

GaetandeCast mentioned this pull request Jun 25, 2025

bump min dependencies anticipating next major release on 2025-11-01 #31656

Open

return_as generator and shorten method names

f4ac8e2

raise error when calling ufi with oob_score=False and corresponding test

dfbb104

MarieSacksick mentioned this pull request Jun 27, 2025

docs: In the plot_feature_importance example, also compute the MDI on the test set probabl-ai/skore#1419

Open

adam2392 self-requested a review June 30, 2025 14:53

Uh oh!

Unbiased MDI-like feature importance measure for random forests #31279

Are you sure you want to change the base?

Unbiased MDI-like feature importance measure for random forests #31279

Conversation

GaetandeCast commented Apr 30, 2025 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

Uh oh!

antoinebaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GaetandeCast commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

GaetandeCast commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jul 1, 2025

Uh oh!

GaetandeCast commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR (summary)

Details

Uh oh!

GaetandeCast commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GaetandeCast commented Jul 3, 2025

Uh oh!

ogrisel commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jul 4, 2025

Uh oh!

ogrisel commented Jul 4, 2025

Uh oh!

lorentzenchr commented Jul 4, 2025

Uh oh!

Uh oh!

GaetandeCast commented Apr 30, 2025 •

edited by ogrisel

Loading

GaetandeCast commented Jun 25, 2025 •

edited

Loading

GaetandeCast commented Jun 25, 2025 •

edited

Loading

GaetandeCast commented Jul 2, 2025 •

edited

Loading

lorentzenchr commented Jul 2, 2025 •

edited

Loading

GaetandeCast commented Jul 3, 2025 •

edited

Loading

lorentzenchr commented Jul 3, 2025 •

edited

Loading

ogrisel commented Jul 4, 2025 •

edited

Loading