Unbiased MDI-like feature importance measure for random forests #31271

GaetandeCast · 2025-04-29T10:22:20Z

Reference Issues/PRs

Fixes #20059

What does this implement/fix? Explain your changes.

This implements two methods that correct the cardinality bias of the feature_importances_ attribute of random forest estimators by leveraging out-of-bag (oob) samples.
The first method is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker. The corresponding attribute is named ufi_feature_importances_.
The second method is derived from A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. The corresponding attribute is named mdi_oob_feature_importances_.
The names are temporary, we are still seeking a way of favoring one method over the other (currently investigating whether one of the two reaches asymptotic behavior faster than the other).

These attributes are set by the fit method after training, if the parameter oob_score is set to True. In this case we send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.

This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. The two proposed method differ in the way they mix in-bag and oob samples.

This PR also includes these two new feature importance measures to the test suite, specifically in test_forest.py. Existing tests are widened to test these two measures and new tests are added to make sure they behave correctly (e.g. they coincide with values given by the code of the cited papers, they recover traditional MDI when used on in-bag samples).

Any other comments?

The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, but we would like the new methods to support the other available criteria in scikit-learn. log_loss support was added for classification with the ufi method by generalizing the idea of mixing in-bag and oob samples.

Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.

Support for sparse matrix input and for sample weights should be added soon.

Tests on oob_score_ currently fail, this is under investigation.

This work is done in close colaboration with @ogrisel.

TODO:

Fix the tests related to oob_score_
done in d198f20
Can the "mdi_oob" method be naturally expanded to support criterion="log_loss" as seems to be the case for the "ufi" method?
Add support for sparse input data (scipy sparse matrix and scipy sparse array containers).
Add support and tests for sample_weight
Expose the feature for GradientBoostingClassifier and GradientBoostintRegressor when row-wise (sub)sampling is enabled at training time.
Shall we expose some public method to allow the user to pass held-out data instead of just computing the importance using OOB samples identified at training time?

…as "extreme value" issues

…lity

…d that they coincide with feature_importances_ on inbag samples

…ut Gini

…s_ on X_train

…n different from gini/mse

github-actions · 2025-04-29T10:23:29Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: fda8349. Link to the linter CI: here}

ogrisel · 2025-04-29T15:21:48Z

Note that this PR is meant to stay in draft mode while we conduct empirical experiments to better understand the pros and cons of each method and decide whether we want to only keep one or both.

sklearn/tree/_tree.pyx

sklearn/ensemble/_forest.py

GaetandeCast · 2025-04-30T16:02:25Z

Closed because wrong branch name, new PR is #31279

GaetandeCast added 23 commits April 14, 2025 17:43

First working implementation of UFI, does not support multi output, h…

c0e22ea

…as "extreme value" issues

Removed the normalization inherited from the old MDI to avoid instabi…

b1e9df8

…lity

added multi output support

2a694b6

removed redundant cross_impurity computations

fd0abfb

added mdi_oob

ef9f48d

redesigned ufi for better memory management

a225a42

removed a debug import

83f3880

added mdi_oob, cleaned the code

27618db

better unified the code between ufi and mdi_oob

5ad9636

fixed a call oversight

21d2e04

fixed an error in mdi_oob computations

8194d6e

changed tests on feature_importances_ to use unbiased FI too

9e16a09

add tests to check that the added methods coincide with the papers an…

8991d79

…d that they coincide with feature_importances_ on inbag samples

added support for regression (only MSE split)

a9d2983

added warning for unbiased feature importance in classification witho…

710d42c

…ut Gini

merged test_non_OOB_unbiased_feature_importances_class & _reg

ddedf27

Fixed a few mistake so that ufi-regression matches feature_importance…

1de98fc

…s_ on X_train

Extended the tests on matching the paper values to regression

c7c5d76

re added tests on oob_score for dense X. They fail

a44084d

revert a small change to a test

082206c

raise an error when calling unbiased feature importance with criterio…

b028cb9

…n different from gini/mse

adapted the tests to the previous commit

dcb3106

Added log_loss ufi

c61c8dc

github-actions bot added module:ensemble module:tree cython labels Apr 29, 2025

fixed the oob_score_ issue, simplified the self.value accesses

d198f20