ENH: OOB Permutation Importance for Random Forests #18603

robert-robison · 2020-10-11T20:48:01Z

This implements out-of-bag permutation importance in random forests. This is why it provides value in addition to inspection.permutation_importance:

It's standard in both main R Random Forest packages: randomForest and Ranger.
OOB permutation importances do not require a significant change to the workflow (creating an additional validation set)
Although it's covered in the documentation, many users likely do not realize just how biased impurity-based metrics are (see here; it marks a random feature as most important).
- For HistGradientBoostingClassifier/Regressor, impurity-based feature importances were left out entirely for this reason (see #15132 (comment))
OOB permutation importances are faster and more stable than inspection.permutation_importance across repeated runs on the same data in the tests I've run (see here)
Would more seamlessly work with SelectFromModel (see discussion in feature_importances_ should be a method in the ideal design #9606)

Reference Issues/PRs

There are no open issues this directly addresses or closes to my knowledge. This has been discussed in multiple issues and PRs over the past several years, some of which are referenced above.

Issues/PRs that have suggested/implemented some type of RF permutation importance have been closed based on eli5 permutation importance or, more recently, sklearn.inspection.permutation_importance:

As we have a model agnostic solution to assert feature importances without suffering from cardinality bias, I would be in favor of not extending our Random Forest implementation just to tackle bias issues in their feature importances, unless the method described in this paper is significantly better than permutation based feature importances but this would need to be demonstrated on some example.

I believe this is demonstrated by the reasons stated above, namely:

Minimizes misinterpretation due to impurity-based importance
Simplifies workflow by removing need for additional validation set
Is more stable and faster

What does this implement/fix? Explain your changes.

The specific method that is implemented is the one shown in Breimann (2001), which is described in more details by the developers of the Ranger R package in the paper Do Little Interactions get Lost in Dark Random Forests? The algorithm works like this:

Get the OOB accuracy for each tree.
Randomly shuffle a feature.
Get the OOB accuracy again. The average drop in accuracy across all trees is the permutation importance for that feature.
Repeat for all features.

For regression, the r2_score is used instead of accuracy.

This was implemented by adding an importance_type parameter that can take the values {"impurity", "permutation"}. While impurity-based importances aren't calculated until you need them, you need the dataset to calculate permutation importances, so they are calculated during the fit process if specified.

Any other comments?

The changes made here are functionally equivalent to the ones made in #3436, so I did reference that PR often while making these changes.

Also, this is my first pull request and this is kind of a big change, so I completely understand if I'm overstepping bounds here. Thank you!

robert-robison · 2020-12-11T15:25:59Z

Anyone have any thoughts on this?

glemaitre · 2021-01-07T15:07:53Z

I think this would be a good addition. However, the feature should be shared for:

RandomForest
RandomTreeEmbedding
ExtraTreees

I think that we should take into consideration the following comment: #3436 (comment)

Using a scorer should be possible: having a private function computing the importance given a scorer. The scorer would be specific and given by the ForestRegressor (using r2_score) and ForestClassifier (using the accuracy).

In addition, we should add documentation. I think that we should show update the example showing the difference between random forest feature importance and permutation importance. I quickly check and it is true that we can show that the importance of the random feature will be really low.

Regarding the parameter, I think that I would be explicit by making it contain feature_importances_* (e.g. feature_importances_type). In the future, we can see if we can change the default permutation then.

sklearn/ensemble/_forest.py

…into rf_permutation_importance

robert-robison · 2021-01-11T03:45:02Z

@glemaitre Thanks for reviewing. I moved _set_oob_permutation_importance to BaseForest and added a scoring parameter to all subclasses for use with check_scoring. That way it will default to using the DecisionTree*.score() method if nothing is explicitly passed.

I don't think it makes sense to implement this for RandomTreeEmbedding because it's unsupervised. Correct me if I'm wrong.

I also added to the example comparing inspection.permutation_importance to default Random Forest feature importance, and changed the parameter name to feature_importances_type like you suggested. Let me know your thoughts

glemaitre · 2021-01-11T17:52:51Z

The tests are failing. I think that there is something wrong with the example that you modify (name_steps instead of named_steps)?

robert-robison · 2021-01-11T23:35:26Z

An extra named_steps got in there somehow, now fixed.

glemaitre

I would add a test to check that the permutation implemented does not suffer from the bias as the feature_importances_ using impurity. You can create a random variable and check that it is important with the impurity but not important with the permutation technique.

examples/inspection/plot_permutation_importance.py

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_forest.py

sklearn/ensemble/_forest.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

robert-robison · 2021-01-12T23:21:57Z

I attempted to address all comments. I realized we're basically replicating inspection.permutation_importance with a Decision Tree as the estimator on the out-of-bag data, so I changed it to call permutation_importance on every estimator in the RF in the new _get_tree_oob_performance method. I ran into a circular import error, though, so the import is temporarily in that method. Not sure the best way around that. I also attempted to make the outer for loop parallel; not sure I did that correctly.

inspection.permutation_importance doesn't allow sparse, so I left sparse=False for the moment.

All your other suggestions were pretty straightforward. Let me know if I missed anything or messed anything up.

glemaitre · 2021-01-13T21:51:15Z

I will review it soon. With @ogrisel, we checked more into details the paper and indeed the phrasing in the paper of Breiman is quite confusing and I think that the current implementation is in line with the one stated in different paper from the literature and the Fortran implementation of Breiman and Cutler (https://math.usu.edu/~adele/forests/cc_home.htm#varimp)

I attempted to address all comments. I realized we're basically replicating inspection.permutation_importance with a Decision Tree as the estimator on the out-of-bag data, so I changed it to call permutation_importance on every estimator in the RF in the new _get_tree_oob_performance method. I ran into a circular import error, though, so the import is temporarily in that method. Not sure the best way around that. I also attempted to make the outer for loop parallel; not sure I did that correctly.

I have to check as well more into details but it was my comment regarding using _compute_permutation_importance. I think that we should use this function instead of permutation_importance if possible because it should avoid a couple of useless check (but I have to check).

In the meanwhile, I have been working on #19162 to improve the way the OOB score is computed and I will probably have a new look to see if it allows for a better modularity with your code (using _get_oob_predictions for instance).

jjerphan · 2021-06-21T06:49:49Z

examples/inspection/plot_permutation_importance.py

+# Alternative to MDI using Permutation Feature Importance
+# -------------------------------------------------------
+# The limitations of MDI pointed out in the previous section can be mitigated
+# using an alternative strategy: Feature Permutation Importance.


Small nitpick. 🙂

Suggested change

# using an alternative strategy: Feature Permutation Importance.

# using an alternative strategy: Permutation Feature Importance.

jjerphan · 2021-06-21T07:01:26Z

Instructions given in #20301 will help you to solve the conflicts caused by code style reformatting on main.

jjerphan

Here is my first pass on the code, including a few more comments on the examples.

I find the two examples are a bit interlinked with a bit of redundancy (this was already the case prior to this PR).

To me, another PR could merged them and introduce another example using permutation feature importance on other algorithms than BaseForest's, precising its use and value.

What do you think?

examples/ensemble/plot_forest_importances.py

jjerphan · 2021-06-21T07:51:20Z

examples/ensemble/plot_forest_importances.py

-# Please see :ref:`permutation_importance` for more details. We can now plot
-# the importance ranking.
-
+# The permutation importances is more computationally costly. Indeed, it


Suggested change

# The permutation importances is more computationally costly. Indeed, it

# The permutation importance is more computationally costly. Indeed, it

jjerphan · 2021-06-21T07:51:58Z

examples/inspection/plot_permutation_importance.py

@@ -106,6 +113,33 @@
 # %%
 # Tree's Feature Importance from Mean Decrease in Impurity (MDI)
 # --------------------------------------------------------------
+# We plot the feature importances computed across all trees of the forest.
+# We use a box plot representation to show the information. The mean is in the
+# box plot would corresponds to the value reported by the fitted attribute


Suggested change

# box plot would corresponds to the value reported by the fitted attribute

# box plot would correspond to the value reported by the fitted attribute

jjerphan · 2021-06-21T07:53:06Z

examples/inspection/plot_permutation_importance.py

+# information. Thus, one tree could use the feature `sex_male` and ignore
+# `sex_female` to create split while another tree could could make the opposite
+# choice. We will see that the permutation feature importance does not solve
+# this issue.


OK, I think one can remove this last sentence as permutation feature importance has not been covered yet in this example and as this is explained in its dedicated part.

examples/ensemble/plot_forest_importances.py

jjerphan · 2021-06-21T08:37:12Z

sklearn/ensemble/_forest.py

+        The higher, the more important the feature. There is two possible
+        strategies:


Suggested change

The higher, the more important the feature. There is two possible

strategies:

The higher, the more important the feature. There are two possible

strategies:

jjerphan · 2021-06-21T08:52:36Z

examples/inspection/plot_permutation_importance.py

+from sklearn.impute import SimpleImputer
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import OneHotEncoder
+
 categorical_encoder = OneHotEncoder(handle_unknown='ignore')
 numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean'))


I can't suggest on line 77, but I would make it explicit as mentioned in the previous example:

('classifier', RandomForestClassifier(feature_importances="impurity", random_state=42))

jjerphan · 2021-06-21T09:45:59Z

sklearn/ensemble/_forest.py

+    return permutation_importance(
+        estimator,
+        X[unsampled_indices, :],
+        y[unsampled_indices],
+        scoring=None,
+        n_repeats=1,
+        n_jobs=1,
+        random_state=random_state,
+        sample_weight=sample_weight[unsampled_indices],
+    ).importances[:, 0]


I do not know permutation_importance sufficiently, but at first sight, expliciting why n_jobs=1 is used would helps: IIUC as _permutation_importances_oob is called within joblib.Parallel.__apply__ and as permutation_importance also calls it, we explicitly force sequential execution by setting n_jobs=1.

Also, why does one only takes .importances[:, 0]? Does permutation_importance compute more than what we need here?

jjerphan · 2021-06-21T09:50:05Z

sklearn/ensemble/_forest.py

+        with config_context(assume_finite=True):
+            # avoid redundant checking performed on X in the permutation
+            # importance function.
+            oob_importances = np.transpose(Parallel(n_jobs=self.n_jobs)(
+                delayed(_permutation_importances_oob)(
+                    estimator,
+                    X,
+                    y,
+                    sample_weight,
+                    n_samples,
+                    n_samples_bootstrap,
+                    random_state,
+                )
+                for estimator in self.estimators_
+            ))
+
+        return oob_importances


Should we prefer threads here?

Suggested change

with config_context(assume_finite=True):

# avoid redundant checking performed on X in the permutation

# importance function.

oob_importances = np.transpose(Parallel(n_jobs=self.n_jobs)(

delayed(_permutation_importances_oob)(

estimator,

X,

y,

sample_weight,

n_samples,

n_samples_bootstrap,

random_state,

)

for estimator in self.estimators_

))

return oob_importances

with config_context(assume_finite=True):

# avoid redundant checking performed on X in the permutation

# importance function.

parallel_args = {

**_joblib_parallel_args(prefer="threads"),

"n_jobs": self.n_jobs

}

oob_importances = Parallel(**parallel_args)(

delayed(_permutation_importances_oob)(

estimator,

X,

y,

sample_weight,

n_samples,

n_samples_bootstrap,

random_state,

)

for estimator in self.estimators_

))

oob_importances = np.transpose(oob_importances)

return oob_importances

I was under the impression that I did look at it before and the threading was not the best choice (not sure that we are releasing the GIL enough).

jjerphan · 2021-06-21T09:53:51Z

sklearn/ensemble/_forest.py

+    Instead of storing the OOB sample indices in the forest, it is more memory
+    efficient to rebuild the indices given the random state used to create the
+    bootstrap. This operation can be neglected in terms of computation time
+    compared to other processes when it is used (e.g. scoring).


Thanks for precising it.

…tation_importance

…rtance

robert-robison · 2021-06-22T02:16:29Z

I agree the examples are a little redundant and that combining them would be outside the scope of this PR. Combining them seems like a good idea to me, but I also don't know what the intended purpose of the plot_forest_importances.py example is.

glemaitre · 2021-06-22T11:52:11Z

doc/modules/ensemble.rst

+ .. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al.
+ Bias in random forest variable importance measures: Illustrations,
+ sources and a solution. BMC Bioinformatics 8, 25 (2007).
+ <https://doi.org/10.1186/1471-2105-8-25>`_
+ .. [White94] `White, A.P., Liu, W.Z. Technical Note:
+ Bias in Information-Based Measures in Decision Tree Induction.
+ Machine Learning 15, 321–329 (1994).
+ <https://doi.org/10.1023/A:1022694010754>`_


Suggested change

.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al.

Bias in random forest variable importance measures: Illustrations,

sources and a solution. BMC Bioinformatics 8, 25 (2007).

<https://doi.org/10.1186/1471-2105-8-25>`_

.. [White94] `White, A.P., Liu, W.Z. Technical Note:

Bias in Information-Based Measures in Decision Tree Induction.

Machine Learning 15, 321–329 (1994).

<https://doi.org/10.1023/A:1022694010754>`_

.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al.

Bias in random forest variable importance measures: Illustrations,

sources and a solution.

BMC Bioinformatics 8, 25 (2007).

<https://doi.org/10.1186/1471-2105-8-25>`_

.. [White94] `White, A.P., Liu, W.Z. Technical Note:

Bias in Information-Based Measures in Decision Tree Induction.

Machine Learning 15, 321–329 (1994).

<https://doi.org/10.1023/A:1022694010754>`_

I think that it should be the error given by sphinx.

glemaitre · 2021-06-22T11:52:45Z

examples/ensemble/plot_forest_importances.py


 # %%
-# Feature importance based on mean decrease in impurity
+# Feature importance based on Mean Decrease in Impurity (MDI)
 # -----------------------------------------------------


This line should have as many character than the previous one.

glemaitre · 2021-06-22T11:55:22Z

To me, another PR could merged them and introduce another example using permutation feature importance on other algorithms than BaseForest's, precising its use and value.

We could keep it for another PR for sure. One issue with removing examples boils down to referencing in Google. I am not sure if we can get outdated links that will not work anymore.

jjerphan · 2021-06-23T07:08:58Z

We could keep it for another PR for sure. One issue with removing examples boils down to referencing in Google. I am not sure if we can get outdated links that will not work anymore.

I agree. The content of plot_permutation_importance could complete the one on forests' feature importance, plot_forest_importances.

This would allow replacing its content with a generic example for permutation feature importance using other interfaces.
This would also allow could the link as it, even though the content would be different between different versions' documentation.

What do you think?

glemaitre · 2021-06-23T15:40:23Z

This sound good to me

…

On Wed, 23 Jun 2021 at 09:09, Julien Jerphanion ***@***.***> wrote: We could keep it for another PR for sure. One issue with removing examples boils down to referencing in Google. I am not sure if we can get outdated links that will not work anymore. I agree. The content of plot_permutation_importance <https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html> could complete the one on forests' feature importance, plot_forest_importances <https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html> . This would allow replacing its content with a generic example for permutation feature importance using other interfaces. This would also allow could the link as it, even though the content would be different between different versions' documentation. What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18603 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P4G7WLQ7JYYMI7Z2LTTUGCBLANCNFSM4SMAN7WA> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

lorentzenchr · 2021-06-29T15:50:37Z

Without reading this whole PR: Note that OOB permutation importance is biased und should be corrected, see The revival of the Gini importance. This corrected version (AIR) is (among other alternatives) implemented in the ranger package.

robert-robison · 2021-06-29T16:25:01Z

Without reading this whole PR: Note that OOB permutation importance is biased und should be corrected, see The revival of the Gini importance. This corrected version (AIR) is (among other alternatives) implemented in the ranger package.

I think you have impurity importance and OOB permutation importance confused. I don't believe that paper found permutation importance to be biased. The following is a quote from the introduction:

The impurity importance is known to be biased in favor of variables with many possible split points...The permutation importance does not suffer from these kinds of bias and is consequently generally preferred

Their experiments confirm this--the impurity importance is biased towards variables with more split points, while the permutation importance is unbiased. The only negative comment they make towards permutation importance is that positive outliers are more likely due to overlap in the out-of-bag observations.

AIR is not corrected OOB permutation importance, but corrected impurity importance (it stands for Actual Impurity Reduction). I would be all in favor of implementing that as well, but I think that would probably have to be in a subsequent PR, as it would require some pretty significant changes:

As explained in Section 2.4, the tree growing procedure is slightly altered when the AIR importance is computed. Because a fraction of the splits are uninformative, this could lead to a loss of prediction accuracy of the RF, in particular if the mtry value is low. Consequently, we recommend to grow a separate RF for prediction when the AIR importance is used and the ranger package issues a warning when prediction is performed with a forest grown with the AIR importance.

Let me know if my understanding is incorrect in any way

lorentzenchr · 2021-06-29T22:16:02Z

@robert-robison Thanks for your clarifications and helping my confusion. Then I'm keen to see the progress of this PR. Explainability tools are very useful.

jjerphan

This looks good to me after merge conflict resolution and one remaining comment.

examples/ensemble/plot_forest_importances.py

glemaitre · 2021-07-29T17:21:17Z

@jjerphan Am I not sure if we settle on a possible API thought since we might have other metrics that would provide other feature importance: #20059 (comment)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

adrinjalali · 2021-08-22T10:19:26Z

Removing the milestone, will have this in once the API discussion is resolved :)

robert-robison added 4 commits October 6, 2020 22:54

implemented perm imp in oob_score for class and reg

09f02c0

Fixed RandomTreesEmbedding compatability error

a4984a9

Put permutation importance in its own method

e46d3e2

added tests and fixed random_state and formatting

965e02a

github-actions bot added the module:ensemble label Oct 11, 2020

cmarmo added the Waiting for Reviewer label Jan 6, 2021

glemaitre self-requested a review January 7, 2021 14:41

glemaitre reviewed Jan 7, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

glemaitre reviewed Jan 7, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

glemaitre removed the Waiting for Reviewer label Jan 7, 2021

robert-robison added 3 commits January 10, 2021 18:20

Add scorer, update example, changed param name

87fbeae

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

05dab08

…into rf_permutation_importance

formatting

98a94a6

fixed bug in example

93470c9

glemaitre reviewed Jan 12, 2021

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

robert-robison and others added 5 commits January 12, 2021 10:35

Apply suggestions from code review

4b1f394

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

parallelized, removed scoring, fixed tests

dfeaf52

formatting

d842f14

Add random feature test

99414a5

formatting

2d33c43

glemaitre mentioned this pull request Jan 20, 2021

Add Altmann's permutation importance (PIMP) and Janitza's New Testing Approach (NTA) #18223

Open

Base automatically changed from master to main January 22, 2021 10:53

jjerphan reviewed Jun 21, 2021

View reviewed changes

robert-robison and others added 6 commits June 21, 2021 21:11

Apply suggestions from code review

0225547

Merge commit '0e7761cdc4f244adb4803f1a97f0a9fe4b365a99' into rf_permu…

3c404cb

…tation_importance

MAINT Adds target_version to black config (scikit-learn#20293)

a47784f

black formatted changes

3695ef2

Merge remote-tracking branch 'upstream/main' into rf_permutation_impo…

86bb5f0

…rtance

remove old assert_allclose import

096abe5

glemaitre reviewed Jun 22, 2021

View reviewed changes

robert-robison added 2 commits June 22, 2021 09:01

Apply suggested format changes

be2392e

Reference section edits

d3ddaf8

jjerphan requested changes Jul 29, 2021

View reviewed changes

examples/ensemble/plot_forest_importances.py Outdated Show resolved Hide resolved

examples/ensemble/plot_forest_importances.py Show resolved Hide resolved

Update examples/ensemble/plot_forest_importances.py

280f9d9

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

adrinjalali removed this from the 1.0 milestone Aug 22, 2021

cmarmo added Needs Decision - API and removed Waiting for Reviewer labels Jan 24, 2022

lorentzenchr added the Stalled label Apr 14, 2022

lorentzenchr added API Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024

	# using an alternative strategy: Feature Permutation Importance.
	# using an alternative strategy: Permutation Feature Importance.

	# The permutation importances is more computationally costly. Indeed, it
	# The permutation importance is more computationally costly. Indeed, it

	# box plot would corresponds to the value reported by the fitted attribute
	# box plot would correspond to the value reported by the fitted attribute

		The higher, the more important the feature. There is two possible
		strategies:

Uh oh!

ENH: OOB Permutation Importance for Random Forests #18603

Are you sure you want to change the base?

ENH: OOB Permutation Importance for Random Forests #18603

Uh oh!

Conversation

robert-robison commented Oct 11, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

robert-robison commented Dec 11, 2020

Uh oh!

glemaitre commented Jan 7, 2021

Uh oh!

Uh oh!

Uh oh!

robert-robison commented Jan 11, 2021

Uh oh!

glemaitre commented Jan 11, 2021

Uh oh!

robert-robison commented Jan 11, 2021

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robert-robison commented Jan 12, 2021

Uh oh!

glemaitre commented Jan 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Jun 21, 2021

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robert-robison commented Jun 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jun 22, 2021

Uh oh!

jjerphan commented Jun 23, 2021

Uh oh!

glemaitre commented Jun 23, 2021 via email

Uh oh!