Skip to content

ENH: OOB Permutation Importance for Random Forests #18603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 60 commits into
base: main
Choose a base branch
from

Conversation

robert-robison
Copy link

This implements out-of-bag permutation importance in random forests. This is why it provides value in addition to inspection.permutation_importance:

  • It's standard in both main R Random Forest packages: randomForest and Ranger.
  • OOB permutation importances do not require a significant change to the workflow (creating an additional validation set)
  • Although it's covered in the documentation, many users likely do not realize just how biased impurity-based metrics are (see here; it marks a random feature as most important).
    • For HistGradientBoostingClassifier/Regressor, impurity-based feature importances were left out entirely for this reason (see #15132 (comment))
  • OOB permutation importances are faster and more stable than inspection.permutation_importance across repeated runs on the same data in the tests I've run (see here)
  • Would more seamlessly work with SelectFromModel (see discussion in feature_importances_ should be a method in the ideal design #9606)

Reference Issues/PRs

There are no open issues this directly addresses or closes to my knowledge. This has been discussed in multiple issues and PRs over the past several years, some of which are referenced above.

Issues/PRs that have suggested/implemented some type of RF permutation importance have been closed based on eli5 permutation importance or, more recently, sklearn.inspection.permutation_importance:

As we have a model agnostic solution to assert feature importances without suffering from cardinality bias, I would be in favor of not extending our Random Forest implementation just to tackle bias issues in their feature importances, unless the method described in this paper is significantly better than permutation based feature importances but this would need to be demonstrated on some example.

I believe this is demonstrated by the reasons stated above, namely:

  • Minimizes misinterpretation due to impurity-based importance
  • Simplifies workflow by removing need for additional validation set
  • Is more stable and faster

What does this implement/fix? Explain your changes.

The specific method that is implemented is the one shown in Breimann (2001), which is described in more details by the developers of the Ranger R package in the paper Do Little Interactions get Lost in Dark Random Forests? The algorithm works like this:

  • Get the OOB accuracy for each tree.
  • Randomly shuffle a feature.
  • Get the OOB accuracy again. The average drop in accuracy across all trees is the permutation importance for that feature.
  • Repeat for all features.

For regression, the r2_score is used instead of accuracy.

This was implemented by adding an importance_type parameter that can take the values {"impurity", "permutation"}. While impurity-based importances aren't calculated until you need them, you need the dataset to calculate permutation importances, so they are calculated during the fit process if specified.

Any other comments?

The changes made here are functionally equivalent to the ones made in #3436, so I did reference that PR often while making these changes.

Also, this is my first pull request and this is kind of a big change, so I completely understand if I'm overstepping bounds here. Thank you!

@robert-robison
Copy link
Author

Anyone have any thoughts on this?

@glemaitre
Copy link
Member

I think this would be a good addition. However, the feature should be shared for:

  • RandomForest
  • RandomTreeEmbedding
  • ExtraTreees

I think that we should take into consideration the following comment: #3436 (comment)

Using a scorer should be possible: having a private function computing the importance given a scorer. The scorer would be specific and given by the ForestRegressor (using r2_score) and ForestClassifier (using the accuracy).

In addition, we should add documentation. I think that we should show update the example showing the difference between random forest feature importance and permutation importance. I quickly check and it is true that we can show that the importance of the random feature will be really low.

Regarding the parameter, I think that I would be explicit by making it contain feature_importances_* (e.g. feature_importances_type). In the future, we can see if we can change the default permutation then.

@robert-robison
Copy link
Author

@glemaitre Thanks for reviewing. I moved _set_oob_permutation_importance to BaseForest and added a scoring parameter to all subclasses for use with check_scoring. That way it will default to using the DecisionTree*.score() method if nothing is explicitly passed.

I don't think it makes sense to implement this for RandomTreeEmbedding because it's unsupervised. Correct me if I'm wrong.

I also added to the example comparing inspection.permutation_importance to default Random Forest feature importance, and changed the parameter name to feature_importances_type like you suggested. Let me know your thoughts

@glemaitre
Copy link
Member

The tests are failing. I think that there is something wrong with the example that you modify (name_steps instead of named_steps)?

@robert-robison
Copy link
Author

An extra named_steps got in there somehow, now fixed.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a test to check that the permutation implemented does not suffer from the bias as the feature_importances_ using impurity. You can create a random variable and check that it is important with the impurity but not important with the permutation technique.

@robert-robison
Copy link
Author

I attempted to address all comments. I realized we're basically replicating inspection.permutation_importance with a Decision Tree as the estimator on the out-of-bag data, so I changed it to call permutation_importance on every estimator in the RF in the new _get_tree_oob_performance method. I ran into a circular import error, though, so the import is temporarily in that method. Not sure the best way around that. I also attempted to make the outer for loop parallel; not sure I did that correctly.

inspection.permutation_importance doesn't allow sparse, so I left sparse=False for the moment.

All your other suggestions were pretty straightforward. Let me know if I missed anything or messed anything up.

@glemaitre
Copy link
Member

I will review it soon. With @ogrisel, we checked more into details the paper and indeed the phrasing in the paper of Breiman is quite confusing and I think that the current implementation is in line with the one stated in different paper from the literature and the Fortran implementation of Breiman and Cutler (https://math.usu.edu/~adele/forests/cc_home.htm#varimp)

I attempted to address all comments. I realized we're basically replicating inspection.permutation_importance with a Decision Tree as the estimator on the out-of-bag data, so I changed it to call permutation_importance on every estimator in the RF in the new _get_tree_oob_performance method. I ran into a circular import error, though, so the import is temporarily in that method. Not sure the best way around that. I also attempted to make the outer for loop parallel; not sure I did that correctly.

I have to check as well more into details but it was my comment regarding using _compute_permutation_importance. I think that we should use this function instead of permutation_importance if possible because it should avoid a couple of useless check (but I have to check).

In the meanwhile, I have been working on #19162 to improve the way the OOB score is computed and I will probably have a new look to see if it allows for a better modularity with your code (using _get_oob_predictions for instance).

# Alternative to MDI using Permutation Feature Importance
# -------------------------------------------------------
# The limitations of MDI pointed out in the previous section can be mitigated
# using an alternative strategy: Feature Permutation Importance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick. 🙂

Suggested change
# using an alternative strategy: Feature Permutation Importance.
# using an alternative strategy: Permutation Feature Importance.

@jjerphan
Copy link
Member

Instructions given in #20301 will help you to solve the conflicts caused by code style reformatting on main.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my first pass on the code, including a few more comments on the examples.

I find the two examples are a bit interlinked with a bit of redundancy (this was already the case prior to this PR).

To me, another PR could merged them and introduce another example using permutation feature importance on other algorithms than BaseForest's, precising its use and value.

What do you think?

# Please see :ref:`permutation_importance` for more details. We can now plot
# the importance ranking.

# The permutation importances is more computationally costly. Indeed, it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The permutation importances is more computationally costly. Indeed, it
# The permutation importance is more computationally costly. Indeed, it

@@ -106,6 +113,33 @@
# %%
# Tree's Feature Importance from Mean Decrease in Impurity (MDI)
# --------------------------------------------------------------
# We plot the feature importances computed across all trees of the forest.
# We use a box plot representation to show the information. The mean is in the
# box plot would corresponds to the value reported by the fitted attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# box plot would corresponds to the value reported by the fitted attribute
# box plot would correspond to the value reported by the fitted attribute

Comment on lines 162 to 165
# information. Thus, one tree could use the feature `sex_male` and ignore
# `sex_female` to create split while another tree could could make the opposite
# choice. We will see that the permutation feature importance does not solve
# this issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think one can remove this last sentence as permutation feature importance has not been covered yet in this example and as this is explained in its dedicated part.

Comment on lines +658 to +659
The higher, the more important the feature. There is two possible
strategies:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The higher, the more important the feature. There is two possible
strategies:
The higher, the more important the feature. There are two possible
strategies:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't suggest on line 77, but I would make it explicit as mentioned in the previous example:

('classifier', RandomForestClassifier(feature_importances="impurity",
                                      random_state=42))

Comment on lines +203 to +212
return permutation_importance(
estimator,
X[unsampled_indices, :],
y[unsampled_indices],
scoring=None,
n_repeats=1,
n_jobs=1,
random_state=random_state,
sample_weight=sample_weight[unsampled_indices],
).importances[:, 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know permutation_importance sufficiently, but at first sight, expliciting why n_jobs=1 is used would helps: IIUC as _permutation_importances_oob is called within joblib.Parallel.__apply__ and as permutation_importance also calls it, we explicitly force sequential execution by setting n_jobs=1.

Also, why does one only takes .importances[:, 0]? Does permutation_importance compute more than what we need here?

Comment on lines 563 to 579
with config_context(assume_finite=True):
# avoid redundant checking performed on X in the permutation
# importance function.
oob_importances = np.transpose(Parallel(n_jobs=self.n_jobs)(
delayed(_permutation_importances_oob)(
estimator,
X,
y,
sample_weight,
n_samples,
n_samples_bootstrap,
random_state,
)
for estimator in self.estimators_
))

return oob_importances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we prefer threads here?

Suggested change
with config_context(assume_finite=True):
# avoid redundant checking performed on X in the permutation
# importance function.
oob_importances = np.transpose(Parallel(n_jobs=self.n_jobs)(
delayed(_permutation_importances_oob)(
estimator,
X,
y,
sample_weight,
n_samples,
n_samples_bootstrap,
random_state,
)
for estimator in self.estimators_
))
return oob_importances
with config_context(assume_finite=True):
# avoid redundant checking performed on X in the permutation
# importance function.
parallel_args = {
**_joblib_parallel_args(prefer="threads"),
"n_jobs": self.n_jobs
}
oob_importances = Parallel(**parallel_args)(
delayed(_permutation_importances_oob)(
estimator,
X,
y,
sample_weight,
n_samples,
n_samples_bootstrap,
random_state,
)
for estimator in self.estimators_
))
oob_importances = np.transpose(oob_importances)
return oob_importances

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that I did look at it before and the threading was not the best choice (not sure that we are releasing the GIL enough).

Comment on lines +131 to +134
Instead of storing the OOB sample indices in the forest, it is more memory
efficient to rebuild the indices given the random state used to create the
bootstrap. This operation can be neglected in terms of computation time
compared to other processes when it is used (e.g. scoring).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for precising it.

@robert-robison
Copy link
Author

I agree the examples are a little redundant and that combining them would be outside the scope of this PR. Combining them seems like a good idea to me, but I also don't know what the intended purpose of the plot_forest_importances.py example is.

Comment on lines 358 to 365
.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al.
Bias in random forest variable importance measures: Illustrations,
sources and a solution. BMC Bioinformatics 8, 25 (2007).
<https://doi.org/10.1186/1471-2105-8-25>`_
.. [White94] `White, A.P., Liu, W.Z. Technical Note:
Bias in Information-Based Measures in Decision Tree Induction.
Machine Learning 15, 321–329 (1994).
<https://doi.org/10.1023/A:1022694010754>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al.
Bias in random forest variable importance measures: Illustrations,
sources and a solution. BMC Bioinformatics 8, 25 (2007).
<https://doi.org/10.1186/1471-2105-8-25>`_
.. [White94] `White, A.P., Liu, W.Z. Technical Note:
Bias in Information-Based Measures in Decision Tree Induction.
Machine Learning 15, 321–329 (1994).
<https://doi.org/10.1023/A:1022694010754>`_
.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al.
Bias in random forest variable importance measures: Illustrations,
sources and a solution.
BMC Bioinformatics 8, 25 (2007).
<https://doi.org/10.1186/1471-2105-8-25>`_
.. [White94] `White, A.P., Liu, W.Z. Technical Note:
Bias in Information-Based Measures in Decision Tree Induction.
Machine Learning 15, 321–329 (1994).
<https://doi.org/10.1023/A:1022694010754>`_

I think that it should be the error given by sphinx.


# %%
# Feature importance based on mean decrease in impurity
# Feature importance based on Mean Decrease in Impurity (MDI)
# -----------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should have as many character than the previous one.

@glemaitre
Copy link
Member

To me, another PR could merged them and introduce another example using permutation feature importance on other algorithms than BaseForest's, precising its use and value.

We could keep it for another PR for sure. One issue with removing examples boils down to referencing in Google. I am not sure if we can get outdated links that will not work anymore.

@jjerphan
Copy link
Member

We could keep it for another PR for sure. One issue with removing examples boils down to referencing in Google. I am not sure if we can get outdated links that will not work anymore.

I agree. The content of plot_permutation_importance could complete the one on forests' feature importance, plot_forest_importances.

This would allow replacing its content with a generic example for permutation feature importance using other interfaces.
This would also allow could the link as it, even though the content would be different between different versions' documentation.

What do you think?

@glemaitre
Copy link
Member

glemaitre commented Jun 23, 2021 via email

@lorentzenchr
Copy link
Member

Without reading this whole PR: Note that OOB permutation importance is biased und should be corrected, see The revival of the Gini importance. This corrected version (AIR) is (among other alternatives) implemented in the ranger package.

@robert-robison
Copy link
Author

Without reading this whole PR: Note that OOB permutation importance is biased und should be corrected, see The revival of the Gini importance. This corrected version (AIR) is (among other alternatives) implemented in the ranger package.

I think you have impurity importance and OOB permutation importance confused. I don't believe that paper found permutation importance to be biased. The following is a quote from the introduction:

The impurity importance is known to be biased in favor of variables with many possible split points...The permutation importance does not suffer from these kinds of bias and is consequently generally preferred

Their experiments confirm this--the impurity importance is biased towards variables with more split points, while the permutation importance is unbiased. The only negative comment they make towards permutation importance is that positive outliers are more likely due to overlap in the out-of-bag observations.

AIR is not corrected OOB permutation importance, but corrected impurity importance (it stands for Actual Impurity Reduction). I would be all in favor of implementing that as well, but I think that would probably have to be in a subsequent PR, as it would require some pretty significant changes:

As explained in Section 2.4, the tree growing procedure is slightly altered when the AIR importance is computed. Because a fraction of the splits are uninformative, this could lead to a loss of prediction accuracy of the RF, in particular if the mtry value is low. Consequently, we recommend to grow a separate RF for prediction when the AIR importance is used and the ranger package issues a warning when prediction is performed with a forest grown with the AIR importance.

Let me know if my understanding is incorrect in any way

@lorentzenchr
Copy link
Member

@robert-robison Thanks for your clarifications and helping my confusion. Then I'm keen to see the progress of this PR. Explainability tools are very useful.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me after merge conflict resolution and one remaining comment.

@glemaitre
Copy link
Member

@jjerphan Am I not sure if we settle on a possible API thought since we might have other metrics that would provide other feature importance: #20059 (comment)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@adrinjalali
Copy link
Member

Removing the milestone, will have this in once the API discussion is resolved :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants