-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH: OOB Permutation Importance for Random Forests #18603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH: OOB Permutation Importance for Random Forests #18603
Conversation
Anyone have any thoughts on this? |
I think this would be a good addition. However, the feature should be shared for:
I think that we should take into consideration the following comment: #3436 (comment) Using a scorer should be possible: having a private function computing the importance given a scorer. The scorer would be specific and given by the In addition, we should add documentation. I think that we should show update the example showing the difference between random forest feature importance and permutation importance. I quickly check and it is true that we can show that the importance of the random feature will be really low. Regarding the parameter, I think that I would be explicit by making it contain |
…into rf_permutation_importance
@glemaitre Thanks for reviewing. I moved I don't think it makes sense to implement this for I also added to the example comparing |
The tests are failing. I think that there is something wrong with the example that you modify ( |
An extra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a test to check that the permutation implemented does not suffer from the bias as the feature_importances_
using impurity. You can create a random variable and check that it is important with the impurity but not important with the permutation technique.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
I attempted to address all comments. I realized we're basically replicating
All your other suggestions were pretty straightforward. Let me know if I missed anything or messed anything up. |
I will review it soon. With @ogrisel, we checked more into details the paper and indeed the phrasing in the paper of Breiman is quite confusing and I think that the current implementation is in line with the one stated in different paper from the literature and the Fortran implementation of Breiman and Cutler (https://math.usu.edu/~adele/forests/cc_home.htm#varimp)
I have to check as well more into details but it was my comment regarding using In the meanwhile, I have been working on #19162 to improve the way the OOB score is computed and I will probably have a new look to see if it allows for a better modularity with your code (using |
# Alternative to MDI using Permutation Feature Importance | ||
# ------------------------------------------------------- | ||
# The limitations of MDI pointed out in the previous section can be mitigated | ||
# using an alternative strategy: Feature Permutation Importance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick. 🙂
# using an alternative strategy: Feature Permutation Importance. | |
# using an alternative strategy: Permutation Feature Importance. |
Instructions given in #20301 will help you to solve the conflicts caused by code style reformatting on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my first pass on the code, including a few more comments on the examples.
I find the two examples are a bit interlinked with a bit of redundancy (this was already the case prior to this PR).
To me, another PR could merged them and introduce another example using permutation feature importance on other algorithms than BaseForest
's, precising its use and value.
What do you think?
# Please see :ref:`permutation_importance` for more details. We can now plot | ||
# the importance ranking. | ||
|
||
# The permutation importances is more computationally costly. Indeed, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The permutation importances is more computationally costly. Indeed, it | |
# The permutation importance is more computationally costly. Indeed, it |
@@ -106,6 +113,33 @@ | |||
# %% | |||
# Tree's Feature Importance from Mean Decrease in Impurity (MDI) | |||
# -------------------------------------------------------------- | |||
# We plot the feature importances computed across all trees of the forest. | |||
# We use a box plot representation to show the information. The mean is in the | |||
# box plot would corresponds to the value reported by the fitted attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# box plot would corresponds to the value reported by the fitted attribute | |
# box plot would correspond to the value reported by the fitted attribute |
# information. Thus, one tree could use the feature `sex_male` and ignore | ||
# `sex_female` to create split while another tree could could make the opposite | ||
# choice. We will see that the permutation feature importance does not solve | ||
# this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think one can remove this last sentence as permutation feature importance has not been covered yet in this example and as this is explained in its dedicated part.
The higher, the more important the feature. There is two possible | ||
strategies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The higher, the more important the feature. There is two possible | |
strategies: | |
The higher, the more important the feature. There are two possible | |
strategies: |
from sklearn.impute import SimpleImputer | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.preprocessing import OneHotEncoder | ||
|
||
categorical_encoder = OneHotEncoder(handle_unknown='ignore') | ||
numerical_pipe = Pipeline([ | ||
('imputer', SimpleImputer(strategy='mean')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't suggest on line 77, but I would make it explicit as mentioned in the previous example:
('classifier', RandomForestClassifier(feature_importances="impurity",
random_state=42))
return permutation_importance( | ||
estimator, | ||
X[unsampled_indices, :], | ||
y[unsampled_indices], | ||
scoring=None, | ||
n_repeats=1, | ||
n_jobs=1, | ||
random_state=random_state, | ||
sample_weight=sample_weight[unsampled_indices], | ||
).importances[:, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know permutation_importance
sufficiently, but at first sight, expliciting why n_jobs=1
is used would helps: IIUC as _permutation_importances_oob
is called within joblib.Parallel.__apply__
and as permutation_importance
also calls it, we explicitly force sequential execution by setting n_jobs=1
.
Also, why does one only takes .importances[:, 0]
? Does permutation_importance
compute more than what we need here?
sklearn/ensemble/_forest.py
Outdated
with config_context(assume_finite=True): | ||
# avoid redundant checking performed on X in the permutation | ||
# importance function. | ||
oob_importances = np.transpose(Parallel(n_jobs=self.n_jobs)( | ||
delayed(_permutation_importances_oob)( | ||
estimator, | ||
X, | ||
y, | ||
sample_weight, | ||
n_samples, | ||
n_samples_bootstrap, | ||
random_state, | ||
) | ||
for estimator in self.estimators_ | ||
)) | ||
|
||
return oob_importances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we prefer threads here?
with config_context(assume_finite=True): | |
# avoid redundant checking performed on X in the permutation | |
# importance function. | |
oob_importances = np.transpose(Parallel(n_jobs=self.n_jobs)( | |
delayed(_permutation_importances_oob)( | |
estimator, | |
X, | |
y, | |
sample_weight, | |
n_samples, | |
n_samples_bootstrap, | |
random_state, | |
) | |
for estimator in self.estimators_ | |
)) | |
return oob_importances | |
with config_context(assume_finite=True): | |
# avoid redundant checking performed on X in the permutation | |
# importance function. | |
parallel_args = { | |
**_joblib_parallel_args(prefer="threads"), | |
"n_jobs": self.n_jobs | |
} | |
oob_importances = Parallel(**parallel_args)( | |
delayed(_permutation_importances_oob)( | |
estimator, | |
X, | |
y, | |
sample_weight, | |
n_samples, | |
n_samples_bootstrap, | |
random_state, | |
) | |
for estimator in self.estimators_ | |
)) | |
oob_importances = np.transpose(oob_importances) | |
return oob_importances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that I did look at it before and the threading was not the best choice (not sure that we are releasing the GIL enough).
Instead of storing the OOB sample indices in the forest, it is more memory | ||
efficient to rebuild the indices given the random state used to create the | ||
bootstrap. This operation can be neglected in terms of computation time | ||
compared to other processes when it is used (e.g. scoring). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for precising it.
…tation_importance
I agree the examples are a little redundant and that combining them would be outside the scope of this PR. Combining them seems like a good idea to me, but I also don't know what the intended purpose of the |
doc/modules/ensemble.rst
Outdated
.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al. | ||
Bias in random forest variable importance measures: Illustrations, | ||
sources and a solution. BMC Bioinformatics 8, 25 (2007). | ||
<https://doi.org/10.1186/1471-2105-8-25>`_ | ||
.. [White94] `White, A.P., Liu, W.Z. Technical Note: | ||
Bias in Information-Based Measures in Decision Tree Induction. | ||
Machine Learning 15, 321–329 (1994). | ||
<https://doi.org/10.1023/A:1022694010754>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al. | |
Bias in random forest variable importance measures: Illustrations, | |
sources and a solution. BMC Bioinformatics 8, 25 (2007). | |
<https://doi.org/10.1186/1471-2105-8-25>`_ | |
.. [White94] `White, A.P., Liu, W.Z. Technical Note: | |
Bias in Information-Based Measures in Decision Tree Induction. | |
Machine Learning 15, 321–329 (1994). | |
<https://doi.org/10.1023/A:1022694010754>`_ | |
.. [Strobl07] `Strobl, C., Boulesteix, AL., Zeileis, A. et al. | |
Bias in random forest variable importance measures: Illustrations, | |
sources and a solution. | |
BMC Bioinformatics 8, 25 (2007). | |
<https://doi.org/10.1186/1471-2105-8-25>`_ | |
.. [White94] `White, A.P., Liu, W.Z. Technical Note: | |
Bias in Information-Based Measures in Decision Tree Induction. | |
Machine Learning 15, 321–329 (1994). | |
<https://doi.org/10.1023/A:1022694010754>`_ |
I think that it should be the error given by sphinx.
|
||
# %% | ||
# Feature importance based on mean decrease in impurity | ||
# Feature importance based on Mean Decrease in Impurity (MDI) | ||
# ----------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should have as many character than the previous one.
We could keep it for another PR for sure. One issue with removing examples boils down to referencing in Google. I am not sure if we can get outdated links that will not work anymore. |
I agree. The content of This would allow replacing its content with a generic example for permutation feature importance using other interfaces. What do you think? |
This sound good to me
…On Wed, 23 Jun 2021 at 09:09, Julien Jerphanion ***@***.***> wrote:
We could keep it for another PR for sure. One issue with removing examples
boils down to referencing in Google. I am not sure if we can get outdated
links that will not work anymore.
I agree. The content of plot_permutation_importance
<https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html>
could complete the one on forests' feature importance,
plot_forest_importances
<https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html>
.
This would allow replacing its content with a generic example for
permutation feature importance using other interfaces.
This would also allow could the link as it, even though the content would
be different between different versions' documentation.
What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18603 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P4G7WLQ7JYYMI7Z2LTTUGCBLANCNFSM4SMAN7WA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
Without reading this whole PR: Note that OOB permutation importance is biased und should be corrected, see The revival of the Gini importance. This corrected version (AIR) is (among other alternatives) implemented in the ranger package. |
I think you have impurity importance and OOB permutation importance confused. I don't believe that paper found permutation importance to be biased. The following is a quote from the introduction:
Their experiments confirm this--the impurity importance is biased towards variables with more split points, while the permutation importance is unbiased. The only negative comment they make towards permutation importance is that positive outliers are more likely due to overlap in the out-of-bag observations. AIR is not corrected OOB permutation importance, but corrected impurity importance (it stands for Actual Impurity Reduction). I would be all in favor of implementing that as well, but I think that would probably have to be in a subsequent PR, as it would require some pretty significant changes:
Let me know if my understanding is incorrect in any way |
@robert-robison Thanks for your clarifications and helping my confusion. Then I'm keen to see the progress of this PR. Explainability tools are very useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me after merge conflict resolution and one remaining comment.
@jjerphan Am I not sure if we settle on a possible API thought since we might have other metrics that would provide other feature importance: #20059 (comment) |
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Removing the milestone, will have this in once the API discussion is resolved :) |
This implements out-of-bag permutation importance in random forests. This is why it provides value in addition to
inspection.permutation_importance
:randomForest
andRanger
.inspection.permutation_importance
across repeated runs on the same data in the tests I've run (see here)SelectFromModel
(see discussion in feature_importances_ should be a method in the ideal design #9606)Reference Issues/PRs
There are no open issues this directly addresses or closes to my knowledge. This has been discussed in multiple issues and PRs over the past several years, some of which are referenced above.
Issues/PRs that have suggested/implemented some type of RF permutation importance have been closed based on
eli5
permutation importance or, more recently,sklearn.inspection.permutation_importance
:I believe this is demonstrated by the reasons stated above, namely:
What does this implement/fix? Explain your changes.
The specific method that is implemented is the one shown in Breimann (2001), which is described in more details by the developers of the
Ranger
R package in the paper Do Little Interactions get Lost in Dark Random Forests? The algorithm works like this:For regression, the
r2_score
is used instead of accuracy.This was implemented by adding an
importance_type
parameter that can take the values{"impurity", "permutation"}
. While impurity-based importances aren't calculated until you need them, you need the dataset to calculate permutation importances, so they are calculated during the fit process if specified.Any other comments?
The changes made here are functionally equivalent to the ones made in #3436, so I did reference that PR often while making these changes.
Also, this is my first pull request and this is kind of a big change, so I completely understand if I'm overstepping bounds here. Thank you!