-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Add variable importance to linear models #21170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@GaelVaroquaux @rth @NicolasHug @TomDLT friendly ping in case of interest as you've been involved in earlier issues. |
I think that this is a very slippery slope: t-statistics are not well controlled outside of maximum-likelihood estimates. Either people know what they are doing, and it's trivial to compute the above, or they don't, and they will misinterpret it (that's true of much of the model interpretation literature, that's been going around in circles for years because it is trying to give simple answers to problems that do not have a good solution in statistics). I'm -1 on this line |
We don't have to call it "t-statistic", just "native linear model feature importance". @GaelVaroquaux Your arguments could be used against random forest feature importance, or even any feature importance measure. What do you propose instead for answering: "How important is feature X in your model? Could we drop it (for whatever good reasons, maybe it costs money)?" I think we should have answers for the most simple, most taught and most trusted model class: linear models. I also think that the focus on model interpretation of late was very important for building trust in ML and showing that predictive performance is not necessarily the most important thing. Though, admittedly, model interpretation might be a hard nut to crack of its own. |
By reusing the analysis that we did with the tree-based models, it seems that we had an understanding that there is no single good feature importance but rather feature importance methods that have some pros and cons. I assume that the same thing can be said about linear models, e.g. permutation importance vs. weights importance. Adding a default So there is probably a choice of API to think about: use native importance vs. use helper functions to compute the importance. If we want to avoid legitimate a particular feature importance, each model should provide a parameter/a method to compute a specific type of importance. However, we would still have some default feature importance. In the case, that we use helper functions, the choice of the importance will be user-specified. The issue, in this case, will be the integration with estimators that relied on |
@GaelVaroquaux Your arguments could be used against random forest feature
importance,
No, these are hard to implement :).
I think we should have answers for the most simple, most taught and most trusted model class: linear models.
But the answer is valid only for maximum likelihood is low dimensions.
We have a didactic example on this topic. I think that we did the best that we could.
Though, admittedly, model interpretation might be a hard nut to crack of its own.
As far as I am concerned, it's still an open research question.
|
I think adding something specific to linear models to the inspection module would be a good way to not have it as easily accessible as |
I think part of the problem is to provide a utility with a generic name such as "feature importance" which could imply that what we propose is "The Way" to assess the contributions of input features to a model. Some of this problem would go away if we provide more specific names for different methods to compute local (per sample) and global (per dataset) "explanations" of model decisions. For instance we could provide a utility function to compute "feature effects" for linear models to decompose the decision function for individual predictions as follows:
And then this same function could be be aggregated across a dataset to compute a feature effect plot such as: https://christophm.github.io/interpretable-ml-book/limo.html#effect-plot This would be similar to the request to implement to implement feature effects / impacts for a decision on a individual sample (local explanation)
feature effect of a given feature computed on a test set (global explanation)
If we want this utility to reflect the uncertainty caused the sampling of the training set and the training procedure, we could |
As someone said elsewhere on a different topic
I think the same applies here😏 |
Describe the workflow you want to enable
I'd like to have a feature importance method native to linear models (without L1 penalty) that is calculated on the training set:
Describe your proposed solution
New proposal
Evaluate if the LMG (Lindeman, Merenda and Gold, see [1, 2]) is applicable and feasible for L2 penalized regression and for GLMs. Else, consider other measures of [1, 2].
In short, LMG is Shapley value decomposition of R2 by the features.
References:
Original proposal
Compute the t-statistic of the coefficients
and use the absolute, i.e.
|t|
, as measure of (in-sample) importance. For GLMs like the logistic regression, see section 5.3 in https://arxiv.org/pdf/1509.09169.pdf for a formula ofVar[coef]
.Describe alternatives you've considered, if relevant
Any general importance measure (permutation importance, SHAP values, ...) also works.
Additional context
Given the great and legitimate need for interpretability, I would favor to have a native importance measure for linear models. Random Forests have their own native
feature_importances_
with the warningWe could add a similar warning for collinear features like
I guess, in the end, this is true for all feature importance measures, even for SHAP (see also our multicollinear example).
Prior discussions like #16802, #6773, #13048, focued on p-values which seem out-of-scope for scikit-learn for different reasons. I hope we can circumvent these reasons by focusing on feature importance only and not considering p-values.
The text was updated successfully, but these errors were encountered: