Skip to content

ENH HistGradientBoosting estimators should have .feature_importances_ attribute #25707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vitaliset opened this issue Feb 26, 2023 · 2 comments
Labels

Comments

@vitaliset
Copy link
Contributor

Describe the workflow you want to enable

The practicality of using the .feature_importances_ attribute to superficially analyze global explanations of your classifier is really useful, IMHO.

Right after evaluating the model, running something like

pd.DataFrame(model.feature_importances_, index=X_train.columns, columns=["feat_imp"]).sort_values("feat_imp", ascending=False)

is the first thing I do.

Furthermore, some variable selection strategies assume that your estimator has this attribute, and it seems suboptimal to me to deprive ourselves of using HistGradientBoosting with these techniques.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import HistGradientBoostingClassifier
from lightgbm import LGBMClassifier

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]
selector = SelectFromModel(estimator=LGBMClassifier()).fit(X, y)
selector.get_support()
>>> array([ True,  True,  True])

selector = SelectFromModel(estimator=HistGradientBoostingClassifier()).fit(X, y)
selector.get_support()
>>> ValueError: when `importance_getter=='auto'`, the underlying estimator HistGradientBoostingClassifier should have `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.

I understand the concerns raised by @ogrisel's comment in the original PR that implemented HistGradientBoosting, but I believe we have more points in favor of its implementation than its omission. I'd love to hear your thoughts. :)

Describe your proposed solution

No response

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@vitaliset vitaliset added Needs Triage Issue requires triage New Feature labels Feb 26, 2023
@glemaitre
Copy link
Member

but I believe we have more points in favor of its implementation than its omission. I'd love to hear your thoughts. :)

I don't think that we should implement the feature. Indeed, there is not a single variable/feature importance that rules them all and I think that we should not have a single attribute.

Right now, our API of the SelectFromModel is a bit limited: we accept a callable. However, it takes only a fitted estimator as an attribute. So you could still do something ugly as:

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import HistGradientBoostingClassifier

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]

X_test = X.copy()
y_test = y.copy()

def importance_getter(estimator):
    from sklearn.inspection import permutation_importance
    results = permutation_importance(estimator, X_test, y_test)
    return results.importances_mean

selector = SelectFromModel(
    estimator=HistGradientBoostingClassifier(),
    importance_getter=importance_getter,
).fit(X, y)

The broader issue here is that we need an API discussion on how to compute variable importance as discussed here: #20059 (comment)

@thomasjpfan thomasjpfan added Needs Decision - API and removed Needs Triage Issue requires triage labels Mar 9, 2023
@ogrisel
Copy link
Member

ogrisel commented Mar 9, 2023

I think it could useful to find a way to compute the mean decrease in impurity (MDI) but the feature_importances_ API is too limiting because it forces us to only consider training set statistics.

So let's find a way to implement/fix #20059 in general (and in particular how to compute MDI on test data #20059 (comment)) and then we can discuss how to make this available for HistGradientBoosting* models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants