ENH HistGradientBoosting estimators should have `.feature_importances_` attribute #25707

vitaliset · 2023-02-26T20:23:11Z

Describe the workflow you want to enable

The practicality of using the .feature_importances_ attribute to superficially analyze global explanations of your classifier is really useful, IMHO.

Right after evaluating the model, running something like

pd.DataFrame(model.feature_importances_, index=X_train.columns, columns=["feat_imp"]).sort_values("feat_imp", ascending=False)

is the first thing I do.

Furthermore, some variable selection strategies assume that your estimator has this attribute, and it seems suboptimal to me to deprive ourselves of using HistGradientBoosting with these techniques.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import HistGradientBoostingClassifier
from lightgbm import LGBMClassifier

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]
selector = SelectFromModel(estimator=LGBMClassifier()).fit(X, y)
selector.get_support()
>>> array([ True,  True,  True])

selector = SelectFromModel(estimator=HistGradientBoostingClassifier()).fit(X, y)
selector.get_support()
>>> ValueError: when `importance_getter=='auto'`, the underlying estimator HistGradientBoostingClassifier should have `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.

I understand the concerns raised by @ogrisel's comment in the original PR that implemented HistGradientBoosting, but I believe we have more points in favor of its implementation than its omission. I'd love to hear your thoughts. :)

Describe your proposed solution

No response

Describe alternatives you've considered, if relevant

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-03-08T14:56:30Z

but I believe we have more points in favor of its implementation than its omission. I'd love to hear your thoughts. :)

I don't think that we should implement the feature. Indeed, there is not a single variable/feature importance that rules them all and I think that we should not have a single attribute.

Right now, our API of the SelectFromModel is a bit limited: we accept a callable. However, it takes only a fitted estimator as an attribute. So you could still do something ugly as:

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import HistGradientBoostingClassifier

X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]

X_test = X.copy()
y_test = y.copy()

def importance_getter(estimator):
    from sklearn.inspection import permutation_importance
    results = permutation_importance(estimator, X_test, y_test)
    return results.importances_mean

selector = SelectFromModel(
    estimator=HistGradientBoostingClassifier(),
    importance_getter=importance_getter,
).fit(X, y)

The broader issue here is that we need an API discussion on how to compute variable importance as discussed here: #20059 (comment)

ogrisel · 2023-03-09T16:10:48Z

I think it could useful to find a way to compute the mean decrease in impurity (MDI) but the feature_importances_ API is too limiting because it forces us to only consider training set statistics.

So let's find a way to implement/fix #20059 in general (and in particular how to compute MDI on test data #20059 (comment)) and then we can discuss how to make this available for HistGradientBoosting* models.

vitaliset added Needs Triage Issue requires triage New Feature labels Feb 26, 2023

thomasjpfan added Needs Decision - API and removed Needs Triage Issue requires triage labels Mar 9, 2023

lorentzenchr added API Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH HistGradientBoosting estimators should have `.feature_importances_` attribute #25707

ENH HistGradientBoosting estimators should have `.feature_importances_` attribute #25707

vitaliset commented Feb 26, 2023

glemaitre commented Mar 8, 2023

ogrisel commented Mar 9, 2023 •

edited

Loading

ENH HistGradientBoosting estimators should have .feature_importances_ attribute #25707

ENH HistGradientBoosting estimators should have .feature_importances_ attribute #25707

Comments

vitaliset commented Feb 26, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Mar 8, 2023

ogrisel commented Mar 9, 2023 • edited Loading

ENH HistGradientBoosting estimators should have `.feature_importances_` attribute #25707

ENH HistGradientBoosting estimators should have `.feature_importances_` attribute #25707

ogrisel commented Mar 9, 2023 •

edited

Loading