Skip to content

[WIP] Extending AdaBoost's base estimator's feature_importances_ attr to support coef_ attr of an estimator #12326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dsprahul
Copy link

@dsprahul dsprahul commented Oct 8, 2018

Modified feature_importances_ attribute in AdaBoost's base estimator to work with coef_ attributes of given estimators if they don't provide feature_importances_

Reference Issues/PRs

Fixes #12137

What does this implement/fix? Explain your changes.

feature_importances_ attribute of BaseWeightBoosting class now converts coef_ attribute of input classifiers to their respective feature_importances_ using existing _get_feature_importances method in feature_selection/from_model.py.

This method accepts an optional argument: norm_order which defaults to 2 (L2 norm).

norm_order : {non-zero int, inf, -inf, "fro"}, optional
        Same as `numpy.linalg.norm`'s norm_order parameter

Any other comments?

This method is now renamed to get_feature_importances and moved to utils/__init__.py as it is a shared resource.

…r to accept "norm_type" parameter for

converting `coef_` to `feature_importances`.
Made a basic version of _coef_to_feature_importances function to handle this conversion
@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging e2d069b into 63e5ae6 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

)
norm = non_negative_coefs_.sum()

return non_negative_coefs_ / norm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem is multiclass, this return value will still, as far as I can tell, be 2-dimensional. feature_importances_ must be 1d. I had intended that the norm would be used to aggregate coefficients across classes, as is done in _get_feature_importances.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I will reuse _get_feature_importances function from feature_selection to extend feature_importances_ attribute of AdaBoost base class. Thanks for the link.

…` to support estimators with `coef_`

parameter while calculating `feature_importances_` in AdaBoost based estimators. doc skip
@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 61d9811 into 5fd9e03 - view on LGTM.com

new alerts:

  • 1 for Encoding error

Comment posted by LGTM.com

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -33,6 +33,7 @@
from ..externals.six.moves import zip
from ..externals.six.moves import xrange as range
from .forest import BaseForest
from ..feature_selection.from_model import _get_feature_importances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a shared utility, it should live in sklearn.utils

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation currently mentions SelectFromModel and should not

Copy link
Author

@dsprahul dsprahul Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two solutions:

  1. Move _get_feature_importances to sklearn.utils.__init__.py or create a new file (feature_importances.py) within utils and move it there; import this function into all those modules it's being used in (2 places)
  2. Without reusing this function, duplicate its functionality with-in weight_boosting.py; if the above step is unnecessary.

@jnothman Which one should I choose?

@jnothman
Copy link
Member

jnothman commented Oct 9, 2018

Put it in utils/init.py for now. May also be able to reuse in RFE (I think there may be an issue somewhere about it only supporting one norm?)

@dsprahul
Copy link
Author

dsprahul commented Oct 9, 2018

Alright, done.
Yes, I think it's the issue #2121. I will look into it/ pass-on the update once this one is MGR

@@ -599,3 +599,39 @@ def is_scalar_nan(x):
# Redondant np.floating is needed because numbers can't match np.float32
# in python 2.
return bool(isinstance(x, (numbers.Real, np.floating)) and np.isnan(x))


def get_feature_importances(estimator, norm_order=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still might want to name this _get_feature_importances, otherwise we are promising API stability to users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will change it back to _get_feature_importances. Didn't think of that.
Also, should I write more tests to pass codecov? (Currently it's 84.21%)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code coverage would be nice too, yes.

Sorry for all this messing about. It would be great to have an API for parameterized feature importances (I made an issue about it at some point), but it is not a small thing to change. I'm undecided on what to advise here.

def feature_importances_(self):
"""Return the feature importances (the higher, the more important the
feature).
def get_feature_importances_(self, norm_order=2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we should do this. We might need to make the norm a class parameter. Or just choose a norm and not give users an option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. I will come-up with a solution by Monday.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman: How about we make a Mixin for feature_importances_ and inherit it where-ever it's required (Adaboost estimator, RFE, SelectFromModel etc..). This will require norm_order to be declared as a class variable in which-ever class it is being inherited into.

The job of this mixin is to determine whether a metaestimator or an usual estimator is inheriting it and then calculate feature_importances_ accordingly.

Something like this that can go into sklearn/base.py:

class FeatureImportancesMixin(object):
    """Mixin class that generates ``feature_importances_`` attribute."""

    def _is_metaestimator(self):
        """Determines if `FeatureImportancesMixin` is inherited by an
        metaestimator.
        """
        return hasattr(self, "estimators_")

    def _get_feature_importances(estimator, norm_order):
        """Retrieve or aggregate feature importances from estimator.

        Parameters
        ----------
        estimator : any estimator object with either `feature_importances_`
            attribute or `coef_` attribute

        norm_order : {non-zero int, inf, -inf, "fro"}, optional
            Same as `numpy.linalg.norm`'s norm_order parameter

        Returns
        -------
        importances : A 1d array, shape: [n_features]
        """
        importances = getattr(estimator, "feature_importances_", None)

        coef_ = getattr(estimator, "coef_", None)
        if importances is None and coef_ is not None:
            if estimator.coef_.ndim == 1:
                importances = np.abs(coef_)

            else:
                importances = np.linalg.norm(coef_,
                                             axis=0,
                                             ord=self.norm_order)

        elif importances is None:
            raise AttributeError(
                "The underlying estimator %s has no `coef_` or "
                "`feature_importances_` attribute. Either pass a fitted estimator"
                " to SelectFromModel or call fit before calling transform."
                % estimator.__class__.__name__)

        return importances

    @property
    def feature_importances_(self):
        """Return the feature importances.

         The higher the value, the more important the feature.

        Returns
        -------
        feature_importances_ : array, shape = [n_features]
        """

           try:

                if self._is_metaestimator():
                    if self.estimators_ is None or len(self.estimators_) == 0:
                        raise ValueError(
                        "Estimator not fitted, "
                        "call `fit` before `feature_importances_`.")

                    norm = self.estimator_weights_.sum()
                    estimator_feature_importances_ = [
                        _get_feature_importances(estimator, self.norm_order)
                        for estimator in self.estimators_]

                    return (sum(weight * feature_importances_
                                for weight, feature_importances_ in zip(
                                    self.estimator_weights_,
                                    estimator_feature_importances_)) / norm)
                else:
                    return _get_feature_importances(estimator=self.estimator_,
                                                    norm_order=self.norm_order)

            except AttributeError:
                raise AttributeError(
                    "Unable to compute feature importances "
                    "since base estimator neither has a "
                    "feature_importances_ attribute nor a "
                    "coef_ attribute")

@jnothman
Copy link
Member

That design is not quite right, as I think you're confusing the meta estimator that uses the importances from the underlying estimator that provides only coef_.

Ultimately I think the norm is something users should be aware of when writing about how a model works, but is unlikely (?) something they would care to tweak.

This could be solved by deciding on one norm and sticking to it... Except that for a while we used different norms in SelectFromModel and RFE with no option to change. This is relatively hard to document... Although if we actually defined feature_importances_ on all estimators with coef_, that would be easy to document.

And if we went down the path of implementing get_importances() and deprecating feature_importances_, it would potentially involve adding an importances_params parameter to the meta estimators, but at least that would be clear and generic.

I think for now we should just choose a norm. Yes, we could make it configurable through a class-level attribute, and yes we should reuse that function from SelectFromModel, but I now think that over-engineering the meta-estimator is the wrong thing to do when the design issue is in the constituent estimator.

Wrote a test case to test the feature_importances_ calculation using coef_ attr of underlying estimator
@dsprahul
Copy link
Author

Alright, I made L2 norm as the default for AdaBoosts which could be configured via class-variable norm_order. I also added a test to test coef_ to feature_importances_ conversion.

@hermidalc
Copy link
Contributor

Actually @jnothman @RahulDamineni @alexsieusahai I don't think we should provide a parameter for the norm to make coef_ values non-negative, RFE and RFECV do not. We should use the exact same code as what is there, at least as a first implementation:

# Get coefs
if hasattr(estimator, 'coef_'):
coefs = estimator.coef_
else:
coefs = getattr(estimator, 'feature_importances_', None)
if coefs is None:
raise RuntimeError('The classifier does not expose '
'"coef_" or "feature_importances_" '
'attributes')
# Get ranks
if coefs.ndim > 1:
ranks = np.argsort(safe_sqr(coefs).sum(axis=0))
else:
ranks = np.argsort(safe_sqr(coefs))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AdaBoost feature_importances_ should support base estimators with coef_
6 participants