Skip to content

[MRG] ENH Add support for dataframe in PDP #14028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 110 commits into from
Oct 31, 2019

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Jun 5, 2019

build on:

This PR adds support for:

  • pandas DataFrame
  • ColumnTransformer as a preprocessing within a Pipeline.

Additional features for future PRs:

  • Request of PDP for augmented features. For instance, one might compute the interaction between features (e.g. PolynomialFeatures) and would be interested in the PDP of these new features. It seems that we will need support for get_feature_names for the pipeline to ease such use case.
  • Support of categorical data (bar plot should be OK for those).

@glemaitre
Copy link
Member Author

Some questions regarding the expected behaviour:

  1. When using a pipeline, do we expect values to reflect the data scale after or before preprocessing. Values before preprocessing will require support of inverse_transform which will be challenging for some transformer (e.g., ColumnTransformer, PolynomialFeature, etc.);
  2. In the case of data augmentation within a pipeline (e.g., PolynomialFeature), one would be interested to know the interaction of an augmented feature.

@glemaitre glemaitre closed this Jun 6, 2019
@glemaitre glemaitre reopened this Jun 6, 2019
@jnothman
Copy link
Member

jnothman commented Jun 6, 2019 via email

@glemaitre
Copy link
Member Author

If a pipeline is passed, we should be quantifying in terms of its input features... If the user wants otherwise they should slice it.

After couple of iterations this morning is what I thought as well. This will make thing so difficult otherwise. get_feature_names will be handy in case a user need to slice it.

@NicolasHug
Copy link
Member

Feel free to ping when you need reviews! (I'll look at #14035)

@glemaitre glemaitre changed the title [WIP] Add support for dataframe in PDP [MRG] Add support for dataframe in PDP Jul 23, 2019
@glemaitre
Copy link
Member Author

I think this is good to be reviewed @thomasjpfan @NicolasHug

I will probably add an example to illustrate to we can use the ColumnTransformer

@glemaitre
Copy link
Member Author

We would need to merge #15354 before then.

@glemaitre
Copy link
Member Author

I dropped the support and put a meaningful error message depending of the type of X

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to review but I have now the following question:

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think the example https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html should be updated to use a data frame and to pass features names instead of feature indices to make it more readable. But this can be done in a later PR if your prefer. As you wish.

``X`` is used both to generate a grid of values for the
``features``, and to compute the averaged predictions when
method is 'brute'.
features : list or array-like of int
features : array-like of {int, str}
The target features for which the partial dependency should be
computed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it explicit that it should be either a single feature or a pair of 2 features.

Also "target features" does not really mean anything.

May I suggest the following: "The feature or pair of interacting features for which the partial dependency should be computed."

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2019

@glemaitre
Copy link
Member Author

I updated the examples. However, be aware that we still have to specify features and features_names. I only updated partial_dependence and not plot_partial_dependence. So the next upcoming PR is to make plot_partial_dependence infer the features_names for dataframe such that one has only to give a list of features names.

@glemaitre
Copy link
Member Author

@NicolasHug Can we merge this one.

@glemaitre
Copy link
Member Author

@thomasjpfan as well :P

@thomasjpfan thomasjpfan merged commit 7ea7861 into scikit-learn:master Oct 31, 2019
@NicolasHug
Copy link
Member

Thanks @glemaitre !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.