[WIP] Generalized partial dependence plots #5653

trevorstephens · 2015-11-01T20:00:04Z

Still a fair bit to do on this one, but the bones are there now for the partial_dependence function.

I have implemented an "exact" formulation of the partial dependence function, based on doing predictions on the original training data. I believe this can work on any regression model, and any classification model supporting predict_proba. It is based on https://github.com/cran/randomForest/blob/master/R/partialPlot.R

Additionally I added an "estimated" method that uses the column means of X, instead of looping over all the grid points and reevaluating every time. It is many times speedier than the other methods and gives reasonable results from the POV of seeing the "direction" a variable takes the model.

RandomForestRegressor is now also supported using the recursive method that is already working for GBMs. I am still toying with ways to recurse classification trees, but that might be another PR after this merges.

TODO:

Do something sensible with multioutput estimators
Support for Pipeline as well as BaseSearchCV estimators
Update the plot_partial_dependence function
Add some tests
Clean up exception logic
Deprecate sklearn.ensemble.partial_dependence
Move & enhance docs (add example for non-GBRT classification/regression, narrative docs on new methods, etc.)

High-level questions:

Should 'exact' or 'estimated' be the default for non regression tree-based estimators?
Is sklearn.partial_dependence the proper place for this utility?
Is demeaning the regression output a smart idea? This would need y to be passed to the function as well.

Timing:

The 'exact' method is extremely slow for ensembles. The 'estimated' method is very quick on just about anything and gives surprisingly similar results. The 'recursive' method is a fair bit slower on deep forests when compared to the GBMs which tend to be quite shallow trees. Here's the California housing example from the website run with both GBM and RF regressors, timers in the headings, both models with 100 trees.

Any comments on the methods being used? @pprett , since you wrote the original, would especially value your comments.

trevorstephens · 2015-11-01T20:08:07Z

Reference issue #4405

pauljacksonrodgers · 2015-11-02T17:35:32Z

sklearn/partial_dependence.py

+        pdp = np.subtract(pdp, np.mean(pdp, 1)[:, np.newaxis])
+        pdp = pdp.transpose()
+    else:
+        raise ValueError('est must be a fitted regressor or classifier model.')


Is there any reason you can't let est be an instance of sklearn.pipeline.Pipeline? So that you can look at the partial dependence as it applies to a predictor before some transformation is applied.

I'd have to look into it, at first thought it may get tricky to get a transformed X on which to construct the pdplot grid and calculate the function. Can you give an example of such a pipeline you would want to use, and where you'd want the pdplot to be calculated?

For example--say you apply a spline to a predictor. Then you might want to look at the partial dependence of the target on the raw predictor, even though the regression uses a transformed predictor.

Doesn't a pipeline require the final step to be the predictor? Unless I misunderstand you... Can you give an actual example? ie. with the actual pipeline constructor code?

In general, I don't see any reason why I can't support a pipeline or gridsearch object as input.

I meant predictor as in column, for example:

pipeline = Pipeline( steps=[ ("apply_spline_to_a_feature", ApplySplineToFeature(feature_num=1), ("run_a_linear_regression", LinearRegression()) ] )

where ApplySplineToColumn is an instance of TransformerMixin.

Then I would hope to be able to calculate the partial dependence on the entire pipeline:

partial_dependence(pipeline, X, target_features=(0))

I would be interested in seeing the partial dependence where the horizontal axis represents the "raw" data, even though some transformation is applied before scoring. Hope this makes a little more sense!

@pauljacksonrodgers is this what you were going for? Same example as above... Pipeline for discussion purposes only... I don't generally PCA my forests :-) Note that this can't work with the "recursive" option as the X's within the tree splits are all modified.

estimators = [('scaling', StandardScaler()), ('reduce_dim', PCA(n_components=6)), ('rfr', RandomForestRegressor(n_estimators=100))] clf = Pipeline(estimators) clf.fit(X_train, y_train)

It also works for GridSearch...

parameters = {'learning_rate':[0.1, 0.5]} gbr = GradientBoostingRegressor(n_estimators=100, max_depth=4, loss='huber', random_state=1) clf = GridSearchCV(gbr, parameters) clf.fit(X_train, y_train)

FYI both the Pipeline and GridsearchCV examples above were generated on my local branch with

partial_dependence(clf, target_feature, X=X_train, grid_resolution=50, method=method)

where X_train is the same as the second half of http://scikit-learn.org/stable/auto_examples/ensemble/plot_partial_dependence.html

This is exactly what I meant-thanks! I hadn't even considered using GridSearchCV here, also very cool. And yes, makes sense that you can't use the "recursive" method with either of these, since the "raw" data isn't even used in the computation.

Great. Sorry if I wasn't following you earlier. It's a good enhancement and easy to support. Will push changes as soon as I stop pulling my hair out about multi-output :-)

vene · 2015-11-02T18:20:54Z

Here's a related reference that I ran into recently and seemed very useful at a skim, but I didn't get to thoroughly look into it or use it yet. It seems like some interaction effects can be masked on average. The paper has some nice visualisations. http://arxiv.org/pdf/1309.6392.pdf

pauljacksonrodgers · 2015-11-02T18:24:33Z

sklearn/partial_dependence.py

+        else:
+            n_features = est.n_features_
+    elif X is None:
+        raise ValueError('X is required for method="exact" or "estimated".')


Why is this so? Seems like you should be able to do exact/estimated computation on the grid too, right?

Estimated and exact methods use X to calculate the function using predict or predict_proba. The grid only shows values for the target variables.

trevorstephens · 2015-11-02T20:05:42Z

Thanks @vene , will have a flick through this when I have some time.

jnothman · 2015-11-03T03:27:46Z

sklearn/partial_dependence.py

+    if isinstance(est, RegressorMixin):
+        try:
+            pdp = est.predict(X_eval)
+        except:


Never! Use except Exception. except: blocks KeyboardInterrupt

But besides, we now have common tests to ensure that calling predict before fit will raise NotFittedError

A lot of this is just place holders @jnothman :-) I need to figure out what errors to let the estimator pass through and what to explicitly catch in this function. Not ready for code reviews yet. Just structural stuff for now please.

jnothman · 2015-11-03T03:33:20Z

Add to your todos: example for GBRT, and for non-GBRT classification/regression; narrative docs if not already in your TODO

trevorstephens · 2015-11-03T03:37:29Z

I had "enhance docs" in the todos. There is a fair bit of work left, this is just the bare bones. Appreciate your input though! Can you comment on the high-level questions in the original post?

asjedh · 2016-01-26T20:21:56Z

Do we have any updates here? Partial dependence plots for random forests would be very helpful :)

trevorstephens · 2016-01-27T04:52:24Z

Hi @asjedh , been a bit busy since the holidays but will get back on the pull request soon. Might have a bit of time for it this weekend in fact!

darribas · 2016-03-11T17:01:56Z

+1 on PDP for random forests! :)

chris60201 · 2016-04-27T03:45:57Z

Hey guys! Do we have an update here? Looks like checks have passed...

amueller · 2016-10-11T01:59:18Z

@trevorstephens do you want to keep working on this or have someone else take it up?

trevorstephens · 2016-10-11T02:15:20Z

Hey @amueller ... thanks for the ping. I know this is pretty stale, but I'd like to keep working on it actually.

Work being very busy and then recently moving back home to Australia made the year to date totally bonkers, but I have time now to come back to this. Will check in some code this weekend hopefully.

Looks like I will have to check the new model validation module to see how it plays with this feature after the recent update.

amueller · 2016-10-11T02:17:30Z

@trevorstephens sure, no worries. Thanks for the update. I've been out for a bit myself, but trying to do some late spring cleaning here ;)

jph00 · 2017-01-05T20:45:10Z

I'm hoping to use this in part 2 of http://course.fast.ai - otherwise I'll have to use R :( Is this still being worked on? Any chance of having it done in the next couple of weeks?
(Don't mean to hassle anybody! Just needing to finalize plans for the course)

jnothman · 2017-01-06T00:24:05Z

I don't think you can expect it merged into scikit-learn in the next couple of weeks, sorry. Doesn't mean it's not usable as a stand-alone library with @trevorstephens's permission.

…

On 6 January 2017 at 07:45, Jeremy Howard ***@***.***> wrote: I'm hoping to use this in part 2 of http://course.fast.ai - otherwise I'll have to use R :( Is this still being worked on? Any chance of having it done in the next couple of weeks? (Don't mean to hassle anybody! Just needing to finalize plans for the course) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5653 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61tLwctOvLozG_HYNjIfspkkL0PSks5rPVZYgaJpZM4GZv4k> .

trevorstephens · 2017-01-25T07:52:09Z

Hey @jph00 ... If I packaged the code up and pushed to PyPI (pip install-able) as a temporary measure would that help? Untested, unreviewed of course... You are more than welcome to just copy the function in the PR and use it directly as well. You would just need to change the relative imports to point to sklearn instead.

Irene-GM · 2017-06-25T11:47:07Z

Hi guys! Is there any update about the partial dependence plots for Random Forest? :-)
Can we have access to the source code of the generic function or have it as a standalone package in PyPi?
Thanks for the great effort!

jnothman · 2017-06-25T11:57:10Z

@trevorstephens, would you welcome another contributor trying to finish off your work?

lucianoviola · 2017-07-20T14:03:14Z

@trevorstephens @jnothman I could volunteer; I have had some previous experience with partial dependence on my job.

jnothman · 2017-07-20T21:43:12Z

that would be welcome, IMO!

…

On 21 Jul 2017 12:03 am, "Luciano Viola" ***@***.***> wrote: @trevorstephens <https://github.com/trevorstephens> @jnothman <https://github.com/jnothman> I could volunteer; I have had some previous experience with partial dependence on my job. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5653 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6z8-E_uyQOWWDcNJXCqXxBQzzlfjks5sP14ngaJpZM4GZv4k> .

trevorstephens · 2017-07-20T22:51:36Z

Hi @lucianoviola , thanks for the offer! I had actually been working on this locally after @Irene-GM woke me up :-D should have some commits in this weekend and would welcome comments and suggestions from your experience soon! @jnothman sorry for dragging it out. Been very busy lately but should be ready for review in the coming days!

…earn#7846

amueller · 2018-11-06T16:03:33Z

btw, I just saw these:
https://github.com/nyuvis/partial_dependence
https://pdpbox.readthedocs.io/en/latest/

amueller · 2018-11-06T16:09:05Z

the plotting module seems unlikely to happen soon. I think we should move this forward, possibly without the approximation.

amueller · 2018-11-07T16:33:16Z

@trevorstephens would you mind if I put @NicolasHug on this to wrap up your work? I'd really like to see this get into 0.21 ;)

trevorstephens · 2018-11-08T01:10:01Z

Sorry, been hard to find the time to work on this @amueller :-( The documentation part especially will require a fair bit of effort! I was recently thinking of spinning it out as a contrib library so I could finish it up bit-by-bit for later inclusion in the main repo rather than going straight to master, but if @NicolasHug has the capacity to finish it up, then go for it 👍

NicolasHug · 2018-11-08T14:08:53Z

Thanks @trevorstephens, I'll finish this in another PR.

NicolasHug · 2018-11-15T20:43:36Z

Hi @trevorstephens , just FWI I've opened #12599 (not finished yet but soon) and of course credited you.

Here is a list of the main changes. Some of them actually come from the original implementation, not necessarily form your PR:

removed the approx method after discussing it with @amueller
removed support for multiclass-multioutput classifiers.
merged label and output into target parameter
removed support for RandomForestRegressor with recursion because it actually doesn't
work in multioutput settings
renamed exact into brute
renamed axes into values, which is arguably not much better... But I find 'axes' confusing since it's not directly related to a matplotlib axis instance.
renamed pdp into averaged_predictions
factorized tests and added a lot more of them
updated doc

trevorstephens · 2018-11-15T20:57:40Z

Nice work @NicolasHug 👍

pauljacksonrodgers reviewed Nov 2, 2015
View reviewed changes

jnothman reviewed Nov 3, 2015
View reviewed changes

trevorstephens mentioned this pull request Nov 13, 2015

"grid_resolution" setting of function "partial_dependence" #5810

Closed

amueller changed the title ~~[WIP] Generalized partial dependence plots~~ [MRG] Generalized partial dependence plots Oct 11, 2016

amueller changed the title ~~[MRG] Generalized partial dependence plots~~ [WIP] Generalized partial dependence plots Oct 11, 2016

jnothman mentioned this pull request Nov 8, 2016

Partial Dependence Plots for Random Forests. #4405

Closed

pramitchoudhary mentioned this pull request Feb 9, 2017

Partial dependency between the target and features dmlc/xgboost#486

Closed

trevorstephens added 18 commits May 25, 2018 18:55

rebase and catch up to scikit-learn#6762, scikit-learn#7673, scikit-l…

152a190

…earn#7846

catch up on scikit-learn#9434

2cdc5ea

initial update of plot_partial_dependence

ba1f8da

deprecate ensemble.partial_dependence

1b1d8f0

refactor estimated and exact functions to _predict

9095305

make "auto" the default rather than None for method

3fc1727

some more refactoring

259ec99

avoid namespace collision

cbc20af

fix output shapes of all estimators

63da115

add tests to ensure all estimators output same shape

8f7d2b0

quick fixes

6fc3a49

fix docstring, test fails

b1f8bfc

refactor tests for easier debugging

dc93b69

speed up tests, add two-way plot test

cd8f8de

move input validation on X

4eb1a80

fix output shape for multi-label classification

21544ce

update plot helper to support multi-output

610b5c5

update plot helper to pass-through output

dcbd0c6

trevorstephens force-pushed the general-pplot branch from e8c8a0c to dcbd0c6 Compare May 25, 2018 08:57

amueller mentioned this pull request Nov 6, 2018

Upstream to sklearn? SauceCat/PDPbox#44

Closed

NicolasHug mentioned this pull request Nov 15, 2018

[MRG] Partial dependence plots -- continued #12599

Merged

jnothman closed this in #12599 Apr 24, 2019

trevorstephens deleted the general-pplot branch May 15, 2019 10:20

[WIP] Generalized partial dependence plots #5653

[WIP] Generalized partial dependence plots #5653

Conversation

trevorstephens commented Nov 1, 2015 • edited Loading

trevorstephens commented Nov 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vene commented Nov 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorstephens commented Nov 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Nov 3, 2015

trevorstephens commented Nov 3, 2015

asjedh commented Jan 26, 2016

trevorstephens commented Jan 27, 2016

darribas commented Mar 11, 2016

chris60201 commented Apr 27, 2016

amueller commented Oct 11, 2016

trevorstephens commented Oct 11, 2016

amueller commented Oct 11, 2016

jph00 commented Jan 5, 2017

jnothman commented Jan 6, 2017 via email

trevorstephens commented Jan 25, 2017

Irene-GM commented Jun 25, 2017

jnothman commented Jun 25, 2017

lucianoviola commented Jul 20, 2017

jnothman commented Jul 20, 2017 via email

trevorstephens commented Jul 20, 2017

amueller commented Nov 6, 2018

amueller commented Nov 6, 2018

amueller commented Nov 7, 2018

trevorstephens commented Nov 8, 2018

NicolasHug commented Nov 8, 2018

NicolasHug commented Nov 15, 2018 • edited Loading

trevorstephens commented Nov 15, 2018

trevorstephens commented Nov 1, 2015 •

edited

Loading

NicolasHug commented Nov 15, 2018 •

edited

Loading