[WIP] Add `return_std` option to ensembles #5532

glouppe · 2015-10-22T13:21:56Z

This supersedes #3645.

Implement http://arxiv.org/pdf/1311.4555.pdf for confidence intervals in RFs
Update GBRT's API for quantile regression
Update and polish example
Add tests
Add warnings in GPs if fixed parameters and n_restarts > 0

mblondel · 2015-10-22T13:44:35Z

sklearn/ensemble/bagging.py

@@ -856,11 +855,13 @@ def __init__(self,
            random_state=random_state,
            verbose=verbose)

-    def predict(self, X):
+    def predict(self, X, with_std=False):


The GP module uses return_std.

glouppe · 2015-10-22T14:28:40Z

examples/plot_predictive_standard_deviation.py

+# its standard deviation
+x = np.atleast_2d(np.linspace(0, 10, 1000)).T
+
+regrs = {"Gaussian Process": GaussianProcessRegressor(alpha=(dy / y) ** 2),


@jmetzen is this correct (was previously nugget)? Shall we reuse the same theta0, thetaU and theta0 that were set previously? The plot is a bit different for GP now.

…t and Bagging regressors REFACTOR Add option with_std to GaussianProcess.predict This is consistent with the interface of RandomForestRegressor.predict The old way of requesting the predictive variance via eval_MSE is deprecated REFACTOR Tests and examples of GaussianProcess use with_std instead of eval_MSE ADD Example comparing the predictive distributions of different regressors DOC Improved documentation of with_std parameter of predict() method FIX Bug in BaggingRegressor using _parallel_predict_regression DOC More consistent documentation of optional return-value y_std of predict DOC Updated doc of predict() of BaggingRegressor and RandomForestRegressor ENH Extending example plot_predictive_standard_deviation.py

Conflicts: sklearn/ensemble/bagging.py

glouppe · 2015-10-27T08:31:05Z

Here is the current output for the example, using bagged extra-trees and GPs.

There is one important thing that I need to clearly mention somewhere: return_std for forests is the sampling standard deviation of the predictions. That is, how different would the predictions be had the forest been trained on another training set. This is fundamentally different from the standard deviation values returned for GPs, which is computed from the modeled conditional distribution of the output

(In particular, since a random forest is consistent, the larger the training sample, the more its sampling variance will tends towards 0, since the forest will tend towards the "Bayes" model (i.e., always predict the true mean output value).)

glouppe · 2015-10-27T08:40:08Z

Also, if we want instead to model the conditional distribution of the output in forests, then we would have to switch to quantile regression forests, which would require significant changes in the forest code.

glouppe · 2015-10-27T09:05:12Z

@jmetzen In light of my previous comments, I would suggest removing the comparison with respect to the mean log probability of noise-free samples. In particular, increasing the size of the training data makes it possible to make this value arbitrarily low since the predictions of the forest will tend towards the noise-free values themselves with arbitrarily high probability. (Again, this stems from the fact that GP's returned stds and Bagging returned stds correspond to different quantities...)

glouppe · 2015-10-28T07:19:01Z

A better example might be to merge this one with ensemble/plot_gradient_boosting_quantile.py, which shows how to do quantile regression with GBRT. Then both GP's and GBRT's returned std would correspond to comparable quantities.

There is just one thing that I dont link regarding the API, which is that GP's and GBRT's cannot currently be used in the same way to compute prediction intervals. I'll make a proposal to update GBRT's API.

glouppe · 2015-10-30T16:16:38Z

Anybody has an opinion about what should be the semantics of return_std? (either the std of the modeled conditional distribution, like in GPs or with quantile regression, or the sampling standard deviation?) CC: @jmetzen @arjoly @mblondel @GaelVaroquaux @fabianp @pprett

jmetzen · 2015-10-31T14:26:30Z

I agree that the comparison of GP and bagging with regard to the mean log probability should be removed as the returned stds correspond to different quantities. Having a unified interface for GP and GBRT for return_std would be great and the comparison of GPs and quantile regression with GBRT would be nice. Not sure how much effort this would be?

Regarding your last question: When an estimator "only" supports returning the sampling standard deviation, we could use the keyword return_sampling_std instead of return_std. This would make it more explicit that the two quantities are not directly comparable.

MechCoder · 2016-03-17T20:56:14Z

Can we split this PR separately for RandomForests and other ensembles? Because I want to implement SMAC and I feel it would be great to have a return_std option for RandomForests at least

glouppe · 2016-03-17T20:59:39Z

Oh sorry, i totally forgot about this PR. Yes, please go ahead. We should just agree on the semantics of return_std (which is very critical to get right for SMAC to work properly as far as I understand). Note that you can already implement SMAC for GBRT, using the quantile regression mode.

lesshaste · 2016-03-18T11:03:57Z

This PR looks awesome. Not a very helpful comment but nonetheless... :)

MechCoder · 2016-03-19T04:52:35Z

Great. I'll have a look over the weekend.

betatim · 2016-03-19T08:28:04Z

@MechCoder I am interested in SMAC (or tree based black box optimisation), can we work together on that?

MechCoder · 2016-03-20T16:38:29Z

@glouppe I am okay with both the options, that is either return_std with documentation stating that these values are not comparable or return_sampling_std to make this explicit. But ultimately the notion it to capture the variance in each prediction, so I do not really understand why we need to make it explicit that this is captured in different ways, i,e deriving the conditional in case of GP's and the Infinitesimal Jackknife estimate in the case of Random Forests

On a related note, should we add return_std to predict_proba as well? The paper has some interesting observations using the variations in the predicted probability values.

MechCoder · 2016-03-20T16:40:29Z

@betatim Sure. I plan to implement a separate repo with a scipy.optimize like interface. I've done it for GP-based stuff. I'll let you know when I create it. We can work on various enhancements, if that is okay with you...

glouppe · 2016-03-20T17:13:58Z

But ultimately the notion it to capture the variance in each prediction

Variance comes from various sources and it should be clear which one we are referring to. In our case, there are two main source variances:

The variance of Y|X, irrespective of the supervised learning algorithm used to model this quantity. This is the quantity usually modeled by a GP.
The variance of the predictions \hat Y|X. This one is tight to the supervised learning algorithm and can itself often be decomposed into several variance terms (e.g., variance due to randomness, variance due to the training data, etc). This is what the Jacknife measures.

In the case of SMAC and other model driven approaches, I believe what we are looking for is a measure certainty of the prediction, in particular in regions where you have not yet sampled. So ideally, it is certainly a mix of both sources of variance... not sure what is best. Would be worth exploring in practice on a few problems.

@betatim Sure. I plan to implement a separate repo with a scipy.optimize like interface. I've done it for GP-based stuff. I'll let you know when I create it. We can work on various enhancements, if that is okay with you...

Nice! i would love to contribute too. I am willing to give some help and explore a few things regarding tree-based approaches. We have nice applications @betatim and I at CERN :)

MechCoder · 2016-03-20T19:28:56Z

Is it straightforward to model the conditional distribution of new data given the training data in case of RandomForests (your first option)? I'm asking because in GP's it is easier to interpret because it is a conditional multivariate Gaussian

glouppe · 2016-03-20T19:36:09Z

Yes, the proper way is to do it through quantile regression (which is not currently support in our RF implementation, but is available already in GBRT). It requires some work to have this in RF, but that is not that difficult to do either.

MechCoder · 2016-03-20T21:18:54Z

@glouppe @betatim The repository is here https://github.com/MechCoder/BlackBox. I named it BlackBox because of my low creativity levels. Right now there is support for GP-based minimizers. It seems to work according to the tests but is slow.

I would be obliged to give you push access if you want to push directly and we can move it to somewhere more noticeable later. (I also think we can do away with the MRG+2 rule for now ;) )

amueller · 2016-10-10T23:40:29Z

@glouppe is this PR still relevant? Or will this live in scikit-optimize?

glouppe · 2016-10-11T07:19:57Z

@amueller Not sure we converged about what quantities we would like to return.

for an approximation of the standard deviation of \hat{Y}|X (over the randomness due to the training data), then this PR or sklearn-ci are good candidates.
for an approximation of the standard deviation of Y|X, then we added some tools for that in scikit-optimize (see https://github.com/scikit-optimize/scikit-optimize/blob/master/skopt/learning/forest.py). We would be happy to backport them.

adrinjalali · 2024-04-18T08:23:50Z

I was gonna close this citing scikit-optimize, but it seems scikit-optimize hasn't been maintained for the past 3 years and the repo is now in archive mode.

@lorentzenchr WDYT of this?

betatim · 2024-04-18T12:14:25Z

scikit-optimize (and projects it relies on) hasn't been maintained for ages. Please don't send people there they end up making me feel guilty for no longer maintaining it :-/

lorentzenchr · 2024-04-18T13:12:04Z

From https://arxiv.org/abs/1311.4555

The error bars shown in Figure 1 give an estimate of the sampling variance of the random
forest; in other words, they tell us how much the random forest’s predictions might change
if we trained it on a new training set.
..
The goal of our paper is to study the sampling variance of
bagged learners

So this measures sampling variance and what a user might expect is the std error of the prediction. So would rather close as "not solve".

adrinjalali · 2024-04-18T13:28:53Z

Works for me.

glouppe mentioned this pull request Oct 22, 2015

[MRG] Prediction variance in bagging-based regressors #3645

Closed

mblondel reviewed Oct 22, 2015
View reviewed changes

glouppe changed the title ~~[WIP] Add with_std option to ensembles~~ [WIP] Add return_std option to ensembles Oct 22, 2015

glouppe reviewed Oct 22, 2015
View reviewed changes

glouppe force-pushed the bagging-variancewq branch 2 times, most recently from 017f412 to 8f604b6 Compare October 26, 2015 15:12

Jan Hendrik Metzen and others added 4 commits October 26, 2015 16:16

FIX: rename with_std to return_std

714dca0

FIX: optimize kernel parameters

de829ed

DOC: cleanup example

0061ae7

glouppe force-pushed the bagging-variancewq branch 3 times, most recently from c9c43f4 to 27b5859 Compare October 26, 2015 18:04

ENH: implement infinitesimal jacknife estimate for bagging std

2e93e4d

Conflicts: sklearn/ensemble/bagging.py

glouppe force-pushed the bagging-variancewq branch from 27b5859 to 2e93e4d Compare October 27, 2015 07:22

ENH: implement Jacknife estimate for sampling variance

523bbf9

glouppe force-pushed the bagging-variancewq branch from 4e3f802 to 523bbf9 Compare October 28, 2015 12:50

glouppe mentioned this pull request Nov 23, 2015

Variance ensemble regressors #5894

Closed

glouppe mentioned this pull request Jun 30, 2016

New contribution: sklearn-forest-ci scikit-learn-contrib/scikit-learn-contrib#6

Closed

amueller added the Needs Decision Requires decision label Aug 5, 2019

github-actions bot added the module:ensemble label Mar 2, 2020

haiatn mentioned this pull request Sep 18, 2020

prediction variance in bagging-based regressors #3271

Open

Base automatically changed from master to main January 22, 2021 10:48

thomasjpfan mentioned this pull request May 5, 2022

Add standard-deviation output to sklearn.ensemble.RandomForestRegressor #23257

Closed

adrinjalali closed this Apr 18, 2024

Uh oh!

[WIP] Add return_std option to ensembles #5532

[WIP] Add return_std option to ensembles #5532

Uh oh!

Conversation

glouppe commented Oct 22, 2015

Uh oh!

mblondel Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

glouppe Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

glouppe commented Oct 27, 2015

Uh oh!

glouppe commented Oct 27, 2015

Uh oh!

glouppe commented Oct 27, 2015

Uh oh!

glouppe commented Oct 28, 2015

Uh oh!

glouppe commented Oct 30, 2015

Uh oh!

jmetzen commented Oct 31, 2015

Uh oh!

MechCoder commented Mar 17, 2016

Uh oh!

glouppe commented Mar 17, 2016

Uh oh!

lesshaste commented Mar 18, 2016

Uh oh!

MechCoder commented Mar 19, 2016

Uh oh!

betatim commented Mar 19, 2016

Uh oh!

MechCoder commented Mar 20, 2016

Uh oh!

MechCoder commented Mar 20, 2016

Uh oh!

glouppe commented Mar 20, 2016

Uh oh!

MechCoder commented Mar 20, 2016

Uh oh!

glouppe commented Mar 20, 2016

Uh oh!

MechCoder commented Mar 20, 2016

Uh oh!

amueller commented Oct 10, 2016

Uh oh!

glouppe commented Oct 11, 2016

Uh oh!

adrinjalali commented Apr 18, 2024

Uh oh!

betatim commented Apr 18, 2024

Uh oh!

lorentzenchr commented Apr 18, 2024

Uh oh!

adrinjalali commented Apr 18, 2024

Uh oh!

Uh oh!

[WIP] Add `return_std` option to ensembles #5532

[WIP] Add `return_std` option to ensembles #5532