Skip to content

[WIP] Add return_std option to ensembles #5532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

glouppe
Copy link
Contributor

@glouppe glouppe commented Oct 22, 2015

This supersedes #3645.

  • Implement http://arxiv.org/pdf/1311.4555.pdf for confidence intervals in RFs
  • Update GBRT's API for quantile regression
  • Update and polish example
  • Add tests
  • Add warnings in GPs if fixed parameters and n_restarts > 0

@@ -856,11 +855,13 @@ def __init__(self,
random_state=random_state,
verbose=verbose)

def predict(self, X):
def predict(self, X, with_std=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GP module uses return_std.

@glouppe glouppe changed the title [WIP] Add with_std option to ensembles [WIP] Add return_std option to ensembles Oct 22, 2015
# its standard deviation
x = np.atleast_2d(np.linspace(0, 10, 1000)).T

regrs = {"Gaussian Process": GaussianProcessRegressor(alpha=(dy / y) ** 2),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmetzen is this correct (was previously nugget)? Shall we reuse the same theta0, thetaU and theta0 that were set previously? The plot is a bit different for GP now.

@glouppe glouppe force-pushed the bagging-variancewq branch 2 times, most recently from 017f412 to 8f604b6 Compare October 26, 2015 15:12
Jan Hendrik Metzen and others added 4 commits October 26, 2015 16:16
…t and Bagging regressors

REFACTOR Add option with_std to GaussianProcess.predict

This is consistent with the interface of RandomForestRegressor.predict
The old way of requesting the predictive variance via eval_MSE is deprecated

REFACTOR Tests and examples of GaussianProcess use with_std instead of eval_MSE

ADD Example comparing the predictive distributions of different regressors

DOC Improved documentation of with_std parameter of predict() method

FIX Bug in BaggingRegressor using _parallel_predict_regression

DOC More consistent documentation of optional return-value y_std of predict

DOC Updated doc of predict() of BaggingRegressor and RandomForestRegressor

ENH Extending example plot_predictive_standard_deviation.py
@glouppe glouppe force-pushed the bagging-variancewq branch 3 times, most recently from c9c43f4 to 27b5859 Compare October 26, 2015 18:04
@glouppe glouppe force-pushed the bagging-variancewq branch from 27b5859 to 2e93e4d Compare October 27, 2015 07:22
@glouppe
Copy link
Contributor Author

glouppe commented Oct 27, 2015

Here is the current output for the example, using bagged extra-trees and GPs.

Imgur

There is one important thing that I need to clearly mention somewhere: return_std for forests is the sampling standard deviation of the predictions. That is, how different would the predictions be had the forest been trained on another training set. This is fundamentally different from the standard deviation values returned for GPs, which is computed from the modeled conditional distribution of the output

(In particular, since a random forest is consistent, the larger the training sample, the more its sampling variance will tends towards 0, since the forest will tend towards the "Bayes" model (i.e., always predict the true mean output value).)

@glouppe
Copy link
Contributor Author

glouppe commented Oct 27, 2015

Also, if we want instead to model the conditional distribution of the output in forests, then we would have to switch to quantile regression forests, which would require significant changes in the forest code.

@glouppe
Copy link
Contributor Author

glouppe commented Oct 27, 2015

@jmetzen In light of my previous comments, I would suggest removing the comparison with respect to the mean log probability of noise-free samples. In particular, increasing the size of the training data makes it possible to make this value arbitrarily low since the predictions of the forest will tend towards the noise-free values themselves with arbitrarily high probability. (Again, this stems from the fact that GP's returned stds and Bagging returned stds correspond to different quantities...)

@glouppe
Copy link
Contributor Author

glouppe commented Oct 28, 2015

A better example might be to merge this one with ensemble/plot_gradient_boosting_quantile.py, which shows how to do quantile regression with GBRT. Then both GP's and GBRT's returned std would correspond to comparable quantities.

There is just one thing that I dont link regarding the API, which is that GP's and GBRT's cannot currently be used in the same way to compute prediction intervals. I'll make a proposal to update GBRT's API.

@glouppe glouppe force-pushed the bagging-variancewq branch from 4e3f802 to 523bbf9 Compare October 28, 2015 12:50
@glouppe
Copy link
Contributor Author

glouppe commented Oct 30, 2015

Anybody has an opinion about what should be the semantics of return_std? (either the std of the modeled conditional distribution, like in GPs or with quantile regression, or the sampling standard deviation?) CC: @jmetzen @arjoly @mblondel @GaelVaroquaux @fabianp @pprett

@jmetzen
Copy link
Member

jmetzen commented Oct 31, 2015

I agree that the comparison of GP and bagging with regard to the mean log probability should be removed as the returned stds correspond to different quantities. Having a unified interface for GP and GBRT for return_std would be great and the comparison of GPs and quantile regression with GBRT would be nice. Not sure how much effort this would be?

Regarding your last question: When an estimator "only" supports returning the sampling standard deviation, we could use the keyword return_sampling_std instead of return_std. This would make it more explicit that the two quantities are not directly comparable.

@MechCoder
Copy link
Member

Can we split this PR separately for RandomForests and other ensembles? Because I want to implement SMAC and I feel it would be great to have a return_std option for RandomForests at least

@glouppe
Copy link
Contributor Author

glouppe commented Mar 17, 2016

Oh sorry, i totally forgot about this PR. Yes, please go ahead. We should just agree on the semantics of return_std (which is very critical to get right for SMAC to work properly as far as I understand). Note that you can already implement SMAC for GBRT, using the quantile regression mode.

@lesshaste
Copy link

This PR looks awesome. Not a very helpful comment but nonetheless... :)

@MechCoder
Copy link
Member

Great. I'll have a look over the weekend.

@betatim
Copy link
Member

betatim commented Mar 19, 2016

@MechCoder I am interested in SMAC (or tree based black box optimisation), can we work together on that?

@MechCoder
Copy link
Member

@glouppe I am okay with both the options, that is either return_std with documentation stating that these values are not comparable or return_sampling_std to make this explicit. But ultimately the notion it to capture the variance in each prediction, so I do not really understand why we need to make it explicit that this is captured in different ways, i,e deriving the conditional in case of GP's and the Infinitesimal Jackknife estimate in the case of Random Forests

On a related note, should we add return_std to predict_proba as well? The paper has some interesting observations using the variations in the predicted probability values.

@MechCoder
Copy link
Member

@betatim Sure. I plan to implement a separate repo with a scipy.optimize like interface. I've done it for GP-based stuff. I'll let you know when I create it. We can work on various enhancements, if that is okay with you...

@glouppe
Copy link
Contributor Author

glouppe commented Mar 20, 2016

But ultimately the notion it to capture the variance in each prediction

Variance comes from various sources and it should be clear which one we are referring to. In our case, there are two main source variances:

  • The variance of Y|X, irrespective of the supervised learning algorithm used to model this quantity. This is the quantity usually modeled by a GP.
  • The variance of the predictions \hat Y|X. This one is tight to the supervised learning algorithm and can itself often be decomposed into several variance terms (e.g., variance due to randomness, variance due to the training data, etc). This is what the Jacknife measures.

In the case of SMAC and other model driven approaches, I believe what we are looking for is a measure certainty of the prediction, in particular in regions where you have not yet sampled. So ideally, it is certainly a mix of both sources of variance... not sure what is best. Would be worth exploring in practice on a few problems.

@betatim Sure. I plan to implement a separate repo with a scipy.optimize like interface. I've done it for GP-based stuff. I'll let you know when I create it. We can work on various enhancements, if that is okay with you...

Nice! i would love to contribute too. I am willing to give some help and explore a few things regarding tree-based approaches. We have nice applications @betatim and I at CERN :)

@MechCoder
Copy link
Member

Is it straightforward to model the conditional distribution of new data given the training data in case of RandomForests (your first option)? I'm asking because in GP's it is easier to interpret because it is a conditional multivariate Gaussian

@glouppe
Copy link
Contributor Author

glouppe commented Mar 20, 2016

Yes, the proper way is to do it through quantile regression (which is not currently support in our RF implementation, but is available already in GBRT). It requires some work to have this in RF, but that is not that difficult to do either.

@MechCoder
Copy link
Member

@glouppe @betatim The repository is here https://github.com/MechCoder/BlackBox. I named it BlackBox because of my low creativity levels. Right now there is support for GP-based minimizers. It seems to work according to the tests but is slow.

I would be obliged to give you push access if you want to push directly and we can move it to somewhere more noticeable later. (I also think we can do away with the MRG+2 rule for now ;) )

@amueller
Copy link
Member

@glouppe is this PR still relevant? Or will this live in scikit-optimize?

@glouppe
Copy link
Contributor Author

glouppe commented Oct 11, 2016

@amueller Not sure we converged about what quantities we would like to return.

@adrinjalali
Copy link
Member

I was gonna close this citing scikit-optimize, but it seems scikit-optimize hasn't been maintained for the past 3 years and the repo is now in archive mode.

@lorentzenchr WDYT of this?

@betatim
Copy link
Member

betatim commented Apr 18, 2024

scikit-optimize (and projects it relies on) hasn't been maintained for ages. Please don't send people there they end up making me feel guilty for no longer maintaining it :-/

@lorentzenchr
Copy link
Member

From https://arxiv.org/abs/1311.4555

The error bars shown in Figure 1 give an estimate of the sampling variance of the random
forest; in other words, they tell us how much the random forest’s predictions might change
if we trained it on a new training set.
..
The goal of our paper is to study the sampling variance of
bagged learners

So this measures sampling variance and what a user might expect is the std error of the prediction. So would rather close as "not solve".

@adrinjalali
Copy link
Member

Works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants