[MRG] Prediction variance in bagging-based regressors #3645

jmetzen · 2014-09-07T08:31:48Z

This addresses issue #3271. The main points are

Option with_std added to method predict of BaggingRegressor and RandomForestRegressor. If True, the standard deviation of the predictions is returned in addition to the mean
Added option with_std to GaussianProcess.predict() and deprecated eval_MSE, which returned the predictive variance
Added one example comparing the predictive distributions of the three methods.

Please also check the planned deprecation path of the parameter eval_MSE in GaussianProcess

…t and Bagging regressors

This is consistent with the interface of RandomForestRegressor.predict The old way of requesting the predictive variance via eval_MSE is deprecated

…f eval_MSE

…ssors

jnothman · 2014-09-07T09:43:32Z

This might benefit from a mention in the narrative docs too.

jmetzen · 2014-09-07T15:42:19Z

I updated the documentation of the with_std option as proposed and fixed the bug in BaggingRegressor. Regarding the narrative docs: any opinions on where this could be added?

glouppe · 2014-09-08T13:45:57Z

sklearn/ensemble/forest.py

+        y_mean: array of shape = [n_samples] or [n_samples, n_outputs]
+            The mean of the predicted values.
+
+        y_std : array of shape = [n_samples]


Please mention that this is optional, like it is done below for GPs.

…redict

coveralls · 2014-09-08T15:26:28Z

Coverage decreased (-0.02%) when pulling 1780535 on jmetzen:variance_ensemble_regressors into 57f67d0 on scikit-learn:master.

…essor

jmetzen · 2014-09-14T18:09:02Z

@glouppe I extended the example (quantitative evaluation based on MSE and log probability under predictive distribution). The main point of the example is an illustration of the quite different predictive distributions of the ensemble-based regressors and the Gaussian Process.

jmetzen · 2014-09-14T18:10:26Z

@mblondel is this what you had in mind in issue #3271?

coveralls · 2014-09-14T18:15:50Z

Coverage decreased (-0.03%) when pulling 6043345 on jmetzen:variance_ensemble_regressors into 57f67d0 on scikit-learn:master.

mblondel · 2014-09-15T14:32:36Z

Could you show what the new example looks like for the lazy reviewers like me? :)

jmetzen · 2014-09-15T16:47:12Z

Sure, here is the generated figure and the terminal output

Comparison of predictive distributions of different regressors

A simple one-dimensional, noisy regression problem adressed by three different
regressors:

A Gaussian Process
A Random Forest
A Bagging-based Regressor

The regressors are fitted based on noisy observations where the magnitude of
the noise at the different training point is constant and known. Plotted are
both the mean and the pointwise 95% confidence interval of the predictions.
The mean predictions are evaluated on noise-less test data using the mean-
squared-error. The mean log probabilities of the noise-less test data are used
to evaluate the predictive distributions (a normal distribution with the
predicted mean and standard deviation) of the three regressors.

This example is based on the example gaussian_process/plot_gp_regression.py.

Mean-squared error of predictors on 1000 equidistant noise-less test datapoints:
* Random Forest: 0.78
* Bagging: 0.71
* Gaussian Process: 0.57
Mean log-probability of 1000 equidistant noise-less test datapoints under the (normal) predictive distribution of the predictors, i.e., log N(y_true| y_pred_mean, y_pred_std) [less is better]:
* Random Forest: -12.44
* Bagging: -14.20
* Gaussian Process: -26.62
In summary, the mean predictions of the Gaussian Process are slightly better than those of Random Forest and Bagging. The predictive distributions (taking into account also the predictive variance) of the Gaussian Process are considerably better.

glouppe · 2014-09-15T17:53:09Z

I am still not convinced by this example. It is misleading in my opinion.
The mean squared error and the prediction variance of the individual trees
should not mistaken with the mean squared error and the prediction variance
of the forest.

On 15 September 2014 18:47, Jan Hendrik Metzen notifications@github.com
wrote:

Sure, here is the generated figure and the terminal output
[image: figure_1]
https://cloud.githubusercontent.com/assets/1116263/4274869/7a2d818a-3cf7-11e4-9ff5-18048825ef94.png
Comparison of predictive distributions of different regressors

A simple one-dimensional, noisy regression problem adressed by three
different
regressors:

A Gaussian Process

A Random Forest

A Bagging-based Regressor

The regressors are fitted based on noisy observations where the magnitude
of
the noise at the different training point is constant and known. Plotted
are
both the mean and the pointwise 95% confidence interval of the predictions.
The mean predictions are evaluated on noise-less test data using the mean-
squared-error. The mean log probabilities of the noise-less test data are
used
to evaluate the predictive distributions (a normal distribution with the
predicted mean and standard deviation) of the three regressors.

This example is based on the example
gaussian_process/plot_gp_regression.py.

Mean-squared error of predictors on 1000 equidistant noise-less test
datapoints:

Random Forest: 0.78

Bagging: 0.71

Gaussian Process: 0.57
Mean log-probability of 1000 equidistant noise-less test datapoints under
the (normal) predictive distribution of the predictors, i.e., log N(y_true|
y_pred_mean, y_pred_std) [less is better]:

Random Forest: -12.44

Bagging: -14.20

Gaussian Process: -26.62
In summary, the mean predictions of the Gaussian Process are slightly
better than those of Random Forest and Bagging. The predictive
distributions (taking into account also the predictive variance) of the
Gaussian Process are considerably better.

—
Reply to this email directly or view it on GitHub
#3645 (comment)
.

jmetzen · 2014-09-15T18:17:00Z

Could you be a bit more specific what you find misleading? What do you mean with prediction variance of the individual trees? DecisionTreeRegressor does not define a prediction variance. Or do you mean the variance from the Bias-variance decomposition of the MSE?

glouppe · 2014-09-16T08:26:24Z

The predictive distributions (taking into account also the predictive variance) of the Gaussian Process are considerably better.

This is where I dont agree. The variance of the predictions of the individual trees (as computed when with_std=True) does not correspond to the variance of the predictions of the ensemble. Hence it is not correct and misleading to use this as a way to characterize the predictive distribution of the forest (at least not like that). In fact, the beauty of making an ensemble is that it erases the part of the error that is specifically due to the variance of the individual models.

mblondel · 2014-09-16T08:59:17Z

So do you think that adding with_std to random forests is a bad idea?

mblondel · 2014-09-16T09:03:01Z

examples/plot_predictive_standard_deviation.py

+
+print "Mean log-probability of 1000 equidistant noise-less test datapoints " \
+    "under the (normal) predictive distribution of the predictors, i.e., " \
+    "log N(y_true| y_pred_mean, y_pred_std) [less is better]:"\


I might be missing the point but I find it strange to evaluate non-parametric models such as RF like this.

You mean, assuming that its predictive distribution is normal?

glouppe · 2014-09-16T09:03:59Z

@mblondel : No, it may be used as a good proxy for characterizing the certainty of the predictions and identifying hard cases. Yet, caution should be taken regarding the exact interpretation of these values.

mblondel · 2014-09-16T09:10:30Z

examples/plot_predictive_standard_deviation.py

+    log_pdf_loss[name] = \
+        norm(y_pred, sigma).logpdf(f(x)).mean()
+
+    if name == "Random Forest":  # Skip because RF is very similar to Bagging


If you skip RF, just don't mention them in the example description. This is confusing. Or, you could also use bagging with an SVR (using RBF or polynomial kernel). This should produce different results from RF.

Sounds reasonable, I will change the example accordingly.

mblondel · 2014-09-16T09:27:10Z

@glouppe Could you suggest a small paragraph which would explain your point for the purpose of this example?

mblondel · 2014-09-16T09:31:33Z

examples/plot_predictive_standard_deviation.py

+print "In summary, the mean predictions of the Gaussian Process are slightly "\
+    "better than those of Random Forest and Bagging. The predictive " \
+    "distributions (taking into account also the predictive variance) " \
+    "of the Gaussian Process are considerably better."


It's hard to tell which method has better variance estimates without an objective metric. So I would rather remove this sentence. I would add in the example description that GP models the predictive distribution in a Bayesian way, while RF and Bagging use frequentist variance estimates and that no parametric assumption are made on the predictive distribution.

Yes, that makes sense. It was not intended that the take-home message is "GPs are better than ensembles", even though its sounds a bit like that admittedly. The target function matches very nicely to the assumptions of the squared-exponential kernel, which surely helps the GP. I will adapt this once I find time

GaelVaroquaux · 2014-09-16T10:59:24Z

As a side remark, somewhat relevant to this PR :), I should say that I have always been uncomfortable with the 'predict' method of Gaussian processes, that had a different signature than the standard predict. I am afraid that this is the door open to more and more API slippage.

mblondel · 2014-09-16T13:27:56Z

I am afraid that this is the door open to more and more API slippage.

What do you propose? :)

GaelVaroquaux · 2014-09-16T13:29:42Z

What do you propose? :)

I honestly have no good solution, this is why I phrase this as a side
comment. I just want to mention that, to me, this is a code smell, and
whoever has a proposal here would be greatly appreciate.

I don't think that this should be an argument against merging this PR.

jmetzen · 2014-09-17T14:32:09Z

@glouppe I think I got your point now. We should probably distinguish between the variance of the inferential procedure and what a fitted estimator (can) return as standad deviation of the prediction. The decision tree inference procedure has high variance but a single fitted decision tree provides no way of estimating this variance (does it?). The inference procedure for an ensemble like RF has less variance than that of an individual tree. What is returned by predict of a fitted RF/BaggingRegressor (via with_std) is related to the variance of the decision tree inference procedure. Does that make sense?

jmetzen · 2014-09-17T14:37:39Z

@GaelVaroquaux One alternative to the with_std/eval_MSE option of predict could be to add a predict_proba method in regressors (just as in classifiers) which returns (for every query point) an object representing the predictive distribution. For example an instance of scipy.stats.norm.
The main disadvantage of this would be the complex return type of the method. Not sure if I like that; just mention it as an option.

mblondel · 2014-09-18T02:27:22Z

@jmetzen This is an interesting idea but for RF I am not sure we can assume any parametric form for the predictive distribution...

glouppe · 2014-09-19T14:45:37Z

@glouppe I think I got your point now. We should probably distinguish between the variance of the inferential procedure and what a fitted estimator (can) return as standad deviation of the prediction. The decision tree inference procedure has high variance but a single fitted decision tree provides no way of estimating this variance (does it?). The inference procedure for an ensemble like RF has less variance than that of an individual tree. What is returned by predict of a fitted RF/BaggingRegressor (via with_std) is related to the variance of the decision tree inference procedure. Does that make sense?

Yes, this is exactly that. Accordingly, I would either not extend the GP example with RF/Bagging, or try to explain this as best as possible in the example.

dsjoerg · 2015-02-06T22:25:55Z

Looking forward to seeing this merged in someday! I am too noob to help sadly TT

oddskool · 2015-04-01T12:37:26Z

Taking the discussion a bit late, but wouldn't a bootstrap approach be usable here ?

Such a method would be able to estimate the variance of the prediction for any estimator at the cost of computing model and predictions on a number of resampled datasets.

I can't seem to find a proper reference for this method but you can find a sketch here

glouppe · 2015-10-21T11:47:12Z

@jmetzen What shall we do with this PR? The changes on the old GPs are no longer relevant and the comparison with forests was tricky to explain.

mblondel · 2015-10-21T12:43:29Z

Personally, I would still be interested in having this functionality in bagging estimators.

jmetzen · 2015-10-21T14:30:00Z

True, some parts of this PR are now obsolete. We could still add the option "return_std" (that's how it is called in GPR now) to BaggingRegressor and RandomForestRegressor. The extra LOCs and computation are fairly modest; the explanation how the interpretation of the value returned differs from the value returned by GPR's predict is slightly more involved.

@amueller's comment regarding how well-calibrated the estimate would be is also valid; I have no experience on this.

Does anyone have an idea for an example which illustrates a valid use-case of return_std for ensembles?

andosa · 2015-10-21T20:19:25Z

Shouldn't this be done via quantile regression forests (http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf)?
An implementation seems relatively straightforward, basically to record all of values in the tree leaves, not just the mean. For prediction interval or estimating a quantile, the requested quantile can be calculated on the stored values.

glouppe · 2015-10-22T13:22:17Z

Closing in favour of #5532.

Jan Hendrik Metzen added 4 commits September 7, 2014 10:20

ENH Predictive standard deviation optionally returned in Random Fores…

9e6386e

…t and Bagging regressors

REFACTOR Add option with_std to GaussianProcess.predict

4893e9d

This is consistent with the interface of RandomForestRegressor.predict The old way of requesting the predictive variance via eval_MSE is deprecated

REFACTOR Tests and examples of GaussianProcess use with_std instead o…

e99d16e

…f eval_MSE

ADD Example comparing the predictive distributions of different regre…

c3ea011

…ssors

jmetzen mentioned this pull request Sep 7, 2014

prediction variance in bagging-based regressors #3271

Open

Jan Hendrik Metzen added 2 commits September 7, 2014 17:26

DOC Improved documentation of with_std parameter of predict() method

bb625d5

FIX Bug in BaggingRegressor using _parallel_predict_regression

8c23d85

glouppe reviewed Sep 8, 2014
View reviewed changes

DOC More consistent documentation of optional return-value y_std of p…

1780535

…redict

Jan Hendrik Metzen added 2 commits September 14, 2014 20:04

DOC Updated doc of predict() of BaggingRegressor and RandomForestRegr…

3eb4771

…essor

ENH Extending example plot_predictive_standard_deviation.py

6043345

mblondel reviewed Sep 16, 2014
View reviewed changes

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

glouppe mentioned this pull request Oct 19, 2015

Prediction interval for Random Forests #4768

Open

glouppe mentioned this pull request Oct 22, 2015

[WIP] Add return_std option to ensembles #5532

Closed

5 tasks

glouppe closed this Oct 22, 2015

Uh oh!

[MRG] Prediction variance in bagging-based regressors #3645

[MRG] Prediction variance in bagging-based regressors #3645

Uh oh!

Conversation

jmetzen commented Sep 7, 2014

Uh oh!

jnothman commented Sep 7, 2014

Uh oh!

jmetzen commented Sep 7, 2014

Uh oh!

glouppe Sep 8, 2014

Choose a reason for hiding this comment

Uh oh!

coveralls commented Sep 8, 2014

Uh oh!

jmetzen commented Sep 14, 2014

Uh oh!

jmetzen commented Sep 14, 2014

Uh oh!

coveralls commented Sep 14, 2014

Uh oh!

mblondel commented Sep 15, 2014

Uh oh!

jmetzen commented Sep 15, 2014

Comparison of predictive distributions of different regressors

Uh oh!

glouppe commented Sep 15, 2014

Uh oh!

jmetzen commented Sep 15, 2014

Uh oh!

glouppe commented Sep 16, 2014

Uh oh!

mblondel commented Sep 16, 2014

Uh oh!

mblondel Sep 16, 2014

Choose a reason for hiding this comment

Uh oh!

jmetzen Sep 17, 2014

Choose a reason for hiding this comment

Uh oh!

glouppe commented Sep 16, 2014

Uh oh!

mblondel Sep 16, 2014

Choose a reason for hiding this comment

Uh oh!

jmetzen Sep 17, 2014

Choose a reason for hiding this comment

Uh oh!

mblondel commented Sep 16, 2014

Uh oh!

mblondel Sep 16, 2014

Choose a reason for hiding this comment

Uh oh!

jmetzen Sep 17, 2014

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Sep 16, 2014

Uh oh!

mblondel commented Sep 16, 2014

Uh oh!

GaelVaroquaux commented Sep 16, 2014

Uh oh!

jmetzen commented Sep 17, 2014

Uh oh!

jmetzen commented Sep 17, 2014

Uh oh!

mblondel commented Sep 18, 2014

Uh oh!

glouppe commented Sep 19, 2014

Uh oh!

dsjoerg commented Feb 6, 2015

Uh oh!

oddskool commented Apr 1, 2015

Uh oh!

glouppe commented Oct 21, 2015

Uh oh!

mblondel commented Oct 21, 2015

Uh oh!

jmetzen commented Oct 21, 2015