Skip to content

[MRG] Prediction variance in bagging-based regressors #3645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jmetzen
Copy link
Member

@jmetzen jmetzen commented Sep 7, 2014

This addresses issue #3271. The main points are

  • Option with_std added to method predict of BaggingRegressor and RandomForestRegressor. If True, the standard deviation of the predictions is returned in addition to the mean
  • Added option with_std to GaussianProcess.predict() and deprecated eval_MSE, which returned the predictive variance
  • Added one example comparing the predictive distributions of the three methods.

Please also check the planned deprecation path of the parameter eval_MSE in GaussianProcess

Jan Hendrik Metzen added 4 commits September 7, 2014 10:20
This is consistent with the interface of RandomForestRegressor.predict
The old way of requesting the predictive variance via eval_MSE is deprecated
@jnothman
Copy link
Member

jnothman commented Sep 7, 2014

This might benefit from a mention in the narrative docs too.

@jmetzen
Copy link
Member Author

jmetzen commented Sep 7, 2014

I updated the documentation of the with_std option as proposed and fixed the bug in BaggingRegressor. Regarding the narrative docs: any opinions on where this could be added?

y_mean: array of shape = [n_samples] or [n_samples, n_outputs]
The mean of the predicted values.

y_std : array of shape = [n_samples]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that this is optional, like it is done below for GPs.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) when pulling 1780535 on jmetzen:variance_ensemble_regressors into 57f67d0 on scikit-learn:master.

@jmetzen
Copy link
Member Author

jmetzen commented Sep 14, 2014

@glouppe I extended the example (quantitative evaluation based on MSE and log probability under predictive distribution). The main point of the example is an illustration of the quite different predictive distributions of the ensemble-based regressors and the Gaussian Process.

@jmetzen
Copy link
Member Author

jmetzen commented Sep 14, 2014

@mblondel is this what you had in mind in issue #3271?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) when pulling 6043345 on jmetzen:variance_ensemble_regressors into 57f67d0 on scikit-learn:master.

@mblondel
Copy link
Member

Could you show what the new example looks like for the lazy reviewers like me? :)

@jmetzen
Copy link
Member Author

jmetzen commented Sep 15, 2014

Sure, here is the generated figure and the terminal output
figure_1

Comparison of predictive distributions of different regressors

A simple one-dimensional, noisy regression problem adressed by three different
regressors:

  1. A Gaussian Process
  2. A Random Forest
  3. A Bagging-based Regressor

The regressors are fitted based on noisy observations where the magnitude of
the noise at the different training point is constant and known. Plotted are
both the mean and the pointwise 95% confidence interval of the predictions.
The mean predictions are evaluated on noise-less test data using the mean-
squared-error. The mean log probabilities of the noise-less test data are used
to evaluate the predictive distributions (a normal distribution with the
predicted mean and standard deviation) of the three regressors.

This example is based on the example gaussian_process/plot_gp_regression.py.

Mean-squared error of predictors on 1000 equidistant noise-less test datapoints:
* Random Forest: 0.78
* Bagging: 0.71
* Gaussian Process: 0.57
Mean log-probability of 1000 equidistant noise-less test datapoints under the (normal) predictive distribution of the predictors, i.e., log N(y_true| y_pred_mean, y_pred_std) [less is better]:
* Random Forest: -12.44
* Bagging: -14.20
* Gaussian Process: -26.62
In summary, the mean predictions of the Gaussian Process are slightly better than those of Random Forest and Bagging. The predictive distributions (taking into account also the predictive variance) of the Gaussian Process are considerably better.

@glouppe
Copy link
Contributor

glouppe commented Sep 15, 2014

I am still not convinced by this example. It is misleading in my opinion.
The mean squared error and the prediction variance of the individual trees
should not mistaken with the mean squared error and the prediction variance
of the forest.

On 15 September 2014 18:47, Jan Hendrik Metzen notifications@github.com
wrote:

Sure, here is the generated figure and the terminal output
[image: figure_1]
https://cloud.githubusercontent.com/assets/1116263/4274869/7a2d818a-3cf7-11e4-9ff5-18048825ef94.png
Comparison of predictive distributions of different regressors

A simple one-dimensional, noisy regression problem adressed by three
different
regressors:

  1. A Gaussian Process
  2. A Random Forest
  3. A Bagging-based Regressor

The regressors are fitted based on noisy observations where the magnitude
of
the noise at the different training point is constant and known. Plotted
are
both the mean and the pointwise 95% confidence interval of the predictions.
The mean predictions are evaluated on noise-less test data using the mean-
squared-error. The mean log probabilities of the noise-less test data are
used
to evaluate the predictive distributions (a normal distribution with the
predicted mean and standard deviation) of the three regressors.

This example is based on the example
gaussian_process/plot_gp_regression.py.

Mean-squared error of predictors on 1000 equidistant noise-less test
datapoints:

  • Random Forest: 0.78
  • Bagging: 0.71
  • Gaussian Process: 0.57
    Mean log-probability of 1000 equidistant noise-less test datapoints under
    the (normal) predictive distribution of the predictors, i.e., log N(y_true|
    y_pred_mean, y_pred_std) [less is better]:
  • Random Forest: -12.44
  • Bagging: -14.20
  • Gaussian Process: -26.62
    In summary, the mean predictions of the Gaussian Process are slightly
    better than those of Random Forest and Bagging. The predictive
    distributions (taking into account also the predictive variance) of the
    Gaussian Process are considerably better.


Reply to this email directly or view it on GitHub
#3645 (comment)
.

@jmetzen
Copy link
Member Author

jmetzen commented Sep 15, 2014

Could you be a bit more specific what you find misleading? What do you mean with prediction variance of the individual trees? DecisionTreeRegressor does not define a prediction variance. Or do you mean the variance from the Bias-variance decomposition of the MSE?

@glouppe
Copy link
Contributor

glouppe commented Sep 16, 2014

The predictive distributions (taking into account also the predictive variance) of the Gaussian Process are considerably better.

This is where I dont agree. The variance of the predictions of the individual trees (as computed when with_std=True) does not correspond to the variance of the predictions of the ensemble. Hence it is not correct and misleading to use this as a way to characterize the predictive distribution of the forest (at least not like that). In fact, the beauty of making an ensemble is that it erases the part of the error that is specifically due to the variance of the individual models.

@mblondel
Copy link
Member

So do you think that adding with_std to random forests is a bad idea?


print "Mean log-probability of 1000 equidistant noise-less test datapoints " \
"under the (normal) predictive distribution of the predictors, i.e., " \
"log N(y_true| y_pred_mean, y_pred_std) [less is better]:"\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing the point but I find it strange to evaluate non-parametric models such as RF like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, assuming that its predictive distribution is normal?

@glouppe
Copy link
Contributor

glouppe commented Sep 16, 2014

@mblondel : No, it may be used as a good proxy for characterizing the certainty of the predictions and identifying hard cases. Yet, caution should be taken regarding the exact interpretation of these values.

log_pdf_loss[name] = \
norm(y_pred, sigma).logpdf(f(x)).mean()

if name == "Random Forest": # Skip because RF is very similar to Bagging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you skip RF, just don't mention them in the example description. This is confusing. Or, you could also use bagging with an SVR (using RBF or polynomial kernel). This should produce different results from RF.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, I will change the example accordingly.

@mblondel
Copy link
Member

@glouppe Could you suggest a small paragraph which would explain your point for the purpose of this example?

print "In summary, the mean predictions of the Gaussian Process are slightly "\
"better than those of Random Forest and Bagging. The predictive " \
"distributions (taking into account also the predictive variance) " \
"of the Gaussian Process are considerably better."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to tell which method has better variance estimates without an objective metric. So I would rather remove this sentence. I would add in the example description that GP models the predictive distribution in a Bayesian way, while RF and Bagging use frequentist variance estimates and that no parametric assumption are made on the predictive distribution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. It was not intended that the take-home message is "GPs are better than ensembles", even though its sounds a bit like that admittedly. The target function matches very nicely to the assumptions of the squared-exponential kernel, which surely helps the GP. I will adapt this once I find time

@GaelVaroquaux
Copy link
Member

As a side remark, somewhat relevant to this PR :), I should say that I have always been uncomfortable with the 'predict' method of Gaussian processes, that had a different signature than the standard predict. I am afraid that this is the door open to more and more API slippage.

@mblondel
Copy link
Member

I am afraid that this is the door open to more and more API slippage.

What do you propose? :)

@GaelVaroquaux
Copy link
Member

What do you propose? :)

I honestly have no good solution, this is why I phrase this as a side
comment. I just want to mention that, to me, this is a code smell, and
whoever has a proposal here would be greatly appreciate.

I don't think that this should be an argument against merging this PR.

@jmetzen
Copy link
Member Author

jmetzen commented Sep 17, 2014

@glouppe I think I got your point now. We should probably distinguish between the variance of the inferential procedure and what a fitted estimator (can) return as standad deviation of the prediction. The decision tree inference procedure has high variance but a single fitted decision tree provides no way of estimating this variance (does it?). The inference procedure for an ensemble like RF has less variance than that of an individual tree. What is returned by predict of a fitted RF/BaggingRegressor (via with_std) is related to the variance of the decision tree inference procedure. Does that make sense?

@jmetzen
Copy link
Member Author

jmetzen commented Sep 17, 2014

@GaelVaroquaux One alternative to the with_std/eval_MSE option of predict could be to add a predict_proba method in regressors (just as in classifiers) which returns (for every query point) an object representing the predictive distribution. For example an instance of scipy.stats.norm.
The main disadvantage of this would be the complex return type of the method. Not sure if I like that; just mention it as an option.

@mblondel
Copy link
Member

@jmetzen This is an interesting idea but for RF I am not sure we can assume any parametric form for the predictive distribution...

@glouppe
Copy link
Contributor

glouppe commented Sep 19, 2014

@glouppe I think I got your point now. We should probably distinguish between the variance of the inferential procedure and what a fitted estimator (can) return as standad deviation of the prediction. The decision tree inference procedure has high variance but a single fitted decision tree provides no way of estimating this variance (does it?). The inference procedure for an ensemble like RF has less variance than that of an individual tree. What is returned by predict of a fitted RF/BaggingRegressor (via with_std) is related to the variance of the decision tree inference procedure. Does that make sense?

Yes, this is exactly that. Accordingly, I would either not extend the GP example with RF/Bagging, or try to explain this as best as possible in the example.

@dsjoerg
Copy link

dsjoerg commented Feb 6, 2015

Looking forward to seeing this merged in someday! I am too noob to help sadly TT

@oddskool
Copy link
Contributor

oddskool commented Apr 1, 2015

Taking the discussion a bit late, but wouldn't a bootstrap approach be usable here ?

Such a method would be able to estimate the variance of the prediction for any estimator at the cost of computing model and predictions on a number of resampled datasets.

I can't seem to find a proper reference for this method but you can find a sketch here

@glouppe
Copy link
Contributor

glouppe commented Oct 21, 2015

@jmetzen What shall we do with this PR? The changes on the old GPs are no longer relevant and the comparison with forests was tricky to explain.

@mblondel
Copy link
Member

Personally, I would still be interested in having this functionality in bagging estimators.

@jmetzen
Copy link
Member Author

jmetzen commented Oct 21, 2015

True, some parts of this PR are now obsolete. We could still add the option "return_std" (that's how it is called in GPR now) to BaggingRegressor and RandomForestRegressor. The extra LOCs and computation are fairly modest; the explanation how the interpretation of the value returned differs from the value returned by GPR's predict is slightly more involved.

@amueller's comment regarding how well-calibrated the estimate would be is also valid; I have no experience on this.

Does anyone have an idea for an example which illustrates a valid use-case of return_std for ensembles?

@andosa
Copy link

andosa commented Oct 21, 2015

Shouldn't this be done via quantile regression forests (http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf)?
An implementation seems relatively straightforward, basically to record all of values in the tree leaves, not just the mean. For prediction interval or estimating a quantile, the requested quantile can be calculated on the stored values.

@glouppe
Copy link
Contributor

glouppe commented Oct 22, 2015

Closing in favour of #5532.

@glouppe glouppe closed this Oct 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants