-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Prediction variance in bagging-based regressors #3645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t and Bagging regressors
This is consistent with the interface of RandomForestRegressor.predict The old way of requesting the predictive variance via eval_MSE is deprecated
This might benefit from a mention in the narrative docs too. |
I updated the documentation of the with_std option as proposed and fixed the bug in BaggingRegressor. Regarding the narrative docs: any opinions on where this could be added? |
y_mean: array of shape = [n_samples] or [n_samples, n_outputs] | ||
The mean of the predicted values. | ||
|
||
y_std : array of shape = [n_samples] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention that this is optional, like it is done below for GPs.
@glouppe I extended the example (quantitative evaluation based on MSE and log probability under predictive distribution). The main point of the example is an illustration of the quite different predictive distributions of the ensemble-based regressors and the Gaussian Process. |
Could you show what the new example looks like for the lazy reviewers like me? :) |
I am still not convinced by this example. It is misleading in my opinion. On 15 September 2014 18:47, Jan Hendrik Metzen notifications@github.com
|
Could you be a bit more specific what you find misleading? What do you mean with prediction variance of the individual trees? DecisionTreeRegressor does not define a prediction variance. Or do you mean the variance from the Bias-variance decomposition of the MSE? |
This is where I dont agree. The variance of the predictions of the individual trees (as computed when |
So do you think that adding |
|
||
print "Mean log-probability of 1000 equidistant noise-less test datapoints " \ | ||
"under the (normal) predictive distribution of the predictors, i.e., " \ | ||
"log N(y_true| y_pred_mean, y_pred_std) [less is better]:"\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing the point but I find it strange to evaluate non-parametric models such as RF like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, assuming that its predictive distribution is normal?
@mblondel : No, it may be used as a good proxy for characterizing the certainty of the predictions and identifying hard cases. Yet, caution should be taken regarding the exact interpretation of these values. |
log_pdf_loss[name] = \ | ||
norm(y_pred, sigma).logpdf(f(x)).mean() | ||
|
||
if name == "Random Forest": # Skip because RF is very similar to Bagging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you skip RF, just don't mention them in the example description. This is confusing. Or, you could also use bagging with an SVR (using RBF or polynomial kernel). This should produce different results from RF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, I will change the example accordingly.
@glouppe Could you suggest a small paragraph which would explain your point for the purpose of this example? |
print "In summary, the mean predictions of the Gaussian Process are slightly "\ | ||
"better than those of Random Forest and Bagging. The predictive " \ | ||
"distributions (taking into account also the predictive variance) " \ | ||
"of the Gaussian Process are considerably better." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to tell which method has better variance estimates without an objective metric. So I would rather remove this sentence. I would add in the example description that GP models the predictive distribution in a Bayesian way, while RF and Bagging use frequentist variance estimates and that no parametric assumption are made on the predictive distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. It was not intended that the take-home message is "GPs are better than ensembles", even though its sounds a bit like that admittedly. The target function matches very nicely to the assumptions of the squared-exponential kernel, which surely helps the GP. I will adapt this once I find time
As a side remark, somewhat relevant to this PR :), I should say that I have always been uncomfortable with the 'predict' method of Gaussian processes, that had a different signature than the standard predict. I am afraid that this is the door open to more and more API slippage. |
What do you propose? :) |
I honestly have no good solution, this is why I phrase this as a side I don't think that this should be an argument against merging this PR. |
@glouppe I think I got your point now. We should probably distinguish between the variance of the inferential procedure and what a fitted estimator (can) return as standad deviation of the prediction. The decision tree inference procedure has high variance but a single fitted decision tree provides no way of estimating this variance (does it?). The inference procedure for an ensemble like RF has less variance than that of an individual tree. What is returned by predict of a fitted RF/BaggingRegressor (via with_std) is related to the variance of the decision tree inference procedure. Does that make sense? |
@GaelVaroquaux One alternative to the with_std/eval_MSE option of predict could be to add a predict_proba method in regressors (just as in classifiers) which returns (for every query point) an object representing the predictive distribution. For example an instance of scipy.stats.norm. |
@jmetzen This is an interesting idea but for RF I am not sure we can assume any parametric form for the predictive distribution... |
Yes, this is exactly that. Accordingly, I would either not extend the GP example with RF/Bagging, or try to explain this as best as possible in the example. |
Looking forward to seeing this merged in someday! I am too noob to help sadly TT |
Taking the discussion a bit late, but wouldn't a bootstrap approach be usable here ? Such a method would be able to estimate the variance of the prediction for any estimator at the cost of computing model and predictions on a number of resampled datasets. I can't seem to find a proper reference for this method but you can find a sketch here |
@jmetzen What shall we do with this PR? The changes on the old GPs are no longer relevant and the comparison with forests was tricky to explain. |
Personally, I would still be interested in having this functionality in bagging estimators. |
True, some parts of this PR are now obsolete. We could still add the option "return_std" (that's how it is called in GPR now) to BaggingRegressor and RandomForestRegressor. The extra LOCs and computation are fairly modest; the explanation how the interpretation of the value returned differs from the value returned by GPR's predict is slightly more involved. @amueller's comment regarding how well-calibrated the estimate would be is also valid; I have no experience on this. Does anyone have an idea for an example which illustrates a valid use-case of return_std for ensembles? |
Shouldn't this be done via quantile regression forests (http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf)? |
Closing in favour of #5532. |
This addresses issue #3271. The main points are
Please also check the planned deprecation path of the parameter eval_MSE in GaussianProcess