-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RandomForestRegressor quantile Criterion #18540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's no plan to include a quantile criterion as far as I know (unless @lorentzenchr had this in his backlog?) But I think we discussed implementing a quantile loss in HistGradientBoostingRegressor before. There was also #11086 |
I don't have it on my personal roadmap. But I'd be interested in exploring the idea. The MAE loss for trees is non-trivial. How much code change would it require, is there some elegant way? If we decide to have quantile regression, with which model type would we start? Note that quantiles do NOT give you prediction intervals, only estimates/predictions over the target distribution, i.e. it is without the estimation error. At least, that is my understanding. |
On top of estimation error (overfitting), there could also be approximation error (underfitting) caused by the inductive biais of the trees and their hyperparameters (for instance with shallow trees). |
My 2 cents here : I'm working on that subject, I found a paper mentionning two types of uncertainties when creating prediction intervals:
They create overall prediction intervals by summing contribution from both uncertainties, estimated with bootstrap sampling. Note : i do not understand exactly the scientifi foundations behind this difference they make between model and data uncertainty, I'd be happy to find references on this if anyone has. |
In order to generally predict a confidence interval, we would need to store the training samples within each leaf node. This is what's done in There is a trade-off to doing so since this makes the stored trees much bigger on disc/RAM. However, I could see the appeal in the low-sample size settings (i.e. < 10k), where random forests excel already. I think it's an important issue. I don't see a way to do it without storing the leaf node samples, but I could be wrong. |
For
sklearn
version 0.23.2, the parametercriterion
ofRandomForestRegressor
only supportsmse
andmae
. Is there any plan to extend these two options for a general quantile (e.g. 95% quantile, 25% quantile ... etc), such that a prediction interval can be made for random forest regression ?The text was updated successfully, but these errors were encountered: