Skip to content

RandomForestRegressor quantile Criterion #18540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
newTypeGeek opened this issue Oct 6, 2020 · 5 comments
Open

RandomForestRegressor quantile Criterion #18540

newTypeGeek opened this issue Oct 6, 2020 · 5 comments

Comments

@newTypeGeek
Copy link

For sklearn version 0.23.2, the parameter criterion of RandomForestRegressor only supports mse and mae. Is there any plan to extend these two options for a general quantile (e.g. 95% quantile, 25% quantile ... etc), such that a prediction interval can be made for random forest regression ?

@NicolasHug
Copy link
Member

NicolasHug commented Oct 6, 2020

There's no plan to include a quantile criterion as far as I know (unless @lorentzenchr had this in his backlog?)

But I think we discussed implementing a quantile loss in HistGradientBoostingRegressor before. There was also #11086

@NicolasHug NicolasHug changed the title RandomForestRegressor Criterion RandomForestRegressor quantile Criterion Oct 6, 2020
@lorentzenchr
Copy link
Member

I don't have it on my personal roadmap. But I'd be interested in exploring the idea. The MAE loss for trees is non-trivial. How much code change would it require, is there some elegant way? If we decide to have quantile regression, with which model type would we start?

Note that quantiles do NOT give you prediction intervals, only estimates/predictions over the target distribution, i.e. it is without the estimation error. At least, that is my understanding.

@ogrisel
Copy link
Member

ogrisel commented Feb 12, 2021

Note that quantiles do NOT give you prediction intervals, only estimates/predictions over the target distribution, i.e. it is without the estimation error. At least, that is my understanding.

On top of estimation error (overfitting), there could also be approximation error (underfitting) caused by the inductive biais of the trees and their hyperparameters (for instance with shallow trees).

@ThomasBourgeois
Copy link

My 2 cents here : I'm working on that subject, I found a paper mentionning two types of uncertainties when creating prediction intervals:

They create overall prediction intervals by summing contribution from both uncertainties, estimated with bootstrap sampling.

Note : i do not understand exactly the scientifi foundations behind this difference they make between model and data uncertainty, I'd be happy to find references on this if anyone has.

@adam2392
Copy link
Member

adam2392 commented Jul 2, 2024

In order to generally predict a confidence interval, we would need to store the training samples within each leaf node. This is what's done in quantile-forest and scikit-tree when max_samples_leaf>1.

There is a trade-off to doing so since this makes the stored trees much bigger on disc/RAM. However, I could see the appeal in the low-sample size settings (i.e. < 10k), where random forests excel already.

I think it's an important issue. I don't see a way to do it without storing the leaf node samples, but I could be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants