-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
Classifiers have a predict_proba
method that makes it possible to quantify probabilistic ally the certainty in the predictions for a given input X_i
.
Currently most regressors in scikit-learn only predict a conditional expectile E[Y|X], and some have a return_std
option that makes it also possible to estimate sqrt(VAR[Y|X]), which can be used to quantify the certainty when assuming a Gaussian predictive distribution (typically for Gaussian processes which estimate a Gaussian predictive posterior distribution).
We do have pointwise quantile estimators (linear models, gradient boosting, hist gradient boosting) where the predict
method returns a single point estimate for target quantile passed as an hyper-parameter instead of estimating an expectile.
Several people have expressed the need to have more generic API that can return an array of quantile estimates for a given input X_i
.
The goal of this issue is to centralize the discussion of an API extension to be able to do this more uniformly in scikit-learn, either via a meta-estimator that wraps an array of point-wise quantile estimator to turn it into a quantile-array estimator or to directly have the base estimators able to do this directly (and sometimes more efficiently).
Some non-exhausitive list of related PRs and issues (feel free to add or suggest new ones):
- quantile regression with dynamic or multiple quantiles #19851
- Draft implementation of non-parametric quantile methods (RF, Extra Trees and Nearest Neighbors) #19754
- Show the std of parameters posterior distribution for Bayesian ridge regression #20964
Also related:
- conformal predictions: https://github.com/scikit-learn-contrib/MAPIE
- XGBoostLSS (location, scale and shape) https://github.com/StatMixedML/XGBoostLSS
Furthermore, models like Poisson regression that make a specific assumption about the conditional Y|X distribution, it would be possible to estimates of the inverse-CDF values of the estimated Y|X for instance. Those could probably also benefit from an expanded API.
If we do this, then we have the side question of how to evaluate such multi-quantile models. We could probably extend the pinball_loss scorer to average the pinball scores for an array of quantiles for instance.