-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
API to predict multiple quantiles at once #23334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for creating this issue.
My thoughts on this problem-space: as you mention, I think that we are better off focusing on an API for quantiles, and moving the code that we have on std to quantiles (in GPs, for instance). The reason is that quantiles are less parametric and more universal.
|
About names, MAPIE uses the term "prediction intervals" which is nice because it could be used either in a frequentist or a Bayesian context (instead of confidence vs credible intervals) and makes it explicit that this is for a specific prediction. However "interval" is annoying because we might want to grid the inverse-CDF of the predictive distribution instead of using a pair of quantiles. So maybe we should never speak about intervals in our API (and maybe we should focus on names such as "prediction quantiles" or "grided inverse-CDF of the predictive distribution"). Then we have the problem about when to pass the quantiles: some estimators require them at fit time, and therefore the quantiles should be passed as constructor parameters. Others (e.g. I think we should strive to find an API that can handle both cases. regressor = FitTimeQuantileAwareRegressor(quantiles=[0.05, 0.5, 0.95])
regressor.fit(X_train, y_train_observations)
y_test_quantiles = regressor.predict(X)
# or with a dedicated method:
y_test_quantiles = regressor.predict_quantiles(X) if the latter, what would regressor = PredictTimeQuantileAwareRegressor()
regressor.fit(X_train, y_train_observations)
y_pred, y_quantiles = regressor.predict(X_test, return_quantiles=[0.05, 0.5, 0.95])
# or with a dedicated method:
y_quantiles = regressor.predict_quantiles(X_test, quantiles=[0.05, 0.5, 0.95]) here |
I was just going through this in my head again and came up with a very similar API. I agree with @ogrisel's suggestions, and I agree that doing it based on quantiles seems good. We should deprecate the |
AFAIU, our API is general enough to support multiple outputs from If a model predicts quantiles, I guess a keyword like I would not, however, mix quantiles and point estimates of the expectation/mean because if you specify two quantile levels, say 25% and 75%, the estimated expectation might still lie outside this interval. In this case, I would propose to use expectiles of different probability levels. I guess, the problem is then to pass this meta-info ("hi there, here come 3 quantiles for levels 5%, 50% and 95%") to scorers and other model diagnostic tools.
To be precise, |
Personally, it would be great to see the scikit package take some direction on what a future multi quantile api might look like. Scikit has such a great influence it is basically sets the standard for python ml best practice. Here are a few packages with differing APIs based on scikit learn: Other well known packages using quantiles such as MAPIE and LightGBM often resort looping over a list of quantiles. NGBoost, PGBM and XGBoost-Distribution could all easily add multi-quantile predictions to their scikit APIs - my guess is the main reason they haven't is that there is not standard api coming from scikit on how best to do this! I believe a I would love to hear your in put on this as I am looking to looking to work on some of these projects and think it would be good to conform on an somewhat standard Multi Quantile API. |
It integrates with The specific multiple quantiles interface that is proposed - and widely in use with
|
Perhaps to add, where Instead, the architectural principle adopted is one of mini-packages, or dependency management on the level of estimator. For instance, MAPIE proper is maintained in its own repository, and so is Of course, natively implemented estimators a la vanilla |
Thanks @fkiraly
As such, I think I will adopt this for my use case: XGBoostLSS and then LightGBMLSS. It would be great if the sklearn core team adopt the skpro API as the longer term solution. |
Could you be more concrete? For which model/estimator do you propose which extension of its API? |
Updated my previous comment for added clarity. Essentially, skpro's API meets the requirements for quantile and/or distribution regression on top of the sklearn API. @fkiraly Correst me if I am wrong, but it seems that the only thing skpro does not satisfy (that is mentioned in the issue) is something like the FitTimeQuantileAwareRegressor. |
There was in fact one such case, namly the The solution we adopted was:
An alternative solution discussed was allowing to pass parameters such as We decided against it, as there are multiple |
Here's an interesting factoid which imo highlights the importance of having both (a) a clear interface definition and (b) stringent tests for it. Some |
just wondering - is this discussion stale, or what are the next steps? |
Ping In case it helps, if you were to make me a core dev, I'd be happy to devote substantial time to fold the API into
|
I‘m not 100% convinced of introducing such a method. For quantile regressors, @ogrisel wrote in the initial statement:
@scikit-learn/core-devs ping for API discussion. |
I'm +1 with the API proposed here: #23334 (comment). Now that we have metadata routing, we can properly support |
Classifiers have a
predict_proba
method that makes it possible to quantify probabilistic ally the certainty in the predictions for a given inputX_i
.Currently most regressors in scikit-learn only predict a conditional expectile E[Y|X], and some have a
return_std
option that makes it also possible to estimate sqrt(VAR[Y|X]), which can be used to quantify the certainty when assuming a Gaussian predictive distribution (typically for Gaussian processes which estimate a Gaussian predictive posterior distribution).We do have pointwise quantile estimators (linear models, gradient boosting, hist gradient boosting) where the
predict
method returns a single point estimate for target quantile passed as an hyper-parameter instead of estimating an expectile.Several people have expressed the need to have more generic API that can return an array of quantile estimates for a given input
X_i
.The goal of this issue is to centralize the discussion of an API extension to be able to do this more uniformly in scikit-learn, either via a meta-estimator that wraps an array of point-wise quantile estimator to turn it into a quantile-array estimator or to directly have the base estimators able to do this directly (and sometimes more efficiently).
Some non-exhausitive list of related PRs and issues (feel free to add or suggest new ones):
Also related:
Furthermore, models like Poisson regression that make a specific assumption about the conditional Y|X distribution, it would be possible to estimates of the inverse-CDF values of the estimated Y|X for instance. Those could probably also benefit from an expanded API.
If we do this, then we have the side question of how to evaluate such multi-quantile models. We could probably extend the pinball_loss scorer to average the pinball scores for an array of quantiles for instance.
/cc @GaelVaroquaux @amueller @lorentzenchr
The text was updated successfully, but these errors were encountered: