You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scikit-learn has a predict and predict_proba method for Classification classes but only a predict method for regression, with the option of quantile. Scikit-learn is adding more quantile output functionality HistGradientBoostingRegressor and QuantileRegressor - no doubt more will come in due course. The single quantile parameter is set at the class init step.
LightGBM and other packages also follow a similar API.
MAPIE allows alpha being set on class init and predict.
XGBoost also has this option but also allows multiple outputs with e.g. alpha=np.array([0.05, 0.5, 0.95]). Currently, this isn't documented in the scikit-learn documentation. Clearly, this is a far superior piece of functionality where possible.
Additionally, distributional regression packages like: XGBoostLSS allow options on the predict method such as: pred_type = quantiles, parameters, expectiles. This returns a m x n array.
PGBM uses predict with just mean and an return_std=True option as a 1 x n or 2 x n array.
NGBoost has predict and pred_dist which return point predictions and the distribution parameters that can be passed to a scipy.stats distribution object. E.g. normal.
All of these packages use scikit learn style APIs or aim to add this as a feature.
Describe your proposed solution
All this is to say, I think scikit-learn has the authority and opportunity to lead the way on unifying an API to cover both distributional outputs and quantile outputs. Therefore, I would like to open a discussion with the following points:
Should this be added to the core scikit learn API/package? Is this within scope?
If not, should it be an sckit-learn contrib package or is this something the "distributional python ML community" should sort out themselves?
If this is something scikit-learn would like to take a lead on, what should this API look like? If scikit-learn admin/owners think this is outside of scope but still have insights/opinions on this it would still be extremely valuable to hear them!
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
I believe this should be lead by the scikit-learn API - this sentiment is clearly agreed with here: API to predict multiple quantiles at once #23334, however, I believe it should be extended to include the distribution parameters for those model that can handle them.
I have no strong opinions, but my initial ideas are as follows:
Multioutput Quantiles
This is well covered in #23334, however there could also be a sklearn.multioutput.MultiOutputQuantile class to handle those quantile regression models with single quantile outputs.
Distribution Parameters
These should have a separate method (like predict_proba for classification). For example a predict_dist_params or predict_dist method with outputs in a m x n array, with m determined by the number of parameter for the given distribution.
regressor=XGBD().fit(X, y) ) # default normal distregressor.predict(X) # point prediction regressor.predict_quantile(X, quantile=[0.05, 0.5, 0.95]) # 3 x nregressor.predict_dist_params(X) # 2 x n as mean/loc, std/scale
regressor=MultiOutputQuantile(QuantileRegressor, quantile=[0.05, 0.5, 0.95]).fit(X, y) ) # single output regressorregressor.predict(X) # 3 x nregressor.predict_quantile(X) # 3 x nregressor.predict_dist_params(X) # raises error
Describe the workflow you want to enable
Scikit-learn has a
predict
andpredict_proba
method for Classification classes but only apredict
method for regression, with the option of quantile. Scikit-learn is adding more quantile output functionality HistGradientBoostingRegressor and QuantileRegressor - no doubt more will come in due course. The singlequantile
parameter is set at the class init step.LightGBM and other packages also follow a similar API.
MAPIE allows alpha being set on class init and predict.
XGBoost also has this option but also allows multiple outputs with e.g.
alpha=np.array([0.05, 0.5, 0.95])
. Currently, this isn't documented in the scikit-learn documentation. Clearly, this is a far superior piece of functionality where possible.Additionally, distributional regression packages like:
XGBoostLSS allow options on the
predict
method such as:pred_type
=quantiles
,parameters
,expectiles
. This returns a m x n array.PGBM uses
predict
with just mean and anreturn_std=True
option as a 1 x n or 2 x n array.XGBD returns the mean and std as a namedtuple.
NGBoost has
predict
andpred_dist
which return point predictions and the distribution parameters that can be passed to a scipy.stats distribution object. E.g.normal
.All of these packages use scikit learn style APIs or aim to add this as a feature.
Describe your proposed solution
All this is to say, I think scikit-learn has the authority and opportunity to lead the way on unifying an API to cover both distributional outputs and quantile outputs. Therefore, I would like to open a discussion with the following points:
Should this be added to the core scikit learn API/package? Is this within scope?
If not, should it be an sckit-learn contrib package or is this something the "distributional python ML community" should sort out themselves?
If this is something scikit-learn would like to take a lead on, what should this API look like? If scikit-learn admin/owners think this is outside of scope but still have insights/opinions on this it would still be extremely valuable to hear them!
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: