Skip to content

Regression Probability Distribution & Multi-Quantile Output API #28060

Closed
@joshdunnlime

Description

@joshdunnlime

Describe the workflow you want to enable

Scikit-learn has a predict and predict_proba method for Classification classes but only a predict method for regression, with the option of quantile. Scikit-learn is adding more quantile output functionality HistGradientBoostingRegressor and QuantileRegressor - no doubt more will come in due course. The single quantile parameter is set at the class init step.

LightGBM and other packages also follow a similar API.

MAPIE allows alpha being set on class init and predict.

XGBoost also has this option but also allows multiple outputs with e.g. alpha=np.array([0.05, 0.5, 0.95]). Currently, this isn't documented in the scikit-learn documentation. Clearly, this is a far superior piece of functionality where possible.

Additionally, distributional regression packages like:
XGBoostLSS allow options on the predict method such as: pred_type = quantiles, parameters, expectiles. This returns a m x n array.

PGBM uses predict with just mean and an return_std=True option as a 1 x n or 2 x n array.

XGBD returns the mean and std as a namedtuple.

NGBoost has predict and pred_dist which return point predictions and the distribution parameters that can be passed to a scipy.stats distribution object. E.g. normal.

All of these packages use scikit learn style APIs or aim to add this as a feature.

Describe your proposed solution

All this is to say, I think scikit-learn has the authority and opportunity to lead the way on unifying an API to cover both distributional outputs and quantile outputs. Therefore, I would like to open a discussion with the following points:

  1. Should this be added to the core scikit learn API/package? Is this within scope?

  2. If not, should it be an sckit-learn contrib package or is this something the "distributional python ML community" should sort out themselves?

  3. If this is something scikit-learn would like to take a lead on, what should this API look like? If scikit-learn admin/owners think this is outside of scope but still have insights/opinions on this it would still be extremely valuable to hear them!

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions