Skip to content

Regression Probability Distribution & Multi-Quantile Output API #28060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joshdunnlime opened this issue Jan 3, 2024 · 3 comments
Closed

Regression Probability Distribution & Multi-Quantile Output API #28060

joshdunnlime opened this issue Jan 3, 2024 · 3 comments
Labels
Needs Triage Issue requires triage New Feature

Comments

@joshdunnlime
Copy link

joshdunnlime commented Jan 3, 2024

Describe the workflow you want to enable

Scikit-learn has a predict and predict_proba method for Classification classes but only a predict method for regression, with the option of quantile. Scikit-learn is adding more quantile output functionality HistGradientBoostingRegressor and QuantileRegressor - no doubt more will come in due course. The single quantile parameter is set at the class init step.

LightGBM and other packages also follow a similar API.

MAPIE allows alpha being set on class init and predict.

XGBoost also has this option but also allows multiple outputs with e.g. alpha=np.array([0.05, 0.5, 0.95]). Currently, this isn't documented in the scikit-learn documentation. Clearly, this is a far superior piece of functionality where possible.

Additionally, distributional regression packages like:
XGBoostLSS allow options on the predict method such as: pred_type = quantiles, parameters, expectiles. This returns a m x n array.

PGBM uses predict with just mean and an return_std=True option as a 1 x n or 2 x n array.

XGBD returns the mean and std as a namedtuple.

NGBoost has predict and pred_dist which return point predictions and the distribution parameters that can be passed to a scipy.stats distribution object. E.g. normal.

All of these packages use scikit learn style APIs or aim to add this as a feature.

Describe your proposed solution

All this is to say, I think scikit-learn has the authority and opportunity to lead the way on unifying an API to cover both distributional outputs and quantile outputs. Therefore, I would like to open a discussion with the following points:

  1. Should this be added to the core scikit learn API/package? Is this within scope?

  2. If not, should it be an sckit-learn contrib package or is this something the "distributional python ML community" should sort out themselves?

  3. If this is something scikit-learn would like to take a lead on, what should this API look like? If scikit-learn admin/owners think this is outside of scope but still have insights/opinions on this it would still be extremely valuable to hear them!

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@joshdunnlime joshdunnlime added Needs Triage Issue requires triage New Feature labels Jan 3, 2024
@joshdunnlime
Copy link
Author

To answer my own questions:

  1. I believe this should be lead by the scikit-learn API - this sentiment is clearly agreed with here: API to predict multiple quantiles at once #23334, however, I believe it should be extended to include the distribution parameters for those model that can handle them.

  2. I have no strong opinions, but my initial ideas are as follows:

Multioutput Quantiles
This is well covered in #23334, however there could also be a sklearn.multioutput.MultiOutputQuantile class to handle those quantile regression models with single quantile outputs.

Distribution Parameters
These should have a separate method (like predict_proba for classification). For example a predict_dist_params or predict_dist method with outputs in a m x n array, with m determined by the number of parameter for the given distribution.

regressor = XGBD().fit(X, y) )  # default normal dist
regressor.predict(X) #  point prediction 
regressor.predict_quantile(X, quantile=[0.05, 0.5, 0.95])  # 3 x n
regressor.predict_dist_params(X)  # 2 x n as mean/loc, std/scale
regressor = MultiOutputQuantile(QuantileRegressor, quantile=[0.05, 0.5, 0.95]).fit(X, y) )  # single output regressor
regressor.predict(X) #  3 x n
regressor.predict_quantile(X)  # 3 x n
regressor.predict_dist_params(X)  # raises error

@glemaitre
Copy link
Member

I am going to close this issue in the favor of #23334. @joshdunnlime Could you bring all your discussion on the other issue.

This topic is part of the roadmap but we need to settle in some ways.

@joshdunnlime
Copy link
Author

I'll add any additional information from this to the previous issue. Thanks for checking back in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage New Feature
Projects
None yet
Development

No branches or pull requests

2 participants