RFC Principled metrics for scoring and calibration of supervised learning #21718

lorentzenchr · 2021-11-19T19:23:57Z

We receive a lot feature requests to add new metrics for classification and regression task, each shedding light on different aspects of supervised learning, e.g. (incomplete list)

Scikit-learn estimators for regression and classification give, with rare exceptions, point forecasts, i.e. a single number as prediction for a single row of input data. Point forecasts usually aim at predicting a certain property/functional of the predictive distribution such as the expectation E[Y|X] or the median Median(Y|X). Such point forecasts are well understood and there is good theory on how to validate them, see [1].

Examples: LinearRegression, Ridge, HistGradientBoostingRegressor(loss='squared_error') and classifiers with a predict_proba like LogisticRegression estimate the expectation.

There are 2 main aspects of model validation/selection:

Calibration: How good does a model take into account the available information in form of features? Is there (conditional) bias?
This can be assessed with identification functions, which in the case of the expectation amounts to checking for bias sum(y_predicted - y_observed), see [2, 4].
Scoring: How good is the predictive performance of a model A compared to a model B? The aim is often to select the better model. This can be assessed by (consistent) scoring functions, like the (mean) squared error for the expectation sum((y_predicted - y_observed)^2), see [1, 4].

A. How to structure `metrics` in a principled way to make it clear which metric assesses what?

How to make the distinction between calibration and scoring more explicit?
How to make it clear which functional is assessed, e.g. expectation or a quantile, or 2 different quantiles?
What is in-scope, what is out-of-scope?

Note that (consistent) scoring functions assess calibration and resolution at the same time, see [3, 4]. They are a good fit for cross validation.

For example, #20162 proposes scoring functions and calibration functions for 2 quantiles, i.e. the prediction of 2 estimators at the same time (a 2-point forecast).

B. How to assess calibration?

Currently, we have calibration_curve to assess (auto-)calibration for classifiers. It would be desirable to

also assess regression tasks;
also look at conditional calibration, i.e. aggregates grouped by some feature (either bins or categorical level);
handle different functionals, e.g. the expectation and quantiles.
What is in-scope, what is out-of-scope?

Note that calibration scores, by nature, often do not comply with "larger is better" or "smaller is better* and are thus not suitable for cross-validation/grid search (for hyper-parameter tuning). An example is the bias proposed in #17854, also mentioned above, which is sign sensitive (larger absolute value means worse, and the sign means under- or overshooting of the prediction).

Further examples: #11096 is for (auto-) calibration of classification only, #20162 proposes (among others) a metric for the calibration of 2 quantiles.

References

[1] Gneiting (2009) https://arxiv.org/abs/0912.0902
[2] Nolde & Ziegel (2016) https://arxiv.org/abs/1608.05498
[3] Pohle (2020) https://arxiv.org/abs/2005.01835
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-11-20T13:02:22Z

@lorentzenchr Thanks a lot for opening this issue.

I really have concerns each time we get a feature request for a new metric because I don't know really know if this is a legit or not request. Basically, I lack insights regarding the evaluation and our documentation does not help there. I think that your RFC is framing what should be a recommendation section and should help us in specifying inclusion criteria specifically for metrics.

As I said I don't think that I am the legit person to define those but I will really be happy to see a discussion going on here (and to learn from it). Such discussion will benefit our documentation and probably we could come with a recommendation. I assume that we should have similar recommendations in the classification setting (and subcases like imbalanced classification and so on).

jnothman · 2021-11-21T11:04:37Z

Teaching of ML often underemphasises metrics relative to models, but at the end of the day all the assumptions that make a predictive model credible are about evaluation: data, CV, metrics and model selection. In this vein, I do think we can take some responsibility in giving guidance to our users.

In principle, I think we should be open to including various metrics in Scikit-learn. We do so for at least two reasons:

to support users to critically choose evaluation and diagnostic tools that are appropriate to their task
to help users adopt high quality implementations of metrics they have seen in the literature, rather than facing the pitfalls of reimplementation

However, I think a main reason we have been hesitant to add metrics is often clutter, which I think @lorentzenchr is seeking to solve, by better structuring or documenting the metrics. While over the years we have attempted to consolidate the implementation of metrics (especially with common tests and things like multilabel_confusion_matrix), the namespace of metrics is still very large (and the runtime of multiple metrics unnecessarily costly, when they often recompute sufficient statistics).

Some of this clutter is the mixing of multiple different tasks (binary classification, other classification, regression, ...) as well as different purposes or applications. While documentation will help, an API like #12385 could be extended to explicitly allow someone to navigate the library of metrics with different properties to identify those most appropriate to their task...???

lorentzenchr added RFC module:metrics module:calibration labels Nov 19, 2021

lorentzenchr mentioned this issue Nov 24, 2021

[MRG] FEA Lift metric and curve #21320

Open

changhsinlee mentioned this issue Dec 3, 2021

Add more D2 scores #20943

Open

ColdTeapot273K mentioned this issue Dec 13, 2021

Calibration and Refinement loss for Brier score loss #21774

Open

cmarmo mentioned this issue Dec 23, 2021

FEA Add cumulative gain curve metric #18479

Open

ColdTeapot273K mentioned this issue Jan 17, 2022

[WIP] Brier score binless decomposition #22233

Open

lorentzenchr mentioned this issue Aug 5, 2022

Adds gain scoring metrics #24121

Open

Kshitij68 mentioned this issue Oct 15, 2022

FEA implement max precision@recall K / recall@precision K #24671

Closed

Kshitij68 mentioned this issue Nov 4, 2022

Precision @ Recall K || Recall @ Precision K #20266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC Principled metrics for scoring and calibration of supervised learning #21718

RFC Principled metrics for scoring and calibration of supervised learning #21718

lorentzenchr commented Nov 19, 2021 •

edited by glemaitre

Loading

glemaitre commented Nov 20, 2021

jnothman commented Nov 21, 2021 •

edited

Loading

RFC Principled metrics for scoring and calibration of supervised learning #21718

RFC Principled metrics for scoring and calibration of supervised learning #21718

Comments

lorentzenchr commented Nov 19, 2021 • edited by glemaitre Loading

There are 2 main aspects of model validation/selection:

A. How to structure metrics in a principled way to make it clear which metric assesses what?

B. How to assess calibration?

References

glemaitre commented Nov 20, 2021

jnothman commented Nov 21, 2021 • edited Loading

lorentzenchr commented Nov 19, 2021 •

edited by glemaitre

Loading

A. How to structure `metrics` in a principled way to make it clear which metric assesses what?

jnothman commented Nov 21, 2021 •

edited

Loading