Skip to content

RFC Principled metrics for scoring and calibration of supervised learning #21718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Nov 19, 2021 · 2 comments
Open

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Nov 19, 2021

We receive a lot feature requests to add new metrics for classification and regression task, each shedding light on different aspects of supervised learning, e.g. (incomplete list)

Scikit-learn estimators for regression and classification give, with rare exceptions, point forecasts, i.e. a single number as prediction for a single row of input data. Point forecasts usually aim at predicting a certain property/functional of the predictive distribution such as the expectation E[Y|X] or the median Median(Y|X). Such point forecasts are well understood and there is good theory on how to validate them, see [1].

Examples: LinearRegression, Ridge, HistGradientBoostingRegressor(loss='squared_error') and classifiers with a predict_proba like LogisticRegression estimate the expectation.

There are 2 main aspects of model validation/selection:

  1. Calibration: How good does a model take into account the available information in form of features? Is there (conditional) bias?
    This can be assessed with identification functions, which in the case of the expectation amounts to checking for bias sum(y_predicted - y_observed), see [2, 4].
  2. Scoring: How good is the predictive performance of a model A compared to a model B? The aim is often to select the better model. This can be assessed by (consistent) scoring functions, like the (mean) squared error for the expectation sum((y_predicted - y_observed)^2), see [1, 4].

A. How to structure metrics in a principled way to make it clear which metric assesses what?

  1. How to make the distinction between calibration and scoring more explicit?
  2. How to make it clear which functional is assessed, e.g. expectation or a quantile, or 2 different quantiles?
  3. What is in-scope, what is out-of-scope?

Note that (consistent) scoring functions assess calibration and resolution at the same time, see [3, 4]. They are a good fit for cross validation.

For example, #20162 proposes scoring functions and calibration functions for 2 quantiles, i.e. the prediction of 2 estimators at the same time (a 2-point forecast).

B. How to assess calibration?

Currently, we have calibration_curve to assess (auto-)calibration for classifiers. It would be desirable to

  1. also assess regression tasks;
  2. also look at conditional calibration, i.e. aggregates grouped by some feature (either bins or categorical level);
  3. handle different functionals, e.g. the expectation and quantiles.
  4. What is in-scope, what is out-of-scope?

Note that calibration scores, by nature, often do not comply with "larger is better" or "smaller is better* and are thus not suitable for cross-validation/grid search (for hyper-parameter tuning). An example is the bias proposed in #17854, also mentioned above, which is sign sensitive (larger absolute value means worse, and the sign means under- or overshooting of the prediction).

Further examples: #11096 is for (auto-) calibration of classification only, #20162 proposes (among others) a metric for the calibration of 2 quantiles.

References

[1] Gneiting (2009) https://arxiv.org/abs/0912.0902
[2] Nolde & Ziegel (2016) https://arxiv.org/abs/1608.05498
[3] Pohle (2020) https://arxiv.org/abs/2005.01835
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780

@glemaitre
Copy link
Member

@lorentzenchr Thanks a lot for opening this issue.

I really have concerns each time we get a feature request for a new metric because I don't know really know if this is a legit or not request. Basically, I lack insights regarding the evaluation and our documentation does not help there. I think that your RFC is framing what should be a recommendation section and should help us in specifying inclusion criteria specifically for metrics.

As I said I don't think that I am the legit person to define those but I will really be happy to see a discussion going on here (and to learn from it). Such discussion will benefit our documentation and probably we could come with a recommendation. I assume that we should have similar recommendations in the classification setting (and subcases like imbalanced classification and so on).

@jnothman
Copy link
Member

jnothman commented Nov 21, 2021

Teaching of ML often underemphasises metrics relative to models, but at the end of the day all the assumptions that make a predictive model credible are about evaluation: data, CV, metrics and model selection. In this vein, I do think we can take some responsibility in giving guidance to our users.

In principle, I think we should be open to including various metrics in Scikit-learn. We do so for at least two reasons:

  • to support users to critically choose evaluation and diagnostic tools that are appropriate to their task
  • to help users adopt high quality implementations of metrics they have seen in the literature, rather than facing the pitfalls of reimplementation

However, I think a main reason we have been hesitant to add metrics is often clutter, which I think @lorentzenchr is seeking to solve, by better structuring or documenting the metrics. While over the years we have attempted to consolidate the implementation of metrics (especially with common tests and things like multilabel_confusion_matrix), the namespace of metrics is still very large (and the runtime of multiple metrics unnecessarily costly, when they often recompute sufficient statistics).

Some of this clutter is the mixing of multiple different tasks (binary classification, other classification, regression, ...) as well as different purposes or applications. While documentation will help, an API like #12385 could be extended to explicitly allow someone to navigate the library of metrics with different properties to identify those most appropriate to their task...???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants