-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC Principled metrics for scoring and calibration of supervised learning #21718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@lorentzenchr Thanks a lot for opening this issue. I really have concerns each time we get a feature request for a new metric because I don't know really know if this is a legit or not request. Basically, I lack insights regarding the evaluation and our documentation does not help there. I think that your RFC is framing what should be a recommendation section and should help us in specifying inclusion criteria specifically for metrics. As I said I don't think that I am the legit person to define those but I will really be happy to see a discussion going on here (and to learn from it). Such discussion will benefit our documentation and probably we could come with a recommendation. I assume that we should have similar recommendations in the classification setting (and subcases like imbalanced classification and so on). |
Teaching of ML often underemphasises metrics relative to models, but at the end of the day all the assumptions that make a predictive model credible are about evaluation: data, CV, metrics and model selection. In this vein, I do think we can take some responsibility in giving guidance to our users. In principle, I think we should be open to including various metrics in Scikit-learn. We do so for at least two reasons:
However, I think a main reason we have been hesitant to add metrics is often clutter, which I think @lorentzenchr is seeking to solve, by better structuring or documenting the metrics. While over the years we have attempted to consolidate the implementation of metrics (especially with common tests and things like Some of this clutter is the mixing of multiple different tasks (binary classification, other classification, regression, ...) as well as different purposes or applications. While documentation will help, an API like #12385 could be extended to explicitly allow someone to navigate the library of metrics with different properties to identify those most appropriate to their task...??? |
We receive a lot feature requests to add new metrics for classification and regression task, each shedding light on different aspects of supervised learning, e.g. (incomplete list)
Fmax
score (or maximum of F1/Fbeta) #26026Scikit-learn estimators for regression and classification give, with rare exceptions, point forecasts, i.e. a single number as prediction for a single row of input data. Point forecasts usually aim at predicting a certain property/functional of the predictive distribution such as the expectation
E[Y|X]
or the medianMedian(Y|X)
. Such point forecasts are well understood and there is good theory on how to validate them, see [1].Examples:
LinearRegression
,Ridge
,HistGradientBoostingRegressor(loss='squared_error')
and classifiers with apredict_proba
likeLogisticRegression
estimate the expectation.There are 2 main aspects of model validation/selection:
This can be assessed with identification functions, which in the case of the expectation amounts to checking for bias
sum(y_predicted - y_observed)
, see [2, 4].sum((y_predicted - y_observed)^2)
, see [1, 4].A. How to structure
metrics
in a principled way to make it clear which metric assesses what?Note that (consistent) scoring functions assess calibration and resolution at the same time, see [3, 4]. They are a good fit for cross validation.
For example, #20162 proposes scoring functions and calibration functions for 2 quantiles, i.e. the prediction of 2 estimators at the same time (a 2-point forecast).
B. How to assess calibration?
Currently, we have
calibration_curve
to assess (auto-)calibration for classifiers. It would be desirable toNote that calibration scores, by nature, often do not comply with "larger is better" or "smaller is better* and are thus not suitable for cross-validation/grid search (for hyper-parameter tuning). An example is the bias proposed in #17854, also mentioned above, which is sign sensitive (larger absolute value means worse, and the sign means under- or overshooting of the prediction).
Further examples: #11096 is for (auto-) calibration of classification only, #20162 proposes (among others) a metric for the calibration of 2 quantiles.
References
[1] Gneiting (2009) https://arxiv.org/abs/0912.0902
[2] Nolde & Ziegel (2016) https://arxiv.org/abs/1608.05498
[3] Pohle (2020) https://arxiv.org/abs/2005.01835
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780
The text was updated successfully, but these errors were encountered: