Skip to content

Commit a508dab

Browse files
lorentzenchrChristian Lorentzencmarmo
authored
DOC add guideline for choosing a scoring function (#11430)
Co-authored-by: Christian Lorentzen <lorentzen.ch@googlemail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
1 parent f54c98e commit a508dab

File tree

1 file changed

+137
-0
lines changed

1 file changed

+137
-0
lines changed

doc/modules/model_evaluation.rst

+137
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,143 @@
66
Metrics and scoring: quantifying the quality of predictions
77
===========================================================
88

9+
.. _which_scoring_function:
10+
11+
Which scoring function should I use?
12+
====================================
13+
14+
Before we take a closer look into the details of the many scores and
15+
:term:`evaluation metrics`, we want to give some guidance, inspired by statistical
16+
decision theory, on the choice of **scoring functions** for **supervised learning**,
17+
see [Gneiting2009]_:
18+
19+
- *Which scoring function should I use?*
20+
- *Which scoring function is a good one for my task?*
21+
22+
In a nutshell, if the scoring function is given, e.g. in a kaggle competition
23+
or in a business context, use that one.
24+
If you are free to choose, it starts by considering the ultimate goal and application
25+
of the prediction. It is useful to distinguish two steps:
26+
27+
* Predicting
28+
* Decision making
29+
30+
**Predicting:**
31+
Usually, the response variable :math:`Y` is a random variable, in the sense that there
32+
is *no deterministic* function :math:`Y = g(X)` of the features :math:`X`.
33+
Instead, there is a probability distribution :math:`F` of :math:`Y`.
34+
One can aim to predict the whole distribution, known as *probabilistic prediction*,
35+
or---more the focus of scikit-learn---issue a *point prediction* (or point forecast)
36+
by choosing a property or functional of that distribution :math:`F`.
37+
Typical examples are the mean (expected value), the median or a quantile of the
38+
response variable :math:`Y` (conditionally on :math:`X`).
39+
40+
Once that is settled, use a **strictly consistent** scoring function for that
41+
(target) functional, see [Gneiting2009]_.
42+
This means using a scoring function that is aligned with *measuring the distance
43+
between predictions* `y_pred` *and the true target functional using observations of*
44+
:math:`Y`, i.e. `y_true`.
45+
For classification **strictly proper scoring rules**, see
46+
`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_
47+
and [Gneiting2007]_, coincide with strictly consistent scoring functions.
48+
The table further below provides examples.
49+
One could say that consistent scoring functions act as *truth serum* in that
50+
they guarantee *"that truth telling [. . .] is an optimal strategy in
51+
expectation"* [Gneiting2014]_.
52+
53+
Once a strictly consistent scoring function is chosen, it is best used for both: as
54+
loss function for model training and as metric/score in model evaluation and model
55+
comparison.
56+
57+
Note that for regressors, the prediction is done with :term:`predict` while for
58+
classifiers it is usually :term:`predict_proba`.
59+
60+
**Decision Making:**
61+
The most common decisions are done on binary classification tasks, where the result of
62+
:term:`predict_proba` is turned into a single outcome, e.g., from the predicted
63+
probability of rain a decision is made on how to act (whether to take mitigating
64+
measures like an umbrella or not).
65+
For classifiers, this is what :term:`predict` returns.
66+
See also :ref:`TunedThresholdClassifierCV`.
67+
There are many scoring functions which measure different aspects of such a
68+
decision, most of them are covered with or derived from the
69+
:func:`metrics.confusion_matrix`.
70+
71+
**List of strictly consistent scoring functions:**
72+
Here, we list some of the most relevant statistical functionals and corresponding
73+
strictly consistent scoring functions for tasks in practice. Note that the list is not
74+
complete and that there are more of them.
75+
For further criteria on how to select a specific one, see [Fissler2022]_.
76+
77+
================== =================================================== ==================== =================================
78+
functional scoring or loss function response `y` prediction
79+
================== =================================================== ==================== =================================
80+
**Classification**
81+
mean :ref:`Brier score <brier_score_loss>` :sup:`1` multi-class ``predict_proba``
82+
mean :ref:`log loss <log_loss>` multi-class ``predict_proba``
83+
mode :ref:`zero-one loss <zero_one_loss>` :sup:`2` multi-class ``predict``, categorical
84+
**Regression**
85+
mean :ref:`squared error <mean_squared_error>` :sup:`3` all reals ``predict``, all reals
86+
mean :ref:`Poisson deviance <mean_tweedie_deviance>` non-negative ``predict``, strictly positive
87+
mean :ref:`Gamma deviance <mean_tweedie_deviance>` strictly positive ``predict``, strictly positive
88+
mean :ref:`Tweedie deviance <mean_tweedie_deviance>` depends on ``power`` ``predict``, depends on ``power``
89+
median :ref:`absolute error <mean_absolute_error>` all reals ``predict``, all reals
90+
quantile :ref:`pinball loss <pinball_loss>` all reals ``predict``, all reals
91+
mode no consistent one exists reals
92+
================== =================================================== ==================== =================================
93+
94+
:sup:`1` The Brier score is just a different name for the squared error in case of
95+
classification.
96+
97+
:sup:`2` The zero-one loss is only consistent but not strictly consistent for the mode.
98+
The zero-one loss is equivalent to one minus the accuracy score, meaning it gives
99+
different score values but the same ranking.
100+
101+
:sup:`3` R² gives the same ranking as squared error.
102+
103+
**Fictitious Example:**
104+
Let's make the above arguments more tangible. Consider a setting in network reliability
105+
engineering, such as maintaining stable internet or Wi-Fi connections.
106+
As provider of the network, you have access to the dataset of log entries of network
107+
connections containing network load over time and many interesting features.
108+
Your goal is to improve the reliability of the connections.
109+
In fact, you promise your customers that on at least 99% of all days there are no
110+
connection discontinuities larger than 1 minute.
111+
Therefore, you are interested in a prediction of the 99% quantile (of longest
112+
connection interruption duration per day) in order to know in advance when to add
113+
more bandwidth and thereby satisfy your customers. So the *target functional* is the
114+
99% quantile. From the table above, you choose the pinball loss as scoring function
115+
(fair enough, not much choice given), for model training (e.g.
116+
`HistGradientBoostingRegressor(loss="quantile", quantile=0.99)`) as well as model
117+
evaluation (`mean_pinball_loss(..., alpha=0.99)` - we apologize for the different
118+
argument names, `quantile` and `alpha`) be it in grid search for finding
119+
hyperparameters or in comparing to other models like
120+
`QuantileRegressor(quantile=0.99)`.
121+
122+
.. rubric:: References
123+
124+
.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper
125+
Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>`
126+
In: Journal of the American Statistical Association 102 (2007),
127+
pp. 359– 378.
128+
`link to pdf <www.stat.washington.edu/people/raftery/Research/PDF/Gneiting2007jasa.pdf>`_
129+
130+
.. [Gneiting2009] T. Gneiting. :arxiv:`Making and Evaluating Point Forecasts
131+
<0912.0902>`
132+
Journal of the American Statistical Association 106 (2009): 746 - 762.
133+
134+
.. [Gneiting2014] T. Gneiting and M. Katzfuss. :doi:`Probabilistic Forecasting
135+
<10.1146/annurev-st atistics-062713-085831>`. In: Annual Review of Statistics and Its Application 1.1 (2014), pp. 125–151.
136+
137+
.. [Fissler2022] T. Fissler, C. Lorentzen and M. Mayer. :arxiv:`Model
138+
Comparison and Calibration Assessment: User Guide for Consistent Scoring
139+
Functions in Machine Learning and Actuarial Practice. <2202.12780>`
140+
141+
.. _scoring_api_overview:
142+
143+
Scoring API overview
144+
====================
145+
9146
There are 3 different APIs for evaluating the quality of a model's
10147
predictions:
11148

0 commit comments

Comments
 (0)