DOC add guideline for choosing a scoring function (#11430)

lorentzenchr · Christian Lorentzen · cmarmo · web-flow · commit a508dab716ec · 2024-11-25T16:00:23.000+01:00
Co-authored-by: Christian Lorentzen &lt;lorentzen.ch@googlemail.com&gt;
Co-authored-by: Chiara Marmo &lt;cmarmo@users.noreply.github.com&gt;
diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
@@ -6,6 +6,143 @@
 Metrics and scoring: quantifying the quality of predictions
 ===========================================================
 
+.. _which_scoring_function:
+
+Which scoring function should I use?
+====================================
+
+Before we take a closer look into the details of the many scores and
+:term:`evaluation metrics`, we want to give some guidance, inspired by statistical
+decision theory, on the choice of **scoring functions** for **supervised learning**,
+see [Gneiting2009]_:
+
+- *Which scoring function should I use?*
+- *Which scoring function is a good one for my task?*
+
+In a nutshell, if the scoring function is given, e.g. in a kaggle competition
+or in a business context, use that one.
+If you are free to choose, it starts by considering the ultimate goal and application
+of the prediction. It is useful to distinguish two steps:
+
+* Predicting
+* Decision making
+
+**Predicting:**
+Usually, the response variable :math:`Y` is a random variable, in the sense that there
+is *no deterministic* function :math:`Y = g(X)` of the features :math:`X`.
+Instead, there is a probability distribution :math:`F` of :math:`Y`.
+One can aim to predict the whole distribution, known as *probabilistic prediction*,
+or---more the focus of scikit-learn---issue a *point prediction* (or point forecast)
+by choosing a property or functional of that distribution :math:`F`.
+Typical examples are the mean (expected value), the median or a quantile of the
+response variable :math:`Y` (conditionally on :math:`X`).
+
+Once that is settled, use a **strictly consistent** scoring function for that
+(target) functional, see [Gneiting2009]_.
+This means using a scoring function that is aligned with *measuring the distance
+between predictions* `y_pred` *and the true target functional using observations of*
+:math:`Y`, i.e. `y_true`.
+For classification **strictly proper scoring rules**, see
+`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_
+and [Gneiting2007]_, coincide with strictly consistent scoring functions.
+The table further below provides examples.
+One could say that consistent scoring functions act as *truth serum* in that
+they guarantee *"that truth telling [. . .] is an optimal strategy in
+expectation"* [Gneiting2014]_.
+
+Once a strictly consistent scoring function is chosen, it is best used for both: as
+loss function for model training and as metric/score in model evaluation and model
+comparison.
+
+Note that for regressors, the prediction is done with :term:`predict` while for
+classifiers it is usually :term:`predict_proba`.
+
+**Decision Making:**
+The most common decisions are done on binary classification tasks, where the result of
+:term:`predict_proba` is turned into a single outcome, e.g., from the predicted
+probability of rain a decision is made on how to act (whether to take mitigating
+measures like an umbrella or not).
+For classifiers, this is what :term:`predict` returns.
+See also :ref:`TunedThresholdClassifierCV`.
+There are many scoring functions which measure different aspects of such a
+decision, most of them are covered with or derived from the
+:func:`metrics.confusion_matrix`.
+
+**List of strictly consistent scoring functions:**
+Here, we list some of the most relevant statistical functionals and corresponding
+strictly consistent scoring functions for tasks in practice. Note that the list is not
+complete and that there are more of them.
+For further criteria on how to select a specific one, see [Fissler2022]_.
+
+==================  ===================================================  ====================  =================================
+functional          scoring or loss function                             response `y`          prediction
+==================  ===================================================  ====================  =================================
+**Classification**
+mean                :ref:`Brier score <brier_score_loss>` :sup:`1`       multi-class           ``predict_proba``
+mean                :ref:`log loss <log_loss>`                           multi-class           ``predict_proba``
+mode                :ref:`zero-one loss <zero_one_loss>` :sup:`2`        multi-class           ``predict``, categorical
+**Regression**
+mean                :ref:`squared error <mean_squared_error>` :sup:`3`   all reals             ``predict``, all reals
+mean                :ref:`Poisson deviance <mean_tweedie_deviance>`      non-negative          ``predict``, strictly positive
+mean                :ref:`Gamma deviance <mean_tweedie_deviance>`        strictly positive     ``predict``, strictly positive
+mean                :ref:`Tweedie deviance <mean_tweedie_deviance>`      depends on ``power``  ``predict``, depends on ``power``
+median              :ref:`absolute error <mean_absolute_error>`          all reals             ``predict``, all reals
+quantile            :ref:`pinball loss <pinball_loss>`                   all reals             ``predict``, all reals
+mode                no consistent one exists                             reals
+==================  ===================================================  ====================  =================================
+
+:sup:`1` The Brier score is just a different name for the squared error in case of
+classification.
+
+:sup:`2` The zero-one loss is only consistent but not strictly consistent for the mode.
+The zero-one loss is equivalent to one minus the accuracy score, meaning it gives
+different score values but the same ranking.
+
+:sup:`3` R² gives the same ranking as squared error.
+
+**Fictitious Example:**
+Let's make the above arguments more tangible. Consider a setting in network reliability
+engineering, such as maintaining stable internet or Wi-Fi connections.
+As provider of the network, you have access to the dataset of log entries of network
+connections containing network load over time and many interesting features.
+Your goal is to improve the reliability of the connections.
+In fact, you promise your customers that on at least 99% of all days there are no
+connection discontinuities larger than 1 minute.
+Therefore, you are interested in a prediction of the 99% quantile (of longest
+connection interruption duration per day) in order to know in advance when to add
+more bandwidth and thereby satisfy your customers. So the *target functional* is the
+99% quantile. From the table above, you choose the pinball loss as scoring function
+(fair enough, not much choice given), for model training (e.g.
+`HistGradientBoostingRegressor(loss="quantile", quantile=0.99)`) as well as model
+evaluation (`mean_pinball_loss(..., alpha=0.99)` - we apologize for the different
+argument names, `quantile` and `alpha`) be it in grid search for finding
+hyperparameters or in comparing to other models like
+`QuantileRegressor(quantile=0.99)`.
+
+.. rubric:: References
+
+.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper
+    Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>`
+    In: Journal of the American Statistical Association 102 (2007),
+    pp. 359– 378.
+    `link to pdf <www.stat.washington.edu/people/raftery/Research/PDF/Gneiting2007jasa.pdf>`_
+
+.. [Gneiting2009] T. Gneiting. :arxiv:`Making and Evaluating Point Forecasts
+    <0912.0902>`
+    Journal of the American Statistical Association 106 (2009): 746 - 762.
+
+.. [Gneiting2014] T. Gneiting and M. Katzfuss. :doi:`Probabilistic Forecasting
+    <10.1146/annurev-st atistics-062713-085831>`. In: Annual Review of Statistics and Its Application 1.1 (2014), pp. 125–151.
+
+.. [Fissler2022] T. Fissler, C. Lorentzen and M. Mayer. :arxiv:`Model
+    Comparison and Calibration Assessment: User Guide for Consistent Scoring
+    Functions in Machine Learning and Actuarial Practice. <2202.12780>`
+
+.. _scoring_api_overview:
+
+Scoring API overview
+====================
+
 There are 3 different APIs for evaluating the quality of a model's
 predictions: