|
6 | 6 | Metrics and scoring: quantifying the quality of predictions
|
7 | 7 | ===========================================================
|
8 | 8 |
|
| 9 | +.. _which_scoring_function: |
| 10 | + |
| 11 | +Which scoring function should I use? |
| 12 | +==================================== |
| 13 | + |
| 14 | +Before we take a closer look into the details of the many scores and |
| 15 | +:term:`evaluation metrics`, we want to give some guidance, inspired by statistical |
| 16 | +decision theory, on the choice of **scoring functions** for **supervised learning**, |
| 17 | +see [Gneiting2009]_: |
| 18 | + |
| 19 | +- *Which scoring function should I use?* |
| 20 | +- *Which scoring function is a good one for my task?* |
| 21 | + |
| 22 | +In a nutshell, if the scoring function is given, e.g. in a kaggle competition |
| 23 | +or in a business context, use that one. |
| 24 | +If you are free to choose, it starts by considering the ultimate goal and application |
| 25 | +of the prediction. It is useful to distinguish two steps: |
| 26 | + |
| 27 | +* Predicting |
| 28 | +* Decision making |
| 29 | + |
| 30 | +**Predicting:** |
| 31 | +Usually, the response variable :math:`Y` is a random variable, in the sense that there |
| 32 | +is *no deterministic* function :math:`Y = g(X)` of the features :math:`X`. |
| 33 | +Instead, there is a probability distribution :math:`F` of :math:`Y`. |
| 34 | +One can aim to predict the whole distribution, known as *probabilistic prediction*, |
| 35 | +or---more the focus of scikit-learn---issue a *point prediction* (or point forecast) |
| 36 | +by choosing a property or functional of that distribution :math:`F`. |
| 37 | +Typical examples are the mean (expected value), the median or a quantile of the |
| 38 | +response variable :math:`Y` (conditionally on :math:`X`). |
| 39 | + |
| 40 | +Once that is settled, use a **strictly consistent** scoring function for that |
| 41 | +(target) functional, see [Gneiting2009]_. |
| 42 | +This means using a scoring function that is aligned with *measuring the distance |
| 43 | +between predictions* `y_pred` *and the true target functional using observations of* |
| 44 | +:math:`Y`, i.e. `y_true`. |
| 45 | +For classification **strictly proper scoring rules**, see |
| 46 | +`Wikipedia entry for Scoring rule <https://en.wikipedia.org/wiki/Scoring_rule>`_ |
| 47 | +and [Gneiting2007]_, coincide with strictly consistent scoring functions. |
| 48 | +The table further below provides examples. |
| 49 | +One could say that consistent scoring functions act as *truth serum* in that |
| 50 | +they guarantee *"that truth telling [. . .] is an optimal strategy in |
| 51 | +expectation"* [Gneiting2014]_. |
| 52 | + |
| 53 | +Once a strictly consistent scoring function is chosen, it is best used for both: as |
| 54 | +loss function for model training and as metric/score in model evaluation and model |
| 55 | +comparison. |
| 56 | + |
| 57 | +Note that for regressors, the prediction is done with :term:`predict` while for |
| 58 | +classifiers it is usually :term:`predict_proba`. |
| 59 | + |
| 60 | +**Decision Making:** |
| 61 | +The most common decisions are done on binary classification tasks, where the result of |
| 62 | +:term:`predict_proba` is turned into a single outcome, e.g., from the predicted |
| 63 | +probability of rain a decision is made on how to act (whether to take mitigating |
| 64 | +measures like an umbrella or not). |
| 65 | +For classifiers, this is what :term:`predict` returns. |
| 66 | +See also :ref:`TunedThresholdClassifierCV`. |
| 67 | +There are many scoring functions which measure different aspects of such a |
| 68 | +decision, most of them are covered with or derived from the |
| 69 | +:func:`metrics.confusion_matrix`. |
| 70 | + |
| 71 | +**List of strictly consistent scoring functions:** |
| 72 | +Here, we list some of the most relevant statistical functionals and corresponding |
| 73 | +strictly consistent scoring functions for tasks in practice. Note that the list is not |
| 74 | +complete and that there are more of them. |
| 75 | +For further criteria on how to select a specific one, see [Fissler2022]_. |
| 76 | + |
| 77 | +================== =================================================== ==================== ================================= |
| 78 | +functional scoring or loss function response `y` prediction |
| 79 | +================== =================================================== ==================== ================================= |
| 80 | +**Classification** |
| 81 | +mean :ref:`Brier score <brier_score_loss>` :sup:`1` multi-class ``predict_proba`` |
| 82 | +mean :ref:`log loss <log_loss>` multi-class ``predict_proba`` |
| 83 | +mode :ref:`zero-one loss <zero_one_loss>` :sup:`2` multi-class ``predict``, categorical |
| 84 | +**Regression** |
| 85 | +mean :ref:`squared error <mean_squared_error>` :sup:`3` all reals ``predict``, all reals |
| 86 | +mean :ref:`Poisson deviance <mean_tweedie_deviance>` non-negative ``predict``, strictly positive |
| 87 | +mean :ref:`Gamma deviance <mean_tweedie_deviance>` strictly positive ``predict``, strictly positive |
| 88 | +mean :ref:`Tweedie deviance <mean_tweedie_deviance>` depends on ``power`` ``predict``, depends on ``power`` |
| 89 | +median :ref:`absolute error <mean_absolute_error>` all reals ``predict``, all reals |
| 90 | +quantile :ref:`pinball loss <pinball_loss>` all reals ``predict``, all reals |
| 91 | +mode no consistent one exists reals |
| 92 | +================== =================================================== ==================== ================================= |
| 93 | + |
| 94 | +:sup:`1` The Brier score is just a different name for the squared error in case of |
| 95 | +classification. |
| 96 | + |
| 97 | +:sup:`2` The zero-one loss is only consistent but not strictly consistent for the mode. |
| 98 | +The zero-one loss is equivalent to one minus the accuracy score, meaning it gives |
| 99 | +different score values but the same ranking. |
| 100 | + |
| 101 | +:sup:`3` R² gives the same ranking as squared error. |
| 102 | + |
| 103 | +**Fictitious Example:** |
| 104 | +Let's make the above arguments more tangible. Consider a setting in network reliability |
| 105 | +engineering, such as maintaining stable internet or Wi-Fi connections. |
| 106 | +As provider of the network, you have access to the dataset of log entries of network |
| 107 | +connections containing network load over time and many interesting features. |
| 108 | +Your goal is to improve the reliability of the connections. |
| 109 | +In fact, you promise your customers that on at least 99% of all days there are no |
| 110 | +connection discontinuities larger than 1 minute. |
| 111 | +Therefore, you are interested in a prediction of the 99% quantile (of longest |
| 112 | +connection interruption duration per day) in order to know in advance when to add |
| 113 | +more bandwidth and thereby satisfy your customers. So the *target functional* is the |
| 114 | +99% quantile. From the table above, you choose the pinball loss as scoring function |
| 115 | +(fair enough, not much choice given), for model training (e.g. |
| 116 | +`HistGradientBoostingRegressor(loss="quantile", quantile=0.99)`) as well as model |
| 117 | +evaluation (`mean_pinball_loss(..., alpha=0.99)` - we apologize for the different |
| 118 | +argument names, `quantile` and `alpha`) be it in grid search for finding |
| 119 | +hyperparameters or in comparing to other models like |
| 120 | +`QuantileRegressor(quantile=0.99)`. |
| 121 | + |
| 122 | +.. rubric:: References |
| 123 | + |
| 124 | +.. [Gneiting2007] T. Gneiting and A. E. Raftery. :doi:`Strictly Proper |
| 125 | + Scoring Rules, Prediction, and Estimation <10.1198/016214506000001437>` |
| 126 | + In: Journal of the American Statistical Association 102 (2007), |
| 127 | + pp. 359– 378. |
| 128 | + `link to pdf <www.stat.washington.edu/people/raftery/Research/PDF/Gneiting2007jasa.pdf>`_ |
| 129 | +
|
| 130 | +.. [Gneiting2009] T. Gneiting. :arxiv:`Making and Evaluating Point Forecasts |
| 131 | + <0912.0902>` |
| 132 | + Journal of the American Statistical Association 106 (2009): 746 - 762. |
| 133 | +
|
| 134 | +.. [Gneiting2014] T. Gneiting and M. Katzfuss. :doi:`Probabilistic Forecasting |
| 135 | + <10.1146/annurev-st atistics-062713-085831>`. In: Annual Review of Statistics and Its Application 1.1 (2014), pp. 125–151. |
| 136 | +
|
| 137 | +.. [Fissler2022] T. Fissler, C. Lorentzen and M. Mayer. :arxiv:`Model |
| 138 | + Comparison and Calibration Assessment: User Guide for Consistent Scoring |
| 139 | + Functions in Machine Learning and Actuarial Practice. <2202.12780>` |
| 140 | +
|
| 141 | +.. _scoring_api_overview: |
| 142 | + |
| 143 | +Scoring API overview |
| 144 | +==================== |
| 145 | + |
9 | 146 | There are 3 different APIs for evaluating the quality of a model's
|
10 | 147 | predictions:
|
11 | 148 |
|
|
0 commit comments