Improve documentation: consistent scoring functions 

#### Abstract
Improve the documentation [3.3. Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html). Give advice to use strictly proper scoring functions.

#### Explanation
The documentation of scikit-learn is amazing and a strong argument for usage (and fame). I think that one could get a bit lost by choosing _the right_ scoring function or metric out of the many options for model evaluation. There are some influential papers which advocate the usage of **strictly proper scoring functions**:

1. [Making and Evaluating Point Forecasts](https://arxiv.org/abs/0912.0902), [alt. link](https://www.bundesbank.de/Redaktion/EN/Downloads/Bundesbank/Research_Centre/Conferences/2012/2012_06_01_eltville_11_gneiting_paper.pdf)

2. [Strictly Proper Scoring Rules, Prediction,
and Estimation](https://doi.org/10.1198/016214506000001437), [alt. link](https://www.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf)

For classification and regression, most of the time one is interested in the mean functional of the distribution of the target variable `y`. Then, the scoring function used to compare the predictive power of different models (like when choosing the regularization strength via cross validation) should be strictly consistent for the mean functional.

#### Examples
For binary classification knowing the mean of `y` is equivalent to knowing the whole distribution of `y`. Brier (squared error) and logistic (log) loss are strictly consistent for the mean. Accuracy and 0-1 loss are only consistent, but not strictly consistent (they are strictly consistent for the mode which is less informative than the mean for classification). ROC, precision, recall etc. are not consistent for the mean, afaik.

For regression, Bregman functions (eq. (18) of 1.) are strictly consistent for the mean. For targets `y` on the whole real numbers, the squared error is one example. For positive-valued targets `y`, the squared error (b=2), Gamma deviance (b=0) and Poisson deviance (b=1) are examples, see eq. (20) of 1.

Maybe one should check the many metrics already provided by scikit-learn, if they are (strictly) consistent for a certain functional and if they are equivalent to one another.

#### Disclaimer
I do not know the authors of the cited papers and I'm not pushing my own research, just my opinion on how to improve this fantastic library 😏  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve documentation: consistent scoring functions #10584

Abstract

Explanation

Examples

Disclaimer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve documentation: consistent scoring functions #10584

Description

Abstract

Explanation

Examples

Disclaimer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions