Skip to content

Commit 810b920

Browse files
OmarManzoorogrisel
andauthored
FEA D2 Brier Score (#28971)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
1 parent a589342 commit 810b920

File tree

6 files changed

+369
-3
lines changed

6 files changed

+369
-3
lines changed

doc/modules/model_evaluation.rst

Lines changed: 48 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,7 @@ Scoring string name Function
233233
'roc_auc_ovr_weighted' :func:`metrics.roc_auc_score`
234234
'roc_auc_ovo_weighted' :func:`metrics.roc_auc_score`
235235
'd2_log_loss_score' :func:`metrics.d2_log_loss_score`
236+
'd2_brier_score' :func:`metrics.d2_brier_score`
236237

237238
**Clustering**
238239
'adjusted_mutual_info_score' :func:`metrics.adjusted_mutual_info_score`
@@ -506,6 +507,7 @@ Some of these are restricted to the binary classification case:
506507
roc_curve
507508
class_likelihood_ratios
508509
det_curve
510+
d2_brier_score
509511

510512

511513
Others also work in the multiclass case:
@@ -2156,15 +2158,15 @@ D² score for classification
21562158
The D² score computes the fraction of deviance explained.
21572159
It is a generalization of R², where the squared error is generalized and replaced
21582160
by a classification deviance of choice :math:`\text{dev}(y, \hat{y})`
2159-
(e.g., Log loss). D² is a form of a *skill score*.
2161+
(e.g., Log loss, Brier score,). D² is a form of a *skill score*.
21602162
It is calculated as
21612163

21622164
.. math::
21632165
21642166
D^2(y, \hat{y}) = 1 - \frac{\text{dev}(y, \hat{y})}{\text{dev}(y, y_{\text{null}})} \,.
21652167
21662168
Where :math:`y_{\text{null}}` is the optimal prediction of an intercept-only model
2167-
(e.g., the per-class proportion of `y_true` in the case of the Log loss).
2169+
(e.g., the per-class proportion of `y_true` in the case of the Log loss and Brier score).
21682170

21692171
Like R², the best possible score is 1.0 and it can be negative (because the
21702172
model can be arbitrarily worse). A constant model that always predicts
@@ -2210,6 +2212,50 @@ of 0.0.
22102212
-0.552
22112213

22122214

2215+
|details-start|
2216+
**D2 Brier score**
2217+
|details-split|
2218+
2219+
The :func:`d2_brier_score` function implements the special case
2220+
of D² with the Brier score, see :ref:`brier_score_loss`, i.e.:
2221+
2222+
.. math::
2223+
2224+
\text{dev}(y, \hat{y}) = \text{brier_score_loss}(y, \hat{y}).
2225+
2226+
This is also referred to as the Brier Skill Score (BSS).
2227+
2228+
Here are some usage examples of the :func:`d2_brier_score` function::
2229+
2230+
>>> from sklearn.metrics import d2_brier_score
2231+
>>> y_true = [1, 1, 2, 3]
2232+
>>> y_pred = [
2233+
... [0.5, 0.25, 0.25],
2234+
... [0.5, 0.25, 0.25],
2235+
... [0.5, 0.25, 0.25],
2236+
... [0.5, 0.25, 0.25],
2237+
... ]
2238+
>>> d2_brier_score(y_true, y_pred)
2239+
0.0
2240+
>>> y_true = [1, 2, 3]
2241+
>>> y_pred = [
2242+
... [0.98, 0.01, 0.01],
2243+
... [0.01, 0.98, 0.01],
2244+
... [0.01, 0.01, 0.98],
2245+
... ]
2246+
>>> d2_brier_score(y_true, y_pred)
2247+
0.9991
2248+
>>> y_true = [1, 2, 3]
2249+
>>> y_pred = [
2250+
... [0.1, 0.6, 0.3],
2251+
... [0.1, 0.6, 0.3],
2252+
... [0.4, 0.5, 0.1],
2253+
... ]
2254+
>>> d2_brier_score(y_true, y_pred)
2255+
-0.370...
2256+
2257+
|details-end|
2258+
22132259
.. _multilabel_ranking_metrics:
22142260

22152261
Multilabel ranking metrics
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
- :func:`metrics.d2_brier_score` has been added which calculates the D^2 for the Brier score.
2+
By :user:`Omar Salman <OmarManzoor>`.

sklearn/metrics/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
classification_report,
1313
cohen_kappa_score,
1414
confusion_matrix,
15+
d2_brier_score,
1516
d2_log_loss_score,
1617
f1_score,
1718
fbeta_score,
@@ -124,6 +125,7 @@
124125
"consensus_score",
125126
"coverage_error",
126127
"d2_absolute_error_score",
128+
"d2_brier_score",
127129
"d2_log_loss_score",
128130
"d2_pinball_score",
129131
"d2_tweedie_score",

sklearn/metrics/_classification.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3744,3 +3744,105 @@ def d2_log_loss_score(y_true, y_pred, *, sample_weight=None, labels=None):
37443744
)
37453745

37463746
return float(1 - (numerator / denominator))
3747+
3748+
3749+
@validate_params(
3750+
{
3751+
"y_true": ["array-like"],
3752+
"y_proba": ["array-like"],
3753+
"sample_weight": ["array-like", None],
3754+
"pos_label": [Real, str, "boolean", None],
3755+
"labels": ["array-like", None],
3756+
},
3757+
prefer_skip_nested_validation=True,
3758+
)
3759+
def d2_brier_score(
3760+
y_true,
3761+
y_proba,
3762+
*,
3763+
sample_weight=None,
3764+
pos_label=None,
3765+
labels=None,
3766+
):
3767+
""":math:`D^2` score function, fraction of Brier score explained.
3768+
3769+
Best possible score is 1.0 and it can be negative because the model can
3770+
be arbitrarily worse than the null model. The null model, also known as the
3771+
optimal intercept model, is a model that constantly predicts the per-class
3772+
proportions of `y_true`, disregarding the input features. The null model
3773+
gets a D^2 score of 0.0.
3774+
3775+
Read more in the :ref:`User Guide <d2_score_classification>`.
3776+
3777+
Parameters
3778+
----------
3779+
y_true : array-like of shape (n_samples,)
3780+
True targets.
3781+
3782+
y_proba : array-like of shape (n_samples,) or (n_samples, n_classes)
3783+
Predicted probabilities. If `y_proba.shape = (n_samples,)`
3784+
the probabilities provided are assumed to be that of the
3785+
positive class. If `y_proba.shape = (n_samples, n_classes)`
3786+
the columns in `y_proba` are assumed to correspond to the
3787+
labels in alphabetical order, as done by
3788+
:class:`~sklearn.preprocessing.LabelBinarizer`.
3789+
3790+
sample_weight : array-like of shape (n_samples,), default=None
3791+
Sample weights.
3792+
3793+
pos_label : int, float, bool or str, default=None
3794+
Label of the positive class. `pos_label` will be inferred in the
3795+
following manner:
3796+
3797+
* if `y_true` in {-1, 1} or {0, 1}, `pos_label` defaults to 1;
3798+
* else if `y_true` contains string, an error will be raised and
3799+
`pos_label` should be explicitly specified;
3800+
* otherwise, `pos_label` defaults to the greater label,
3801+
i.e. `np.unique(y_true)[-1]`.
3802+
3803+
labels : array-like of shape (n_classes,), default=None
3804+
Class labels when `y_proba.shape = (n_samples, n_classes)`.
3805+
If not provided, labels will be inferred from `y_true`.
3806+
3807+
Returns
3808+
-------
3809+
d2 : float
3810+
The D^2 score.
3811+
3812+
References
3813+
----------
3814+
.. [1] `Wikipedia entry for the Brier Skill Score (BSS)
3815+
<https://en.wikipedia.org/wiki/Brier_score>`_.
3816+
"""
3817+
if _num_samples(y_proba) < 2:
3818+
msg = "D^2 score is not well-defined with less than two samples."
3819+
warnings.warn(msg, UndefinedMetricWarning)
3820+
return float("nan")
3821+
3822+
# brier score of the fitted model
3823+
brier_score = brier_score_loss(
3824+
y_true=y_true,
3825+
y_proba=y_proba,
3826+
sample_weight=sample_weight,
3827+
pos_label=pos_label,
3828+
labels=labels,
3829+
)
3830+
3831+
# brier score of the reference or baseline model
3832+
y_true = column_or_1d(y_true)
3833+
weights = _check_sample_weight(sample_weight, y_true)
3834+
labels = np.unique(y_true if labels is None else labels)
3835+
3836+
mask = y_true[None, :] == labels[:, None]
3837+
label_counts = (mask * weights).sum(axis=1)
3838+
y_prob = label_counts / weights.sum()
3839+
y_proba_ref = np.tile(y_prob, (len(y_true), 1))
3840+
brier_score_ref = brier_score_loss(
3841+
y_true=y_true,
3842+
y_proba=y_proba_ref,
3843+
sample_weight=sample_weight,
3844+
pos_label=pos_label,
3845+
labels=labels,
3846+
)
3847+
3848+
return 1 - brier_score / brier_score_ref

0 commit comments

Comments
 (0)