[MRG] Reliability curves for calibration of predict_proba #3574

jmetzen · 2014-08-18T18:03:10Z

This PR adds the reliability_curve metric to metrics.ranking.py. Reliability diagrams allow checking if the predicted probabilities of a binary classifier are well calibrated. The PR also contains an example comparing how well the predicted probabilities of different classifiers are calibrated. A notebook giving the same example can be found under http://jmetzen.github.io/2014-08-16/reliability-diagram.html

For some backgrounds on reliability-diagrams, please refer to the paper "Predicting Good Probabilities with Supervised Learning"

jnothman · 2014-08-19T01:36:13Z

sklearn/metrics/ranking.py

+                         "set to False.")
+
+    bin_width = 1.0 / bins
+    bin_centers = np.linspace(0, 1.0 - bin_width, bins) + bin_width / 2


I suspect this algorithm is equivalent to:

bin_width = 1.0 / bins # TODO: check boundary cases binned = np.searchsorted(np.linspace(0, 1.0, bins), y_score) bin_sums = np.bincount(binned, weights=y_score, minlength=bins) bin_positives = np.bincount(binned, weights=y_true, minlength=bins) bin_total = np.bincount(binned, minlength=bins) bin_total[bin_counts == 0] = np.nan y_score_bin_mean = bin_sums / bin_total empirical_prob_pos = bin_positives / bin_total

jnothman · 2014-08-19T01:37:47Z

You need a test.

I think you should also follow the convention that we don't know whether y_true is 0s and 1s (the most common alternative being that negatives are -1). Perhaps you should take a pos_label=1 parameter.

mblondel · 2014-08-19T02:03:58Z

This was already implemented in PR #1176 as calibration_plot. You should join forces with @agramfort :)

mblondel · 2014-08-19T02:05:46Z

Plotting the probability histogram below the calibration plot as shown in your notebook is a good idea!

agramfort · 2014-08-22T09:57:46Z

I'll "steal" your idea of histograms and update the PR asap. I'll ping you
when done.

jmetzen · 2014-08-22T19:48:25Z

Sure, feel free to reuse anything you find useful. I didn'd know about PR #1176 ; it's really cool.

Two things from reliability_curve of this PR might be useful for calibration_plot from the other:

The normalize flag
Treating it as a ranking "metric". The interface is actually very similar to precision_recall_curve and roc_curve; what do you think about moving calibration_plot to ranking.py and rename it to calibration_curve (it's not plotting anything actually). The parameter "sample_weight" used in metrics/curves could also be useful in calibration_plot.

Let me know if I can help out in any of these issues.

jnothman · 2014-08-23T09:49:42Z

Closing as duplicate, then?

agramfort · 2014-08-23T10:24:10Z

Closing as duplicate, then?

Jan I won't find the time to work on this in the next few days. Feel free
to send me a PR that:

renames to calibration_curve
adds the histogram to one example
add the normalize flag

I would keep it in classification.py

jmetzen · 2014-08-30T07:34:24Z

I sent you a PR with the discussed content

ecampana · 2017-03-23T06:56:09Z

@agramfort, are there any plans for calibration_curve to support sample_weight or are there theoretical grounds for why doing this would not make sense in the first place? I made a similar comment on #2630. Thank you in advance for any information you my shed on this matter.

agramfort · 2017-03-24T08:49:48Z

Using sample_weight would make sense yes

ecampana · 2017-03-25T12:14:48Z

@agramfort, how would calibration_curve need to be modified to include sample_weight. I have spent a few hours trying to figure out the logic behind the function, but I still fail to see where it would be appropriate to introduce sample_weight. I would appreciate any hints as to the best place to modify the below function. Thanks again for your help.

from sklearn.metrics.classification import _check_binary_probabilistic_predictions
from sklearn.utils import column_or_1d

def calibration_curve(y_true, y_prob, normalize=False, n_bins=5):
    """Compute true and predicted probabilities for a calibration curve.
    Read more in the :ref:`User Guide <calibration>`.
    Parameters
    ----------
    y_true : array, shape (n_samples,)
        True targets.
    y_prob : array, shape (n_samples,)
        Probabilities of the positive class.
    normalize : bool, optional, default=False
        Whether y_prob needs to be normalized into the bin [0, 1], i.e. is not
        a proper probability. If True, the smallest value in y_prob is mapped
        onto 0 and the largest one onto 1.
    n_bins : int
        Number of bins. A bigger number requires more data.
    Returns
    -------
    prob_true : array, shape (n_bins,)
        The true probability in each bin (fraction of positives).
    prob_pred : array, shape (n_bins,)
        The mean predicted probability in each bin.
    References
    ----------
    Alexandru Niculescu-Mizil and Rich Caruana (2005) Predicting Good
    Probabilities With Supervised Learning, in Proceedings of the 22nd
    International Conference on Machine Learning (ICML).
    See section 4 (Qualitative Analysis of Predictions).
    """
    y_true = column_or_1d(y_true)
    y_prob = column_or_1d(y_prob)

    if normalize:  # Normalize predicted values into interval [0, 1]
        y_prob = (y_prob - y_prob.min()) / (y_prob.max() - y_prob.min())
    elif y_prob.min() < 0 or y_prob.max() > 1:
        raise ValueError("y_prob has values outside [0, 1] and normalize is "
                         "set to False.")

    y_true = _check_binary_probabilistic_predictions(y_true, y_prob)

    bins = np.linspace(0., 1. + 1e-8, n_bins + 1)
    binids = np.digitize(y_prob, bins) - 1

    bin_sums = np.bincount(binids, weights=y_prob, minlength=len(bins))
    bin_true = np.bincount(binids, weights=y_true, minlength=len(bins))
    bin_total = np.bincount(binids, minlength=len(bins))

    nonzero = bin_total != 0
    prob_true = (bin_true[nonzero] / bin_total[nonzero])
    prob_pred = (bin_sums[nonzero] / bin_total[nonzero])

    return prob_true, prob_pred

By the way my best guess would be to make the following changes:

    bin_sums = np.bincount(binids, weights=sample_weight*y_prob, minlength=len(bins))
    bin_true = np.bincount(binids, weights=sample_weight*y_true, minlength=len(bins))
    bin_total = np.bincount(binids, weights=sample_weight, minlength=len(bins))

But for some reason my results do not look right after such changes.

jmetzen · 2017-03-26T18:00:01Z

Your idea of just multiplying sample_weight to the weights would also have been my first guess. In which sense do the results not look as expected?

BTW: Could you open an issue regarding adding sample_weight to calibration_curve? The further discussion should take place there and not in this old PR.

ecampana · 2017-03-29T02:29:39Z

@jmetzen, thank you for your reply. I will open up a new issue. Hopefully when I put in a new PR I can give an explanation that will clarify the motivation behind my request to add in sample_weight as a parameter for calibration_curve since stephen-hoover was not so sure if he would believe a "weighted" calibration curve. Also, please ignore my comment about the weighted calibration curves not making sense as I had expected. Another unrelated issue was causing the problem.

By the way, it was your personal blog, jmetzen.github, that got me started on calibration curves. Thanks for the great post. It helped correct my analysis from an issue that almost went unnoticed.

Jan Hendrik Metzen added 2 commits August 18, 2014 19:51

ADD New ranking-metric "reliability-score" added

f407241

ADD Example script illustrating relibability-diagram

77f3a4e

jmetzen changed the title ~~Reliability curves for calibration of predict_proba~~ [mrg] Reliability curves for calibration of predict_proba Aug 18, 2014

jmetzen changed the title ~~[mrg] Reliability curves for calibration of predict_proba~~ [MRG] Reliability curves for calibration of predict_proba Aug 18, 2014

jnothman reviewed Aug 19, 2014
View reviewed changes

jnothman closed this Aug 23, 2014

jmetzen deleted the reliability_diagram branch September 15, 2014 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Reliability curves for calibration of predict_proba #3574

[MRG] Reliability curves for calibration of predict_proba #3574

Uh oh!

jmetzen commented Aug 18, 2014

Uh oh!

jnothman Aug 19, 2014

Uh oh!

jnothman commented Aug 19, 2014

Uh oh!

mblondel commented Aug 19, 2014

Uh oh!

mblondel commented Aug 19, 2014

Uh oh!

agramfort commented Aug 22, 2014

Uh oh!

jmetzen commented Aug 22, 2014

Uh oh!

jnothman commented Aug 23, 2014

Uh oh!

agramfort commented Aug 23, 2014

Uh oh!

jmetzen commented Aug 30, 2014

Uh oh!

ecampana commented Mar 23, 2017

Uh oh!

agramfort commented Mar 24, 2017 via email

Uh oh!

ecampana commented Mar 25, 2017 •

edited

Loading

Uh oh!

jmetzen commented Mar 26, 2017

Uh oh!

ecampana commented Mar 29, 2017 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[MRG] Reliability curves for calibration of predict_proba #3574

[MRG] Reliability curves for calibration of predict_proba #3574

Uh oh!

Conversation

jmetzen commented Aug 18, 2014

Uh oh!

jnothman Aug 19, 2014

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 19, 2014

Uh oh!

mblondel commented Aug 19, 2014

Uh oh!

mblondel commented Aug 19, 2014

Uh oh!

agramfort commented Aug 22, 2014

Uh oh!

jmetzen commented Aug 22, 2014

Uh oh!

jnothman commented Aug 23, 2014

Uh oh!

agramfort commented Aug 23, 2014

Uh oh!

jmetzen commented Aug 30, 2014

Uh oh!

ecampana commented Mar 23, 2017

Uh oh!

agramfort commented Mar 24, 2017 via email

Uh oh!

ecampana commented Mar 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmetzen commented Mar 26, 2017

Uh oh!

ecampana commented Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ecampana commented Mar 25, 2017 •

edited

Loading

ecampana commented Mar 29, 2017 •

edited

Loading