Skip to content

DOC Expand on sigmoid and isotonic in calibration.rst #17725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 22, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 132 additions & 44 deletions doc/modules/calibration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,22 @@ When performing classification you often want not only to predict the class
label, but also obtain a probability of the respective label. This probability
gives you some kind of confidence on the prediction. Some models can give you
poor estimates of the class probabilities and some even do not support
probability prediction. The calibration module allows you to better calibrate
probability prediction (e.g., some instances of
:class:`~sklearn.linear_model.SGDClassifier`).
The calibration module allows you to better calibrate
the probabilities of a given model, or to add support for probability
prediction.

Well calibrated classifiers are probabilistic classifiers for which the output
of the predict_proba method can be directly interpreted as a confidence level.
of the :term:`predict_proba` method can be directly interpreted as a confidence
level.
For instance, a well calibrated (binary) classifier should classify the samples
such that among the samples to which it gave a predict_proba value close to 0.8,
such that among the samples to which it gave a :term:`predict_proba` value
close to 0.8,
approximately 80% actually belong to the positive class.

.. _calibration_curve:

Calibration curves
------------------

Expand All @@ -37,7 +43,7 @@ class is the positive class (in each bin).
.. currentmodule:: sklearn.linear_model

:class:`LogisticRegression` returns well calibrated predictions by default as it directly
optimizes log-loss. In contrast, the other methods return biased probabilities;
optimizes :ref:`log_loss`. In contrast, the other methods return biased probabilities;
with different biases per method:

.. currentmodule:: sklearn.naive_bayes
Expand Down Expand Up @@ -73,34 +79,41 @@ to 0 or 1 typically.
.. currentmodule:: sklearn.svm

Linear Support Vector Classification (:class:`LinearSVC`) shows an even more
sigmoid curve as the RandomForestClassifier, which is typical for
maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_), which
focus on hard samples that are close to the decision boundary (the support
vectors).
sigmoid curve than :class:`~sklearn.ensemble.RandomForestClassifier`, which is
typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_),
which focus on difficult to classify samples that are close to the decision
boundary (the support vectors).

Calibrating a classifier
------------------------

.. currentmodule:: sklearn.calibration

Calibrating a classifier consists in fitting a regressor (called a
Calibrating a classifier consists of fitting a regressor (called a
*calibrator*) that maps the output of the classifier (as given by
:term:`predict` or :term:`predict_proba`) to a calibrated probability in [0,
1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
:term:`decision_function` or :term:`predict_proba`) to a calibrated probability
in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
the calibrator tries to predict :math:`p(y_i = 1 | f_i)`.

The samples that are used to train the calibrator should not be used to
train the target classifier.
The samples that are used to fit the calibrator should not be the same
samples used to fit the classifier, as this would
introduce bias. The classifier performance on its training data would be
better than for novel data. Using the classifier output from training data
to fit the calibrator would thus result in a biased calibrator that maps to
probabilities closer to 0 and 1 than it should.

Usage
-----

The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.

:class:`CalibratedClassifierCV` uses a cross-validation approach to fit both
the classifier and the regressor. For each of the k `(trainset, testset)`
couple, a classifier is trained on the train set, and its predictions on the
test set are used to fit a regressor. We end up with k
the classifier and the regressor. The data is split into k
`(train_set, test_set)` couples (as determined by `cv`). The classifier
(`base_estimator`) is trained on the train set, and its predictions on the
test set are used to fit a regressor. This ensures that the data used to fit
the classifier is always disjoint from the data used to fit the calibrator.
After fitting, we end up with k
`(classifier, regressor)` couples where each regressor maps the output of
its corresponding classifier into [0, 1]. Each couple is exposed in the
`calibrated_classifiers_` attribute, where each entry is a calibrated
Expand All @@ -111,30 +124,89 @@ predicted probabilities of the `k` estimators in the
`calibrated_classifiers_` list. The output of :term:`predict` is the class
that has the highest probability.

The regressor that is used for calibration depends on the `method`
parameter. `'sigmoid'` corresponds to a parametric approach based on Platt's
logistic model [3]_, i.e. :math:`p(y_i = 1 | f_i)` is modeled as
:math:`\sigma(A f_i + B)` where :math:`\sigma` is the logistic function, and
:math:`A` and :math:`B` are real numbers to be determined when fitting the
regressor via maximum likelihood. `'isotonic'` will instead fit a
non-parametric isotonic regressor, which outputs a step-wise non-decreasing
function (see :mod:`sklearn.isotonic`).

An already fitted classifier can be calibrated by setting `cv="prefit"`. In
this case, the data is only used to fit the regressor. It is up to the user
Alternatively an already fitted classifier can be calibrated by setting
`cv="prefit"`. In this case, the data is not split and all of it is used to
fit the regressor. It is up to the user
make sure that the data used for fitting the classifier is disjoint from the
data used for fitting the regressor.

:class:`CalibratedClassifierCV` can calibrate probabilities in a multiclass
setting if the base estimator supports multiclass predictions. The classifier
is calibrated first for each class separately in a one-vs-rest fashion [4]_.
When predicting probabilities, the calibrated probabilities for each class
:func:`sklearn.metrics.brier_score_loss` may be used to assess how
well a classifier is calibrated. However, this metric should be used with care
because a lower Brier score does not always mean a better calibrated model.
This is because the Brier score metric is a combination of calibration loss
and refinement loss. Calibration loss is defined as the mean squared deviation
from empirical probabilities derived from the slope of ROC segments.
Refinement loss can be defined as the expected optimal loss as measured by the
area under the optimal cost curve. As refinement loss can change
independently from calibration loss, a lower Brier score does not necessarily
mean a better calibrated model.

:class:`CalibratedClassifierCV` supports the use of two 'calibration'
regressors: 'sigmoid' and 'isotonic'.

Sigmoid
^^^^^^^

The sigmoid regressor is based on Platt's logistic model [3]_:

.. math::
p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}

where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
is the output of the un-calibrated classifier for sample :math:`i`. :math:`A`
and :math:`B` are real numbers to be determined when fitting the regressor via
maximum likelihood.

The sigmoid method assumes the :ref:`calibration curve <calibration_curve>`
can be corrected by applying a sigmoid function to the raw predictions. This
assumption has been empirically justified in the case of :ref:`svm` with
common kernel functions on various benchmark datasets in section 2.1 of Platt
1999 [3]_ but does not necessarily hold in general. Additionally, the
logistic model works best if the calibration error is symmetrical, meaning
the classifier output for each binary class is normally distributed with
the same variance [6]_. This is can be a problem for highly imbalanced
classification problems, where outputs do not have equal variance.

In general this method is most effective when the un-calibrated model is
under-confident and has similar calibration errors for both high and low
outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also mention the fact that Platt scaling assumes symmetric calibration errors, that is, it assumes that the over-confidence errors for low values of f_i have the same magnitude as for high values of f_i. This is not necessarily the case for highly imbalanced classification problems where the un-calibrated classifier can have asymmetric calibration errors.

This is just an intuition (I have not run experiments to confirm this happens in practice) though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ogrisel ogrisel Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference, Beta calibration looks very nice, I did not know about it. Unfortunately it does not meet the criterion for inclusion in scikit-learn in terms of citations but honestly I wouldn't mind considering a PR to add as a third option to CalibratedClassifierCV.

Isotonic
^^^^^^^^

The 'isotonic' method fits a non-parametric isotonic regressor, which outputs
a step-wise non-decreasing function (see :mod:`sklearn.isotonic`). It
minimizes:

.. math::
\sum_{i=1}^{n} (y_i - \hat{f}_i)^2

subject to :math:`\hat{f}_i >= \hat{f}_j` whenever
:math:`f_i >= f_j`. :math:`y_i` is the true
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
This method is more general when compared to 'sigmoid' as the only restriction
is that the mapping function is monotonically increasing. It is thus more
powerful as it can correct any monotonic distortion of the un-calibrated model.
However, it is more prone to overfitting, especially on small datasets [5]_.

Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
there is enough data (greater than ~ 1000 samples) to avoid overfitting [1]_.

Multiclass support
^^^^^^^^^^^^^^^^^^

Both isotonic and sigmoid regressors only
support 1-dimensional data (e.g., binary classification output) but are
extended for multiclass classification if the `base_estimator` supports
multiclass predictions. For multiclass predictions,
:class:`CalibratedClassifierCV` calibrates for
each class separately in a :ref:`ovr_classification` fashion [4]_. When
predicting
probabilities, the calibrated probabilities for each class
are predicted separately. As those probabilities do not necessarily sum to
one, a postprocessing is performed to normalize them.

The :func:`sklearn.metrics.brier_score_loss` may be used to evaluate how
well a classifier is calibrated.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_calibration_plot_calibration_curve.py`
Expand All @@ -144,15 +216,31 @@ well a classifier is calibrated.

.. topic:: References:

.. [1] Predicting Good Probabilities with Supervised Learning,
.. [1] `Predicting Good Probabilities with Supervised Learning
<https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf>`_,
A. Niculescu-Mizil & R. Caruana, ICML 2005

.. [2] On the combination of forecast probabilities for
consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
Wilks, D. S., 1990a

.. [3] Probabilistic Outputs for Support Vector Machines and Comparisons
to Regularized Likelihood Methods, J. Platt, (1999)

.. [4] Transforming Classifier Scores into Accurate Multiclass
Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
.. [2] `On the combination of forecast probabilities for
consecutive precipitation periods.
<https://journals.ametsoc.org/waf/article/5/4/640/40179>`_
Wea. Forecasting, 5, 640–650., Wilks, D. S., 1990a

.. [3] `Probabilistic Outputs for Support Vector Machines and Comparisons
to Regularized Likelihood Methods.
<https://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf>`_
J. Platt, (1999)

.. [4] `Transforming Classifier Scores into Accurate Multiclass
Probability Estimates.
<https://dl.acm.org/doi/pdf/10.1145/775047.775151>`_
B. Zadrozny & C. Elkan, (KDD 2002)

.. [5] `Predicting accurate probabilities with a ranking loss.
<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/>`_
Menon AK, Jiang XJ, Vembu S, Elkan C, Ohno-Machado L.
Proc Int Conf Mach Learn. 2012;2012:703-710

.. [6] `Beyond sigmoids: How to obtain well-calibrated probabilities from
binary classifiers with beta calibration
<https://projecteuclid.org/euclid.ejs/1513306867>`_
Kull, M., Silva Filho, T. M., & Flach, P. (2017).