Skip to content

Commit 6a27d4d

Browse files
authored
DOC Better UG for calibration (#16175)
1 parent 10e7b2b commit 6a27d4d

File tree

3 files changed

+122
-182
lines changed

3 files changed

+122
-182
lines changed

doc/modules/calibration.rst

Lines changed: 107 additions & 160 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,16 @@ Well calibrated classifiers are probabilistic classifiers for which the output
1919
of the predict_proba method can be directly interpreted as a confidence level.
2020
For instance, a well calibrated (binary) classifier should classify the samples
2121
such that among the samples to which it gave a predict_proba value close to 0.8,
22-
approximately 80% actually belong to the positive class. The following plot compares
23-
how well the probabilistic predictions of different classifiers are calibrated,
24-
using :func:`calibration_curve`:
22+
approximately 80% actually belong to the positive class.
23+
24+
Calibration curves
25+
------------------
26+
27+
The following plot compares how well the probabilistic predictions of
28+
different classifiers are calibrated, using :func:`calibration_curve`.
29+
The x axis represents the average predicted probability in each bin. The
30+
y axis is the *fraction of positives*, i.e. the proportion of samples whose
31+
class is the positive class (in each bin).
2532

2633
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_compare_calibration_001.png
2734
:target: ../auto_examples/calibration/plot_compare_calibration.html
@@ -35,177 +42,117 @@ with different biases per method:
3542

3643
.. currentmodule:: sklearn.naive_bayes
3744

38-
* :class:`GaussianNB` tends to push probabilities to 0 or 1 (note the
39-
counts in the histograms). This is mainly because it makes the assumption
40-
that features are conditionally independent given the class, which is not
41-
the case in this dataset which contains 2 redundant features.
45+
:class:`GaussianNB` tends to push probabilities to 0 or 1 (note the counts
46+
in the histograms). This is mainly because it makes the assumption that
47+
features are conditionally independent given the class, which is not the
48+
case in this dataset which contains 2 redundant features.
4249

4350
.. currentmodule:: sklearn.ensemble
4451

45-
* :class:`RandomForestClassifier` shows the opposite behavior: the histograms
46-
show peaks at approximately 0.2 and 0.9 probability, while probabilities close to
47-
0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil
48-
and Caruana [4]_: "Methods such as bagging and random forests that average
49-
predictions from a base set of models can have difficulty making predictions
50-
near 0 and 1 because variance in the underlying base models will bias
51-
predictions that should be near zero or one away from these values. Because
52-
predictions are restricted to the interval [0,1], errors caused by variance
53-
tend to be one-sided near zero and one. For example, if a model should
54-
predict p = 0 for a case, the only way bagging can achieve this is if all
55-
bagged trees predict zero. If we add noise to the trees that bagging is
56-
averaging over, this noise will cause some trees to predict values larger
57-
than 0 for this case, thus moving the average prediction of the bagged
58-
ensemble away from 0. We observe this effect most strongly with random
59-
forests because the base-level trees trained with random forests have
60-
relatively high variance due to feature subsetting." As a result, the
61-
calibration curve also referred to as the reliability diagram (Wilks 1995 [5]_) shows a
62-
characteristic sigmoid shape, indicating that the classifier could trust its
63-
"intuition" more and return probabilities closer to 0 or 1 typically.
52+
:class:`RandomForestClassifier` shows the opposite behavior: the histograms
53+
show peaks at approximately 0.2 and 0.9 probability, while probabilities
54+
close to 0 or 1 are very rare. An explanation for this is given by
55+
Niculescu-Mizil and Caruana [1]_: "Methods such as bagging and random
56+
forests that average predictions from a base set of models can have
57+
difficulty making predictions near 0 and 1 because variance in the
58+
underlying base models will bias predictions that should be near zero or one
59+
away from these values. Because predictions are restricted to the interval
60+
[0,1], errors caused by variance tend to be one-sided near zero and one. For
61+
example, if a model should predict p = 0 for a case, the only way bagging
62+
can achieve this is if all bagged trees predict zero. If we add noise to the
63+
trees that bagging is averaging over, this noise will cause some trees to
64+
predict values larger than 0 for this case, thus moving the average
65+
prediction of the bagged ensemble away from 0. We observe this effect most
66+
strongly with random forests because the base-level trees trained with
67+
random forests have relatively high variance due to feature subsetting." As
68+
a result, the calibration curve also referred to as the reliability diagram
69+
(Wilks 1995 [2]_) shows a characteristic sigmoid shape, indicating that the
70+
classifier could trust its "intuition" more and return probabilities closer
71+
to 0 or 1 typically.
6472

6573
.. currentmodule:: sklearn.svm
6674

67-
* Linear Support Vector Classification (:class:`LinearSVC`) shows an even more sigmoid curve
68-
as the RandomForestClassifier, which is typical for maximum-margin methods
69-
(compare Niculescu-Mizil and Caruana [4]_), which focus on hard samples
70-
that are close to the decision boundary (the support vectors).
71-
72-
.. currentmodule:: sklearn.calibration
73-
74-
Two approaches for performing calibration of probabilistic predictions are
75-
provided: a parametric approach based on Platt's sigmoid model and a
76-
non-parametric approach based on isotonic regression (:mod:`sklearn.isotonic`).
77-
Probability calibration should be done on new data not used for model fitting.
78-
The class :class:`CalibratedClassifierCV` uses a cross-validation generator and
79-
estimates for each split the model parameter on the train samples and the
80-
calibration of the test samples. The probabilities predicted for the
81-
folds are then averaged. Already fitted classifiers can be calibrated by
82-
:class:`CalibratedClassifierCV` via the parameter cv="prefit". In this case,
83-
the user has to take care manually that data for model fitting and calibration
84-
are disjoint.
85-
86-
The following images demonstrate the benefit of probability calibration.
87-
The first image present a dataset with 2 classes and 3 blobs of
88-
data. The blob in the middle contains random samples of each class.
89-
The probability for the samples in this blob should be 0.5.
90-
91-
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_calibration_001.png
92-
:target: ../auto_examples/calibration/plot_calibration.html
93-
:align: center
94-
95-
The following image shows on the data above the estimated probability
96-
using a Gaussian naive Bayes classifier without calibration,
97-
with a sigmoid calibration and with a non-parametric isotonic
98-
calibration. One can observe that the non-parametric model
99-
provides the most accurate probability estimates for samples
100-
in the middle, i.e., 0.5.
101-
102-
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_calibration_002.png
103-
:target: ../auto_examples/calibration/plot_calibration.html
104-
:align: center
105-
106-
.. currentmodule:: sklearn.metrics
107-
108-
The following experiment is performed on an artificial dataset for binary
109-
classification with 100,000 samples (1,000 of them are used for model fitting)
110-
with 20 features. Of the 20 features, only 2 are informative and 10 are
111-
redundant. The figure shows the estimated probabilities obtained with
112-
logistic regression, a linear support-vector classifier (SVC), and linear SVC with
113-
both isotonic calibration and sigmoid calibration.
114-
The Brier score is a metric which is a combination of calibration loss and refinement loss,
115-
:func:`brier_score_loss`, reported in the legend (the smaller the better).
116-
Calibration loss is defined as the mean squared deviation from empirical probabilities
117-
derived from the slope of ROC segments. Refinement loss can be defined as the expected
118-
optimal loss as measured by the area under the optimal cost curve.
119-
120-
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_calibration_curve_002.png
121-
:target: ../auto_examples/calibration/plot_calibration_curve.html
122-
:align: center
75+
Linear Support Vector Classification (:class:`LinearSVC`) shows an even more
76+
sigmoid curve as the RandomForestClassifier, which is typical for
77+
maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_), which
78+
focus on hard samples that are close to the decision boundary (the support
79+
vectors).
12380

124-
One can observe here that logistic regression is well calibrated as its curve is
125-
nearly diagonal. Linear SVC's calibration curve or reliability diagram has a
126-
sigmoid curve, which is typical for an under-confident classifier. In the case of
127-
LinearSVC, this is caused by the margin property of the hinge loss, which lets
128-
the model focus on hard samples that are close to the decision boundary
129-
(the support vectors). Both kinds of calibration can fix this issue and yield
130-
nearly identical results. The next figure shows the calibration curve of
131-
Gaussian naive Bayes on the same data, with both kinds of calibration and also
132-
without calibration.
133-
134-
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_calibration_curve_001.png
135-
:target: ../auto_examples/calibration/plot_calibration_curve.html
136-
:align: center
137-
138-
One can see that Gaussian naive Bayes performs very badly but does so in an
139-
other way than linear SVC: While linear SVC exhibited a sigmoid calibration
140-
curve, Gaussian naive Bayes' calibration curve has a transposed-sigmoid shape.
141-
This is typical for an over-confident classifier. In this case, the classifier's
142-
overconfidence is caused by the redundant features which violate the naive Bayes
143-
assumption of feature-independence.
144-
145-
Calibration of the probabilities of Gaussian naive Bayes with isotonic
146-
regression can fix this issue as can be seen from the nearly diagonal
147-
calibration curve. Sigmoid calibration also improves the brier score slightly,
148-
albeit not as strongly as the non-parametric isotonic calibration. This is an
149-
intrinsic limitation of sigmoid calibration, whose parametric form assumes a
150-
sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic
151-
calibration model, however, makes no such strong assumptions and can deal with
152-
either shape, provided that there is sufficient calibration data. In general,
153-
sigmoid calibration is preferable in cases where the calibration curve is sigmoid
154-
and where there is limited calibration data, while isotonic calibration is
155-
preferable for non-sigmoid calibration curves and in situations where large
156-
amounts of data are available for calibration.
81+
Calibrating a classifier
82+
------------------------
15783

15884
.. currentmodule:: sklearn.calibration
15985

160-
:class:`CalibratedClassifierCV` can also deal with classification tasks that
161-
involve more than two classes if the base estimator can do so. In this case,
162-
the classifier is calibrated first for each class separately in an one-vs-rest
163-
fashion. When predicting probabilities for unseen data, the calibrated
164-
probabilities for each class are predicted separately. As those probabilities
165-
do not necessarily sum to one, a postprocessing is performed to normalize them.
166-
167-
The next image illustrates how sigmoid calibration changes predicted
168-
probabilities for a 3-class classification problem. Illustrated is the standard
169-
2-simplex, where the three corners correspond to the three classes. Arrows point
170-
from the probability vectors predicted by an uncalibrated classifier to the
171-
probability vectors predicted by the same classifier after sigmoid calibration
172-
on a hold-out validation set. Colors indicate the true class of an instance
173-
(red: class 1, green: class 2, blue: class 3).
174-
175-
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_calibration_multiclass_001.png
176-
:target: ../auto_examples/calibration/plot_calibration_multiclass.html
177-
:align: center
178-
179-
The base classifier is a random forest classifier with 25 base estimators
180-
(trees). If this classifier is trained on all 800 training datapoints, it is
181-
overly confident in its predictions and thus incurs a large log-loss.
182-
Calibrating an identical classifier, which was trained on 600 datapoints, with
183-
method='sigmoid' on the remaining 200 datapoints reduces the confidence of the
184-
predictions, i.e., moves the probability vectors from the edges of the simplex
185-
towards the center:
186-
187-
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_calibration_multiclass_002.png
188-
:target: ../auto_examples/calibration/plot_calibration_multiclass.html
189-
:align: center
190-
191-
This calibration results in a lower log-loss. Note that an alternative would
192-
have been to increase the number of base estimators which would have resulted in
193-
a similar decrease in log-loss.
86+
Calibrating a classifier consists in fitting a regressor (called a
87+
*calibrator*) that maps the output of the classifier (as given by
88+
:term:`predict` or :term:`predict_proba`) to a calibrated probability in [0,
89+
1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
90+
the calibrator tries to predict :math:`p(y_i = 1 | f_i)`.
91+
92+
The samples that are used to train the calibrator should not be used to
93+
train the target classifier.
94+
95+
Usage
96+
-----
97+
98+
The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.
99+
100+
:class:`CalibratedClassifierCV` uses a cross-validation approach to fit both
101+
the classifier and the regressor. For each of the k `(trainset, testset)`
102+
couple, a classifier is trained on the train set, and its predictions on the
103+
test set are used to fit a regressor. We end up with k
104+
`(classifier, regressor)` couples where each regressor maps the output of
105+
its corresponding classifier into [0, 1]. Each couple is exposed in the
106+
`calibrated_classifiers_` attribute, where each entry is a calibrated
107+
classifier with a :term:`predict_proba` method that outputs calibrated
108+
probabilities. The output of :term:`predict_proba` for the main
109+
:class:`CalibratedClassifierCV` instance corresponds to the average of the
110+
predicted probabilities of the `k` estimators in the
111+
`calibrated_classifiers_` list. The output of :term:`predict` is the class
112+
that has the highest probability.
113+
114+
The regressor that is used for calibration depends on the `method`
115+
parameter. `'sigmoid'` corresponds to a parametric approach based on Platt's
116+
logistic model [3]_, i.e. :math:`p(y_i = 1 | f_i)` is modeled as
117+
:math:`\sigma(A f_i + B)` where :math:`\sigma` is the logistic function, and
118+
:math:`A` and :math:`B` are real numbers to be determined when fitting the
119+
regressor via maximum likelihood. `'isotonic'` will instead fit a
120+
non-parametric isotonic regressor, which outputs a step-wise non-decreasing
121+
function (see :mod:`sklearn.isotonic`).
122+
123+
An already fitted classifier can be calibrated by setting `cv="prefit"`. In
124+
this case, the data is only used to fit the regressor. It is up to the user
125+
make sure that the data used for fitting the classifier is disjoint from the
126+
data used for fitting the regressor.
127+
128+
:class:`CalibratedClassifierCV` can calibrate probabilities in a multiclass
129+
setting if the base estimator supports multiclass predictions. The classifier
130+
is calibrated first for each class separately in a one-vs-rest fashion [4]_.
131+
When predicting probabilities, the calibrated probabilities for each class
132+
are predicted separately. As those probabilities do not necessarily sum to
133+
one, a postprocessing is performed to normalize them.
134+
135+
The :func:`sklearn.metrics.brier_score_loss` may be used to evaluate how
136+
well a classifier is calibrated.
137+
138+
.. topic:: Examples:
139+
140+
* :ref:`sphx_glr_auto_examples_calibration_plot_calibration_curve.py`
141+
* :ref:`sphx_glr_auto_examples_calibration_plot_calibration_multiclass.py`
142+
* :ref:`sphx_glr_auto_examples_calibration_plot_calibration.py`
143+
* :ref:`sphx_glr_auto_examples_calibration_plot_compare_calibration.py`
194144

195145
.. topic:: References:
196146

197-
* Obtaining calibrated probability estimates from decision trees
198-
and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
199-
200-
* Transforming Classifier Scores into Accurate Multiclass
201-
Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
202-
203-
* Probabilistic Outputs for Support Vector Machines and Comparisons to
204-
Regularized Likelihood Methods, J. Platt, (1999)
205-
206-
.. [4] Predicting Good Probabilities with Supervised Learning,
147+
.. [1] Predicting Good Probabilities with Supervised Learning,
207148
A. Niculescu-Mizil & R. Caruana, ICML 2005
208149
209-
.. [5] On the combination of forecast probabilities for
150+
.. [2] On the combination of forecast probabilities for
210151
consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
211152
Wilks, D. S., 1990a
153+
154+
.. [3] Probabilistic Outputs for Support Vector Machines and Comparisons
155+
to Regularized Likelihood Methods, J. Platt, (1999)
156+
157+
.. [4] Transforming Classifier Scores into Accurate Multiclass
158+
Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)

0 commit comments

Comments
 (0)