@@ -19,9 +19,16 @@ Well calibrated classifiers are probabilistic classifiers for which the output
19
19
of the predict_proba method can be directly interpreted as a confidence level.
20
20
For instance, a well calibrated (binary) classifier should classify the samples
21
21
such that among the samples to which it gave a predict_proba value close to 0.8,
22
- approximately 80% actually belong to the positive class. The following plot compares
23
- how well the probabilistic predictions of different classifiers are calibrated,
24
- using :func: `calibration_curve `:
22
+ approximately 80% actually belong to the positive class.
23
+
24
+ Calibration curves
25
+ ------------------
26
+
27
+ The following plot compares how well the probabilistic predictions of
28
+ different classifiers are calibrated, using :func: `calibration_curve `.
29
+ The x axis represents the average predicted probability in each bin. The
30
+ y axis is the *fraction of positives *, i.e. the proportion of samples whose
31
+ class is the positive class (in each bin).
25
32
26
33
.. figure :: ../auto_examples/calibration/images/sphx_glr_plot_compare_calibration_001.png
27
34
:target: ../auto_examples/calibration/plot_compare_calibration.html
@@ -35,177 +42,117 @@ with different biases per method:
35
42
36
43
.. currentmodule :: sklearn.naive_bayes
37
44
38
- * :class: `GaussianNB ` tends to push probabilities to 0 or 1 (note the
39
- counts in the histograms). This is mainly because it makes the assumption
40
- that features are conditionally independent given the class, which is not
41
- the case in this dataset which contains 2 redundant features.
45
+ :class: `GaussianNB ` tends to push probabilities to 0 or 1 (note the counts
46
+ in the histograms). This is mainly because it makes the assumption that
47
+ features are conditionally independent given the class, which is not the
48
+ case in this dataset which contains 2 redundant features.
42
49
43
50
.. currentmodule :: sklearn.ensemble
44
51
45
- * :class: `RandomForestClassifier ` shows the opposite behavior: the histograms
46
- show peaks at approximately 0.2 and 0.9 probability, while probabilities close to
47
- 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil
48
- and Caruana [4 ]_: "Methods such as bagging and random forests that average
49
- predictions from a base set of models can have difficulty making predictions
50
- near 0 and 1 because variance in the underlying base models will bias
51
- predictions that should be near zero or one away from these values. Because
52
- predictions are restricted to the interval [0,1], errors caused by variance
53
- tend to be one-sided near zero and one. For example, if a model should
54
- predict p = 0 for a case, the only way bagging can achieve this is if all
55
- bagged trees predict zero. If we add noise to the trees that bagging is
56
- averaging over, this noise will cause some trees to predict values larger
57
- than 0 for this case, thus moving the average prediction of the bagged
58
- ensemble away from 0. We observe this effect most strongly with random
59
- forests because the base-level trees trained with random forests have
60
- relatively high variance due to feature subsetting." As a result, the
61
- calibration curve also referred to as the reliability diagram (Wilks 1995 [5 ]_) shows a
62
- characteristic sigmoid shape, indicating that the classifier could trust its
63
- "intuition" more and return probabilities closer to 0 or 1 typically.
52
+ :class: `RandomForestClassifier ` shows the opposite behavior: the histograms
53
+ show peaks at approximately 0.2 and 0.9 probability, while probabilities
54
+ close to 0 or 1 are very rare. An explanation for this is given by
55
+ Niculescu-Mizil and Caruana [1 ]_: "Methods such as bagging and random
56
+ forests that average predictions from a base set of models can have
57
+ difficulty making predictions near 0 and 1 because variance in the
58
+ underlying base models will bias predictions that should be near zero or one
59
+ away from these values. Because predictions are restricted to the interval
60
+ [0,1], errors caused by variance tend to be one-sided near zero and one. For
61
+ example, if a model should predict p = 0 for a case, the only way bagging
62
+ can achieve this is if all bagged trees predict zero. If we add noise to the
63
+ trees that bagging is averaging over, this noise will cause some trees to
64
+ predict values larger than 0 for this case, thus moving the average
65
+ prediction of the bagged ensemble away from 0. We observe this effect most
66
+ strongly with random forests because the base-level trees trained with
67
+ random forests have relatively high variance due to feature subsetting." As
68
+ a result, the calibration curve also referred to as the reliability diagram
69
+ (Wilks 1995 [2 ]_) shows a characteristic sigmoid shape, indicating that the
70
+ classifier could trust its "intuition" more and return probabilities closer
71
+ to 0 or 1 typically.
64
72
65
73
.. currentmodule :: sklearn.svm
66
74
67
- * Linear Support Vector Classification (:class: `LinearSVC `) shows an even more sigmoid curve
68
- as the RandomForestClassifier, which is typical for maximum-margin methods
69
- (compare Niculescu-Mizil and Caruana [4 ]_), which focus on hard samples
70
- that are close to the decision boundary (the support vectors).
71
-
72
- .. currentmodule :: sklearn.calibration
73
-
74
- Two approaches for performing calibration of probabilistic predictions are
75
- provided: a parametric approach based on Platt's sigmoid model and a
76
- non-parametric approach based on isotonic regression (:mod: `sklearn.isotonic `).
77
- Probability calibration should be done on new data not used for model fitting.
78
- The class :class: `CalibratedClassifierCV ` uses a cross-validation generator and
79
- estimates for each split the model parameter on the train samples and the
80
- calibration of the test samples. The probabilities predicted for the
81
- folds are then averaged. Already fitted classifiers can be calibrated by
82
- :class: `CalibratedClassifierCV ` via the parameter cv="prefit". In this case,
83
- the user has to take care manually that data for model fitting and calibration
84
- are disjoint.
85
-
86
- The following images demonstrate the benefit of probability calibration.
87
- The first image present a dataset with 2 classes and 3 blobs of
88
- data. The blob in the middle contains random samples of each class.
89
- The probability for the samples in this blob should be 0.5.
90
-
91
- .. figure :: ../auto_examples/calibration/images/sphx_glr_plot_calibration_001.png
92
- :target: ../auto_examples/calibration/plot_calibration.html
93
- :align: center
94
-
95
- The following image shows on the data above the estimated probability
96
- using a Gaussian naive Bayes classifier without calibration,
97
- with a sigmoid calibration and with a non-parametric isotonic
98
- calibration. One can observe that the non-parametric model
99
- provides the most accurate probability estimates for samples
100
- in the middle, i.e., 0.5.
101
-
102
- .. figure :: ../auto_examples/calibration/images/sphx_glr_plot_calibration_002.png
103
- :target: ../auto_examples/calibration/plot_calibration.html
104
- :align: center
105
-
106
- .. currentmodule :: sklearn.metrics
107
-
108
- The following experiment is performed on an artificial dataset for binary
109
- classification with 100,000 samples (1,000 of them are used for model fitting)
110
- with 20 features. Of the 20 features, only 2 are informative and 10 are
111
- redundant. The figure shows the estimated probabilities obtained with
112
- logistic regression, a linear support-vector classifier (SVC), and linear SVC with
113
- both isotonic calibration and sigmoid calibration.
114
- The Brier score is a metric which is a combination of calibration loss and refinement loss,
115
- :func: `brier_score_loss `, reported in the legend (the smaller the better).
116
- Calibration loss is defined as the mean squared deviation from empirical probabilities
117
- derived from the slope of ROC segments. Refinement loss can be defined as the expected
118
- optimal loss as measured by the area under the optimal cost curve.
119
-
120
- .. figure :: ../auto_examples/calibration/images/sphx_glr_plot_calibration_curve_002.png
121
- :target: ../auto_examples/calibration/plot_calibration_curve.html
122
- :align: center
75
+ Linear Support Vector Classification (:class: `LinearSVC `) shows an even more
76
+ sigmoid curve as the RandomForestClassifier, which is typical for
77
+ maximum-margin methods (compare Niculescu-Mizil and Caruana [1 ]_), which
78
+ focus on hard samples that are close to the decision boundary (the support
79
+ vectors).
123
80
124
- One can observe here that logistic regression is well calibrated as its curve is
125
- nearly diagonal. Linear SVC's calibration curve or reliability diagram has a
126
- sigmoid curve, which is typical for an under-confident classifier. In the case of
127
- LinearSVC, this is caused by the margin property of the hinge loss, which lets
128
- the model focus on hard samples that are close to the decision boundary
129
- (the support vectors). Both kinds of calibration can fix this issue and yield
130
- nearly identical results. The next figure shows the calibration curve of
131
- Gaussian naive Bayes on the same data, with both kinds of calibration and also
132
- without calibration.
133
-
134
- .. figure :: ../auto_examples/calibration/images/sphx_glr_plot_calibration_curve_001.png
135
- :target: ../auto_examples/calibration/plot_calibration_curve.html
136
- :align: center
137
-
138
- One can see that Gaussian naive Bayes performs very badly but does so in an
139
- other way than linear SVC: While linear SVC exhibited a sigmoid calibration
140
- curve, Gaussian naive Bayes' calibration curve has a transposed-sigmoid shape.
141
- This is typical for an over-confident classifier. In this case, the classifier's
142
- overconfidence is caused by the redundant features which violate the naive Bayes
143
- assumption of feature-independence.
144
-
145
- Calibration of the probabilities of Gaussian naive Bayes with isotonic
146
- regression can fix this issue as can be seen from the nearly diagonal
147
- calibration curve. Sigmoid calibration also improves the brier score slightly,
148
- albeit not as strongly as the non-parametric isotonic calibration. This is an
149
- intrinsic limitation of sigmoid calibration, whose parametric form assumes a
150
- sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic
151
- calibration model, however, makes no such strong assumptions and can deal with
152
- either shape, provided that there is sufficient calibration data. In general,
153
- sigmoid calibration is preferable in cases where the calibration curve is sigmoid
154
- and where there is limited calibration data, while isotonic calibration is
155
- preferable for non-sigmoid calibration curves and in situations where large
156
- amounts of data are available for calibration.
81
+ Calibrating a classifier
82
+ ------------------------
157
83
158
84
.. currentmodule :: sklearn.calibration
159
85
160
- :class: `CalibratedClassifierCV ` can also deal with classification tasks that
161
- involve more than two classes if the base estimator can do so. In this case,
162
- the classifier is calibrated first for each class separately in an one-vs-rest
163
- fashion. When predicting probabilities for unseen data, the calibrated
164
- probabilities for each class are predicted separately. As those probabilities
165
- do not necessarily sum to one, a postprocessing is performed to normalize them.
166
-
167
- The next image illustrates how sigmoid calibration changes predicted
168
- probabilities for a 3-class classification problem. Illustrated is the standard
169
- 2-simplex, where the three corners correspond to the three classes. Arrows point
170
- from the probability vectors predicted by an uncalibrated classifier to the
171
- probability vectors predicted by the same classifier after sigmoid calibration
172
- on a hold-out validation set. Colors indicate the true class of an instance
173
- (red: class 1, green: class 2, blue: class 3).
174
-
175
- .. figure :: ../auto_examples/calibration/images/sphx_glr_plot_calibration_multiclass_001.png
176
- :target: ../auto_examples/calibration/plot_calibration_multiclass.html
177
- :align: center
178
-
179
- The base classifier is a random forest classifier with 25 base estimators
180
- (trees). If this classifier is trained on all 800 training datapoints, it is
181
- overly confident in its predictions and thus incurs a large log-loss.
182
- Calibrating an identical classifier, which was trained on 600 datapoints, with
183
- method='sigmoid' on the remaining 200 datapoints reduces the confidence of the
184
- predictions, i.e., moves the probability vectors from the edges of the simplex
185
- towards the center:
186
-
187
- .. figure :: ../auto_examples/calibration/images/sphx_glr_plot_calibration_multiclass_002.png
188
- :target: ../auto_examples/calibration/plot_calibration_multiclass.html
189
- :align: center
190
-
191
- This calibration results in a lower log-loss. Note that an alternative would
192
- have been to increase the number of base estimators which would have resulted in
193
- a similar decrease in log-loss.
86
+ Calibrating a classifier consists in fitting a regressor (called a
87
+ *calibrator *) that maps the output of the classifier (as given by
88
+ :term: `predict ` or :term: `predict_proba `) to a calibrated probability in [0,
89
+ 1]. Denoting the output of the classifier for a given sample by :math: `f_i`,
90
+ the calibrator tries to predict :math: `p(y_i = 1 | f_i)`.
91
+
92
+ The samples that are used to train the calibrator should not be used to
93
+ train the target classifier.
94
+
95
+ Usage
96
+ -----
97
+
98
+ The :class: `CalibratedClassifierCV ` class is used to calibrate a classifier.
99
+
100
+ :class: `CalibratedClassifierCV ` uses a cross-validation approach to fit both
101
+ the classifier and the regressor. For each of the k `(trainset, testset) `
102
+ couple, a classifier is trained on the train set, and its predictions on the
103
+ test set are used to fit a regressor. We end up with k
104
+ `(classifier, regressor) ` couples where each regressor maps the output of
105
+ its corresponding classifier into [0, 1]. Each couple is exposed in the
106
+ `calibrated_classifiers_ ` attribute, where each entry is a calibrated
107
+ classifier with a :term: `predict_proba ` method that outputs calibrated
108
+ probabilities. The output of :term: `predict_proba ` for the main
109
+ :class: `CalibratedClassifierCV ` instance corresponds to the average of the
110
+ predicted probabilities of the `k ` estimators in the
111
+ `calibrated_classifiers_ ` list. The output of :term: `predict ` is the class
112
+ that has the highest probability.
113
+
114
+ The regressor that is used for calibration depends on the `method `
115
+ parameter. `'sigmoid' ` corresponds to a parametric approach based on Platt's
116
+ logistic model [3 ]_, i.e. :math: `p(y_i = 1 | f_i)` is modeled as
117
+ :math: `\sigma (A f_i + B)` where :math: `\sigma ` is the logistic function, and
118
+ :math: `A` and :math: `B` are real numbers to be determined when fitting the
119
+ regressor via maximum likelihood. `'isotonic' ` will instead fit a
120
+ non-parametric isotonic regressor, which outputs a step-wise non-decreasing
121
+ function (see :mod: `sklearn.isotonic `).
122
+
123
+ An already fitted classifier can be calibrated by setting `cv="prefit" `. In
124
+ this case, the data is only used to fit the regressor. It is up to the user
125
+ make sure that the data used for fitting the classifier is disjoint from the
126
+ data used for fitting the regressor.
127
+
128
+ :class: `CalibratedClassifierCV ` can calibrate probabilities in a multiclass
129
+ setting if the base estimator supports multiclass predictions. The classifier
130
+ is calibrated first for each class separately in a one-vs-rest fashion [4 ]_.
131
+ When predicting probabilities, the calibrated probabilities for each class
132
+ are predicted separately. As those probabilities do not necessarily sum to
133
+ one, a postprocessing is performed to normalize them.
134
+
135
+ The :func: `sklearn.metrics.brier_score_loss ` may be used to evaluate how
136
+ well a classifier is calibrated.
137
+
138
+ .. topic :: Examples:
139
+
140
+ * :ref: `sphx_glr_auto_examples_calibration_plot_calibration_curve.py `
141
+ * :ref: `sphx_glr_auto_examples_calibration_plot_calibration_multiclass.py `
142
+ * :ref: `sphx_glr_auto_examples_calibration_plot_calibration.py `
143
+ * :ref: `sphx_glr_auto_examples_calibration_plot_compare_calibration.py `
194
144
195
145
.. topic :: References:
196
146
197
- * Obtaining calibrated probability estimates from decision trees
198
- and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
199
-
200
- * Transforming Classifier Scores into Accurate Multiclass
201
- Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
202
-
203
- * Probabilistic Outputs for Support Vector Machines and Comparisons to
204
- Regularized Likelihood Methods, J. Platt, (1999)
205
-
206
- .. [4 ] Predicting Good Probabilities with Supervised Learning,
147
+ .. [1 ] Predicting Good Probabilities with Supervised Learning,
207
148
A. Niculescu-Mizil & R. Caruana, ICML 2005
208
149
209
- .. [5 ] On the combination of forecast probabilities for
150
+ .. [2 ] On the combination of forecast probabilities for
210
151
consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
211
152
Wilks, D. S., 1990a
153
+
154
+ .. [3 ] Probabilistic Outputs for Support Vector Machines and Comparisons
155
+ to Regularized Likelihood Methods, J. Platt, (1999)
156
+
157
+ .. [4 ] Transforming Classifier Scores into Accurate Multiclass
158
+ Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
0 commit comments