From ddea6651838f5fe115dfa897bcd72a4d02b964fd Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Thu, 25 Jun 2020 17:26:53 +0200
Subject: [PATCH 01/11] update calb

---
 doc/modules/calibration.rst | 122 +++++++++++++++++++++++++-----------
 1 file changed, 87 insertions(+), 35 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 19df08ea3b1fe..7700e84e469eb 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -11,16 +11,21 @@ When performing classification you often want not only to predict the class
 label, but also obtain a probability of the respective label. This probability
 gives you some kind of confidence on the prediction. Some models can give you
 poor estimates of the class probabilities and some even do not support
-probability prediction. The calibration module allows you to better calibrate
+probability prediction (e.g., :mod:`sklearn.svm` estimators). The calibration
+module allows you to better calibrate
 the probabilities of a given model, or to add support for probability
 prediction.
 
 Well calibrated classifiers are probabilistic classifiers for which the output
-of the predict_proba method can be directly interpreted as a confidence level.
+of the :term:`predict_proba` method can be directly interpreted as a confidence
+level.
 For instance, a well calibrated (binary) classifier should classify the samples
-such that among the samples to which it gave a predict_proba value close to 0.8,
+such that among the samples to which it gave a :term:`predict_proba` value
+close to 0.8,
 approximately 80% actually belong to the positive class.
 
+.. _calibration_curve:
+
 Calibration curves
 ------------------
 
@@ -37,7 +42,7 @@ class is the positive class (in each bin).
 .. currentmodule:: sklearn.linear_model
 
 :class:`LogisticRegression` returns well calibrated predictions by default as it directly
-optimizes log-loss. In contrast, the other methods return biased probabilities;
+optimizes :ref:`log_loss`. In contrast, the other methods return biased probabilities;
 with different biases per method:
 
 .. currentmodule:: sklearn.naive_bayes
@@ -73,10 +78,10 @@ to 0 or 1 typically.
 .. currentmodule:: sklearn.svm
 
 Linear Support Vector Classification (:class:`LinearSVC`) shows an even more
-sigmoid curve as the RandomForestClassifier, which is typical for
+sigmoid curve than :class:`RandomForestClassifier`, which is typical for
 maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_), which
-focus on hard samples that are close to the decision boundary (the support
-vectors).
+focus on difficult to classify samples that are close to the decision
+boundary (the support vectors).
 
 Calibrating a classifier
 ------------------------
@@ -85,12 +90,16 @@ Calibrating a classifier
 
 Calibrating a classifier consists in fitting a regressor (called a
 *calibrator*) that maps the output of the classifier (as given by
-:term:`predict` or :term:`predict_proba`) to a calibrated probability in [0,
-1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
+:term:`decison_function` or :term:`predict_proba`) to a calibrated probability
+in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
 the calibrator tries to predict :math:`p(y_i = 1 | f_i)`.
 
-The samples that are used to train the calibrator should not be used to
-train the target classifier.
+The samples that are used to fit the calibrator should not be the same
+samples used to fit classifier that is to be calibrated, as this would
+introduce bias. Classifier performance on data used to train it would be
+better than for novel data. Using the classifier output from training data
+to fit the calibrator would thus result in a biased calibrator that maps to
+probabilities closer to 0 and 1 than it should.
 
 Usage
 -----
@@ -100,7 +109,9 @@ The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.
 :class:`CalibratedClassifierCV` uses a cross-validation approach to fit both
 the classifier and the regressor. For each of the k `(trainset, testset)`
 couple, a classifier is trained on the train set, and its predictions on the
-test set are used to fit a regressor. We end up with k
+test set are used to fit a regressor. This ensures that the data used to fit
+the classifier is always disjoint from the data used to fit the calibrator.
+After fitting, we end up with k
 `(classifier, regressor)` couples where each regressor maps the output of
 its corresponding classifier into [0, 1]. Each couple is exposed in the
 `calibrated_classifiers_` attribute, where each entry is a calibrated
@@ -111,30 +122,60 @@ predicted probabilities of the `k` estimators in the
 `calibrated_classifiers_` list. The output of :term:`predict` is the class
 that has the highest probability.
 
-The regressor that is used for calibration depends on the `method`
-parameter. `'sigmoid'` corresponds to a parametric approach based on Platt's
-logistic model [3]_, i.e. :math:`p(y_i = 1 | f_i)` is modeled as
-:math:`\sigma(A f_i + B)` where :math:`\sigma` is the logistic function, and
-:math:`A` and :math:`B` are real numbers to be determined when fitting the
-regressor via maximum likelihood. `'isotonic'` will instead fit a
-non-parametric isotonic regressor, which outputs a step-wise non-decreasing
-function (see :mod:`sklearn.isotonic`).
-
 An already fitted classifier can be calibrated by setting `cv="prefit"`. In
 this case, the data is only used to fit the regressor. It is up to the user
 make sure that the data used for fitting the classifier is disjoint from the
 data used for fitting the regressor.
 
-:class:`CalibratedClassifierCV` can calibrate probabilities in a multiclass
-setting if the base estimator supports multiclass predictions. The classifier
-is calibrated first for each class separately in a one-vs-rest fashion [4]_.
-When predicting probabilities, the calibrated probabilities for each class
+:class:`CalibratedClassifierCV` supports the use of two 'calibration'
+regressors; 'sigmoid' and 'isotonic'. Both these regressors only
+support 1-dimensional data (e.g., binary classification output) but can be
+extended for multiclass classification if the `base_estimator` supports
+multiclass predictions. :class:`CalibratedClassifierCV` calibrates for
+each class separately in a One-Vs-The-Rest fashion [4]_. When predicting
+probabilities, the calibrated probabilities for each class
 are predicted separately. As those probabilities do not necessarily sum to
 one, a postprocessing is performed to normalize them.
 
 The :func:`sklearn.metrics.brier_score_loss` may be used to evaluate how
 well a classifier is calibrated.
 
+Sigmoid
+^^^^^^^
+
+The sigmoid regressor is based on Platt's logistic model [3]_:
+
+.. math::
+       p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}
+
+where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
+is output of the classifier for sample :math:`i`. :math:`A` and :math:`B`
+are real numbers to be determined when fitting the regressor via maximum
+likelihood.
+
+The sigmoid method is biased in that it assumes the :ref:`calibration curve
+<calibration_curve>` of the un-calibrated model has a sigmoid shape [1]_. It
+is thus most effective when the un-calibrated model is over-confident.
+
+Isotonic
+^^^^^^^^
+
+The 'isotonic' method fits a non-parametric isotonic regressor, which outputs
+a step-wise non-decreasing function (see :mod:`sklearn.isotonic`). It
+minimizes:
+
+.. math::
+       \sum_i (y_i - f_i)^2
+
+subject to :math:`\f_i \le f_j`. This method is more general when compared to
+'sigmoid' as the only restriction is that the mapping function is
+monotonically increasing. It is thus more powerful as it can correct any
+monotonic distortion of the un-calibrated model. However, it is more prone
+to overfitting, especially on small datasets [5]_.
+
+Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
+there is enough data (greater than ~ 1000 samples) to avoid overfitting [1]_.
+
 .. topic:: Examples:
 
    * :ref:`sphx_glr_auto_examples_calibration_plot_calibration_curve.py`
@@ -144,15 +185,26 @@ well a classifier is calibrated.
 
 .. topic:: References:
 
-    .. [1] Predicting Good Probabilities with Supervised Learning,
+    .. [1] `Predicting Good Probabilities with Supervised Learning
+           <https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf>`_,
            A. Niculescu-Mizil & R. Caruana, ICML 2005
 
-    .. [2] On the combination of forecast probabilities for
-           consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
-           Wilks, D. S., 1990a
-
-    .. [3] Probabilistic Outputs for Support Vector Machines and Comparisons
-           to Regularized Likelihood Methods, J. Platt, (1999)
-
-    .. [4] Transforming Classifier Scores into Accurate Multiclass
-           Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
+    .. [2] `On the combination of forecast probabilities for
+           consecutive precipitation periods.
+           <https://journals.ametsoc.org/waf/article/5/4/640/40179>`_
+           Wea. Forecasting, 5, 640–650., Wilks, D. S., 1990a
+
+    .. [3] `Probabilistic Outputs for Support Vector Machines and Comparisons
+           to Regularized Likelihood Methods.
+           <https://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf>`_
+           J. Platt, (1999)
+
+    .. [4] `Transforming Classifier Scores into Accurate Multiclass
+           Probability Estimates.
+           <https://dl.acm.org/doi/pdf/10.1145/775047.775151>`_
+           B. Zadrozny & C. Elkan, (KDD 2002)
+
+    .. [5] `Predicting accurate probabilities with a ranking loss.
+           <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/>`_
+           Menon AK, Jiang XJ, Vembu S, Elkan C, Ohno-Machado L.
+           Proc Int Conf Mach Learn. 2012;2012:703-710.
\ No newline at end of file

From 1ecf4268d1c4425a9a36dd8d2d04e93f136d02b2 Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Thu, 25 Jun 2020 17:38:44 +0200
Subject: [PATCH 02/11] wording

---
 doc/modules/calibration.rst | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 7700e84e469eb..a6f3b89a90cc2 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -107,8 +107,9 @@ Usage
 The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.
 
 :class:`CalibratedClassifierCV` uses a cross-validation approach to fit both
-the classifier and the regressor. For each of the k `(trainset, testset)`
-couple, a classifier is trained on the train set, and its predictions on the
+the classifier and the regressor. The data is split into k
+`(train_set, test_set)` couples (as determined by `cv`). The classifier
+(`base_estimator`) is trained on the train set, and its predictions on the
 test set are used to fit a regressor. This ensures that the data used to fit
 the classifier is always disjoint from the data used to fit the calibrator.
 After fitting, we end up with k
@@ -122,17 +123,20 @@ predicted probabilities of the `k` estimators in the
 `calibrated_classifiers_` list. The output of :term:`predict` is the class
 that has the highest probability.
 
-An already fitted classifier can be calibrated by setting `cv="prefit"`. In
-this case, the data is only used to fit the regressor. It is up to the user
+Alternatively an already fitted classifier can be calibrated by setting
+`cv="prefit"`. In this case, the data is not split and all of it is used to
+fit the regressor. It is up to the user
 make sure that the data used for fitting the classifier is disjoint from the
 data used for fitting the regressor.
 
 :class:`CalibratedClassifierCV` supports the use of two 'calibration'
 regressors; 'sigmoid' and 'isotonic'. Both these regressors only
-support 1-dimensional data (e.g., binary classification output) but can be
+support 1-dimensional data (e.g., binary classification output) but are
 extended for multiclass classification if the `base_estimator` supports
-multiclass predictions. :class:`CalibratedClassifierCV` calibrates for
-each class separately in a One-Vs-The-Rest fashion [4]_. When predicting
+multiclass predictions. For multiclass predictions,
+:class:`CalibratedClassifierCV` calibrates for
+each class separately in a :ref:`ovr_classification` fashion [4]_. When
+predicting
 probabilities, the calibrated probabilities for each class
 are predicted separately. As those probabilities do not necessarily sum to
 one, a postprocessing is performed to normalize them.

From 680f124d28098932a8da6cf2f7359acc96521d2c Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Thu, 25 Jun 2020 20:50:34 +0200
Subject: [PATCH 03/11] suggestsion

---
 doc/modules/calibration.rst | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index a6f3b89a90cc2..1c5cda34e02e2 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -11,7 +11,7 @@ When performing classification you often want not only to predict the class
 label, but also obtain a probability of the respective label. This probability
 gives you some kind of confidence on the prediction. Some models can give you
 poor estimates of the class probabilities and some even do not support
-probability prediction (e.g., :mod:`sklearn.svm` estimators). The calibration
+probability prediction (e.g., :class:`SGDClassifier`). The calibration
 module allows you to better calibrate
 the probabilities of a given model, or to add support for probability
 prediction.
@@ -90,13 +90,13 @@ Calibrating a classifier
 
 Calibrating a classifier consists in fitting a regressor (called a
 *calibrator*) that maps the output of the classifier (as given by
-:term:`decison_function` or :term:`predict_proba`) to a calibrated probability
+:term:`decision_function` or :term:`predict_proba`) to a calibrated probability
 in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
 the calibrator tries to predict :math:`p(y_i = 1 | f_i)`.
 
 The samples that are used to fit the calibrator should not be the same
-samples used to fit classifier that is to be calibrated, as this would
-introduce bias. Classifier performance on data used to train it would be
+samples used to fit the classifier, as this would
+introduce bias. The classifier performance on its training data would be
 better than for novel data. Using the classifier output from training data
 to fit the calibrator would thus result in a biased calibrator that maps to
 probabilities closer to 0 and 1 than it should.
@@ -130,7 +130,7 @@ make sure that the data used for fitting the classifier is disjoint from the
 data used for fitting the regressor.
 
 :class:`CalibratedClassifierCV` supports the use of two 'calibration'
-regressors; 'sigmoid' and 'isotonic'. Both these regressors only
+regressors: 'sigmoid' and 'isotonic'. Both these regressors only
 support 1-dimensional data (e.g., binary classification output) but are
 extended for multiclass classification if the `base_estimator` supports
 multiclass predictions. For multiclass predictions,
@@ -153,7 +153,7 @@ The sigmoid regressor is based on Platt's logistic model [3]_:
        p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}
 
 where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
-is output of the classifier for sample :math:`i`. :math:`A` and :math:`B`
+is the output of the classifier for sample :math:`i`. :math:`A` and :math:`B`
 are real numbers to be determined when fitting the regressor via maximum
 likelihood.
 

From 562dc8da2f80a087d1a783bff9abc4b07ed3c6f8 Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Fri, 26 Jun 2020 11:25:24 +0200
Subject: [PATCH 04/11] suggestions

---
 doc/modules/calibration.rst | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 1c5cda34e02e2..7c76130244548 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -158,8 +158,12 @@ are real numbers to be determined when fitting the regressor via maximum
 likelihood.
 
 The sigmoid method is biased in that it assumes the :ref:`calibration curve
-<calibration_curve>` of the un-calibrated model has a sigmoid shape [1]_. It
-is thus most effective when the un-calibrated model is over-confident.
+<calibration_curve>` of the un-calibrated model has a sigmoid shape and is
+symmetrical [1]_. It is thus most effective when the un-calibrated model is
+over-confident and has similar over-confidence errors for both high and low
+output errors. The symmetry assumption is of concern in highly imbalanced
+classification as un-calibrated classifiers can have asymmetric calibration
+errors.
 
 Isotonic
 ^^^^^^^^
@@ -169,13 +173,14 @@ a step-wise non-decreasing function (see :mod:`sklearn.isotonic`). It
 minimizes:
 
 .. math::
-       \sum_i (y_i - f_i)^2
+       \sum_{i=1}^{n} (y_i - f_i)^2 : f_i \leq f_{i+1}\quad \forall i \{1,..., n-1\}
 
-subject to :math:`\f_i \le f_j`. This method is more general when compared to
-'sigmoid' as the only restriction is that the mapping function is
-monotonically increasing. It is thus more powerful as it can correct any
-monotonic distortion of the un-calibrated model. However, it is more prone
-to overfitting, especially on small datasets [5]_.
+where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
+is the output of the classifier for sample :math:`i`. This method is more
+general when compared to 'sigmoid' as the only restriction is that the mapping
+function is monotonically increasing. It is thus more powerful as it can
+correct any monotonic distortion of the un-calibrated model. However, it is
+more prone to overfitting, especially on small datasets [5]_.
 
 Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
 there is enough data (greater than ~ 1000 samples) to avoid overfitting [1]_.

From 73d060f03f5671866e8d1ecd1da9691b2aad0f3e Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Fri, 26 Jun 2020 13:35:32 +0200
Subject: [PATCH 05/11] suggestion

---
 doc/modules/calibration.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 7c76130244548..92484a473a292 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -153,9 +153,9 @@ The sigmoid regressor is based on Platt's logistic model [3]_:
        p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}
 
 where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
-is the output of the classifier for sample :math:`i`. :math:`A` and :math:`B`
-are real numbers to be determined when fitting the regressor via maximum
-likelihood.
+is the output of the un-calibrated classifier for sample :math:`i`. :math:`A`
+and :math:`B` are real numbers to be determined when fitting the regressor via
+maximum likelihood.
 
 The sigmoid method is biased in that it assumes the :ref:`calibration curve
 <calibration_curve>` of the un-calibrated model has a sigmoid shape and is
@@ -176,10 +176,10 @@ minimizes:
        \sum_{i=1}^{n} (y_i - f_i)^2 : f_i \leq f_{i+1}\quad \forall i \{1,..., n-1\}
 
 where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
-is the output of the classifier for sample :math:`i`. This method is more
-general when compared to 'sigmoid' as the only restriction is that the mapping
-function is monotonically increasing. It is thus more powerful as it can
-correct any monotonic distortion of the un-calibrated model. However, it is
+is the output of the un-calibrated classifier for sample :math:`i`. This method
+is more general when compared to 'sigmoid' as the only restriction is that the
+mapping function is monotonically increasing. It is thus more powerful as it
+can correct any monotonic distortion of the un-calibrated model. However, it is
 more prone to overfitting, especially on small datasets [5]_.
 
 Overall, 'isotonic' will perform as well as or better than 'sigmoid' when

From dc50f10615d6ca08e740034f8057bc9dd4197431 Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Fri, 26 Jun 2020 13:42:04 +0200
Subject: [PATCH 06/11] suggestion

---
 doc/modules/calibration.rst | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 92484a473a292..b68b702986946 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -173,10 +173,11 @@ a step-wise non-decreasing function (see :mod:`sklearn.isotonic`). It
 minimizes:
 
 .. math::
-       \sum_{i=1}^{n} (y_i - f_i)^2 : f_i \leq f_{i+1}\quad \forall i \{1,..., n-1\}
+       \sum_{i=1}^{n} (y_i - \hat{f}_i)^2
 
-where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
-is the output of the un-calibrated classifier for sample :math:`i`. This method
+subject to \hat{f}_i >= \hat{f}_j whenever f_i >= f_j. :math:`y_i` is the true
+label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
+calibrated classifier for sample :math:`i`. This method
 is more general when compared to 'sigmoid' as the only restriction is that the
 mapping function is monotonically increasing. It is thus more powerful as it
 can correct any monotonic distortion of the un-calibrated model. However, it is

From 8831bd4ac935ecd2b6176ada6d3f2ea8b232b720 Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Fri, 26 Jun 2020 14:17:17 +0200
Subject: [PATCH 07/11] add math

---
 doc/modules/calibration.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index b68b702986946..ad8d3a97008d1 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -175,7 +175,8 @@ minimizes:
 .. math::
        \sum_{i=1}^{n} (y_i - \hat{f}_i)^2
 
-subject to \hat{f}_i >= \hat{f}_j whenever f_i >= f_j. :math:`y_i` is the true
+subject to :math:`\hat{f}_i >= \hat{f}_j` whenever
+:math:`f_i >= f_j`. :math:`y_i` is the true
 label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
 calibrated classifier for sample :math:`i`. This method
 is more general when compared to 'sigmoid' as the only restriction is that the

From ba71ca5d5d74dfac81799d670a8d157c21c55baf Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Fri, 26 Jun 2020 14:27:11 +0200
Subject: [PATCH 08/11] fix link

---
 doc/modules/calibration.rst | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index ad8d3a97008d1..e68fbbe1c545e 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -11,8 +11,8 @@ When performing classification you often want not only to predict the class
 label, but also obtain a probability of the respective label. This probability
 gives you some kind of confidence on the prediction. Some models can give you
 poor estimates of the class probabilities and some even do not support
-probability prediction (e.g., :class:`SGDClassifier`). The calibration
-module allows you to better calibrate
+probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`).
+The calibration module allows you to better calibrate
 the probabilities of a given model, or to add support for probability
 prediction.
 
@@ -78,9 +78,9 @@ to 0 or 1 typically.
 .. currentmodule:: sklearn.svm
 
 Linear Support Vector Classification (:class:`LinearSVC`) shows an even more
-sigmoid curve than :class:`RandomForestClassifier`, which is typical for
-maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_), which
-focus on difficult to classify samples that are close to the decision
+sigmoid curve than :class:`~sklearn.ensemble.RandomForestClassifier`, which is
+typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1]_),
+which focus on difficult to classify samples that are close to the decision
 boundary (the support vectors).
 
 Calibrating a classifier

From ed6c65d5d7a309afdd28623b0e679a5fa868cf04 Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Sun, 28 Jun 2020 15:27:26 +0200
Subject: [PATCH 09/11] fix under conf

---
 doc/modules/calibration.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index e68fbbe1c545e..72641caf37d0a 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -160,7 +160,7 @@ maximum likelihood.
 The sigmoid method is biased in that it assumes the :ref:`calibration curve
 <calibration_curve>` of the un-calibrated model has a sigmoid shape and is
 symmetrical [1]_. It is thus most effective when the un-calibrated model is
-over-confident and has similar over-confidence errors for both high and low
+under-confident and has similar errors for both high and low
 output errors. The symmetry assumption is of concern in highly imbalanced
 classification as un-calibrated classifiers can have asymmetric calibration
 errors.

From cba41cfc6b32de222c268745847cdf582199406a Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Fri, 3 Jul 2020 13:18:46 +0200
Subject: [PATCH 10/11] suggestions

---
 doc/modules/calibration.rst | 58 +++++++++++++++++++++++--------------
 1 file changed, 36 insertions(+), 22 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 72641caf37d0a..443bf6cae9856 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -11,7 +11,8 @@ When performing classification you often want not only to predict the class
 label, but also obtain a probability of the respective label. This probability
 gives you some kind of confidence on the prediction. Some models can give you
 poor estimates of the class probabilities and some even do not support
-probability prediction (e.g., :class:`~sklearn.linear_model.SGDClassifier`).
+probability prediction (e.g., some instances of
+:class:`~sklearn.linear_model.SGDClassifier`).
 The calibration module allows you to better calibrate
 the probabilities of a given model, or to add support for probability
 prediction.
@@ -88,7 +89,7 @@ Calibrating a classifier
 
 .. currentmodule:: sklearn.calibration
 
-Calibrating a classifier consists in fitting a regressor (called a
+Calibrating a classifier consists of fitting a regressor (called a
 *calibrator*) that maps the output of the classifier (as given by
 :term:`decision_function` or :term:`predict_proba`) to a calibrated probability
 in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
@@ -129,20 +130,19 @@ fit the regressor. It is up to the user
 make sure that the data used for fitting the classifier is disjoint from the
 data used for fitting the regressor.
 
-:class:`CalibratedClassifierCV` supports the use of two 'calibration'
-regressors: 'sigmoid' and 'isotonic'. Both these regressors only
-support 1-dimensional data (e.g., binary classification output) but are
-extended for multiclass classification if the `base_estimator` supports
-multiclass predictions. For multiclass predictions,
-:class:`CalibratedClassifierCV` calibrates for
-each class separately in a :ref:`ovr_classification` fashion [4]_. When
-predicting
-probabilities, the calibrated probabilities for each class
-are predicted separately. As those probabilities do not necessarily sum to
-one, a postprocessing is performed to normalize them.
+:func:`sklearn.metrics.brier_score_loss` may be used to assess how
+well a classifier is calibrated. However, this metric should be used with care
+because a lower Brier score does not always mean a better calibrated model.
+This is because the Brier score metric is a combination of calibration loss
+and refinement loss. Calibration loss is defined as the mean squared deviation
+from empirical probabilities derived from the slope of ROC segments.
+Refinement loss can be defined as the expected optimal loss as measured by the
+area under the optimal cost curve. As refinement loss can change
+independently from calibration loss, a lower Brier score does not necessarily
+mean a better calibrated model.
 
-The :func:`sklearn.metrics.brier_score_loss` may be used to evaluate how
-well a classifier is calibrated.
+:class:`CalibratedClassifierCV` supports the use of two 'calibration'
+regressors: 'sigmoid' and 'isotonic'.
 
 Sigmoid
 ^^^^^^^
@@ -160,8 +160,8 @@ maximum likelihood.
 The sigmoid method is biased in that it assumes the :ref:`calibration curve
 <calibration_curve>` of the un-calibrated model has a sigmoid shape and is
 symmetrical [1]_. It is thus most effective when the un-calibrated model is
-under-confident and has similar errors for both high and low
-output errors. The symmetry assumption is of concern in highly imbalanced
+under-confident and has similar calibration errors for both high and low
+outputs. The symmetry assumption is of concern in highly imbalanced
 classification as un-calibrated classifiers can have asymmetric calibration
 errors.
 
@@ -178,15 +178,29 @@ minimizes:
 subject to :math:`\hat{f}_i >= \hat{f}_j` whenever
 :math:`f_i >= f_j`. :math:`y_i` is the true
 label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
-calibrated classifier for sample :math:`i`. This method
-is more general when compared to 'sigmoid' as the only restriction is that the
-mapping function is monotonically increasing. It is thus more powerful as it
-can correct any monotonic distortion of the un-calibrated model. However, it is
-more prone to overfitting, especially on small datasets [5]_.
+calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
+This method is more general when compared to 'sigmoid' as the only restriction
+is that the mapping function is monotonically increasing. It is thus more
+powerful as it can correct any monotonic distortion of the un-calibrated model.
+However, it is more prone to overfitting, especially on small datasets [5]_.
 
 Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
 there is enough data (greater than ~ 1000 samples) to avoid overfitting [1]_.
 
+Multiclass support
+^^^^^^^^^^^^^^^^^^
+
+Both isotonic and sigmoid regressors only
+support 1-dimensional data (e.g., binary classification output) but are
+extended for multiclass classification if the `base_estimator` supports
+multiclass predictions. For multiclass predictions,
+:class:`CalibratedClassifierCV` calibrates for
+each class separately in a :ref:`ovr_classification` fashion [4]_. When
+predicting
+probabilities, the calibrated probabilities for each class
+are predicted separately. As those probabilities do not necessarily sum to
+one, a postprocessing is performed to normalize them.
+
 .. topic:: Examples:
 
    * :ref:`sphx_glr_auto_examples_calibration_plot_calibration_curve.py`

From 83e74768a9680a66a52853485546d67fced6dc80 Mon Sep 17 00:00:00 2001
From: Lucy Liu <jliu176@gmail.com>
Date: Wed, 22 Jul 2020 17:45:34 +0200
Subject: [PATCH 11/11] update

---
 doc/modules/calibration.rst | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 443bf6cae9856..cfc185c854edb 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -157,13 +157,19 @@ is the output of the un-calibrated classifier for sample :math:`i`. :math:`A`
 and :math:`B` are real numbers to be determined when fitting the regressor via
 maximum likelihood.
 
-The sigmoid method is biased in that it assumes the :ref:`calibration curve
-<calibration_curve>` of the un-calibrated model has a sigmoid shape and is
-symmetrical [1]_. It is thus most effective when the un-calibrated model is
+The sigmoid method assumes the :ref:`calibration curve <calibration_curve>`
+can be corrected by applying a sigmoid function to the raw predictions. This
+assumption has been empirically justified in the case of :ref:`svm` with
+common kernel functions on various benchmark datasets in section 2.1 of Platt
+1999 [3]_ but does not necessarily hold in general. Additionally, the
+logistic model works best if the calibration error is symmetrical, meaning
+the classifier output for each binary class is normally distributed with
+the same variance [6]_. This is can be a problem for highly imbalanced
+classification problems, where outputs do not have equal variance.
+
+In general this method is most effective when the un-calibrated model is
 under-confident and has similar calibration errors for both high and low
-outputs. The symmetry assumption is of concern in highly imbalanced
-classification as un-calibrated classifiers can have asymmetric calibration
-errors.
+outputs.
 
 Isotonic
 ^^^^^^^^
@@ -232,4 +238,9 @@ one, a postprocessing is performed to normalize them.
     .. [5] `Predicting accurate probabilities with a ranking loss.
            <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/>`_
            Menon AK, Jiang XJ, Vembu S, Elkan C, Ohno-Machado L.
-           Proc Int Conf Mach Learn. 2012;2012:703-710.
\ No newline at end of file
+           Proc Int Conf Mach Learn. 2012;2012:703-710
+
+    .. [6] `Beyond sigmoids: How to obtain well-calibrated probabilities from
+           binary classifiers with beta calibration
+           <https://projecteuclid.org/euclid.ejs/1513306867>`_
+           Kull, M., Silva Filho, T. M., & Flach, P. (2017).
\ No newline at end of file