scikit-learn · agramfort · Feb 18, 2015 · Feb 18, 2015 · Feb 18, 2015 · Feb 18, 2015
diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
@@ -0,0 +1,195 @@
+.. _calibration:
+
+=======================
+Probability calibration
+=======================
+
+.. currentmodule:: sklearn.calibration
+
+
+When performing classification you often want not only to predict the class
+label, but also obtain a probability of the respective label. This probability
+gives you some kind of confidence on the prediction. Some models can give you
+poor estimates of the class probabilities and some even do not not support
+probability prediction. The calibration module allows you to better calibrate
+the probabilities of a given model, or to add support for probability
+prediction.
+
+Well calibrated classifiers are probabilistic classifiers for which the output
+of the predict_proba method can be directly interpreted as a confidence level.
+For instance, a well calibrated (binary) classifier should classify the samples
+such that among the samples to which it gave a predict_proba value close to 0.8,
+approximately 80% actually belong to the positive class. The following plot compares
+how well the probabilistic predictions of different classifiers are calibrated:
+
+.. figure:: ../auto_examples/calibration/images/plot_compare_calibration_001.png
+   :target: ../auto_examples/calibration/plot_compare_calibration.html
+   :align: center
+
+.. currentmodule:: sklearn.linear_model
+:class:`LogisticRegression` returns well calibrated predictions by default as it directly
+optimizes log-loss. In contrast, the other methods return biased probabilities;
+with different biases per method:
+
+ * .. currentmodule:: sklearn.naive_bayes
+   :class:`GaussianNB` tends to push probabilties to 0 or 1 (note the
+   counts in the histograms). This is mainly because it makes the assumption
+   that features are conditionally independent given the class, which is not
+   the case in this dataset which contains 2 redundant features.
+
+ * .. currentmodule:: sklearn.ensemble
+   :class:`RandomForestClassifier` shows the opposite behavior: the histograms
+   show peaks at approximately 0.2 and 0.9 probability, while probabilities close to
+   0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil
+   and Caruana [4]: "Methods such as bagging and random forests that average
+   predictions from a base set of models can have difficulty making predictions
+   near 0 and 1 because variance in the underlying base models will bias
+   predictions that should be near zero or one away from these values. Because
+   predictions are restricted to the interval [0,1], errors caused by variance
+   tend to be one-sided near zero and one. For example, if a model should
+   predict p = 0 for a case, the only way bagging can achieve this is if all
+   bagged trees predict zero. If we add noise to the trees that bagging is
+   averaging over, this noise will cause some trees to predict values larger
+   than 0 for this case, thus moving the average prediction of the bagged
+   ensemble away from 0. We observe this effect most strongly with random
+   forests because the base-level trees trained with random forests have
+   relatively high variance due to feature subseting." As a result, the
+   calibration curve shows a characteristic sigmoid shape, indicating that the
+   classifier could trust its "intuition" more and return probabilties closer
+   to 0 or 1 typically.
+
+ * .. currentmodule:: sklearn.svm
+   Linear Support Vector Classification (:class:`LinearSVC`) shows an even more sigmoid curve
+   as the RandomForestClassifier, which is typical for maximum-margin methods
+   (compare Niculescu-Mizil and Caruana [4]), which focus on hard samples
+   that are close to the decision boundary (the support vectors).
+
+.. currentmodule:: sklearn.calibration
+Two approaches for performing calibration of probabilistic predictions are
+provided: a parametric approach based on Platt's sigmoid model and a
+non-parametric approach based on isotonic regression (:mod:`sklearn.isotonic`).
+Probability calibration should be done on new data not used for model fitting.
+The class :class:`CalibratedClassifierCV` uses a cross-validation generator and
+estimates for each split the model parameter on the train samples and the
+calibration of the test samples. The probabilities predicted for the
+folds are then averaged. Already fitted classifiers can be calibrated by
+:class:`CalibratedClassifierCV` via the paramter cv="prefit". In this case,
+the user has to take care manually that data for model fitting and calibration
+are disjoint.
+
+The following images demonstrate the benefit of probability calibration.
+The first image present a dataset with 2 classes and 3 blobs of
+data. The blob in the middle contains random samples of each class.
+The probability for the samples in this blob should be 0.5.
+
+.. figure:: ../auto_examples/calibration/images/plot_calibration_001.png
+   :target: ../auto_examples/calibration/plot_calibration.html
+   :align: center
+
+The following image shows on the data above the estimated probability
+using a Gaussian naive Bayes classifier without calibration,
+with a sigmoid calibration and with a non-parametric isotonic
+calibration. One can observe that the non-parametric model
+provides the most accurate probability estimates for samples
+in the middle, i.e., 0.5.
+
+.. figure:: ../auto_examples/calibration/images/plot_calibration_002.png
+   :target: ../auto_examples/calibration/plot_calibration.html
+   :align: center
+
+.. currentmodule:: sklearn.metrics
+The following experiment is performed on an artificial dataset for binary
+classification with 100.000 samples (1.000 of them are used for model fitting)
+with 20 features. Of the 20 features, only 2 are informative and 10 are
+redundant. The figure shows the estimated probabilities obtained with
+logistic regression, a linear support-vector classifier (SVC), and linear SVC with
+both isotonic calibration and sigmoid calibration. The calibration performance
+is evaluated with Brier score :func:`brier_score_loss`, reported in the legend
+(the smaller the better).
+
+.. figure:: ../auto_examples/calibration/images/plot_calibration_curve_002.png
+   :target: ../auto_examples/calibration/plot_calibration_curve.html
+   :align: center
+
+One can observe here that logistic regression is well calibrated as its curve is
+nearly diagonal. Linear SVC's calibration curve has a sigmoid curve, which is
+typical for an under-confident classifier. In the case of LinearSVC, this is
+caused by the margin property of the hinge loss, which lets the model focus on
+hard samples that are close to the decision boundary (the support vectors). Both
+kinds of calibration can fix this issue and yield nearly identical results.
+The next figure shows the calibration curve of Gaussian naive Bayes on
+the same data, with both kinds of calibration and also without calibration.
+
+.. figure:: ../auto_examples/calibration/images/plot_calibration_curve_001.png
+   :target: ../auto_examples/calibration/plot_calibration_curve.html
+   :align: center
+
+One can see that Gaussian naive Bayes performs very badly but does so in an
+other way than linear SVC: While linear SVC exhibited a sigmoid calibration
+curve, Gaussian naive Bayes' calibration curve has a transposed-sigmoid shape.
+This is typical for an over-confident classifier. In this case, the classifier's
+overconfidence is caused by the redundant features which violate the naive Bayes
+assumption of feature-independence.
+
+Calibration of the probabilities of Gaussian naive Bayes with isotonic
+regression can fix this issue as can be seen from the nearly diagonal
+calibration curve. Sigmoid calibration also improves the brier score slightly,
+albeit not as strongly as the non-parametric isotonic calibration. This is an
+intrinsic limitation of sigmoid calibration, whose parametric form assumes a
+sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic
+calibration model, however, makes no such strong assumptions and can deal with
+either shape, provided that there is sufficient calibration data. In general,
+sigmoid calibration is preferable if the calibration curve is sigmoid and when
+there is few calibration data while isotonic calibration is preferable for non-
+sigmoid calibration curves and in situations where many additional data can be
+used for calibration.
+
+.. currentmodule:: sklearn.calibration
+:class:`CalibratedClassifierCV` can also deal with classification tasks that
+involve more than two classes if the base estimator can do so. In this case,
+the classifier is calibrated first for each class separately in an one-vs-rest
+fashion. When predicting probabilities for unseen data, the calibrated
+probabilities for each class are predicted separately. As those probabilities
+do not necessarily sum to one, a postprocessing is performed to normalize them.
+
+The next image illustrates how sigmoid calibration changes predicted
+probabilities for a 3-class classification problem. Illustrated is the standard
+2-simplex, where the three corners correspond to the three classes. Arrows point
+from the probability vectors predicted by an uncalibrated classifier to the
+probability vectors predicted by the same classifier after sigmoid calibration
+on a hold-out validation set. Colors indicate the true class of an instance
+(red: class 1, green: class 2, blue: class 3).
+
+.. figure:: ../auto_examples/calibration/images/plot_calibration_multiclass_000.png
+   :target: ../auto_examples/calibration/plot_calibration_multiclass.html
+   :align: center
+
+The base classifier is a random forest classifier with 25 base estimators
+(trees). If this classifier is trained on all 800 training datapoints, it is
+overly confident in its predictions and thus incurs a large log-loss.
+Calibrating an identical classifier, which was trained on 600 datapoints, with
+method='sigmoid' on the remaining 200 datapoints reduces the confidence of the
+predictions, i.e., moves the probability vectors from the edges of the simplex
+towards the center:
+
+.. figure:: ../auto_examples/calibration/images/plot_calibration_multiclass_001.png
+   :target: ../auto_examples/calibration/plot_calibration_multiclass.html
+   :align: center
+
+This calibration results in a lower log-loss. Note that an alternative would
+have been to increase the number of base estimators which would have resulted in
+a similar decrease in log-loss.
+
+.. topic:: References:
+
+    .. [1] Obtaining calibrated probability estimates from decision trees
+          and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
+
+    .. [2] Transforming Classifier Scores into Accurate Multiclass
+          Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
+
+    .. [3] Probabilistic Outputs for Support Vector Machines and Comparisons to
+          Regularized Likelihood Methods, J. Platt, (1999)
+
+    .. [4] Predicting Good Probabilities with Supervised Learning,
+          A. Niculescu-Mizil & R. Caruana, ICML 2005
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -768,6 +768,7 @@ details.
    metrics.accuracy_score
    metrics.auc
    metrics.average_precision_score
+   metrics.brier_score_loss
    metrics.classification_report
    metrics.confusion_matrix
    metrics.f1_score
@@ -784,6 +785,7 @@ details.
    metrics.roc_auc_score
    metrics.roc_curve
    metrics.zero_one_loss
+   metrics.brier_score_loss
 
 Regression metrics
 ------------------
@@ -1009,6 +1011,33 @@ See the :ref:`metrics` section of the user guide for further details.
    neural_network.BernoulliRBM
 
 
+.. _calibration_ref:
+
+:mod:`sklearn.calibration`: Probability Calibration
+===================================================
+
+.. automodule:: sklearn.calibration
+   :no-members:
+   :no-inherited-members:
+
+**User guide:** See the :ref:`calibration` section for further details.
+
+.. currentmodule:: sklearn
+
+.. autosummary::
+   :toctree: generated/
+   :template: class.rst
+
+   calibration.CalibratedClassifierCV
+
+
+.. autosummary::
+   :toctree: generated/
+   :template: function.rst
+
+   calibration.calibration_curve
+
+
 .. _cross_decomposition_ref:
 
 :mod:`sklearn.cross_decomposition`: Cross decomposition

diff --git a/doc/supervised_learning.rst b/doc/supervised_learning.rst
@@ -22,3 +22,4 @@ Supervised learning
     modules/feature_selection.rst
     modules/label_propagation.rst
     modules/isotonic.rst
+    modules/calibration.rst
diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -75,6 +75,10 @@ New features
      for fixed user-provided cross-validation folds.
      By `untom <https://github.com/untom>`_.
 
+   - Added :class:`calibration.CalibratedClassifierCV`, an approach for
+     calibrating the predicted probabilities of a classifier.
+     By `Alexandre Gramfort`_ and `Jan Hendrik Metzen`_.
+
 
 Enhancements
 ............

diff --git a/examples/calibration/README.txt b/examples/calibration/README.txt
@@ -0,0 +1,6 @@
+.. _calibration_examples:
+
+Calibration
+-----------------------
+
+Examples illustrating the calibration of predicted probabilities of classifiers.
diff --git a/examples/calibration/plot_calibration.py b/examples/calibration/plot_calibration.py
@@ -0,0 +1,116 @@
+"""
+======================================
+Probability calibration of classifiers
+======================================
+
+When performing classification you often want to predict not only
+the class label, but also the associated probability. This probability
+gives you some kind of confidence on the prediction. However, not all
+classifiers provide well-calibrated probabilities, some being over-confident
+while others being under-confident. Thus, a separate calibration of predicted
+probabilities is often desirable as a postprocessing. This example illustrates
+two different methods for this calibration and evaluates the quality of the
+returned probabilities using Brier's score
+(see http://en.wikipedia.org/wiki/Brier_score).
+
+Compared are the estimated probability using a Gaussian naive Bayes classifier
+without calibration, with a sigmoid calibration, and with a non-parametric
+isotonic calibration. One can observe that only the non-parametric model is able
+to provide a probability calibration that returns probabilities close to the
+expected 0.5 for most of the samples belonging to the middle cluster with
+heterogeneous labels. This results in a significantly improved Brier score.
+"""
+print(__doc__)
+
+# Author: Mathieu Blondel <mathieu@mblondel.org>
+#         Alexandre Gramfort <alexandre.gramfort@telecom-paristech.fr>
+#         Balazs Kegl <balazs.kegl@gmail.com>
+#         Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>
+# License: BSD Style.
+
+import numpy as np
+import matplotlib.pyplot as plt
+from matplotlib import cm
+
+from sklearn.datasets import make_blobs
+from sklearn.naive_bayes import GaussianNB
+from sklearn.metrics import brier_score_loss
+from sklearn.calibration import CalibratedClassifierCV
+from sklearn.cross_validation import train_test_split
+
+
+n_samples = 50000
+n_bins = 3  # use 3 bins for calibration_curve as we have 3 clusters here
+
+# Generate 3 blobs with 2 classes where the second blob contains
+# half positive samples and half negative samples. Probability in this
+# blob is therefore 0.5.
+centers = [(-5, -5), (0, 0), (5, 5)]
+X, y = make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
+                  centers=centers, shuffle=False, random_state=42)
+
+y[:n_samples // 2] = 0
+y[n_samples // 2:] = 1
+sample_weight = np.random.RandomState(42).rand(y.shape[0])
+
+# split train, test for calibration
+X_train, X_test, y_train, y_test, sw_train, sw_test = \
+    train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)
+
+# Gaussian Naive-Bayes with no calibration
+clf = GaussianNB()
+clf.fit(X_train, y_train)  # GaussianNB itself does not support sample-weights
+prob_pos_clf = clf.predict_proba(X_test)[:, 1]
+
+# Gaussian Naive-Bayes with isotonic calibration
+clf_isotonic = CalibratedClassifierCV(clf, cv=2, method='isotonic')
+clf_isotonic.fit(X_train, y_train, sw_train)
+prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]
+
+# Gaussian Naive-Bayes with sigmoid calibration
+clf_sigmoid = CalibratedClassifierCV(clf, cv=2, method='sigmoid')
+clf_sigmoid.fit(X_train, y_train, sw_train)
+prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1]
+
+print("Brier scores: (the smaller the better)")
+
+clf_score = brier_score_loss(y_test, prob_pos_clf, sw_test)
+print("No calibration: %1.3f" % clf_score)
+
+clf_isotonic_score = brier_score_loss(y_test, prob_pos_isotonic, sw_test)
+print("With isotonic calibration: %1.3f" % clf_isotonic_score)
+
+clf_sigmoid_score = brier_score_loss(y_test, prob_pos_sigmoid, sw_test)
+print("With sigmoid calibration: %1.3f" % clf_sigmoid_score)
+
+###############################################################################
+# Plot the data and the predicted probabilities
+plt.figure()
+y_unique = np.unique(y)
+colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size))
+for this_y, color in zip(y_unique, colors):
+    this_X = X_train[y_train == this_y]
+    this_sw = sw_train[y_train == this_y]
+    plt.scatter(this_X[:, 0], this_X[:, 1], s=this_sw * 50, c=color, alpha=0.5,
+                label="Class %s" % this_y)
+plt.legend(loc="best")
+plt.title("Data")
+
+plt.figure()
+order = np.lexsort((prob_pos_clf, ))
+plt.plot(prob_pos_clf[order], 'r', label='No calibration (%1.3f)' % clf_score)
+plt.plot(prob_pos_isotonic[order], 'g', linewidth=3,
+         label='Isotonic calibration (%1.3f)' % clf_isotonic_score)
+plt.plot(prob_pos_sigmoid[order], 'b', linewidth=3,
+         label='Sigmoid calibration (%1.3f)' % clf_sigmoid_score)
+plt.plot(np.linspace(0, y_test.size, 51)[1::2],
+         y_test[order].reshape(25, -1).mean(1),
+         'k', linewidth=3, label=r'Empirical')
+plt.ylim([-0.05, 1.05])
+plt.xlabel("Instances sorted according to predicted probability "
+           "(uncalibrated GNB)")
+plt.ylabel("P(y=1)")
+plt.legend(loc="upper left")
+plt.title("Gaussian naive Bayes probabilities")
+
+plt.show()