-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add decision threshold calibration wrapper #10117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9f5a360
132864b
6477392
f1f9112
97f1fa7
13ee903
6b8fd24
c95df08
904483f
60f641f
4e5a018
cc2b163
125a9b2
6a822c6
c42564d
31da085
19285fc
84ed3bc
2e5a9bb
e63657c
a145a16
3abd1fe
2f778fb
971c0ee
e3d6a15
7a99622
773a78e
e7506ca
bba7003
ba56ed8
a5807fa
3058bc0
24d074c
6f9ce4a
8fda0c4
cf0e800
071ed09
fd648a5
5110b3e
091dd37
dc94b52
1b743d4
c3df7ae
4f1b936
09af8ae
cb35eb9
9fa322f
8214fcd
73ab4c9
11697e9
d91aa35
4d36f4e
45c2d4f
d4d406b
329ce49
c862de4
6937714
59888aa
f9eaa66
eff199b
47d8d2c
59e97cf
93ee091
26b934c
7c1326c
6ff615e
c658af4
52fdb91
02b61ea
73a3609
9df6297
e274c74
193d7c7
ef37050
1896a22
ac0bf60
00e110b
549b186
bb6afee
404ac21
595247e
ec19891
9eb3f16
0aa70d4
9d7f861
2d5aca5
9218c0b
fab9022
e2bf019
febf970
d7c5943
9601e2c
3916f01
d61977e
72c954e
9260a5f
7692711
d3a3f3b
d44ae62
17b9eec
a0f00ec
f8a07a2
ab6e3fd
67af28a
4e1f70a
3b6d9fb
d1228dd
8315204
a995da6
e36a892
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
.. _calibration: | ||
.. _probability_calibration: | ||
|
||
======================= | ||
Probability calibration | ||
|
@@ -208,3 +208,124 @@ a similar decrease in log-loss. | |
.. [5] On the combination of forecast probabilities for | ||
consecutive precipitation periods. Wea. Forecasting, 5, 640–650., | ||
Wilks, D. S., 1990a | ||
|
||
|
||
.. _decision_threshold_calibration: | ||
|
||
============================== | ||
Decision Threshold calibration | ||
============================== | ||
|
||
.. currentmodule:: sklearn.calibration | ||
|
||
Often Machine Learning classifiers base their | ||
predictions on real-valued decision functions or probability estimates that | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "that" -> ". These" Although I think we might land up rewriting some of this. I think these three paragraphs would be clearer with something like: |
||
carry the inherited biases of their models. Additionally when using a machine | ||
learning model the evaluation criteria can differ from the optimisation | ||
objectives used by the model during training. | ||
|
||
When predicting between two classes it is commonly advised that an appropriate | ||
decision threshold is estimated based on some cutoff criteria rather than | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know what "cutoff criteria" are as distinct from "decision threshold" |
||
arbitrarily using the midpoint of the space of possible values. Estimating a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, it's not entirely arbitrary. I don't really think that first sentence says a lot. |
||
decision threshold for a specific use case can help to increase the overall | ||
accuracy of the model and provide better handling for sensitive classes. | ||
|
||
.. currentmodule:: sklearn.calibration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is redundant |
||
|
||
:class:`CutoffClassifier` can be used as a wrapper around a model for binary | ||
classification to help obtain a more appropriate decision threshold and use it | ||
for predicting new samples. | ||
|
||
Usage | ||
----- | ||
|
||
To use the :class:`CutoffClassifier` you need to provide an estimator that has | ||
a ``decision_function`` or a ``predict_proba`` method. The ``method`` | ||
parameter controls whether the first will be preferred over the second if both | ||
are available. | ||
|
||
The wrapped estimator can be pre-trained, in which case ``cv = 'prefit'``, or | ||
not. If the classifier is not trained then a cross-validation loop specified by | ||
the parameter ``cv`` can be used to obtain a decision threshold by averaging | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the right thing / a sensible thing to do? Do we have references for this? For CallibratedClassifierCV we just keep all the models and average them. We could do the same here. That might make more sense, but I don't know of any literature. Have you done any experiments? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What experiments do you have in mind? Not more than running evaluating classifiers prediction accuracy, tpr, tnr given the input parameters of the cutoffclassifier. Literature wise I didn't find anythings related to cross validation and cutoff point. But maybe I haven't searched enough.
do suggest to instead of keeping the decision threshold to keep all the underlying trained models and combine / average their predictions ? what would be the combining criteria for the predictions in this case? just a mean of the predictions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, it would be less obvious how to combine, but voting would be possible. Maybe some experiments that confirm that the current implementation is a sensible thing to do and works in practice? I.e. averaging the thresholds makes it better, not worse, than a single hold-out? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I've seen this in practice. My understanding was that the cv approach is also a way to obtain a threshold on your training data without worrying that it's overfit. Didn't expect it to necessarily improve the threshold or significantly. So what would be the purpose of keeping the underlying models? allowing the user to combine them whatever way they want? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point is that I haven't seen it in practice and I don't know of any write-up that says it's happening in practice so I'd like to be convinced ;) |
||
all decision thresholds calculated on the hold-out parts of each cross | ||
validation iteration. Finally the model is trained on all the provided data. | ||
When using ``cv = 'prefit'`` you need to make sure to use a hold-out part of | ||
your data for calibration. | ||
|
||
The strategies, controlled by the parameter ``strategy``, for finding | ||
appropriate decision thresholds are based either on precision recall estimates | ||
or true positive and true negative rates. Specifically: | ||
|
||
.. currentmodule:: sklearn.metrics | ||
|
||
* ``f_beta`` | ||
selects a decision threshold that maximizes the :func:`fbeta_score`. The | ||
value of beta is specified by the parameter ``beta``. The ``beta`` parameter | ||
determines the weight of precision. When ``beta = 1`` both precision recall | ||
get the same weight therefore the maximization target in this case is the | ||
:func:`f1_score`. if ``beta < 1`` more weight is given to precision whereas | ||
if ``beta > 1`` more weight is given to recall. | ||
|
||
* ``roc`` | ||
selects the decision threshold for the point on the :func:`roc_curve` that | ||
is closest to the ideal corner (0, 1) | ||
|
||
* ``max_tpr`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to think about how to avoid confusion with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm.. confusing indeed, what if we renamed it to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, |
||
selects the decision threshold for the point that yields the highest true | ||
positive rate while maintaining a minimum true negative rate, specified by | ||
the parameter ``threshold`` | ||
|
||
* ``max_tnr`` | ||
selects the decision threshold for the point that yields the highest true | ||
negative rate while maintaining a minimum true positive rate, specified by | ||
the parameter ``threshold`` | ||
|
||
Here is a simple usage example:: | ||
|
||
>>> from sklearn.calibration import CutoffClassifier | ||
>>> from sklearn.datasets import load_breast_cancer | ||
>>> from sklearn.naive_bayes import GaussianNB | ||
>>> from sklearn.metrics import precision_score | ||
>>> from sklearn.model_selection import train_test_split | ||
|
||
>>> X, y = load_breast_cancer(return_X_y=True) | ||
>>> X_train, X_test, y_train, y_test = train_test_split( | ||
... X, y, train_size=0.6, random_state=42) | ||
>>> clf = CutoffClassifier(GaussianNB(), strategy='f_beta', beta=0.6, | ||
... cv=3).fit(X_train, y_train) | ||
>>> y_pred = clf.predict(X_test) | ||
>>> precision_score(y_test, y_pred) # doctest: +ELLIPSIS | ||
0.959... | ||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_calibration_plot_decision_threshold_calibration.py`: Decision | ||
threshold calibration on the breast cancer dataset | ||
|
||
.. currentmodule:: sklearn.calibration | ||
|
||
The following image shows the results of using the :class:`CutoffClassifier` | ||
for finding a decision threshold for a :class:`LogisticRegression` classifier | ||
and an :class:`AdaBoostClassifier` for two use cases. | ||
|
||
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_decision_threshold_calibration_001.png | ||
:target: ../auto_examples/calibration/plot_decision_threshold_calibration.html | ||
:align: center | ||
|
||
In the first case we want to increase the overall accuracy of the classifier on | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if this is described inside the example file, we do not need to repeat it here. |
||
the breast cancer dataset. In the second case we want to find a decision | ||
threshold that yields maximum true positive rate while maintaining a minimum | ||
value for the true negative rate. | ||
|
||
.. topic:: References: | ||
|
||
* Receiver-operating characteristic (ROC) plots: a fundamental | ||
evaluation tool in clinical medicine, MH Zweig, G Campbell - | ||
Clinical chemistry, 1993 | ||
|
||
Notes | ||
----- | ||
|
||
Calibrating the decision threshold of a classifier does not guarantee increased | ||
performance. The generalisation ability of the obtained decision threshold has | ||
to be evaluated. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
""" | ||
====================================================================== | ||
Decision threshold (cutoff point) calibration on breast cancer dataset | ||
====================================================================== | ||
|
||
Machine learning classifiers often base their predictions on real-valued | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be more succinct. The user guide says most of this. |
||
decision functions that don't always have accuracy as their objective. Moreover | ||
the learning objective of a model can differ from the user's needs hence using | ||
an arbitrary decision threshold as defined by the model can be not ideal. | ||
|
||
The CutoffClassifier can be used to calibrate the decision threshold of a model | ||
in order to increase the classifier's trustworthiness. Optimization objectives | ||
during the decision threshold calibration can be the true positive and / or | ||
the true negative rate as well as the f beta score. | ||
|
||
In this example the decision threshold calibration is applied on two | ||
classifiers trained on the breast cancer dataset. The goal in the first case is | ||
to maximize the f1 score of the classifiers whereas in the second the goal is | ||
to maximize the true positive rate while maintaining a minimum true negative | ||
rate. | ||
|
||
As you can see after calibration the f1 score of the LogisticRegression | ||
classifiers has increased slightly whereas the accuracy of the | ||
AdaBoostClassifier classifier has stayed the same. | ||
|
||
For the second goal as seen after calibration both classifiers achieve better | ||
true positive rate while their respective true negative rates have decreased | ||
slightly or remained stable. | ||
""" | ||
|
||
# Author: Prokopios Gryllos <prokopis.gryllos@sentiance.com> | ||
# | ||
# License: BSD 3 clause | ||
|
||
from __future__ import division | ||
|
||
import numpy as np | ||
|
||
from sklearn.ensemble import AdaBoostClassifier | ||
from sklearn.metrics import confusion_matrix, f1_score | ||
from sklearn.calibration import CutoffClassifier | ||
from sklearn.linear_model import LogisticRegression | ||
from sklearn.datasets import load_breast_cancer | ||
import matplotlib.pyplot as plt | ||
from sklearn.model_selection import train_test_split | ||
|
||
|
||
print(__doc__) | ||
|
||
# percentage of the training set that will be used for calibration | ||
calibration_samples_percentage = 0.2 | ||
|
||
X, y = load_breast_cancer(return_X_y=True) | ||
|
||
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, | ||
random_state=42) | ||
|
||
calibration_samples = int(len(X_train) * calibration_samples_percentage) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use train_test_split again to get the calibration samples? |
||
|
||
lr = LogisticRegression().fit( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please use lists and loops rather than this repetitive code and hard-to-read variable names like "f_one_lr_f_beta". This should rather be in some record-based structure (dicts? arrays? I don't mind) with strategy='f1', estimator='logistic', f1=value. |
||
X_train[:-calibration_samples], y_train[:-calibration_samples]) | ||
|
||
y_pred_lr = lr.predict(X_test) | ||
tn_lr, fp_lr, fn_lr, tp_lr = confusion_matrix(y_test, y_pred_lr).ravel() | ||
tpr_lr = tp_lr / (tp_lr + fn_lr) | ||
tnr_lr = tn_lr / (tn_lr + fp_lr) | ||
f_one_lr = f1_score(y_test, y_pred_lr) | ||
|
||
ada = AdaBoostClassifier().fit( | ||
X_train[:-calibration_samples], y_train[:-calibration_samples]) | ||
|
||
y_pred_ada = ada.predict(X_test) | ||
tn_ada, fp_ada, fn_ada, tp_ada = confusion_matrix(y_test, y_pred_ada).ravel() | ||
tpr_ada = tp_ada / (tp_ada + fn_ada) | ||
tnr_ada = tn_ada / (tn_ada + fp_ada) | ||
f_one_ada = f1_score(y_test, y_pred_ada) | ||
|
||
# objective 1: we want to calibrate the decision threshold in order to achieve | ||
# better f1 score | ||
lr_f_beta = CutoffClassifier( | ||
lr, strategy='f_beta', method='predict_proba', beta=1, cv='prefit').fit( | ||
X_train[calibration_samples:], y_train[calibration_samples:]) | ||
|
||
y_pred_lr_f_beta = lr_f_beta.predict(X_test) | ||
f_one_lr_f_beta = f1_score(y_test, y_pred_lr_f_beta) | ||
|
||
ada_f_beta = CutoffClassifier( | ||
ada, strategy='f_beta', method='predict_proba', beta=1, cv='prefit' | ||
).fit(X_train[calibration_samples:], y_train[calibration_samples:]) | ||
|
||
y_pred_ada_f_beta = ada_f_beta.predict(X_test) | ||
f_one_ada_f_beta = f1_score(y_test, y_pred_ada_f_beta) | ||
|
||
# objective 2: we want to maximize the true positive rate while the true | ||
# negative rate is at least 0.7 | ||
lr_max_tpr = CutoffClassifier( | ||
lr, strategy='max_tpr', method='predict_proba', threshold=0.7, cv='prefit' | ||
).fit(X_train[calibration_samples:], y_train[calibration_samples:]) | ||
|
||
y_pred_lr_max_tpr = lr_max_tpr.predict(X_test) | ||
tn_lr_max_tpr, fp_lr_max_tpr, fn_lr_max_tpr, tp_lr_max_tpr = \ | ||
confusion_matrix(y_test, y_pred_lr_max_tpr).ravel() | ||
tpr_lr_max_tpr = tp_lr_max_tpr / (tp_lr_max_tpr + fn_lr_max_tpr) | ||
tnr_lr_max_tpr = tn_lr_max_tpr / (tn_lr_max_tpr + fp_lr_max_tpr) | ||
|
||
ada_max_tpr = CutoffClassifier( | ||
ada, strategy='max_tpr', method='predict_proba', threshold=0.7, cv='prefit' | ||
).fit(X_train[calibration_samples:], y_train[calibration_samples:]) | ||
|
||
y_pred_ada_max_tpr = ada_max_tpr.predict(X_test) | ||
tn_ada_max_tpr, fp_ada_max_tpr, fn_ada_max_tpr, tp_ada_max_tpr = \ | ||
confusion_matrix(y_test, y_pred_ada_max_tpr).ravel() | ||
tpr_ada_max_tpr = tp_ada_max_tpr / (tp_ada_max_tpr + fn_ada_max_tpr) | ||
tnr_ada_max_tpr = tn_ada_max_tpr / (tn_ada_max_tpr + fp_ada_max_tpr) | ||
|
||
print('Calibrated threshold') | ||
print('Logistic Regression classifier: {}'.format( | ||
lr_max_tpr.decision_threshold_)) | ||
print('AdaBoost classifier: {}'.format(ada_max_tpr.decision_threshold_)) | ||
print('before calibration') | ||
print('Logistic Regression classifier: tpr = {}, tnr = {}, f1 = {}'.format( | ||
tpr_lr, tnr_lr, f_one_lr)) | ||
print('AdaBoost classifier: tpr = {}, tpn = {}, f1 = {}'.format( | ||
tpr_ada, tnr_ada, f_one_ada)) | ||
|
||
print('true positive and true negative rates after calibration') | ||
print('Logistic Regression classifier: tpr = {}, tnr = {}, f1 = {}'.format( | ||
tpr_lr_max_tpr, tnr_lr_max_tpr, f_one_lr_f_beta)) | ||
print('AdaBoost classifier: tpr = {}, tnr = {}, f1 = {}'.format( | ||
tpr_ada_max_tpr, tnr_ada_max_tpr, f_one_ada_f_beta)) | ||
|
||
######### | ||
# plots # | ||
######### | ||
bar_width = 0.2 | ||
|
||
plt.subplot(2, 1, 1) | ||
index = np.asarray([1, 2]) | ||
plt.bar(index, [f_one_lr, f_one_ada], bar_width, color='r', | ||
label='Before calibration') | ||
|
||
plt.bar(index + bar_width, [f_one_lr_f_beta, f_one_ada_f_beta], bar_width, | ||
color='b', label='After calibration') | ||
|
||
plt.xticks(index + bar_width / 2, ('f1 logistic', 'f1 adaboost')) | ||
|
||
plt.ylabel('scores') | ||
plt.title('f1 score') | ||
plt.legend(bbox_to_anchor=(.5, -.2), loc='center', borderaxespad=0.) | ||
|
||
plt.subplot(2, 1, 2) | ||
index = np.asarray([1, 2, 3, 4]) | ||
plt.bar(index, [tpr_lr, tnr_lr, tpr_ada, tnr_ada], | ||
bar_width, color='r', label='Before calibration') | ||
|
||
plt.bar(index + bar_width, | ||
[tpr_lr_max_tpr, tnr_lr_max_tpr, tpr_ada_max_tpr, tnr_ada_max_tpr], | ||
bar_width, color='b', label='After calibration') | ||
|
||
plt.xticks( | ||
index + bar_width / 2, | ||
('tpr logistic', 'tnr logistic', 'tpr adaboost', 'tnr adaboost')) | ||
plt.ylabel('scores') | ||
plt.title('true positive & true negative rate') | ||
|
||
plt.subplots_adjust(hspace=0.6) | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will mean that the page has two top-level headings. Rather, at the top of the page, create a heading of this level called "Prediction calibration" and then change the heading level of "Probability Calibration" and "Decision Threshold calibration" to fall under it.