Skip to content

ENH Add libsvm-like calibration procedure to CalibratedClassifierCV #17856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Oct 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
5b248de
wip
lucyleeow Jun 24, 2020
ccf55fb
refactor
lucyleeow Jun 25, 2020
1f97a53
merge master
lucyleeow Jun 30, 2020
2b87d0b
clean up, tests pass
lucyleeow Jul 1, 2020
2ee24cd
remove debugging test
lucyleeow Jul 1, 2020
4ae878b
lint
lucyleeow Jul 1, 2020
12cee4d
suggestions
lucyleeow Jul 1, 2020
e183131
check class y
lucyleeow Jul 1, 2020
1b411f7
add att docstr
lucyleeow Jul 1, 2020
b724760
wip
lucyleeow Jul 7, 2020
dadefa7
first iter, tests pass
lucyleeow Jul 7, 2020
28fc1c0
lint
lucyleeow Jul 7, 2020
443ae50
rename df to preds
lucyleeow Jul 8, 2020
093105d
lint
lucyleeow Jul 8, 2020
7dcd4e5
lint
lucyleeow Jul 9, 2020
9e1ea37
sep pred and fit calib
lucyleeow Jul 9, 2020
42d405b
lint
lucyleeow Jul 9, 2020
b628835
use partial
lucyleeow Jul 9, 2020
6d30cf8
suggestions, update docstring
lucyleeow Jul 28, 2020
062e804
merge master, amend for joblib
lucyleeow Jul 28, 2020
fec2395
lint
lucyleeow Jul 28, 2020
6c916a7
Merge branch 'master' into calbclf_ensemble
lucyleeow Jul 29, 2020
6701bc7
add tests
lucyleeow Jul 29, 2020
82b885f
lint
lucyleeow Jul 29, 2020
1fa0b31
use kwarg in cross val predict
lucyleeow Jul 29, 2020
aaa8793
wip
lucyleeow Jul 29, 2020
2282478
kwarg cv
lucyleeow Jul 29, 2020
ec81580
fix kwargs partial
lucyleeow Jul 30, 2020
1250cc5
use signature get pred
lucyleeow Jul 30, 2020
6dcac0c
use signature get pred
lucyleeow Jul 30, 2020
5e5f53b
add test ensemble
lucyleeow Jul 30, 2020
9c89a5d
set rand state svc
lucyleeow Jul 30, 2020
06c0088
update docs
lucyleeow Jul 30, 2020
7c873e9
suggestions
lucyleeow Jul 30, 2020
3495343
param test_calib, break into 3 tests, add data fixture
lucyleeow Jul 30, 2020
641fad3
lint
lucyleeow Jul 30, 2020
7fae559
og suggests
lucyleeow Jul 31, 2020
b880d3a
merge master
lucyleeow Jul 31, 2020
900a02f
pred -> predictons
lucyleeow Jul 31, 2020
b48efd3
Merge branch 'master' into calbclf_ensemble
ogrisel Aug 7, 2020
b08d1db
suggestion
lucyleeow Aug 27, 2020
2b92201
Merge branch 'calbclf_ensemble' of github.com:lucyleeow/scikit-learn …
lucyleeow Aug 27, 2020
9cce88c
merge master
lucyleeow Oct 8, 2020
1b2e3a1
suggestions, whats new
lucyleeow Oct 17, 2020
d396c6e
expand benefit
lucyleeow Oct 17, 2020
8c07fd4
formatting
lucyleeow Oct 18, 2020
577c7b3
suggestion
lucyleeow Oct 21, 2020
bfaf3e1
add user
lucyleeow Oct 22, 2020
cb83a96
Merge branch 'master' of github.com:scikit-learn/scikit-learn into ca…
ogrisel Oct 22, 2020
52c821c
suggestions
lucyleeow Oct 29, 2020
deb75fc
More stable multiclass test with Brier score
ogrisel Oct 29, 2020
aff6bb8
Unused import
ogrisel Oct 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 33 additions & 16 deletions doc/modules/calibration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,9 +96,9 @@ in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_
the calibrator tries to predict :math:`p(y_i = 1 | f_i)`.

The samples that are used to fit the calibrator should not be the same
samples used to fit the classifier, as this would
introduce bias. The classifier performance on its training data would be
better than for novel data. Using the classifier output from training data
samples used to fit the classifier, as this would introduce bias.
This is because performance of the classifier on its training data would be
better than for novel data. Using the classifier output of training data
to fit the calibrator would thus result in a biased calibrator that maps to
probabilities closer to 0 and 1 than it should.

Expand All @@ -107,22 +107,39 @@ Usage

The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.

:class:`CalibratedClassifierCV` uses a cross-validation approach to fit both
the classifier and the regressor. The data is split into k
`(train_set, test_set)` couples (as determined by `cv`). The classifier
(`base_estimator`) is trained on the train set, and its predictions on the
test set are used to fit a regressor. This ensures that the data used to fit
the classifier is always disjoint from the data used to fit the calibrator.
After fitting, we end up with k
`(classifier, regressor)` couples where each regressor maps the output of
its corresponding classifier into [0, 1]. Each couple is exposed in the
`calibrated_classifiers_` attribute, where each entry is a calibrated
:class:`CalibratedClassifierCV` uses a cross-validation approach to ensure
unbiased data is always used to fit the calibrator. The data is split into k
`(train_set, test_set)` couples (as determined by `cv`). When `ensemble=True`
(default), the following procedure is repeated independently for each
cross-validation split: a clone of `base_estimator` is first trained on the
train subset. Then its predictions on the test subset are used to fit a
calibrator (either a sigmoid or isotonic regressor). This results in an
ensemble of k `(classifier, calibrator)` couples where each calibrator maps
the output of its corresponding classifier into [0, 1]. Each couple is exposed
in the `calibrated_classifiers_` attribute, where each entry is a calibrated
classifier with a :term:`predict_proba` method that outputs calibrated
probabilities. The output of :term:`predict_proba` for the main
:class:`CalibratedClassifierCV` instance corresponds to the average of the
predicted probabilities of the `k` estimators in the
`calibrated_classifiers_` list. The output of :term:`predict` is the class
that has the highest probability.
predicted probabilities of the `k` estimators in the `calibrated_classifiers_`
list. The output of :term:`predict` is the class that has the highest
probability.

When `ensemble=False`, cross-validation is used to obtain 'unbiased'
predictions for all the data, via
:func:`~sklearn.model_selection.cross_val_predict`.
These unbiased predictions are then used to train the calibrator. The attribute
`calibrated_classifiers_` consists of only one `(classifier, calibrator)`
couple where the classifier is the `base_estimator` trained on all the data.
In this case the output of :term:`predict_proba` for
:class:`CalibratedClassifierCV` is the predicted probabilities obtained
from the single `(classifier, calibrator)` couple.

The main advantage of `ensemble=True` is to benefit from the traditional
ensembling effect (similar to :ref:`bagging`). The resulting ensemble should
both be well calibrated and slightly more accurate than with `ensemble=False`.
The main advantage of using `ensemble=False` is computational: it reduces the
overall fit time by training only a single base classifier and calibrator
pair, decreases the final model size and increases prediction speed.

Alternatively an already fitted classifier can be calibrated by setting
`cv="prefit"`. In this case, the data is not split and all of it is used to
Expand Down
8 changes: 8 additions & 0 deletions doc/whats_new/v0.24.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,14 @@ Changelog
sparse matrix or dataframe at the start. :pr:`17546` by
:user:`Lucy Liu <lucyleeow>`.

- |Enhancement| Add `ensemble` parameter to
:class:`calibration.CalibratedClassifierCV`, which enables implementation
of calibration via an ensemble of calibrators (current method) or
just one calibrator using all the data (similar to the built-in feature of
:mod:`sklearn.svm` estimators with the `probabilities=True` parameter).
:pr:`17856` by :user:`Lucy Liu <lucyleeow>` and
:user:`Andrea Esuli <aesuli>`.

:mod:`sklearn.cluster`
......................

Expand Down
Loading