[MRG] Add decision threshold calibration wrapper #10117

PGryllos · 2017-11-12T16:23:17Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds a decision threshold calibration wrapper for binary classifiers that calibrates the decision threshold based on three objectives:

optimise sum of true positive and true negative rate
optimise true positive rate while keeping a minimum of true negative rate
optimise true negative rate while keeping a minimum for the true positive rate

The classifier gives the option to either receive a pre-trained base estimator or train one and do the calibration using cross validation loops.

Any other comments?

There are two available examples to illustrate the first two points

the comments bellow are no longer valid but I will leave them for the completeness of the conversation bellow

Since (as discussed in Add wrapper class that changes threshold value for predict #8614 ) this wrapper focuses on Binary Classification I have made the assumption that labels will always be 0 or 1. Is this a correct assumption or can it be that labels are completely arbitrary? Can they also be more than two as long as one is considered to be the positive the rest negative?
The current implementation gives the option for choosing the positive label, assuming that the indexes of the classes coincide with the labels. If labels will always be in [0, 1] should we fix the positive label to 1? I though that it makes some sense for to be able to choose which class it will consider sensitive.
In the case of cv!="prefit" after calibrating the threshold by averaging across different folds that have been trained with different data than those used for calibration the base estimator is trained using all the data. Do you see anything wrong with that?

jnothman · 2017-11-12T21:25:00Z

thanks for the PR. I've not looked in detail yet, but you can post visualisations here! also, if you include your example in the examples directory, we can view the compiled documentation here.

PGryllos · 2017-11-13T22:15:51Z

@jnothman thnx a lot for the quick response. The visualisations I mentioned on the previous comment show just which point of the roc curve was chosen (for validating that the chosen point is the one closest to the ideal corner), so not a lot to see there. I plan on making a more extensive example and adding it here in the following days.

amueller · 2017-11-21T03:32:48Z

For your questions:

the labels can be basically arbitrary python objects, but you can assume there's exactly two of them. Check other classification code to see how mapping to 0 and 1 is handled.
it should behave consistent with other metrics, which assume 1 is positive if the labels are [0, 1] or [-1, +1] and otherwise require pos_label to be set.
That's what I would have done. That's somewhat different from what CallibratedClassifierCV does. I'm not sure if there's a standard behavior in the literature. I guess there is no really great way to accumulate the different thresholded classifiers. You could vote, but that sounds like it would lose information? But on the other hand, combining all the data together could completely change the meaning of the thresholds.... hum... In particular if you use a grouped CV I could see that going completely sideways.
Can you do an example where different groups have different semantics for thresholds and see if your strategy breaks?

…com:PGryllos/scikit-learn into feat/8614_add_threshold_calibration_wrapper

amueller · 2017-12-18T17:03:55Z

tests are failing

amueller · 2017-12-18T17:04:50Z

as you said in the other thread, we need a user guide and examples. Those often show whether the interface is good ;)

amueller · 2017-12-18T17:06:23Z

sklearn/calibration.py

+
+
+class OptimalCutoffClassifier(BaseEstimator, ClassifierMixin):
+    """Optimal cutoff point selection.


I'm not sure having Optimal in the name is a good idea. Maybe just CutoffClassifier? There is no description in the docstring on what it does.

amueller · 2017-12-18T17:10:06Z

sklearn/calibration.py

+        - 'roc', selects the point on the roc_curve that is closer to the ideal
+        corner (0, 1)
+
+        - 'max_se', selects the point that yields the highest sensitivity with


I prefer the terms "true positive rate" and "true negative rate" because they are pretty clear in their meaning.

marctorsoc · 2019-12-03T13:05:03Z

what's the state of this? I thought this was going to be included in 0.22

PGryllos · 2019-12-03T13:49:16Z

@marctorrellas I am sorry, I haven't caught up with the discussion as I didn't have the capacity to interact with this the past weeks. I may be able to find some time soon :|

glemaitre · 2020-02-23T15:41:27Z

I will take over the PR to address the remaining comments.

PGryllos · 2020-02-23T17:03:58Z

@glemaitre :) I haven't been able to dedicate any time in this for the past months due to job searching. Happy to see this completed!

glemaitre · 2020-02-23T17:05:22Z

Don't worry. I think that this an interesting feature and I want to give it a push.

PGryllos · 2020-02-23T17:18:27Z

Some thoughts that I had over time

the interface wrt to strategy is a bit excessive. It should probably give only the option for tuning the fbeta.
averaging the decision thresholds across folds, it is not super clear why this should work. Instead, we should probably select the decision threshold that works the best across folds?

glemaitre · 2020-02-23T18:34:38Z

the interface wrt to strategy is a bit excessive. It should probably give only the option for tuning the fbeta.

I would like to see if it could be useful in the case of class imbalance. Actually, we were interested of having a manual option with @ogrisel.

PGryllos · 2020-02-23T18:48:34Z

what will the manual option do? Allow the user to specify the exact fpr and trp they'd like?

glemaitre · 2020-02-23T18:57:05Z

Specify the threshold in practice (in medical domain), one wants to select a specific control point (basically as you mentioned). I would think that this is more for an inspection use rather than an automated learning process (that's why I would not use it as a default).

…

On Sun, 23 Feb 2020 at 19:48, Gryllos Prokopis ***@***.***> wrote: what will the manual option do? Allow the user to specify the exact fpr and trp they'd like? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10117?email_source=notifications&email_token=ABY32P2IJEBYSD7YIQJC2X3RELAIJA5CNFSM4EDLYJOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWEBAI#issuecomment-590102657>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P3WTFGT6DFALYXWPGTRELAIJANCNFSM4EDLYJOA> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

PGryllos · 2020-02-23T19:15:41Z

sounds relevant, and more practical than the current interface - which in fact has a similar rationale behind it. I agree; I believe the default should be the fbeta.

marctorsoc · 2020-02-25T00:16:25Z

fold

I don't think that averaging thresholds is correct. For every training data there's an optimal threshold that will give best results on unseen data, and they can be quite different.

I think you're mixing two things:

choosing the best threshold
estimate performance for a model where we have selected the best threshold, i.e. evaluate "the strategy of train with a train set and selecting the threshold with a dev set using criteria X"

For the former I would recommend just train-dev split. Train on train, select threshold with dev

For the latter I would recommend nested CV, with the inner loop selecting the threshold, and the outer loop to evaluate. However, this won't give you which the model (the threshold) to use

ogrisel · 2020-02-25T08:08:07Z

I believe the default should be the fbeta.

I am not so sure. balanced accuracy could also be a good default but it would probably be more computationally intensive than f1 score. I don't have a strong opinion.

The classifier gives the option to either receive a pre-trained base estimator or train one and do the calibration using cross validation loops.

I am not so sure that cross-validation is really necessary here. As any of you tried or read references that would just fit the base estimator and tune the threshold on the same entire training set?

marctorsoc · 2020-02-25T08:29:53Z

Balanced accuracy is usually not good for imbalanced datasets. I would give the option of f1 and balanced accuracy. They are not difficult / expensive to compute either

@ogrisel tuning on the training set is prone to overfitting, at least you need a dev set

ogrisel · 2020-02-25T08:56:55Z

@ogrisel tuning on the training set is prone to overfitting, at least you need a dev set

I was wondering if that would really be the case in practice, but maybe yes you are right: the roc curve could be very different on the training set than on a validation set for overfitting models and therefore using this for cut-off selection would result in a poor choice of the cut-off .

I don't understand why cutoff averaging across CV folds: if the cutoff selected on each CV split is not stable enough, then the cut-off Classifier procedure would be pointless anyway. Fixing an arbitrary internal train-validation split would not solve the unstable cut-off problem, it would just hide it.

jnothman · 2020-02-25T09:23:08Z

(I'm curious, Mark, why balanced accuracy is not good for imbalanced datasets.)

…

marctorsoc · 2020-02-25T09:55:25Z

(I'm curious, Mark, why balanced accuracy is not good for imbalanced datasets.)
…

My name is Marc :)

My statement was not exactly what I wanted to say. Balanced accuracy is better than naive accuracy indeed. I'd rather choose f1, as I see it more relevant for imbalanced datasets, and in the case of this PR let the user choose, with default for bacc maybe...

Example:
TP, FP 5 50
FN, TN 10 10000

balanced accuracy = 66 %
f1 = 14 %

I think in this case f1 represents better how bad this model is, but I'm sure other scenarios can show balanced accuracy as a good metric. Please let me know if there are any errors in the computations

@ogrisel tuning on the training set is prone to overfitting, at least you need a dev set

I was wondering if that would really be the case in practice, but maybe yes you are right: the roc curve could be very different on the training set than on a validation set for overfitting models and therefore using this for cut-off selection would result in a poor choice of the cut-off .

I don't understand why cutoff averaging across CV folds: if the cutoff selected on each CV split is not stable enough, then the cut-off Classifier procedure would be pointless anyway. Fixing an arbitrary internal train-validation split would not solve the unstable cut-off problem, it would just hide it.

You can see the decision threshold as a hyperparameter. Would you choose your HP on the training set?

I don't agree with the last bit. If the cut-off is stable, then either procedure is good. If it's unstable, then you need to select the one that is good for the training portion you have used.

Note: we assume it's unstable w.r.t. the training data you use, not the validation set to tune it

ogrisel · 2020-02-25T13:30:39Z

Thanks for the intuition, I think I get your point. I would still like to see simulations for various models (e.g. for overfitting and underfitting models of various kinds) to evaluate the impact of this strategy and in particular highlight catastrophic cases where CV-based cut-off averaging is detrimental. (if someone would like to volunteer to do a study in a notebook for instance).

In any case CV (+ refit on the full training set) is always significantly more computationally intensive than doing a single train-validation split (without refit on the full training set). So it's also a good reason not to do full-blown CV by default it never improves upon single split tuning.

glemaitre · 2020-02-25T13:48:42Z

Example:
TP, FP 5 50
FN, TN 10 10000
balanced accuracy = 66 %
f1 = 14 %

I don't see the issue with balanced accuracy. Basically 50% means randomness. So you are 16 % above randomness which is still not great as the 14 % of the f1 score is telling you.

PGryllos · 2020-02-25T14:25:51Z

why is f1 preferred over fbeta? Since f1 assumes that precision and recall are equally important for the user and fbeta with default beta = 1 is the same as f1, I'd think that fbeta makes more sense.

marctorsoc · 2020-02-25T15:43:10Z

Example:
TP, FP 5 50
FN, TN 10 10000
balanced accuracy = 66 %
f1 = 14 %

I don't see the issue with balanced accuracy. Basically 50% means randomness. So you are 16 % above randomness which is still not great as the 14 % of the f1 score is telling you.

What I meant is that in the example above I think 14% reflects a lot better how bad the model is as compared to 66%, but of course those numbers explain better or worse depending on whether the reader understands the baselines and metrics. (Almost) any metric is good as long as you can interpret it correctly

About F1 vs Fbeta, I was just trying to reduce scope for the PR. I'm not against Fbeta

cmarmo · 2022-05-17T20:23:58Z

I'm closing this issue as superseded by #16525, following our triaging rules. Thanks @PGryllos for your work! Feel free to comment in #16525 if you are interested in a follow-up.

Prokopios Gryllos added 2 commits November 12, 2017 17:08

add rough implementation of threshold calibrator

9f5a360

fit base estimator after calibration

132864b

PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from 465563c to 132864b Compare November 12, 2017 16:53

Prokopios Gryllos added 2 commits November 19, 2017 20:19

add rough implementation of threshold calibrator

6477392

fit base estimator after calibration

f1f9112

Prokopios Gryllos added 5 commits December 17, 2017 21:59

change name to OptimalCutoffClassifier

97f1fa7

support arbitrary target values

13ee903

add methods max_sp and max_se

6b8fd24

Merge branch 'master' into feat/8614_add_threshold_calibration_wrapper

c95df08

Merge branch 'feat/8614_add_threshold_calibration_wrapper' of github.…

904483f

…com:PGryllos/scikit-learn into feat/8614_add_threshold_calibration_wrapper

amueller reviewed Dec 18, 2017

View reviewed changes

Prokopios Gryllos added 6 commits December 22, 2017 22:12

rename to CutoffClassifier

60f641f

rename sensitivity / specificity to tpr / tnr

4e5a018

Merge branch 'master' into feat/8614_add_threshold_calibration_wrapper

cc2b163

remove attribute set in __init__

125a9b2

remove target check for binary values

6a822c6

fix pep8

c42564d

PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from 3c7b8c9 to c42564d Compare December 22, 2017 22:39

Prokopios Gryllos added 4 commits December 23, 2017 12:20

fix input to label encoder

31da085

use LinearSVC if base estimator not provided

19285fc

add check_is_fitted check

84ed3bc

add input validation checks

2e5a9bb

PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from a695215 to 39e23c6 Compare December 23, 2017 13:40

Not allow None base estimator

e63657c

PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from 39e23c6 to e63657c Compare December 23, 2017 13:40

amueller mentioned this pull request Nov 5, 2019

cross_val_score threshold adjustment #15019

Closed

glemaitre self-assigned this Feb 23, 2020

glemaitre added the Superseded PR has been replace by a newer PR label Feb 23, 2020

glemaitre mentioned this pull request Feb 23, 2020

[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525

Closed

github-actions bot added the module:utils label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

cmarmo removed the Waiting for Reviewer label Mar 29, 2021

cmarmo closed this May 17, 2022

glemaitre mentioned this pull request Apr 7, 2023

FEA add TunedThresholdClassifier meta-estimator to post-tune the cut-off threshold #26120

Merged



		class OptimalCutoffClassifier(BaseEstimator, ClassifierMixin):
		"""Optimal cutoff point selection.

Uh oh!

[MRG] Add decision threshold calibration wrapper #10117

[MRG] Add decision threshold calibration wrapper #10117

Uh oh!

Conversation

PGryllos commented Nov 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

the comments bellow are no longer valid but I will leave them for the completeness of the conversation bellow

Uh oh!

jnothman commented Nov 12, 2017 via email

Uh oh!

PGryllos commented Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Nov 21, 2017

Uh oh!

amueller commented Dec 18, 2017

Uh oh!

amueller commented Dec 18, 2017

Uh oh!

amueller Dec 18, 2017

Choose a reason for hiding this comment

Uh oh!

amueller Dec 18, 2017

Choose a reason for hiding this comment

Uh oh!

marctorsoc commented Dec 3, 2019

Uh oh!

PGryllos commented Dec 3, 2019

Uh oh!

glemaitre commented Feb 23, 2020

Uh oh!

PGryllos commented Feb 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Feb 23, 2020

Uh oh!

PGryllos commented Feb 23, 2020

Uh oh!

glemaitre commented Feb 23, 2020

Uh oh!

PGryllos commented Feb 23, 2020

Uh oh!

glemaitre commented Feb 23, 2020 via email

Uh oh!

PGryllos commented Feb 23, 2020

Uh oh!

marctorsoc commented Feb 25, 2020

Uh oh!

ogrisel commented Feb 25, 2020

Uh oh!

marctorsoc commented Feb 25, 2020

Uh oh!

ogrisel commented Feb 25, 2020

Uh oh!

jnothman commented Feb 25, 2020 via email

Uh oh!

marctorsoc commented Feb 25, 2020

Uh oh!

ogrisel commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Feb 25, 2020

Uh oh!

PGryllos commented Feb 25, 2020

Uh oh!

marctorsoc commented Feb 25, 2020

Uh oh!

cmarmo commented May 17, 2022

Uh oh!

Uh oh!

PGryllos commented Nov 12, 2017 •

edited

Loading

PGryllos commented Nov 13, 2017 •

edited

Loading

PGryllos commented Feb 23, 2020 •

edited

Loading

ogrisel commented Feb 25, 2020 •

edited

Loading