Add wrapper class that changes threshold value for predict #8614

amueller · 2017-03-19T22:47:53Z

This was discussed before, but not sure if there's an issue.
We should have a wrapper class that changes the decision threshold based on a cross-validation (or hold-out, pre-fit) estimate.
This is very very common so I think we should have a built-in solution.
Simple rules for selecting a new threshold are:

picking the point on the roc curve that's closest to the ideal corner
picking the point on the precision-recall curve that's closest to the ideal corner
optimize one metric while holding another one constant: find the best threshold to yield the best recall with a precision of at least 10% for example. We could also make this slightly less flexible and just say "the median of the largest threshold over cross-validation that yields at least X precision".

jnothman · 2017-03-19T23:13:11Z

Belongs in sklearn.calibration?

…

On 20 March 2017 at 09:47, Andreas Mueller ***@***.***> wrote: This was discussed before, but not sure if there's an issue. We should have a wrapper class that changes the decision threshold based on a cross-validation (or hold-out, pre-fit) estimate. This is very very common so I think we should have a built-in solution. Simple rules for selecting a new threshold are: - picking the point on the roc curve that's closest to the ideal corner - picking the point on the precision-recall curve that's closest to the ideal corner - optimize one metric while holding another one constant: find the best threshold to yield the best recall with a precision of at least 10% for example. We could also make this slightly less flexible and just say "the median of the largest threshold over cross-validation that yields at least X precision". — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8614>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61b8TPwZTL-6YRneiF08me7tE6cdks5rnbCagaJpZM4Mh6Hm> .

amueller · 2017-03-19T23:36:39Z

I wouldn't object to that.

amueller · 2017-03-19T23:38:32Z

There was also a recent paper by some colleagues from nyu how to properly control for different kinds of errors but I need to search for that...

glouppe · 2017-03-20T07:09:42Z

Related to #6663 from @betatim ?

amueller · 2017-03-26T17:35:11Z

Yes, definitely related to #6663. Though I think I would implement it as a meta-estimator and not a transformer and I would add cross-validation to adjust the threshold using mentioned strategies.

PGryllos · 2017-10-02T21:55:58Z

Hey, it seems like no one is working on that; can I give it a try?

jnothman · 2017-10-02T23:13:22Z

you're welcome to give it a go, though we often encourage new contributors to start with something smaller to become familiar with the contribution process. If I were you, I'd start by writing the class docstring and then tests.

PGryllos · 2017-10-03T21:41:13Z

@jnothman that sounds good. Indeed I wouldn't try to commit the whole feature at once. I also see it as a way to become more familiar with the library. I first need to get a good grip of the requested feature though. @amueller did you maybe find the paper you mentioned?
thnx in advance

PGryllos · 2017-10-19T11:37:43Z

to give an update. I went through the mentioned issues and got a better understanding of what the task is. Now I think I can start implementing the api of the class. But I could still use some help on what the exact threshold decision should be based on. Maybe as @amueller said provide all of three options and let the user decide which one to use?

jnothman · 2017-10-19T22:31:29Z

Start with one?

PGryllos · 2017-10-23T21:38:30Z

Ok I have to admit I am still pretty stuck :$ I looked at that the CalibrationClassifier in the calibration module which seems like it's tacking a very similar problem. #6663 and #4822 also work on calibrating probability thresholds. I am not sure how exactly the proposal fits in and I find it difficult to find my way around the problem. I would like to have some guidance or
some paper suggestions to help me understand what is being asked :P but if that is too much let me know if someone else should take over

amueller · 2017-10-24T21:31:56Z

Maybe start with implementing a wrapper that picks the point on the P/R curve that's closest to the top right. Try to understand what that would mean doing it on the training set (or using a "prefit" model).
Then, do the same, but inside a cross-validation loop, and take the average (?) of the thresholds.

amueller · 2017-10-24T21:37:24Z

This seems relevant: https://www.ncbi.nlm.nih.gov/pubmed/16828672 (via Max Kuhn's book) but I haven't really found more?

amueller · 2017-10-24T21:46:27Z

There's some info here: http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf

and maybe looking at rocr helps: http://rocr.bioinf.mpi-sb.mpg.de/

but I haven't actually found anything explicitly describing a methodology.

amueller · 2017-10-24T21:47:37Z

this seems closer? https://cran.r-project.org/web/packages/OptimalCutpoints/

PGryllos · 2017-10-25T21:47:50Z

@amueller thanks for taking the time.

Try to understand what that would mean doing it on the training set (or using a "prefit" model)

If I have understood correctly if the base_estimator is not prefit we should hold out part of the training set to calibrate the threshold and the rest to fit the classifier otherwise we can use the whole training set for calibration of the decision threshold. The same as in CalibratedClassifierCV class.

What I am actually still not 100% sure is how is this wrapper going to be different than CalibratedClassifierCV? This class already implements the cv folds logic for calibrating. Should this new wrapper be a similar class that just adds the option to calibrate using the ROC curve?

jnothman · 2017-10-25T22:08:48Z

Yes, that may be an appropriate solution. See also #4822 where something like this is implemented. Needs review

PGryllos · 2017-10-29T18:43:28Z

@jnothman @amueller I am thinking about how should the calibrated predict work in the case of multilabel classification.

i.e. we have 3 labels [1, 2, 3] and the thresholds we found after calibration are [.6, .3, .7]

for sample 1 predict_proba predicts condifence values [.5, .4, .6]. In this case only the confidence for label 2 is above the threshold so calibrated predict should return 2
for sample 2 predict_proba predicts condifence values [.7, .5, .6]. In this case we have two confidence values above the corresponding label's threshold. The confidence / threshold ratio for label 2 is higher than label 1. Do you think it would make sense to base the prediction on this ratio for this case?

PGryllos · 2017-10-29T19:28:25Z

also the way I see it the threshold calibration functionality should be offered through the CalibratedClassifierCV class (mainly because the name is so broad that covers all calibration practises), which currently offers probability estimation calibration, and it should be an option to choose whether to use the threshold calibration or the probability calibration or both. What do you think?

jnothman · 2017-10-29T22:07:07Z

in multilabel you should calibrate each output column independently.

…

On 30 Oct 2017 6:28 am, "Gryllos Prokopis" ***@***.***> wrote: also the way I see it the threshold calibration functionality should be offered through the CalibratedClassifierCV class (mainly because the name is so broad that covers all calibration practises), which currently offers probability estimation calibration, and one should be able to choose whether to use the threshold calibration or the probability calibration or both. What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8614 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_jYMmQmOezCFyldzRv0QUGX95fbks5sxNHbgaJpZM4Mh6Hm> .

jnothman · 2017-10-29T22:12:52Z

I think this should maybe be a separate class but I'm not sure. I'm not really familiar with the algorithms. CalibratedClassifierCV already defines predict in the multiclass case as the argmax over calibrated probabilities. I suppose we could choose threshold calibration without probability calibration in that class with a method switch, but the fact that one handles multiclass and the other is only binary may lead to confusion.

PGryllos · 2017-10-30T08:24:43Z

in multilabel you should calibrate each output column independently

do you mean that the threshold calibration should be used only for binary classification?

jnothman · 2017-10-30T10:59:05Z

does it make sense beyond binary and multilabel classification? only in those do we tend to predict by threshold

PGryllos · 2017-10-30T12:20:30Z

@jnothman sorry I was using multilabel interchangeably with multiclass. When earlier I asked about multilabel I really meant multiclass. So I guess the threshold classification will not concern multiclass problems but only binary and mutilabel. So to answer the original question I made some comments above In the later case the classifier may also predict no label, if no threshold is being met.

I also agree with your point (now that I see multilabel and multiclass differently :p ) that it would be confusing if both functionalities were in the same class

PGryllos · 2017-11-22T12:54:09Z

But in binary case this is exactly equivalent of using 0.5 cutoff...

I am not sure if there are classifiers that produce un-thresholded scores instead of probabilities in which case I am not sure if you can say that the decision threshold is 0.5.

For the second part of the comment, it seems to me that the same can be achieved by allowing the user to decide which of the two metrics to optimise with what minimum value of the other (the third point described in the first comment). I am not confident enough to say whether allowing the user to specify custom cost functions would play nicely, I have to give it further thought, but I don't disagree with the idea per se. Maybe @amueller could give some feedback?

PGryllos · 2017-12-01T08:57:07Z

update I plan on making progress during the weekend; precisely I want to focus on the following

add the other mentioned methods for cut-off estimation
create examples to showcase the implementation

PGryllos · 2017-12-18T12:39:35Z

@amueller @jnothman I extended the implementation with two more methods for picking optimal cutoffs; changed naming and updated docstrings. I plan in the following days to add examples and tests. Do you think that at that point it will qualify for mrg review ?

jnothman · 2017-12-18T12:41:54Z

It's on my pile for review already, I'm just trying to get through a few other things. This one will require some thinking and/or reading up on my part.

PGryllos · 2017-12-18T13:25:17Z

Okay, thnx for the prompt reply. I will also try to make it as clear and comprehensible as possible.

amueller · 2017-12-18T16:59:29Z

@twolodzko @PGryllos
I agree that a cost function is one good way to deal with picking a threshold, though it's not necessarily natural in all cases. If you gamble (or do business, basically same thing) defining the costs is easy. In a medical example, defining the costs is non-obvious. How much more does it cost you if your child dies compared to a misdiagnosis? In these settings I think precision and recall (and associated measures) are more natural.

I don't think we'll introduce a cutoff parameter to predict. I'm not sure what the benefit of that would be - the semantics would be unclear (given that not all models are probabilistic) and it would just move the burden to find the cutoff to the user. And given that we don't have great mechanisms in place for this right now, people would probably tune it on the test set.

@twolodzko what I don't get from your implementation is what the interface of the cost function would be. That's the critical part, I think. You can just have a cost matrix, which is the simplest, and I think we should probably support that. Did you have a more general case in mind?

amueller · 2017-12-18T17:07:57Z

@twolodzko btw do you have references for methods for tuning a cutoff using cross-validation? I didn't really find any.

twolodzko · 2017-12-18T19:55:47Z

@amueller for example the paper you quote earlier on in the thread talks about finding optimal cutoff given the sensitivity + specificity criterion. I can't recall any specific references on that, in most cases they say that in general you should tune it based on the loss function that is specific for given problem, cross validation is rather not discussed but seems to be a natural choice.

My code basically takes as an argument a loss function (say metrics.mean_squared_error) and then uses optimizer with it. It is just an example.

amueller · 2017-12-18T20:09:42Z

@twolodzko The article doesn't talk about cross-validation, and while it might be "an obvious choice" it's not really obvious to me what to do. You could either cross-validate the actual threshold value, or you could keep all the models and apply their thresholds and let them vote. I feel like averaging the threshold value over folds sounds a bit dangerous, but the other option doesn't really seem any better?

Why would you provide a callable for classification? If it's binary, we only need a 2x2 matrix, right? The point is that deciding how to create the interface is the hard part, optimizing it is the easy part ;)

twolodzko · 2017-12-28T14:51:07Z

@PGryllos & @amueller FYI, see AUC: a misleading measure of the
performance of predictive distribution models by Lobo et al:

It has been assumed that in ROC plots the optimal classifier point is the one that maximizes
the sum of sensitivity and specificity (Zweig & Campbell, 1993).
However, Jiménez-Valverde & Lobo (2007) have found that a
threshold that minimizes the difference between sensitivity and
specificity performs slightly better than one that maximizes the
sum if commission and omission errors are equally costly. When
the threshold changes from 0 to 1, the rate of well-predicted
presences diminishes while the rate of well-predicted absences
increases. The point where both curves cross can be considered
the appropriate threshold if both types of errors are equally
weighted (Fig. 1a). In a ROC plot, this point lies at the intersection
of the ROC curve and the line perpendicular to the diagonal of
no discrimination (Fig. 1b), i.e., the ‘northwesternmost’ point of
the ROC curve. The two thresholds can be easily computed
without using the ROC curve. Both thresholds are highly correlated
and, more importantly, they also correlate with prevalence (Liu
et al., 2005; Jiménez-Valverde & Lobo, 2007). As a general rule,
a good classifier needs to minimize the false positive and negative
rates or, similarly, to maximize the true negative and positive
rates. Thus, if we place equal weight on presences and absences
there is only one correct threshold. This optimal threshold, the
one that minimizes the difference between sensitivity and specificity,
achieves this objective and provides a balanced trade-off between
commission and omission errors. Nevertheless, as pointed out
before, if different costs are assigned to false negatives and false
positives, and the prevalence bias is always taken into account,
the threshold should be selected according to the required criteria.
It is also necessary to underline that the transformation of
continuous probabilities into binary maps is frequently necessary
for many practical applications that rely on making decisions
(e.g., reserve selection).

Check also the referred papers for more discussion.

PGryllos · 2018-01-05T09:33:21Z

@twolodzko thnx a lot for the input; I will take a look in the coming days

nizaevka · 2018-11-23T13:47:12Z

So what? is there any solution to use GridSearch to tune the best threshold?
In the current implementation, I use cv in score function to find best th_ for all folds, after that calc average score for folds with that th_. It`s extremely uncomfortable and breaks the structural logic of sklearn.

jnothman · 2018-11-26T07:50:28Z

KNizaev by current implementation do you mean this pull request? Have you tried finding an implementation which does this outside core scikit-learn?

nizaevka · 2018-11-26T13:35:59Z

No, I mean my temporary implementation) Actually, I don`t know how to do it right.
In cv I do :

bind predict_proba for all folds => y_prob vector
get the list of possible thresholds from sklearn.metrics.roc_curve on y_prob
brute force to get the one "th_" that maximize score of y_prob after threshold applied
then use that "th_" to calculate the score of every fold

The problem, that every fold should have the same th_, so before calc score for one fold, we need y_prob from others folds.
Maybe I am wrong, would be appreciated for any solution.

Another way is to use GridSearch(like in #6663), but it is hard to know (without roc_curve) the proper th_ range for search, I assume th_ depends on others hyperparameters of estimator.

jnothman · 2018-11-26T22:26:29Z

All you should need is the roc_curve for each fold, if your optimisation is based on sensitivity/specificity.

nizaevka · 2018-11-27T11:54:43Z

What do you mean? th_ range for #6663 solution or what? Could you write more concretely what steps do you propose?
GridSearch should tune threshold and others hyperparams(hp) simultaneously, in general roc_curve depends on estimator`s hp, so I need roc_curves for each fold for all combinations of hp.

amueller · 2018-11-27T20:55:35Z

@KNizaev this is not a forum for usage questions or how to implement something. Try stackoverflow.

jmwoloso · 2019-03-28T07:01:19Z

The statistic we're looking for here is Youden's J. The scenario arises quite often (all the time in my line of work it seems, lol?) when you have a highly imbalanced dataset.

Our team was looking into implementing something like this as well and optimization via CV (using something like GridSearchCV on a previously-unused portion of the dataset) seemed the natural way to proceed as tuning by hand without CV (a.k.a. guessing) would introduce leakage. We also looked into Matthews' Correlation Coefficient as the metric to use for threshold optimization.

We never implemented it ultimately as we needed to bin the probabilities and turn them into letter grades, so we opted for a quick and dirty method using Jenks' Natural Breaks.

Seems like you'd want to GridSearch the hyper-params while optimizing for Youden's J and allowing a prob_threshold param that will be searched over as well.

Is this on hold for the time being? Thoughts?

jnothman · 2019-03-28T09:32:19Z

(optimising for Youden's J would be the same as for balanced accuracy)

jmwoloso · 2019-03-29T02:02:41Z

@jnothman exactly, sorry, was just trying to put a "name with a face". Looking forward to this and happy to help move it along if needed.

amueller · 2020-05-02T20:33:19Z

Not sure if this was linked before but this is sort-of a duplicate of #4813.

amueller added Enhancement Need Contributor New Feature and removed Enhancement labels Mar 19, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

amueller mentioned this issue Dec 12, 2017

[MRG] Support for multi-class roc_auc scores #7663

Closed

4 tasks

elyase mentioned this issue May 2, 2018

Directly handle out of scope messages RasaHQ/rasa#387

Closed

glemaitre mentioned this issue Feb 23, 2020

[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525

Closed

cmarmo removed the help wanted label Jun 4, 2020

glemaitre mentioned this issue Jul 29, 2021

ENH: Add the classification models threshold as parameter of __init__ method #20635

Closed

cmarmo added the module:utils label Dec 16, 2021

lorentzenchr mentioned this issue Oct 31, 2022

Precision @ Recall K || Recall @ Precision K #20266

Open

glemaitre mentioned this issue Apr 7, 2023

FEA add TunedThresholdClassifier meta-estimator to post-tune the cut-off threshold #26120

Merged

adrinjalali closed this as completed in #26120 May 3, 2024

Add wrapper class that changes threshold value for predict #8614

Add wrapper class that changes threshold value for predict #8614

Comments

amueller commented Mar 19, 2017

jnothman commented Mar 19, 2017 via email

amueller commented Mar 19, 2017

amueller commented Mar 19, 2017

glouppe commented Mar 20, 2017

amueller commented Mar 26, 2017

PGryllos commented Oct 2, 2017 • edited Loading

jnothman commented Oct 2, 2017 via email

PGryllos commented Oct 3, 2017 • edited Loading

PGryllos commented Oct 19, 2017 • edited Loading

jnothman commented Oct 19, 2017 via email

PGryllos commented Oct 23, 2017 • edited Loading

amueller commented Oct 24, 2017

amueller commented Oct 24, 2017

amueller commented Oct 24, 2017

amueller commented Oct 24, 2017

PGryllos commented Oct 25, 2017 • edited Loading

jnothman commented Oct 25, 2017 via email

PGryllos commented Oct 29, 2017 • edited Loading

PGryllos commented Oct 29, 2017 • edited Loading

jnothman commented Oct 29, 2017 via email

jnothman commented Oct 29, 2017 via email

PGryllos commented Oct 30, 2017

jnothman commented Oct 30, 2017 via email

PGryllos commented Oct 30, 2017 • edited Loading

PGryllos commented Nov 22, 2017 • edited Loading

PGryllos commented Dec 1, 2017 • edited Loading

PGryllos commented Dec 18, 2017

jnothman commented Dec 18, 2017 via email

PGryllos commented Dec 18, 2017 • edited Loading

amueller commented Dec 18, 2017

amueller commented Dec 18, 2017

twolodzko commented Dec 18, 2017

amueller commented Dec 18, 2017 • edited Loading

twolodzko commented Dec 28, 2017 • edited Loading

PGryllos commented Jan 5, 2018

nizaevka commented Nov 23, 2018 • edited Loading

jnothman commented Nov 26, 2018 via email

nizaevka commented Nov 26, 2018

jnothman commented Nov 26, 2018 via email

nizaevka commented Nov 27, 2018

amueller commented Nov 27, 2018

jmwoloso commented Mar 28, 2019

jnothman commented Mar 28, 2019 via email

jmwoloso commented Mar 29, 2019

amueller commented May 2, 2020

PGryllos commented Oct 2, 2017 •

edited

Loading

PGryllos commented Oct 3, 2017 •

edited

Loading

PGryllos commented Oct 19, 2017 •

edited

Loading

PGryllos commented Oct 23, 2017 •

edited

Loading

PGryllos commented Oct 25, 2017 •

edited

Loading

PGryllos commented Oct 29, 2017 •

edited

Loading

PGryllos commented Oct 29, 2017 •

edited

Loading

PGryllos commented Oct 30, 2017 •

edited

Loading

PGryllos commented Nov 22, 2017 •

edited

Loading

PGryllos commented Dec 1, 2017 •

edited

Loading

PGryllos commented Dec 18, 2017 •

edited

Loading

amueller commented Dec 18, 2017 •

edited

Loading

twolodzko commented Dec 28, 2017 •

edited

Loading

nizaevka commented Nov 23, 2018 •

edited

Loading