Skip to content

Add wrapper class that changes threshold value for predict #8614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Mar 19, 2017 · 52 comments · Fixed by #26120
Closed

Add wrapper class that changes threshold value for predict #8614

amueller opened this issue Mar 19, 2017 · 52 comments · Fixed by #26120

Comments

@amueller
Copy link
Member

This was discussed before, but not sure if there's an issue.
We should have a wrapper class that changes the decision threshold based on a cross-validation (or hold-out, pre-fit) estimate.
This is very very common so I think we should have a built-in solution.
Simple rules for selecting a new threshold are:

  • picking the point on the roc curve that's closest to the ideal corner
  • picking the point on the precision-recall curve that's closest to the ideal corner
  • optimize one metric while holding another one constant: find the best threshold to yield the best recall with a precision of at least 10% for example. We could also make this slightly less flexible and just say "the median of the largest threshold over cross-validation that yields at least X precision".
@jnothman
Copy link
Member

jnothman commented Mar 19, 2017 via email

@amueller
Copy link
Member Author

I wouldn't object to that.

@amueller
Copy link
Member Author

There was also a recent paper by some colleagues from nyu how to properly control for different kinds of errors but I need to search for that...

@glouppe
Copy link
Contributor

glouppe commented Mar 20, 2017

Related to #6663 from @betatim ?

@amueller
Copy link
Member Author

Yes, definitely related to #6663. Though I think I would implement it as a meta-estimator and not a transformer and I would add cross-validation to adjust the threshold using mentioned strategies.

@PGryllos
Copy link
Contributor

PGryllos commented Oct 2, 2017

Hey, it seems like no one is working on that; can I give it a try?

@jnothman
Copy link
Member

jnothman commented Oct 2, 2017 via email

@PGryllos
Copy link
Contributor

PGryllos commented Oct 3, 2017

@jnothman that sounds good. Indeed I wouldn't try to commit the whole feature at once. I also see it as a way to become more familiar with the library. I first need to get a good grip of the requested feature though. @amueller did you maybe find the paper you mentioned?
thnx in advance

@PGryllos
Copy link
Contributor

PGryllos commented Oct 19, 2017

to give an update. I went through the mentioned issues and got a better understanding of what the task is. Now I think I can start implementing the api of the class. But I could still use some help on what the exact threshold decision should be based on. Maybe as @amueller said provide all of three options and let the user decide which one to use?

@jnothman
Copy link
Member

jnothman commented Oct 19, 2017 via email

@PGryllos
Copy link
Contributor

PGryllos commented Oct 23, 2017

Ok I have to admit I am still pretty stuck :$ I looked at that the CalibrationClassifier in the calibration module which seems like it's tacking a very similar problem. #6663 and #4822 also work on calibrating probability thresholds. I am not sure how exactly the proposal fits in and I find it difficult to find my way around the problem. I would like to have some guidance or
some paper suggestions to help me understand what is being asked :P but if that is too much let me know if someone else should take over

@amueller
Copy link
Member Author

Maybe start with implementing a wrapper that picks the point on the P/R curve that's closest to the top right. Try to understand what that would mean doing it on the training set (or using a "prefit" model).
Then, do the same, but inside a cross-validation loop, and take the average (?) of the thresholds.

@amueller
Copy link
Member Author

This seems relevant: https://www.ncbi.nlm.nih.gov/pubmed/16828672 (via Max Kuhn's book) but I haven't really found more?

@amueller
Copy link
Member Author

There's some info here: http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf

and maybe looking at rocr helps: http://rocr.bioinf.mpi-sb.mpg.de/

but I haven't actually found anything explicitly describing a methodology.

@amueller
Copy link
Member Author

@PGryllos
Copy link
Contributor

PGryllos commented Oct 25, 2017

@amueller thanks for taking the time.

Try to understand what that would mean doing it on the training set (or using a "prefit" model)

If I have understood correctly if the base_estimator is not prefit we should hold out part of the training set to calibrate the threshold and the rest to fit the classifier otherwise we can use the whole training set for calibration of the decision threshold. The same as in CalibratedClassifierCV class.

What I am actually still not 100% sure is how is this wrapper going to be different than CalibratedClassifierCV? This class already implements the cv folds logic for calibrating. Should this new wrapper be a similar class that just adds the option to calibrate using the ROC curve?

@jnothman
Copy link
Member

jnothman commented Oct 25, 2017 via email

@PGryllos
Copy link
Contributor

PGryllos commented Oct 29, 2017

@jnothman @amueller I am thinking about how should the calibrated predict work in the case of multilabel classification.

i.e. we have 3 labels [1, 2, 3] and the thresholds we found after calibration are [.6, .3, .7]

  • for sample 1 predict_proba predicts condifence values [.5, .4, .6]. In this case only the confidence for label 2 is above the threshold so calibrated predict should return 2
  • for sample 2 predict_proba predicts condifence values [.7, .5, .6]. In this case we have two confidence values above the corresponding label's threshold. The confidence / threshold ratio for label 2 is higher than label 1. Do you think it would make sense to base the prediction on this ratio for this case?

@PGryllos
Copy link
Contributor

PGryllos commented Oct 29, 2017

also the way I see it the threshold calibration functionality should be offered through the CalibratedClassifierCV class (mainly because the name is so broad that covers all calibration practises), which currently offers probability estimation calibration, and it should be an option to choose whether to use the threshold calibration or the probability calibration or both. What do you think?

@jnothman
Copy link
Member

jnothman commented Oct 29, 2017 via email

@jnothman
Copy link
Member

jnothman commented Oct 29, 2017 via email

@PGryllos
Copy link
Contributor

in multilabel you should calibrate each output column independently

do you mean that the threshold calibration should be used only for binary classification?

@jnothman
Copy link
Member

jnothman commented Oct 30, 2017 via email

@PGryllos
Copy link
Contributor

PGryllos commented Oct 30, 2017

@jnothman sorry I was using multilabel interchangeably with multiclass. When earlier I asked about multilabel I really meant multiclass. So I guess the threshold classification will not concern multiclass problems but only binary and mutilabel. So to answer the original question I made some comments above In the later case the classifier may also predict no label, if no threshold is being met.

I also agree with your point (now that I see multilabel and multiclass differently :p ) that it would be confusing if both functionalities were in the same class

@PGryllos
Copy link
Contributor

PGryllos commented Nov 22, 2017

But in binary case this is exactly equivalent of using 0.5 cutoff...

I am not sure if there are classifiers that produce un-thresholded scores instead of probabilities in which case I am not sure if you can say that the decision threshold is 0.5.

For the second part of the comment, it seems to me that the same can be achieved by allowing the user to decide which of the two metrics to optimise with what minimum value of the other (the third point described in the first comment). I am not confident enough to say whether allowing the user to specify custom cost functions would play nicely, I have to give it further thought, but I don't disagree with the idea per se. Maybe @amueller could give some feedback?

@PGryllos
Copy link
Contributor

PGryllos commented Dec 1, 2017

update I plan on making progress during the weekend; precisely I want to focus on the following

  1. add the other mentioned methods for cut-off estimation
  2. create examples to showcase the implementation

@PGryllos
Copy link
Contributor

@amueller @jnothman I extended the implementation with two more methods for picking optimal cutoffs; changed naming and updated docstrings. I plan in the following days to add examples and tests. Do you think that at that point it will qualify for mrg review ?

@jnothman
Copy link
Member

jnothman commented Dec 18, 2017 via email

@PGryllos
Copy link
Contributor

PGryllos commented Dec 18, 2017

Okay, thnx for the prompt reply. I will also try to make it as clear and comprehensible as possible.

@amueller
Copy link
Member Author

@twolodzko @PGryllos
I agree that a cost function is one good way to deal with picking a threshold, though it's not necessarily natural in all cases. If you gamble (or do business, basically same thing) defining the costs is easy. In a medical example, defining the costs is non-obvious. How much more does it cost you if your child dies compared to a misdiagnosis? In these settings I think precision and recall (and associated measures) are more natural.

I don't think we'll introduce a cutoff parameter to predict. I'm not sure what the benefit of that would be - the semantics would be unclear (given that not all models are probabilistic) and it would just move the burden to find the cutoff to the user. And given that we don't have great mechanisms in place for this right now, people would probably tune it on the test set.

@twolodzko what I don't get from your implementation is what the interface of the cost function would be. That's the critical part, I think. You can just have a cost matrix, which is the simplest, and I think we should probably support that. Did you have a more general case in mind?

@amueller
Copy link
Member Author

@twolodzko btw do you have references for methods for tuning a cutoff using cross-validation? I didn't really find any.

@twolodzko
Copy link
Contributor

@amueller for example the paper you quote earlier on in the thread talks about finding optimal cutoff given the sensitivity + specificity criterion. I can't recall any specific references on that, in most cases they say that in general you should tune it based on the loss function that is specific for given problem, cross validation is rather not discussed but seems to be a natural choice.

My code basically takes as an argument a loss function (say metrics.mean_squared_error) and then uses optimizer with it. It is just an example.

@amueller
Copy link
Member Author

amueller commented Dec 18, 2017

@twolodzko The article doesn't talk about cross-validation, and while it might be "an obvious choice" it's not really obvious to me what to do. You could either cross-validate the actual threshold value, or you could keep all the models and apply their thresholds and let them vote. I feel like averaging the threshold value over folds sounds a bit dangerous, but the other option doesn't really seem any better?

Why would you provide a callable for classification? If it's binary, we only need a 2x2 matrix, right? The point is that deciding how to create the interface is the hard part, optimizing it is the easy part ;)

@twolodzko
Copy link
Contributor

twolodzko commented Dec 28, 2017

@PGryllos & @amueller FYI, see AUC: a misleading measure of the
performance of predictive distribution models
by Lobo et al:

It has been assumed that in ROC plots the optimal classifier point is the one that maximizes
the sum of sensitivity and specificity (Zweig & Campbell, 1993).
However, Jiménez-Valverde & Lobo (2007) have found that a
threshold that minimizes the difference between sensitivity and
specificity performs slightly better than one that maximizes the
sum if commission and omission errors are equally costly. When
the threshold changes from 0 to 1, the rate of well-predicted
presences diminishes while the rate of well-predicted absences
increases. The point where both curves cross can be considered
the appropriate threshold if both types of errors are equally
weighted (Fig. 1a). In a ROC plot, this point lies at the intersection
of the ROC curve and the line perpendicular to the diagonal of
no discrimination (Fig. 1b), i.e., the ‘northwesternmost’ point of
the ROC curve. The two thresholds can be easily computed
without using the ROC curve. Both thresholds are highly correlated
and, more importantly, they also correlate with prevalence (Liu
et al., 2005; Jiménez-Valverde & Lobo, 2007). As a general rule,
a good classifier needs to minimize the false positive and negative
rates or, similarly, to maximize the true negative and positive
rates. Thus, if we place equal weight on presences and absences
there is only one correct threshold. This optimal threshold, the
one that minimizes the difference between sensitivity and specificity,
achieves this objective and provides a balanced trade-off between
commission and omission errors. Nevertheless, as pointed out
before, if different costs are assigned to false negatives and false
positives, and the prevalence bias is always taken into account,
the threshold should be selected according to the required criteria.
It is also necessary to underline that the transformation of
continuous probabilities into binary maps is frequently necessary
for many practical applications that rely on making decisions
(e.g., reserve selection).

Check also the referred papers for more discussion.

@PGryllos
Copy link
Contributor

PGryllos commented Jan 5, 2018

@twolodzko thnx a lot for the input; I will take a look in the coming days

@nizaevka
Copy link

nizaevka commented Nov 23, 2018

So what? is there any solution to use GridSearch to tune the best threshold?
In the current implementation, I use cv in score function to find best th_ for all folds, after that calc average score for folds with that th_. It`s extremely uncomfortable and breaks the structural logic of sklearn.

@jnothman
Copy link
Member

jnothman commented Nov 26, 2018 via email

@nizaevka
Copy link

No, I mean my temporary implementation) Actually, I don`t know how to do it right.
In cv I do :

  • bind predict_proba for all folds => y_prob vector
  • get the list of possible thresholds from sklearn.metrics.roc_curve on y_prob
  • brute force to get the one "th_" that maximize score of y_prob after threshold applied
  • then use that "th_" to calculate the score of every fold

The problem, that every fold should have the same th_, so before calc score for one fold, we need y_prob from others folds.
Maybe I am wrong, would be appreciated for any solution.

Another way is to use GridSearch(like in #6663), but it is hard to know (without roc_curve) the proper th_ range for search, I assume th_ depends on others hyperparameters of estimator.

@jnothman
Copy link
Member

jnothman commented Nov 26, 2018 via email

@nizaevka
Copy link

What do you mean? th_ range for #6663 solution or what? Could you write more concretely what steps do you propose?
GridSearch should tune threshold and others hyperparams(hp) simultaneously, in general roc_curve depends on estimator`s hp, so I need roc_curves for each fold for all combinations of hp.

@amueller
Copy link
Member Author

@KNizaev this is not a forum for usage questions or how to implement something. Try stackoverflow.

@jmwoloso
Copy link
Contributor

The statistic we're looking for here is Youden's J. The scenario arises quite often (all the time in my line of work it seems, lol?) when you have a highly imbalanced dataset.

Our team was looking into implementing something like this as well and optimization via CV (using something like GridSearchCV on a previously-unused portion of the dataset) seemed the natural way to proceed as tuning by hand without CV (a.k.a. guessing) would introduce leakage. We also looked into Matthews' Correlation Coefficient as the metric to use for threshold optimization.

We never implemented it ultimately as we needed to bin the probabilities and turn them into letter grades, so we opted for a quick and dirty method using Jenks' Natural Breaks.

Seems like you'd want to GridSearch the hyper-params while optimizing for Youden's J and allowing a prob_threshold param that will be searched over as well.

Is this on hold for the time being? Thoughts?

@jnothman
Copy link
Member

jnothman commented Mar 28, 2019 via email

@jmwoloso
Copy link
Contributor

@jnothman exactly, sorry, was just trying to put a "name with a face". Looking forward to this and happy to help move it along if needed.

@amueller
Copy link
Member Author

amueller commented May 2, 2020

Not sure if this was linked before but this is sort-of a duplicate of #4813.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment