-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add wrapper class that changes threshold value for predict #8614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Belongs in sklearn.calibration?
…On 20 March 2017 at 09:47, Andreas Mueller ***@***.***> wrote:
This was discussed before, but not sure if there's an issue.
We should have a wrapper class that changes the decision threshold based
on a cross-validation (or hold-out, pre-fit) estimate.
This is very very common so I think we should have a built-in solution.
Simple rules for selecting a new threshold are:
- picking the point on the roc curve that's closest to the ideal corner
- picking the point on the precision-recall curve that's closest to
the ideal corner
- optimize one metric while holding another one constant: find the
best threshold to yield the best recall with a precision of at least 10%
for example. We could also make this slightly less flexible and just say
"the median of the largest threshold over cross-validation that yields at
least X precision".
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8614>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz61b8TPwZTL-6YRneiF08me7tE6cdks5rnbCagaJpZM4Mh6Hm>
.
|
I wouldn't object to that. |
There was also a recent paper by some colleagues from nyu how to properly control for different kinds of errors but I need to search for that... |
Yes, definitely related to #6663. Though I think I would implement it as a meta-estimator and not a transformer and I would add cross-validation to adjust the threshold using mentioned strategies. |
Hey, it seems like no one is working on that; can I give it a try? |
you're welcome to give it a go, though we often encourage new contributors
to start with something smaller to become familiar with the contribution
process. If I were you, I'd start by writing the class docstring and then
tests.
|
to give an update. I went through the mentioned issues and got a better understanding of what the task is. Now I think I can start implementing the api of the class. But I could still use some help on what the exact threshold decision should be based on. Maybe as @amueller said provide all of three options and let the user decide which one to use? |
Start with one?
|
Ok I have to admit I am still pretty stuck :$ I looked at that the |
Maybe start with implementing a wrapper that picks the point on the P/R curve that's closest to the top right. Try to understand what that would mean doing it on the training set (or using a "prefit" model). |
This seems relevant: https://www.ncbi.nlm.nih.gov/pubmed/16828672 (via Max Kuhn's book) but I haven't really found more? |
There's some info here: http://ncss.wpengine.netdna-cdn.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf and maybe looking at rocr helps: http://rocr.bioinf.mpi-sb.mpg.de/ but I haven't actually found anything explicitly describing a methodology. |
this seems closer? https://cran.r-project.org/web/packages/OptimalCutpoints/ |
@amueller thanks for taking the time.
If I have understood correctly if the base_estimator is not prefit we should hold out part of the training set to calibrate the threshold and the rest to fit the classifier otherwise we can use the whole training set for calibration of the decision threshold. The same as in What I am actually still not 100% sure is how is this wrapper going to be different than |
Yes, that may be an appropriate solution. See also #4822 where something
like this is implemented. Needs review
|
@jnothman @amueller I am thinking about how should the calibrated predict work in the case of multilabel classification. i.e. we have 3 labels
|
also the way I see it the threshold calibration functionality should be offered through the |
in multilabel you should calibrate each output column independently.
…On 30 Oct 2017 6:28 am, "Gryllos Prokopis" ***@***.***> wrote:
also the way I see it the threshold calibration functionality should be
offered through the CalibratedClassifierCV class (mainly because the name
is so broad that covers all calibration practises), which currently offers
probability estimation calibration, and one should be able to choose
whether to use the threshold calibration or the probability calibration or
both. What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8614 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_jYMmQmOezCFyldzRv0QUGX95fbks5sxNHbgaJpZM4Mh6Hm>
.
|
I think this should maybe be a separate class but I'm not sure. I'm not
really familiar with the algorithms. CalibratedClassifierCV already defines
predict in the multiclass case as the argmax over calibrated probabilities.
I suppose we could choose threshold calibration without probability
calibration in that class with a method switch, but the fact that one
handles multiclass and the other is only binary may lead to confusion.
|
do you mean that the threshold calibration should be used only for binary classification? |
does it make sense beyond binary and multilabel classification? only in
those do we tend to predict by threshold
|
@jnothman sorry I was using I also agree with your point (now that I see multilabel and multiclass differently :p ) that it would be confusing if both functionalities were in the same class |
I am not sure if there are classifiers that produce un-thresholded scores instead of probabilities in which case I am not sure if you can say that the decision threshold is 0.5. For the second part of the comment, it seems to me that the same can be achieved by allowing the user to decide which of the two metrics to optimise with what minimum value of the other (the third point described in the first comment). I am not confident enough to say whether allowing the user to specify custom cost functions would play nicely, I have to give it further thought, but I don't disagree with the idea per se. Maybe @amueller could give some feedback? |
update I plan on making progress during the weekend; precisely I want to focus on the following
|
It's on my pile for review already, I'm just trying to get through a few
other things. This one will require some thinking and/or reading up on my
part.
|
Okay, thnx for the prompt reply. I will also try to make it as clear and comprehensible as possible. |
@twolodzko @PGryllos I don't think we'll introduce a @twolodzko what I don't get from your implementation is what the interface of the cost function would be. That's the critical part, I think. You can just have a cost matrix, which is the simplest, and I think we should probably support that. Did you have a more general case in mind? |
@twolodzko btw do you have references for methods for tuning a cutoff using cross-validation? I didn't really find any. |
@amueller for example the paper you quote earlier on in the thread talks about finding optimal cutoff given the sensitivity + specificity criterion. I can't recall any specific references on that, in most cases they say that in general you should tune it based on the loss function that is specific for given problem, cross validation is rather not discussed but seems to be a natural choice. My code basically takes as an argument a loss function (say |
@twolodzko The article doesn't talk about cross-validation, and while it might be "an obvious choice" it's not really obvious to me what to do. You could either cross-validate the actual threshold value, or you could keep all the models and apply their thresholds and let them vote. I feel like averaging the threshold value over folds sounds a bit dangerous, but the other option doesn't really seem any better? Why would you provide a callable for classification? If it's binary, we only need a 2x2 matrix, right? The point is that deciding how to create the interface is the hard part, optimizing it is the easy part ;) |
@PGryllos & @amueller FYI, see AUC: a misleading measure of the
Check also the referred papers for more discussion. |
@twolodzko thnx a lot for the input; I will take a look in the coming days |
So what? is there any solution to use GridSearch to tune the best threshold? |
KNizaev by current implementation do you mean this pull request? Have you
tried finding an implementation which does this outside core scikit-learn?
|
No, I mean my temporary implementation) Actually, I don`t know how to do it right.
The problem, that every fold should have the same th_, so before calc score for one fold, we need y_prob from others folds. Another way is to use GridSearch(like in #6663), but it is hard to know (without roc_curve) the proper th_ range for search, I assume th_ depends on others hyperparameters of estimator. |
All you should need is the roc_curve for each fold, if your optimisation is
based on sensitivity/specificity.
|
What do you mean? th_ range for #6663 solution or what? Could you write more concretely what steps do you propose? |
@KNizaev this is not a forum for usage questions or how to implement something. Try stackoverflow. |
The statistic we're looking for here is Youden's J. The scenario arises quite often (all the time in my line of work it seems, lol?) when you have a highly imbalanced dataset. Our team was looking into implementing something like this as well and optimization via CV (using something like GridSearchCV on a previously-unused portion of the dataset) seemed the natural way to proceed as tuning by hand without CV (a.k.a. guessing) would introduce leakage. We also looked into Matthews' Correlation Coefficient as the metric to use for threshold optimization. We never implemented it ultimately as we needed to bin the probabilities and turn them into letter grades, so we opted for a quick and dirty method using Jenks' Natural Breaks. Seems like you'd want to GridSearch the hyper-params while optimizing for Youden's J and allowing a Is this on hold for the time being? Thoughts? |
(optimising for Youden's J would be the same as for balanced accuracy)
|
@jnothman exactly, sorry, was just trying to put a "name with a face". Looking forward to this and happy to help move it along if needed. |
Not sure if this was linked before but this is sort-of a duplicate of #4813. |
This was discussed before, but not sure if there's an issue.
We should have a wrapper class that changes the decision threshold based on a cross-validation (or hold-out, pre-fit) estimate.
This is very very common so I think we should have a built-in solution.
Simple rules for selecting a new threshold are:
The text was updated successfully, but these errors were encountered: