-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add decision threshold calibration wrapper #10117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add decision threshold calibration wrapper #10117
Conversation
465563c
to
132864b
Compare
thanks for the PR. I've not looked in detail yet, but you can post
visualisations here! also, if you include your example in the examples
directory, we can view the compiled documentation here.
|
@jnothman thnx a lot for the quick response. The visualisations I mentioned on the previous comment show just which point of the roc curve was chosen (for validating that the chosen point is the one closest to the ideal corner), so not a lot to see there. I plan on making a more extensive example and adding it here in the following days. |
For your questions:
|
…com:PGryllos/scikit-learn into feat/8614_add_threshold_calibration_wrapper
tests are failing |
as you said in the other thread, we need a user guide and examples. Those often show whether the interface is good ;) |
sklearn/calibration.py
Outdated
|
||
|
||
class OptimalCutoffClassifier(BaseEstimator, ClassifierMixin): | ||
"""Optimal cutoff point selection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure having Optimal
in the name is a good idea. Maybe just CutoffClassifier
? There is no description in the docstring on what it does.
sklearn/calibration.py
Outdated
- 'roc', selects the point on the roc_curve that is closer to the ideal | ||
corner (0, 1) | ||
|
||
- 'max_se', selects the point that yields the highest sensitivity with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the terms "true positive rate" and "true negative rate" because they are pretty clear in their meaning.
3c7b8c9
to
c42564d
Compare
a695215
to
39e23c6
Compare
39e23c6
to
e63657c
Compare
what's the state of this? I thought this was going to be included in 0.22 |
@marctorrellas I am sorry, I haven't caught up with the discussion as I didn't have the capacity to interact with this the past weeks. I may be able to find some time soon :| |
I will take over the PR to address the remaining comments. |
@glemaitre :) I haven't been able to dedicate any time in this for the past months due to job searching. Happy to see this completed! |
Don't worry. I think that this an interesting feature and I want to give it a push. |
Some thoughts that I had over time
|
I would like to see if it could be useful in the case of class imbalance. Actually, we were interested of having a |
what will the |
Specify the threshold in practice (in medical domain), one wants to select
a specific control point (basically as you mentioned).
I would think that this is more for an inspection use rather than an
automated learning process (that's why I would not use it as a default).
…On Sun, 23 Feb 2020 at 19:48, Gryllos Prokopis ***@***.***> wrote:
what will the manual option do? Allow the user to specify the exact fpr
and trp they'd like?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10117?email_source=notifications&email_token=ABY32P2IJEBYSD7YIQJC2X3RELAIJA5CNFSM4EDLYJOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMWEBAI#issuecomment-590102657>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P3WTFGT6DFALYXWPGTRELAIJANCNFSM4EDLYJOA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
sounds relevant, and more practical than the current interface - which in fact has a similar rationale behind it. I agree; I believe the default should be the fbeta. |
I don't think that averaging thresholds is correct. For every training data there's an optimal threshold that will give best results on unseen data, and they can be quite different. I think you're mixing two things:
For the former I would recommend just train-dev split. Train on train, select threshold with dev For the latter I would recommend nested CV, with the inner loop selecting the threshold, and the outer loop to evaluate. However, this won't give you which the model (the threshold) to use |
I am not so sure. balanced accuracy could also be a good default but it would probably be more computationally intensive than f1 score. I don't have a strong opinion.
I am not so sure that cross-validation is really necessary here. As any of you tried or read references that would just fit the base estimator and tune the threshold on the same entire training set? |
Balanced accuracy is usually not good for imbalanced datasets. I would give the option of f1 and balanced accuracy. They are not difficult / expensive to compute either @ogrisel tuning on the training set is prone to overfitting, at least you need a dev set |
I was wondering if that would really be the case in practice, but maybe yes you are right: the roc curve could be very different on the training set than on a validation set for overfitting models and therefore using this for cut-off selection would result in a poor choice of the cut-off . I don't understand why cutoff averaging across CV folds: if the cutoff selected on each CV split is not stable enough, then the cut-off Classifier procedure would be pointless anyway. Fixing an arbitrary internal train-validation split would not solve the unstable cut-off problem, it would just hide it. |
(I'm curious, Mark, why balanced accuracy is not good for imbalanced
datasets.)
… |
My name is Marc :) My statement was not exactly what I wanted to say. Balanced accuracy is better than naive accuracy indeed. I'd rather choose f1, as I see it more relevant for imbalanced datasets, and in the case of this PR let the user choose, with default for bacc maybe... Example: balanced accuracy = 66 % I think in this case f1 represents better how bad this model is, but I'm sure other scenarios can show balanced accuracy as a good metric. Please let me know if there are any errors in the computations
You can see the decision threshold as a hyperparameter. Would you choose your HP on the training set? I don't agree with the last bit. If the cut-off is stable, then either procedure is good. If it's unstable, then you need to select the one that is good for the training portion you have used. Note: we assume it's unstable w.r.t. the training data you use, not the validation set to tune it |
Thanks for the intuition, I think I get your point. I would still like to see simulations for various models (e.g. for overfitting and underfitting models of various kinds) to evaluate the impact of this strategy and in particular highlight catastrophic cases where CV-based cut-off averaging is detrimental. (if someone would like to volunteer to do a study in a notebook for instance). In any case CV (+ refit on the full training set) is always significantly more computationally intensive than doing a single train-validation split (without refit on the full training set). So it's also a good reason not to do full-blown CV by default it never improves upon single split tuning. |
I don't see the issue with balanced accuracy. Basically 50% means randomness. So you are 16 % above randomness which is still not great as the 14 % of the f1 score is telling you. |
why is f1 preferred over fbeta? Since f1 assumes that precision and recall are equally important for the user and fbeta with default beta = 1 is the same as f1, I'd think that fbeta makes more sense. |
What I meant is that in the example above I think 14% reflects a lot better how bad the model is as compared to 66%, but of course those numbers explain better or worse depending on whether the reader understands the baselines and metrics. (Almost) any metric is good as long as you can interpret it correctly About F1 vs Fbeta, I was just trying to reduce scope for the PR. I'm not against Fbeta |
I'm closing this issue as superseded by #16525, following our triaging rules. Thanks @PGryllos for your work! Feel free to comment in #16525 if you are interested in a follow-up. |
Reference Issues/PRs
Fixes #8614
What does this implement/fix? Explain your changes.
This PR adds a decision threshold calibration wrapper for binary classifiers that calibrates the decision threshold based on three objectives:
The classifier gives the option to either receive a pre-trained base estimator or train one and do the calibration using cross validation loops.
Any other comments?
There are two available examples to illustrate the first two points
the comments bellow are no longer valid but I will leave them for the completeness of the conversation bellow
Since (as discussed in Add wrapper class that changes threshold value for predict #8614 ) this wrapper focuses on Binary Classification I have made the assumption that labels will always be
0
or1
. Is this a correct assumption or can it be that labels are completely arbitrary? Can they also be more than two as long as one is considered to be the positive the rest negative?The current implementation gives the option for choosing the positive label, assuming that the indexes of the classes coincide with the labels. If labels will always be in
[0, 1]
should we fix the positive label to1
? I though that it makes some sense for to be able to choose which class it will consider sensitive.In the case of
cv!="prefit"
after calibrating the threshold by averaging across different folds that have been trained with different data than those used for calibration the base estimator is trained using all the data. Do you see anything wrong with that?