Skip to content

[MRG] Add decision threshold calibration wrapper #10117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

PGryllos
Copy link
Contributor

@PGryllos PGryllos commented Nov 12, 2017

Reference Issues/PRs

Fixes #8614

What does this implement/fix? Explain your changes.

This PR adds a decision threshold calibration wrapper for binary classifiers that calibrates the decision threshold based on three objectives:

  • optimise sum of true positive and true negative rate
  • optimise true positive rate while keeping a minimum of true negative rate
  • optimise true negative rate while keeping a minimum for the true positive rate

The classifier gives the option to either receive a pre-trained base estimator or train one and do the calibration using cross validation loops.

Any other comments?

There are two available examples to illustrate the first two points

the comments bellow are no longer valid but I will leave them for the completeness of the conversation bellow

  1. Since (as discussed in Add wrapper class that changes threshold value for predict #8614 ) this wrapper focuses on Binary Classification I have made the assumption that labels will always be 0 or 1. Is this a correct assumption or can it be that labels are completely arbitrary? Can they also be more than two as long as one is considered to be the positive the rest negative?

  2. The current implementation gives the option for choosing the positive label, assuming that the indexes of the classes coincide with the labels. If labels will always be in [0, 1] should we fix the positive label to 1? I though that it makes some sense for to be able to choose which class it will consider sensitive.

  3. In the case of cv!="prefit" after calibrating the threshold by averaging across different folds that have been trained with different data than those used for calibration the base estimator is trained using all the data. Do you see anything wrong with that?

@PGryllos PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from 465563c to 132864b Compare November 12, 2017 16:53
@jnothman
Copy link
Member

jnothman commented Nov 12, 2017 via email

@PGryllos
Copy link
Contributor Author

PGryllos commented Nov 13, 2017

@jnothman thnx a lot for the quick response. The visualisations I mentioned on the previous comment show just which point of the roc curve was chosen (for validating that the chosen point is the one closest to the ideal corner), so not a lot to see there. I plan on making a more extensive example and adding it here in the following days.

@amueller
Copy link
Member

For your questions:

  1. the labels can be basically arbitrary python objects, but you can assume there's exactly two of them. Check other classification code to see how mapping to 0 and 1 is handled.

  2. it should behave consistent with other metrics, which assume 1 is positive if the labels are [0, 1] or [-1, +1] and otherwise require pos_label to be set.

  3. That's what I would have done. That's somewhat different from what CallibratedClassifierCV does. I'm not sure if there's a standard behavior in the literature. I guess there is no really great way to accumulate the different thresholded classifiers. You could vote, but that sounds like it would lose information? But on the other hand, combining all the data together could completely change the meaning of the thresholds.... hum... In particular if you use a grouped CV I could see that going completely sideways.
    Can you do an example where different groups have different semantics for thresholds and see if your strategy breaks?

@amueller
Copy link
Member

tests are failing

@amueller
Copy link
Member

as you said in the other thread, we need a user guide and examples. Those often show whether the interface is good ;)



class OptimalCutoffClassifier(BaseEstimator, ClassifierMixin):
"""Optimal cutoff point selection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure having Optimal in the name is a good idea. Maybe just CutoffClassifier? There is no description in the docstring on what it does.

- 'roc', selects the point on the roc_curve that is closer to the ideal
corner (0, 1)

- 'max_se', selects the point that yields the highest sensitivity with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the terms "true positive rate" and "true negative rate" because they are pretty clear in their meaning.

@PGryllos PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from 3c7b8c9 to c42564d Compare December 22, 2017 22:39
@PGryllos PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from a695215 to 39e23c6 Compare December 23, 2017 13:40
@PGryllos PGryllos force-pushed the feat/8614_add_threshold_calibration_wrapper branch from 39e23c6 to e63657c Compare December 23, 2017 13:40
@marctorsoc
Copy link
Contributor

what's the state of this? I thought this was going to be included in 0.22

@PGryllos
Copy link
Contributor Author

PGryllos commented Dec 3, 2019

@marctorrellas I am sorry, I haven't caught up with the discussion as I didn't have the capacity to interact with this the past weeks. I may be able to find some time soon :|

@glemaitre glemaitre self-assigned this Feb 23, 2020
@glemaitre glemaitre added the Superseded PR has been replace by a newer PR label Feb 23, 2020
@glemaitre
Copy link
Member

I will take over the PR to address the remaining comments.

@PGryllos
Copy link
Contributor Author

PGryllos commented Feb 23, 2020

@glemaitre :) I haven't been able to dedicate any time in this for the past months due to job searching. Happy to see this completed!

@glemaitre
Copy link
Member

Don't worry. I think that this an interesting feature and I want to give it a push.

@PGryllos
Copy link
Contributor Author

Some thoughts that I had over time

  • the interface wrt to strategy is a bit excessive. It should probably give only the option for tuning the fbeta.
  • averaging the decision thresholds across folds, it is not super clear why this should work. Instead, we should probably select the decision threshold that works the best across folds?

@glemaitre
Copy link
Member

the interface wrt to strategy is a bit excessive. It should probably give only the option for tuning the fbeta.

I would like to see if it could be useful in the case of class imbalance. Actually, we were interested of having a manual option with @ogrisel.

@PGryllos
Copy link
Contributor Author

what will the manual option do? Allow the user to specify the exact fpr and trp they'd like?

@glemaitre
Copy link
Member

glemaitre commented Feb 23, 2020 via email

@PGryllos
Copy link
Contributor Author

sounds relevant, and more practical than the current interface - which in fact has a similar rationale behind it. I agree; I believe the default should be the fbeta.

@marctorsoc
Copy link
Contributor

  • fold

I don't think that averaging thresholds is correct. For every training data there's an optimal threshold that will give best results on unseen data, and they can be quite different.

I think you're mixing two things:

  • choosing the best threshold
  • estimate performance for a model where we have selected the best threshold, i.e. evaluate "the strategy of train with a train set and selecting the threshold with a dev set using criteria X"

For the former I would recommend just train-dev split. Train on train, select threshold with dev

For the latter I would recommend nested CV, with the inner loop selecting the threshold, and the outer loop to evaluate. However, this won't give you which the model (the threshold) to use

@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2020

I believe the default should be the fbeta.

I am not so sure. balanced accuracy could also be a good default but it would probably be more computationally intensive than f1 score. I don't have a strong opinion.

The classifier gives the option to either receive a pre-trained base estimator or train one and do the calibration using cross validation loops.

I am not so sure that cross-validation is really necessary here. As any of you tried or read references that would just fit the base estimator and tune the threshold on the same entire training set?

@marctorsoc
Copy link
Contributor

Balanced accuracy is usually not good for imbalanced datasets. I would give the option of f1 and balanced accuracy. They are not difficult / expensive to compute either

@ogrisel tuning on the training set is prone to overfitting, at least you need a dev set

@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2020

@ogrisel tuning on the training set is prone to overfitting, at least you need a dev set

I was wondering if that would really be the case in practice, but maybe yes you are right: the roc curve could be very different on the training set than on a validation set for overfitting models and therefore using this for cut-off selection would result in a poor choice of the cut-off .

I don't understand why cutoff averaging across CV folds: if the cutoff selected on each CV split is not stable enough, then the cut-off Classifier procedure would be pointless anyway. Fixing an arbitrary internal train-validation split would not solve the unstable cut-off problem, it would just hide it.

@jnothman
Copy link
Member

jnothman commented Feb 25, 2020 via email

@marctorsoc
Copy link
Contributor

(I'm curious, Mark, why balanced accuracy is not good for imbalanced datasets.)

My name is Marc :)

My statement was not exactly what I wanted to say. Balanced accuracy is better than naive accuracy indeed. I'd rather choose f1, as I see it more relevant for imbalanced datasets, and in the case of this PR let the user choose, with default for bacc maybe...

Example:
TP, FP 5 50
FN, TN 10 10000

balanced accuracy = 66 %
f1 = 14 %

I think in this case f1 represents better how bad this model is, but I'm sure other scenarios can show balanced accuracy as a good metric. Please let me know if there are any errors in the computations

@ogrisel tuning on the training set is prone to overfitting, at least you need a dev set

I was wondering if that would really be the case in practice, but maybe yes you are right: the roc curve could be very different on the training set than on a validation set for overfitting models and therefore using this for cut-off selection would result in a poor choice of the cut-off .

I don't understand why cutoff averaging across CV folds: if the cutoff selected on each CV split is not stable enough, then the cut-off Classifier procedure would be pointless anyway. Fixing an arbitrary internal train-validation split would not solve the unstable cut-off problem, it would just hide it.

You can see the decision threshold as a hyperparameter. Would you choose your HP on the training set?

I don't agree with the last bit. If the cut-off is stable, then either procedure is good. If it's unstable, then you need to select the one that is good for the training portion you have used.

Note: we assume it's unstable w.r.t. the training data you use, not the validation set to tune it

@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2020

Thanks for the intuition, I think I get your point. I would still like to see simulations for various models (e.g. for overfitting and underfitting models of various kinds) to evaluate the impact of this strategy and in particular highlight catastrophic cases where CV-based cut-off averaging is detrimental. (if someone would like to volunteer to do a study in a notebook for instance).

In any case CV (+ refit on the full training set) is always significantly more computationally intensive than doing a single train-validation split (without refit on the full training set). So it's also a good reason not to do full-blown CV by default it never improves upon single split tuning.

@glemaitre
Copy link
Member

Example:
TP, FP 5 50
FN, TN 10 10000
balanced accuracy = 66 %
f1 = 14 %

I don't see the issue with balanced accuracy. Basically 50% means randomness. So you are 16 % above randomness which is still not great as the 14 % of the f1 score is telling you.

@PGryllos
Copy link
Contributor Author

why is f1 preferred over fbeta? Since f1 assumes that precision and recall are equally important for the user and fbeta with default beta = 1 is the same as f1, I'd think that fbeta makes more sense.

@marctorsoc
Copy link
Contributor

Example:
TP, FP 5 50
FN, TN 10 10000
balanced accuracy = 66 %
f1 = 14 %

I don't see the issue with balanced accuracy. Basically 50% means randomness. So you are 16 % above randomness which is still not great as the 14 % of the f1 score is telling you.

What I meant is that in the example above I think 14% reflects a lot better how bad the model is as compared to 66%, but of course those numbers explain better or worse depending on whether the reader understands the baselines and metrics. (Almost) any metric is good as long as you can interpret it correctly

About F1 vs Fbeta, I was just trying to reduce scope for the PR. I'm not against Fbeta

Base automatically changed from master to main January 22, 2021 10:49
@cmarmo
Copy link
Contributor

cmarmo commented May 17, 2022

I'm closing this issue as superseded by #16525, following our triaging rules. Thanks @PGryllos for your work! Feel free to comment in #16525 if you are interested in a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:utils Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add wrapper class that changes threshold value for predict
10 participants