[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525

glemaitre · 2020-02-23T15:41:34Z

closes #8614
closes #10117
supersedes #10117

Description

This meta-estimator is intended to find the decision threshold which will maximize an objective metric. There are 2 use-cases:

maximize a metric such as balanced_accuracy_score, f_beta, f1_score, etc. In this case, we are required to maximize the score.
find the decision threshold for a couple of metrics for which one of the metric value will be fixed. For instance, for precision-recall, one would like the threshold maximizing the recall for a given precision or vice-versa. I think that we can support precision/recall and TPR/FPR at first.

Additional work

in the future, we could think of adding support for passing a cost-sensitive matrix of the size of the confusion matrix. It will allow injecting business-oriented information regarding the different classification errors done by a learner.

ogrisel · 2020-02-25T08:10:33Z

I am not sure it belongs to sklearn.calibration: to me the sklearn.calibration module is about ensuring that the model predicted value can match the empirical mean of the observed variable on a test set (on groups of samples binned by predicted values).

For instance a model that always predict 0.5 on a balanced binary classification problem is well-calibrated but not discriminative. On the contrary a model with a good ROC-AUC curve can be very discriminative (it ranks the samples in the right order most of the time) but does not necessarily predict the expected value of the mean observation correction: arbitrary strictly monotic transformation of the predict_proba values will preserve the curve but will break the calibration.

This definition of model calibration matches this:

https://en.wikipedia.org/wiki/Probabilistic_classification

and that:

https://blogs.sas.com/content/iml/2018/05/14/calibration-plots-in-sas.html#prettyPhoto

So to me model calibration, in the context of classification, is a property of the values returned by predict_proba and is therefore independent of a specific cut-off value used to make hard decisions.

If would therefore consider moving the CutoffClassifier class elsewhere, for instance in sklearn.model_selection as to me it has more in common with Grid-Search than probability calibration.

That being said, we could also re-scope the sklearn.calibration and keep Cut-off tuning there but we should be extra pedagogical about the aim of each meta-estimator to highlight they do not solve the same problem at all.

glemaitre · 2020-02-25T22:54:12Z

@ogrisel I think that the API starts to be stable. I added support for the cost-matrix since it is really close to the support for any metric. The cost matrix is passed using the parameter objective_value which, IMO, seems to be suited enough for this. I added a couple of tests to cover the general + the common tests.

I will add the documentation tomorrow by merging the previous doc and improve the blending with some examples.

At last, I think that we should rename the estimator, even more, if we support the cost-matrix. I was thinking maybe to OptimumThresholdClassifier.

glemaitre · 2020-02-25T22:54:51Z

Even if this is still WIP (I need the documentation), @adrinjalali and @amueller might be interested to have a look at it.

glemaitre · 2020-02-25T23:06:22Z

And one thing about API would be to know if we should use CV or not for selecting the threshold. If yes, should it be a single train/test split or a cross_val_predict (forbidding shuffle CVs), or a full cross_validate.

glemaitre · 2020-02-26T09:43:31Z

OptimumThresholdClassifier

Instead of Optimum it could be a term linked to the fine-tuning using the objective_metric.

ogrisel

w.r.t. CV vs single split I think the main point from @marctorrellas in #10117 (comment) is "refit" vs "no refit". When you CV you implicitly refit on the full training model. Or alternatively you would have to keep the k fitted models on each k-splits and take majority vote afterwards but that feels weird.

If you do a single split, then you have the opportunity to keep the estimator fitted on the train side of the split and then the selected threshold matches this specific fitted model. If you refit on the full training set, then the selected threshold is no longer specifically tuned for this fitted model. I think we should do some experiments to drive this discussion further: choose a dataset, materialize several train / validation splits and for each of them draw the ROC curves and highlight the thresholds with optimal f1score or balanced accuracy.

w.r.t. renaming I don't really like optimum OptimumThresholdClassifier. Maybe TunedCutoffClassifier? But I think CutoffClassifier was a fine name.

ogrisel · 2020-02-26T11:41:38Z

sklearn/model_selection/_prediction.py

+    @staticmethod
+    def cost_sensitive_score(y_true, y_pred, cost_matrix):
+        cm = confusion_matrix(y_true, y_pred) * cost_matrix
+        return np.diag(cm) / cm.sum()


Why pass a full cost matrix if you then only consider the diagonal?

We could define an arbitrary business-specific utility function: such as:

gain for true positive

gain (negative cost) for false positives

gain (negative cost) for false negatives

gain for true negative (generally 0 for imbalanced classification problems).

I which case cost_senstive_score would be (confusion_matrix * gain_matrix).sum() / n_samples or alternatively -(confusion_matrix * cost_matrix).sum() / n_samples.

WDYT?

This work could be later extended to have parametrized cost sensitive scorer objects that could both be used for optimal cutoff selection of the final classifier of a pipeline but also used by the SearchCV object for hyper-parameter tuning of the full pipeline in general.

Why pass a full cost matrix if you then only consider the diagonal?

cm.sum() will consider off-diagonal. If you penalize differently false-positive and false-negative, the cost will change in this case, isn't it?

Indeed. This is a really hard to interpret how gains and costs are counteracts one another. I prefer my simple purely additive parametrization unless you can show me that yours make more sense.

I went a bit through the paper of Elkan: http://web.cs.iastate.edu/~honavar/elkan.pdf
In section 2, there is an analytic formulation of the optimal threshold. However, it requires to use predict_proba (which could be fine since method="auto" by default?).
In this case (and I understand correctly), it means that you can compute the threshold just by looking at the cost matrix.

Otherwise, your additive parametrization if fine with me, it makes sense.

The Elkan paper also mentions the case of non-constant costs, that is costs that depends on the value of an input feature. This is practically very useful, probably much more useful than constant costs. Maybe we should re-scope this whole cost matrix thingy for another PR though. It will be much easier to implement once we have a clean API to deal with feature names.

ogrisel · 2020-02-26T12:00:57Z

sklearn/model_selection/_prediction.py

+        The classifier, fitted or not fitted, from which we want to optimize
+        the decision threshold used during `predict`.
+
+    objective_metric : {"tpr", "tnr"}, ndarray of shape (2, 2) or callable, \


If should also include "f1" and "balanced_accuracy" and maybe others such as "fbeta" if we also include a new dedicated "beta" constructor param in CutoffClassifier itself.

I was thinking to let the user pass a skelarn.metric callable. It makes it easy because you will never need a objective_value for this case and only need an objective_value when the a str is provided because we will work with predefined search.

I would still make it possible to pass strings for metrics that can be efficiently computed from the output of _binary_clf_curve. This private tool could actually be extended to _binary_clf_confusion_matrix_curve that would compute the CM for each threshold.

"fbeta" if we also include a new dedicated "beta" constructor param

This could be pass to objective_metric_param indeed.

I would still make it possible to pass strings for metrics that can be efficiently computed from the output of _binary_clf_curve. This private tool could actually be extended to _binary_clf_confusion_matrix_curve that would compute the CM for each threshold.

OK. I will first make the documentation with the naive approach and then, I will improve on the efficiency.

ogrisel

We also need an example, e.g. on a artificially imbalanced version of Adult Census with one hot encoded categorical features, scaled numerical features and Logistic Regression possibly with class_weight="balanced".

Then plot the ROC and PR curves and add the points for the model with the default cutoff (0.5 predict_proba) and the model with the best balanced accuracy and the model with the best f1_score.

ogrisel · 2020-02-26T12:44:19Z

Also it would be great to find the model with the best recall for a given precision constraint (e.g. precision >= 0.3) but not all precision levels are feasible. So this would make lead to models that could raise a ValueError or FeasibilityError at fit time... I am not sure we want this.

On the other hand, all recall / TPR values are feasible, so we might want to have is given a recall / TPR constraint, find the cutoff that leads to the best precision, but this is not easy to specify with the current set constructor parameters.

Finally, if we treated the default cutoff as a regular parameter for any classifier one could also use SearchCV objects to tune it but that would be a very inefficient thing to do (retraining models for nothing). Maybe it's worth highlighting this aspect in the documentation.

marctorsoc · 2020-09-23T10:45:58Z

Looking forward to this!

Is there any possibilities to generalize this to multi-class cases? I haven't really studied the question, but for instance for models that have a predict_proba method, instead of defining 1 threshold, I imagine it would be somewhat equivalent to define the prediction as np.argmax(w * proba, axis=1), where w is a vector of shape (n_classes) to be optimized with some additional constraint, say sum(w) = 1, then use the multi-class averaged f1 etc scores?

IMHO, let's have it for binary first, and then we can move from that. This PR has been here for a long time with a lot of discussion, so better not to increase the scope of the task

what's the state of this @glemaitre ? do you need help with it?

glemaitre · 2020-10-08T09:50:42Z

what's the state of this @glemaitre ? do you need help with it?

I am fixing something related to the scorer in #18114
I want to have this first to be sure that we have a clean API regarding the scoring passed.

lorentzenchr · 2021-03-19T22:01:59Z

Just to set the reference, it would be nice to be able to also close #4813 with this meta-estimator.

adrinjalali · 2021-08-22T10:31:08Z

Not sure if there's time to get this in, since it also seem to depend on #18589. I'll leave it here, but isn't a blocker for the release.

UnixJunkie · 2022-10-24T01:01:56Z

Supporting MCC as one of the metrics would be nice.

lorentzenchr · 2023-06-23T13:21:55Z

superseded by #26120

skanskan · 2024-09-14T12:02:18Z

Can this be used at the same time than GridSearchCV?
Is it the same than using TunedThresholdClassifierCV with refit=True?

glemaitre added 7 commits February 23, 2020 16:18

FEA add CutoffCalibration estimator

d3705d6

add example

b99218b

PEP8

ce66427

add whats new entry

3ee3b5a

mark as only working with binary data

c5a51eb

common tests fixes

0aab70c

iter

6e12f8a

ogrisel changed the title ~~[WIP] FEA Add decision threshold calibration wrapper~~ [WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers Feb 24, 2020

glemaitre added 3 commits February 24, 2020 19:21

xxx

420df8d

iter

6b36f68

pep8

00da5f7

glemaitre added 9 commits February 25, 2020 18:35

iter

e8837e0

move to model_selection

1a03fad

add the missing files

63b285a

Remove current documentation

65e8329

DOC add docstring examples

d924684

PEP8

237c919

fix for predict proba

4210537

add support for cost-sensitive

255dfe8

revert calibration changes

f3a372d

ogrisel reviewed Feb 26, 2020

View reviewed changes

start doc

f998625

glemaitre mentioned this pull request Oct 9, 2020

MNT Refactor scorer using _get_response #18589

Closed

Base automatically changed from master to main January 22, 2021 10:52

glemaitre added this to the 1.0 milestone Feb 1, 2021

rth self-requested a review June 19, 2021 08:27

glemaitre mentioned this pull request Jul 29, 2021

ENH: Add the classification models threshold as parameter of __init__ method #20635

Closed

lorentzenchr mentioned this pull request Aug 17, 2021

predict ought to have an optional threshold argument #4813

Closed

adrinjalali modified the milestones: 1.0, 1.1 Sep 7, 2021

glemaitre mentioned this pull request Sep 9, 2021

MNT Adds CurveDisplayMixin _get_response_values #20999

Closed

glemaitre mentioned this pull request Oct 22, 2021

add sklearn.metrics Display class to plot Precision/Recall/F1 for probability thresholds #21391

Open

glemaitre modified the milestones: 1.1, 1.2 Jan 26, 2022

cmarmo mentioned this pull request May 17, 2022

[MRG] Add decision threshold calibration wrapper #10117

Closed

jonaslandsgesell mentioned this pull request Oct 21, 2022

[Question] Is auto-sklearn ensemble for classification using predict_proba + threshold adaptions automl/auto-sklearn#1549

Open

lorentzenchr mentioned this pull request Oct 31, 2022

Precision @ Recall K || Recall @ Precision K #20266

Open

glemaitre modified the milestones: 1.2, 1.3 Nov 16, 2022

vitaliset mentioned this pull request Feb 18, 2023

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

Open

7 tasks

glemaitre mentioned this pull request Apr 7, 2023

FEA add TunedThresholdClassifier meta-estimator to post-tune the cut-off threshold #26120

Merged

jeremiedbb modified the milestones: 1.3, 1.4 Jun 8, 2023

lorentzenchr closed this Jun 23, 2023

lorentzenchr added the Superseded PR has been replace by a newer PR label Jun 23, 2023

Uh oh!

[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525

[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525

Uh oh!

Conversation

glemaitre commented Feb 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional work

Uh oh!

ogrisel commented Feb 25, 2020

Uh oh!

glemaitre commented Feb 25, 2020

Uh oh!

glemaitre commented Feb 25, 2020

Uh oh!

glemaitre commented Feb 25, 2020

Uh oh!

glemaitre commented Feb 26, 2020

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marctorsoc commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Oct 8, 2020

Uh oh!

lorentzenchr commented Mar 19, 2021

Uh oh!

adrinjalali commented Aug 22, 2021

Uh oh!

UnixJunkie commented Oct 24, 2022

Uh oh!

lorentzenchr commented Jun 23, 2023

Uh oh!

skanskan commented Sep 14, 2024

Uh oh!

Uh oh!

glemaitre commented Feb 23, 2020 •

edited

Loading

ogrisel Feb 26, 2020 •

edited

Loading

ogrisel commented Feb 26, 2020 •

edited

Loading

marctorsoc commented Sep 23, 2020 •

edited

Loading