Skip to content

[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 49 commits into from

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Feb 23, 2020

closes #8614
closes #10117
supersedes #10117

Description

This meta-estimator is intended to find the decision threshold which will maximize an objective metric. There are 2 use-cases:

  • maximize a metric such as balanced_accuracy_score, f_beta, f1_score, etc. In this case, we are required to maximize the score.
  • find the decision threshold for a couple of metrics for which one of the metric value will be fixed. For instance, for precision-recall, one would like the threshold maximizing the recall for a given precision or vice-versa. I think that we can support precision/recall and TPR/FPR at first.

Additional work

  • in the future, we could think of adding support for passing a cost-sensitive matrix of the size of the confusion matrix. It will allow injecting business-oriented information regarding the different classification errors done by a learner.

@ogrisel ogrisel changed the title [WIP] FEA Add decision threshold calibration wrapper [WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers Feb 24, 2020
@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2020

I am not sure it belongs to sklearn.calibration: to me the sklearn.calibration module is about ensuring that the model predicted value can match the empirical mean of the observed variable on a test set (on groups of samples binned by predicted values).

For instance a model that always predict 0.5 on a balanced binary classification problem is well-calibrated but not discriminative. On the contrary a model with a good ROC-AUC curve can be very discriminative (it ranks the samples in the right order most of the time) but does not necessarily predict the expected value of the mean observation correction: arbitrary strictly monotic transformation of the predict_proba values will preserve the curve but will break the calibration.

This definition of model calibration matches this:

https://en.wikipedia.org/wiki/Probabilistic_classification

and that:

https://blogs.sas.com/content/iml/2018/05/14/calibration-plots-in-sas.html#prettyPhoto

So to me model calibration, in the context of classification, is a property of the values returned by predict_proba and is therefore independent of a specific cut-off value used to make hard decisions.

If would therefore consider moving the CutoffClassifier class elsewhere, for instance in sklearn.model_selection as to me it has more in common with Grid-Search than probability calibration.

That being said, we could also re-scope the sklearn.calibration and keep Cut-off tuning there but we should be extra pedagogical about the aim of each meta-estimator to highlight they do not solve the same problem at all.

@glemaitre
Copy link
Member Author

@ogrisel I think that the API starts to be stable. I added support for the cost-matrix since it is really close to the support for any metric. The cost matrix is passed using the parameter objective_value which, IMO, seems to be suited enough for this. I added a couple of tests to cover the general + the common tests.

I will add the documentation tomorrow by merging the previous doc and improve the blending with some examples.

At last, I think that we should rename the estimator, even more, if we support the cost-matrix. I was thinking maybe to OptimumThresholdClassifier.

@glemaitre
Copy link
Member Author

Even if this is still WIP (I need the documentation), @adrinjalali and @amueller might be interested to have a look at it.

@glemaitre
Copy link
Member Author

And one thing about API would be to know if we should use CV or not for selecting the threshold. If yes, should it be a single train/test split or a cross_val_predict (forbidding shuffle CVs), or a full cross_validate.

@glemaitre
Copy link
Member Author

OptimumThresholdClassifier

Instead of Optimum it could be a term linked to the fine-tuning using the objective_metric.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.r.t. CV vs single split I think the main point from @marctorrellas in #10117 (comment) is "refit" vs "no refit". When you CV you implicitly refit on the full training model. Or alternatively you would have to keep the k fitted models on each k-splits and take majority vote afterwards but that feels weird.

If you do a single split, then you have the opportunity to keep the estimator fitted on the train side of the split and then the selected threshold matches this specific fitted model. If you refit on the full training set, then the selected threshold is no longer specifically tuned for this fitted model. I think we should do some experiments to drive this discussion further: choose a dataset, materialize several train / validation splits and for each of them draw the ROC curves and highlight the thresholds with optimal f1score or balanced accuracy.

w.r.t. renaming I don't really like optimum OptimumThresholdClassifier. Maybe TunedCutoffClassifier? But I think CutoffClassifier was a fine name.

@staticmethod
def cost_sensitive_score(y_true, y_pred, cost_matrix):
cm = confusion_matrix(y_true, y_pred) * cost_matrix
return np.diag(cm) / cm.sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass a full cost matrix if you then only consider the diagonal?

We could define an arbitrary business-specific utility function: such as:

  • gain for true positive
  • gain (negative cost) for false positives
  • gain (negative cost) for false negatives
  • gain for true negative (generally 0 for imbalanced classification problems).

I which case cost_senstive_score would be (confusion_matrix * gain_matrix).sum() / n_samples or alternatively -(confusion_matrix * cost_matrix).sum() / n_samples.

WDYT?

This work could be later extended to have parametrized cost sensitive scorer objects that could both be used for optimal cutoff selection of the final classifier of a pipeline but also used by the SearchCV object for hyper-parameter tuning of the full pipeline in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass a full cost matrix if you then only consider the diagonal?

cm.sum() will consider off-diagonal. If you penalize differently false-positive and false-negative, the cost will change in this case, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This is a really hard to interpret how gains and costs are counteracts one another. I prefer my simple purely additive parametrization unless you can show me that yours make more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went a bit through the paper of Elkan: http://web.cs.iastate.edu/~honavar/elkan.pdf
In section 2, there is an analytic formulation of the optimal threshold. However, it requires to use predict_proba (which could be fine since method="auto" by default?).
In this case (and I understand correctly), it means that you can compute the threshold just by looking at the cost matrix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, your additive parametrization if fine with me, it makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Elkan paper also mentions the case of non-constant costs, that is costs that depends on the value of an input feature. This is practically very useful, probably much more useful than constant costs. Maybe we should re-scope this whole cost matrix thingy for another PR though. It will be much easier to implement once we have a clean API to deal with feature names.

The classifier, fitted or not fitted, from which we want to optimize
the decision threshold used during `predict`.

objective_metric : {"tpr", "tnr"}, ndarray of shape (2, 2) or callable, \
Copy link
Member

@ogrisel ogrisel Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If should also include "f1" and "balanced_accuracy" and maybe others such as "fbeta" if we also include a new dedicated "beta" constructor param in CutoffClassifier itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to let the user pass a skelarn.metric callable. It makes it easy because you will never need a objective_value for this case and only need an objective_value when the a str is provided because we will work with predefined search.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still make it possible to pass strings for metrics that can be efficiently computed from the output of _binary_clf_curve. This private tool could actually be extended to _binary_clf_confusion_matrix_curve that would compute the CM for each threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fbeta" if we also include a new dedicated "beta" constructor param

This could be pass to objective_metric_param indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still make it possible to pass strings for metrics that can be efficiently computed from the output of _binary_clf_curve. This private tool could actually be extended to _binary_clf_confusion_matrix_curve that would compute the CM for each threshold.

OK. I will first make the documentation with the naive approach and then, I will improve on the efficiency.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need an example, e.g. on a artificially imbalanced version of Adult Census with one hot encoded categorical features, scaled numerical features and Logistic Regression possibly with class_weight="balanced".

Then plot the ROC and PR curves and add the points for the model with the default cutoff (0.5 predict_proba) and the model with the best balanced accuracy and the model with the best f1_score.

@ogrisel
Copy link
Member

ogrisel commented Feb 26, 2020

Also it would be great to find the model with the best recall for a given precision constraint (e.g. precision >= 0.3) but not all precision levels are feasible. So this would make lead to models that could raise a ValueError or FeasibilityError at fit time... I am not sure we want this.

On the other hand, all recall / TPR values are feasible, so we might want to have is given a recall / TPR constraint, find the cutoff that leads to the best precision, but this is not easy to specify with the current set constructor parameters.

Finally, if we treated the default cutoff as a regular parameter for any classifier one could also use SearchCV objects to tune it but that would be a very inefficient thing to do (retraining models for nothing). Maybe it's worth highlighting this aspect in the documentation.

@marctorsoc
Copy link
Contributor

marctorsoc commented Sep 23, 2020

Looking forward to this!

Is there any possibilities to generalize this to multi-class cases? I haven't really studied the question, but for instance for models that have a predict_proba method, instead of defining 1 threshold, I imagine it would be somewhat equivalent to define the prediction as np.argmax(w * proba, axis=1), where w is a vector of shape (n_classes) to be optimized with some additional constraint, say sum(w) = 1, then use the multi-class averaged f1 etc scores?

IMHO, let's have it for binary first, and then we can move from that. This PR has been here for a long time with a lot of discussion, so better not to increase the scope of the task

what's the state of this @glemaitre ? do you need help with it?

@glemaitre
Copy link
Member Author

what's the state of this @glemaitre ? do you need help with it?

I am fixing something related to the scorer in #18114
I want to have this first to be sure that we have a clean API regarding the scoring passed.

Base automatically changed from master to main January 22, 2021 10:52
@glemaitre glemaitre added this to the 1.0 milestone Feb 1, 2021
@lorentzenchr
Copy link
Member

Just to set the reference, it would be nice to be able to also close #4813 with this meta-estimator.

@adrinjalali
Copy link
Member

Not sure if there's time to get this in, since it also seem to depend on #18589. I'll leave it here, but isn't a blocker for the release.

@UnixJunkie
Copy link
Contributor

Supporting MCC as one of the metrics would be nice.

@lorentzenchr
Copy link
Member

superseded by #26120

@lorentzenchr lorentzenchr added the Superseded PR has been replace by a newer PR label Jun 23, 2023
@skanskan
Copy link

Can this be used at the same time than GridSearchCV?
Is it the same than using TunedThresholdClassifierCV with refit=True?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add wrapper class that changes threshold value for predict