-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] FEA New meta-estimator to post-tune the decision_function/predict_proba threshold for binary classifiers #16525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am not sure it belongs to For instance a model that always predict 0.5 on a balanced binary classification problem is well-calibrated but not discriminative. On the contrary a model with a good ROC-AUC curve can be very discriminative (it ranks the samples in the right order most of the time) but does not necessarily predict the expected value of the mean observation correction: arbitrary strictly monotic transformation of the This definition of model calibration matches this: https://en.wikipedia.org/wiki/Probabilistic_classification and that: https://blogs.sas.com/content/iml/2018/05/14/calibration-plots-in-sas.html#prettyPhoto So to me model calibration, in the context of classification, is a property of the values returned by If would therefore consider moving the CutoffClassifier class elsewhere, for instance in That being said, we could also re-scope the |
@ogrisel I think that the API starts to be stable. I added support for the cost-matrix since it is really close to the support for any metric. The cost matrix is passed using the parameter I will add the documentation tomorrow by merging the previous doc and improve the blending with some examples. At last, I think that we should rename the estimator, even more, if we support the cost-matrix. I was thinking maybe to |
Even if this is still WIP (I need the documentation), @adrinjalali and @amueller might be interested to have a look at it. |
And one thing about API would be to know if we should use CV or not for selecting the threshold. If yes, should it be a single train/test split or a |
Instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w.r.t. CV vs single split I think the main point from @marctorrellas in #10117 (comment) is "refit" vs "no refit". When you CV you implicitly refit on the full training model. Or alternatively you would have to keep the k fitted models on each k-splits and take majority vote afterwards but that feels weird.
If you do a single split, then you have the opportunity to keep the estimator fitted on the train side of the split and then the selected threshold matches this specific fitted model. If you refit on the full training set, then the selected threshold is no longer specifically tuned for this fitted model. I think we should do some experiments to drive this discussion further: choose a dataset, materialize several train / validation splits and for each of them draw the ROC curves and highlight the thresholds with optimal f1score or balanced accuracy.
w.r.t. renaming I don't really like optimum OptimumThresholdClassifier
. Maybe TunedCutoffClassifier
? But I think CutoffClassifier
was a fine name.
@staticmethod | ||
def cost_sensitive_score(y_true, y_pred, cost_matrix): | ||
cm = confusion_matrix(y_true, y_pred) * cost_matrix | ||
return np.diag(cm) / cm.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pass a full cost matrix if you then only consider the diagonal?
We could define an arbitrary business-specific utility function: such as:
- gain for true positive
- gain (negative cost) for false positives
- gain (negative cost) for false negatives
- gain for true negative (generally 0 for imbalanced classification problems).
I which case cost_senstive_score
would be (confusion_matrix * gain_matrix).sum() / n_samples
or alternatively -(confusion_matrix * cost_matrix).sum() / n_samples
.
WDYT?
This work could be later extended to have parametrized cost sensitive scorer objects that could both be used for optimal cutoff selection of the final classifier of a pipeline but also used by the SearchCV object for hyper-parameter tuning of the full pipeline in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pass a full cost matrix if you then only consider the diagonal?
cm.sum()
will consider off-diagonal. If you penalize differently false-positive and false-negative, the cost will change in this case, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. This is a really hard to interpret how gains and costs are counteracts one another. I prefer my simple purely additive parametrization unless you can show me that yours make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went a bit through the paper of Elkan: http://web.cs.iastate.edu/~honavar/elkan.pdf
In section 2, there is an analytic formulation of the optimal threshold. However, it requires to use predict_proba
(which could be fine since method="auto"
by default?).
In this case (and I understand correctly), it means that you can compute the threshold just by looking at the cost matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, your additive parametrization if fine with me, it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Elkan paper also mentions the case of non-constant costs, that is costs that depends on the value of an input feature. This is practically very useful, probably much more useful than constant costs. Maybe we should re-scope this whole cost matrix thingy for another PR though. It will be much easier to implement once we have a clean API to deal with feature names.
The classifier, fitted or not fitted, from which we want to optimize | ||
the decision threshold used during `predict`. | ||
|
||
objective_metric : {"tpr", "tnr"}, ndarray of shape (2, 2) or callable, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If should also include "f1"
and "balanced_accuracy"
and maybe others such as "fbeta"
if we also include a new dedicated "beta" constructor param in CutoffClassifier
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to let the user pass a skelarn.metric callable. It makes it easy because you will never need a objective_value
for this case and only need an objective_value
when the a str
is provided because we will work with predefined search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still make it possible to pass strings for metrics that can be efficiently computed from the output of _binary_clf_curve. This private tool could actually be extended to _binary_clf_confusion_matrix_curve
that would compute the CM for each threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fbeta" if we also include a new dedicated "beta" constructor param
This could be pass to objective_metric_param
indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still make it possible to pass strings for metrics that can be efficiently computed from the output of _binary_clf_curve. This private tool could actually be extended to _binary_clf_confusion_matrix_curve that would compute the CM for each threshold.
OK. I will first make the documentation with the naive approach and then, I will improve on the efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need an example, e.g. on a artificially imbalanced version of Adult Census with one hot encoded categorical features, scaled numerical features and Logistic Regression possibly with class_weight="balanced"
.
Then plot the ROC and PR curves and add the points for the model with the default cutoff (0.5 predict_proba) and the model with the best balanced accuracy and the model with the best f1_score.
Also it would be great to find the model with the best recall for a given precision constraint (e.g. precision >= 0.3) but not all precision levels are feasible. So this would make lead to models that could raise a On the other hand, all recall / TPR values are feasible, so we might want to have is given a recall / TPR constraint, find the cutoff that leads to the best precision, but this is not easy to specify with the current set constructor parameters. Finally, if we treated the default cutoff as a regular parameter for any classifier one could also use SearchCV objects to tune it but that would be a very inefficient thing to do (retraining models for nothing). Maybe it's worth highlighting this aspect in the documentation. |
IMHO, let's have it for binary first, and then we can move from that. This PR has been here for a long time with a lot of discussion, so better not to increase the scope of the task what's the state of this @glemaitre ? do you need help with it? |
I am fixing something related to the scorer in #18114 |
Just to set the reference, it would be nice to be able to also close #4813 with this meta-estimator. |
Not sure if there's time to get this in, since it also seem to depend on #18589. I'll leave it here, but isn't a blocker for the release. |
Supporting MCC as one of the metrics would be nice. |
superseded by #26120 |
Can this be used at the same time than GridSearchCV? |
closes #8614
closes #10117
supersedes #10117
Description
This meta-estimator is intended to find the decision threshold which will maximize an objective metric. There are 2 use-cases:
balanced_accuracy_score
,f_beta
,f1_score
, etc. In this case, we are required to maximize the score.Additional work