Skip to content

Add P4 classification metric #31218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Anderlaxe opened this issue Apr 17, 2025 · 2 comments
Closed

Add P4 classification metric #31218

Anderlaxe opened this issue Apr 17, 2025 · 2 comments

Comments

@Anderlaxe
Copy link

Describe the workflow you want to enable

Hi, while working on a classification problem I found out there is no dedicated function to compute the P4 metric implemented in sklearn. As a reminder, P4 metrics is a binary classification metric that is commonly seen as an extension of the f_beta metrics because it takes into account all four True Positive, False Positive, True Negative and False Negative values, and because is it symmetrical unlike the f_beta metrics.

P4 is defined as follows : P4 = 4 / ( 1/precision + 1/recall + 1/specificity + 1/NPV )

Wikipedia page right here

Medium article right there

Describe your proposed solution

My idea was to create a function p4_support similar to precision_recall_fscore_support. Since it is a binary metric, multiclass and multi-label inputs would be managed with multilabel_confusion_matrix so the arguments for average would be 'macro', 'samples', 'weighted', 'binary', None.
I would compute all necessaries values such as 1/precision, 1/recall, 1/specificity and 1/NPV using _prf_divide. If any of these four ratios are zero divisions, then P4 would also return the zero division argument. Indeed, for example if precision is null, then 1/precision is +inf and the whole denominator of the P4 is +inf which make P4 = 0 (Btw, this behavior is a reason why it is harder to achieve a high P4 score than f_score since all four ratios need to be 1 to have a P4 equals to 1.). The function would return the tuple (p4_value, support)

A second function p4_score which would be the one actually used by users would return only the first element of the previously described p4_support function.

Describe alternatives you've considered, if relevant

Extras :

Since specificity and NVP are computed anyway, the p4_support function could return the tuple (specificity, NVP, p4_score, support) and then be called specificity_nvp_p4_support. It would then also be possible to add specificity and NVP functions as well using the same scheme as precision or recall.

Responding to #21000 issue, P4 could be added in the classification_report function and would be a good summary of all TP, FP, TN, FN values and their combinations.

Additional context

I have checked that this feature is not already in the issues or pull requests.

@Anderlaxe Anderlaxe added Needs Triage Issue requires triage New Feature labels Apr 17, 2025
@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2025

Thanks for the proposal. I took a look at the paper, and it is quite recent (published in 2023) and cited only 31 times according to google scholar. As such, this does not meet our inclusion criteria for scikit-learn.

Personal opinion:

We already have MCC and the P4 metric does seem to be very similar in the sense that both metrics penalize models that have at least one bad entry in their confusion matrix. In that respect, F4 seems a bit redundant.

More importantly, I don't think generic metrics computed on thresholded (hard) predictions (MCC, F4, F1, balanced accuracy...) are the best way to choose a binary classifier for a given problem. Instead, I would select the best classifier based on threshold-independent binary classification metrics, either purely discriminative metrics, such as ROC AUC or Average Precision (area under the PR curve), or calibration-aware metrics, such as log-loss and Brier score, and afterward, find the optimal decision threshold based on application-specific constraints:

  • find the threshold that maximizes precision for a given application-specific recall budget,
  • find the threshold that maximizes recall for a given application-specific precision budget,
  • assign application specific costs (or gains) to the 4 entries of the confusion matrix and derive an application specific business metric.

All of those can be implemented with the help of the TunedThresholdClassifierCV tool as documented in the following example:

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Apr 25, 2025
@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2025

I propose to close this feature request as "not planned" for now. If people disagree with what I wrote above, please feel free to upvote this issue and comment below to extend the analysis and we can consider reopening once the inclusion criteria are met.

@ogrisel ogrisel closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants