Add P4 classification metric #31218

Anderlaxe · 2025-04-17T09:02:01Z

Describe the workflow you want to enable

Hi, while working on a classification problem I found out there is no dedicated function to compute the P4 metric implemented in sklearn. As a reminder, P4 metrics is a binary classification metric that is commonly seen as an extension of the f_beta metrics because it takes into account all four True Positive, False Positive, True Negative and False Negative values, and because is it symmetrical unlike the f_beta metrics.

P4 is defined as follows : P4 = 4 / ( 1/precision + 1/recall + 1/specificity + 1/NPV )

Wikipedia page right here

Medium article right there

Describe your proposed solution

My idea was to create a function p4_support similar to precision_recall_fscore_support. Since it is a binary metric, multiclass and multi-label inputs would be managed with multilabel_confusion_matrix so the arguments for average would be 'macro', 'samples', 'weighted', 'binary', None.
I would compute all necessaries values such as 1/precision, 1/recall, 1/specificity and 1/NPV using _prf_divide. If any of these four ratios are zero divisions, then P4 would also return the zero division argument. Indeed, for example if precision is null, then 1/precision is +inf and the whole denominator of the P4 is +inf which make P4 = 0 (Btw, this behavior is a reason why it is harder to achieve a high P4 score than f_score since all four ratios need to be 1 to have a P4 equals to 1.). The function would return the tuple (p4_value, support)

A second function p4_score which would be the one actually used by users would return only the first element of the previously described p4_support function.

Describe alternatives you've considered, if relevant

Extras :

Since specificity and NVP are computed anyway, the p4_support function could return the tuple (specificity, NVP, p4_score, support) and then be called specificity_nvp_p4_support. It would then also be possible to add specificity and NVP functions as well using the same scheme as precision or recall.

Responding to #21000 issue, P4 could be added in the classification_report function and would be a good summary of all TP, FP, TN, FN values and their combinations.

Additional context

I have checked that this feature is not already in the issues or pull requests.

The text was updated successfully, but these errors were encountered:

ogrisel · 2025-04-25T08:40:41Z

Thanks for the proposal. I took a look at the paper, and it is quite recent (published in 2023) and cited only 31 times according to google scholar. As such, this does not meet our inclusion criteria for scikit-learn.

Personal opinion:

We already have MCC and the P4 metric does seem to be very similar in the sense that both metrics penalize models that have at least one bad entry in their confusion matrix. In that respect, F4 seems a bit redundant.

More importantly, I don't think generic metrics computed on thresholded (hard) predictions (MCC, F4, F1, balanced accuracy...) are the best way to choose a binary classifier for a given problem. Instead, I would select the best classifier based on threshold-independent binary classification metrics, either purely discriminative metrics, such as ROC AUC or Average Precision (area under the PR curve), or calibration-aware metrics, such as log-loss and Brier score, and afterward, find the optimal decision threshold based on application-specific constraints:

find the threshold that maximizes precision for a given application-specific recall budget,
find the threshold that maximizes recall for a given application-specific precision budget,
assign application specific costs (or gains) to the 4 entries of the confusion matrix and derive an application specific business metric.

All of those can be implemented with the help of the TunedThresholdClassifierCV tool as documented in the following example:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cost_sensitive_learning.html

ogrisel · 2025-04-25T08:43:58Z

I propose to close this feature request as "not planned" for now. If people disagree with what I wrote above, please feel free to upvote this issue and comment below to extend the analysis and we can consider reopening once the inclusion criteria are met.

Anderlaxe added Needs Triage Issue requires triage New Feature labels Apr 17, 2025

ogrisel removed the Needs Triage Issue requires triage label Apr 25, 2025

ogrisel closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add P4 classification metric #31218

Add P4 classification metric #31218

Anderlaxe commented Apr 17, 2025

ogrisel commented Apr 25, 2025 •

edited

Loading

ogrisel commented Apr 25, 2025

Add P4 classification metric #31218

Add P4 classification metric #31218

Comments

Anderlaxe commented Apr 17, 2025

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

ogrisel commented Apr 25, 2025 • edited Loading

ogrisel commented Apr 25, 2025

ogrisel commented Apr 25, 2025 •

edited

Loading