The F1 score should be 1 when TP=FP=FN=0 #28675

fabseb60 · 2024-03-21T11:37:00Z

fabseb60
Mar 21, 2024

Dear scikit-learn authors,

In your definition of the F1 score (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score), you say that “F1 is by default calculated as 0.0 when there are no true positives, false negatives, or false positives”. This choice is IMHO dead wrong; in these cases, F1 should be evaluated as 1, and there should be no option for the user to set it otherwise. In these cases, the true class of all the items is the negative class (since there are no true positives and no false negatives), and the classifier has correctly handled all these items, i.e., it has decreed that they are negative. In other words, the classifier could not have done better with the cases it has been asked to classify, since it has classified correctly all of them, and it should thus be awarded nothing else than a perfect score of 1.

Thanks for all the good work! FS

jmarintur · 2024-03-28T15:12:10Z

jmarintur
Mar 28, 2024

Dear @fabseb60, computing the F1 score in this scenario, where TP=FP=FN=0, leads to:

$\text{F1-score} = 2 * \frac{precision \times recall}{precision+recall} = \frac{0}{0}$

which is undetermined. By convention, you could set it up to 0, 1 or undefined. However, if you set it to 1, and the classifier was badly trained and classifies everything as negative, you will get the wrong impression by looking at the F1-score. Setting it to 0 by convention does seem to me like a good way to identifying that something is wrong, either the dataset or both, the dataset and the classifier.

0 replies

fabseb60 · 2024-03-29T06:49:39Z

fabseb60
Mar 29, 2024
Author

Hi Javier, Thank you for your reply. F1 is a function that must be defined by cases. In fact, the original definition that you correctly give, i.e., $F_1= 2 * \frac{precision \times recall}{precision+recall}$ does not specify what happens when $|TP\cup FP|$=0 (i.e., when the classifier predicts that there are no positives in the test set - in this case precision is 0/0) or what happens when $|TP\cup FN|$=0 (i.e., when there are no positives in the test set - in this case recall is 0/0). Because of this, the definition that is more commonly given is $F_1=\frac{2*TP}{2*TP+FP+FN}$, which is obtained by replacing $\frac{TP}{TP+FP}$ for precision and $\frac{TP}{TP+FN}$ for recall, and which is equivalent to the above but for the case in which $|TP\cup FP|$=0; in this case, F_1 is 0, provided that FN is not 0. By itself, also this second definition is incomplete, since it does not specify what happens when TP=FP=FN=0, which is the case we are discussing. It is clear that being incomplete is unsatisfactory for an evaluation measure, since an evaluation measure must return an answer in every legitimate case it confronts; and the case in which TP=FP=FN=0 is indeed legitimate, since it it the case in which there are no positives in the test set and the classifier correctly identifies all items as negatives. Aside from being legitimate, it is also a situation that arises in many practical cases; e.g., think of a classifier that must detect a rare illness, and that correctly decrees that in a give population samples there are (luckily enough) no cases of this illness. It is thus necessary to specify what F1 evaluates to when TP=FP=FN=0.

However, if you set it to 1, and the classifier was badly trained and classifies everything as negative, you will get the wrong impression by looking at the F1-score.

The task of an evaluation measure is not to decide whether the classifier is badly trained or not, or whether it classifies everything as negative; its task is to compare, for a set of test items, their true labels with the labels the classifier has predicted, and score the classifier accordingly. If you submit to the classifier a set of items whose true labels are all negative, the best you can want is that the classifier correctly recognizes them all as negative, and in this case there is no reason not to give it a 1. If a classifier classifies everything as negative, as you say, it will be severely penalized by F1 everytime you ask it to classify a set of items that has at least one positive (in all such cases F1 will evaluate to 0 any classifier that classifies everything as negative). So, there are other (legitimate) ways to spot, and penalize, a classifier that classifies everything as negative.

Setting it to 0 by convention does seem to me like a good way to identifying that something is wrong, either the dataset or both, the dataset and the classifier.

As I argued above, there is nothing wrong with a test set with contains no positive examples; it is a legitimate set, and there are many such sets in real applications. If, in this case, the classifier identifies all the items as negatives, in this case it is not wrong either. As a result of all this, I hope I have convinced you that there is no real reason why F1 should evaluate to anything different from 1 when $TP=FN=FP=0$. Best Fabrizio Sebastiani

0 replies

jmarintur · 2024-03-29T08:30:30Z

jmarintur
Mar 29, 2024

Hi @fabseb60, thank you for your message. From the application perspective (to monitor the classifier once in a while), it does make sense what you expose, since the classifier it's doing a good job whether it has been well trained or not. I agree with you in this situation. From the test perspective, where we want to evaluate how the classifier will behave in a real situation (following a training-validation-testing or training-testing partitioning), I would never recommend to use a test that does not have positive examples. Wdyt?

0 replies

fabseb60 · 2024-03-29T09:15:33Z

fabseb60
Mar 29, 2024
Author

Hi Javier,

On 29 Mar 2024, at 09:30, Javier Marin Tur ***@***.***> wrote: Hi @fabseb60 <https://github.com/fabseb60>, thank you for your message. From the application perspective (to monitor the classifier once in a while), it does make sense what you expose, since the classifier it's doing a good job whether it has been well trained or not. I agree with you in this situation. From the test perspective, where we want to evaluate how the classifier will behave in a real situation (following a training-validation-testing or training-testing partitioning), I would never recommend to use a test that does not have positive examples. Wdyt? — Reply to this email directly, view it on GitHub <#28675 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AV7EFXLFGRVKXD6EVA7UK7LY2UKCZAVCNFSM6AAAAABFBIUY2GVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSNBZGM3DS>. You are receiving this because you were mentioned.

An evaluation measure must do a correct job in any situation, be it the situation in which, as you say, you monitor the classifier once in a while, or be it the situation in which we want to evaluate how the classifier will behave in a real situation. If you would never recommend to use a test that does not have positive examples, fine, it’s your choice; but the evaluation measure should behave correctly also with such a test. A machine-learned classifier is usually tested by applying it to MANY test sets; typically, it is the case that most of them have at least some positive examples and some of them (a minority) have none. After all, we try to test classifiers with realistic data, and it it realistic, as I argued in my previous message, that at least SOME times the classifier will have to confront a test set in which there are no positives (e.g., it is realistic that, given an alarm system meant to detect intrusions, there are some days in which there are no intrusions). E.g., in a recent paper that I co-wrote (still unpublished - I can share it with you privately if you want), we describe experiments in which we test classifiers on 18,900 test sets. About 4.7% such test sets are such that there are no positives. If, as you suggest, the classifier should always receive a F1=0 score in these situations, we would give a 0 to the classifier that correctly identifies all negatives as such AND to the classifier that incorrectly identifies all the negatives as positives. Does this sound right? You will probably agree it doesn’t. Best regards Fabrizio

0 replies

jmarintur · 2024-03-29T10:59:42Z

jmarintur
Mar 29, 2024

Hi Fabrizio, thanks for the discussion. What you say about a classifier being usually tested by applying it to MANY test sets, it does not necessarily mean that some of them may not have both positive and negative samples in a binary classification problem. You are describing a specific scenario here. That said, in such scenario, where you have multiple test sets with some of them not containing positive examples, it truly makes sense to set the F1-score to 1. I agree with you that we should set it 1, but one should be aware that in other scenarios, e.g. having all of your test sets with no positive examples at all, this could lead to not anticipate the general behavior of your classifier. And by saying that, I am not retracting and disagreeing with you about setting it to 1, I am just exposing a specific case.

0 replies

jmarintur · 2024-03-29T11:01:36Z

jmarintur
Mar 29, 2024

Did you create an issue?

0 replies

fabseb60 · 2024-03-29T11:06:47Z

fabseb60
Mar 29, 2024
Author

Hi Javier,

On 29 Mar 2024, at 12:00, Javier Marin Tur ***@***.***> wrote: Hi Fabrizio, thanks for the discussion. What you say about a classifier being usually tested by applying it to MANY test sets, it does not necessarily mean that some of them may not have both positive and negative samples in a binary classification problem. You are describing a specific scenario here. That said, in such scenario, where you have multiple test sets with some of them not containing positive examples, it truly makes sense to set the F1-score to 1. I agree with you that we should set it 1, but one should be aware that in other scenarios, e.g. having all of your test sets with no positive examples at all, this could lead to not anticipate the general behavior of your classifier. And by saying that, I am not retracting and disagreeing with you about setting it to 1, I am just exposing a specific case.

Sure, I can’t really imagine a case in which a thorough experimentation ONLY involves test sets in which all of the examples are negative. I’m simply saying that, out of the many test sets a classifier will be confronted with, SOME of these sets might consist of negative examples only, and that F1 should deal correctly with these latter too. Best Fabrizio

0 replies

fabseb60 · 2024-03-29T11:09:13Z

fabseb60
Mar 29, 2024
Author

No, sorry. I am not a scikit-learn user, my younger collaborators and my students are. Best Fabrizio

…

On 29 Mar 2024, at 12:01, Javier Marin Tur ***@***.***> wrote: Did you create an issue? — Reply to this email directly, view it on GitHub <#28675 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AV7EFXITF2IE3QZ5KQS6Q7DY2U3ZLAVCNFSM6AAAAABFBIUY2GVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSNJQGU4TI>. You are receiving this because you were mentioned.

0 replies

glemaitre · 2024-03-31T16:35:05Z

glemaitre
Mar 31, 2024
Maintainer

Having TP, FP, and FN undefined is a corner case since the denominator is not defined and lead to a zero division. For consistency with different metric, we chose to use 0 and we raise a warning but the behaviour is tunable using the parameter zero_division to accommodate what you think is reasonable for your application.

0 replies

fabseb60 · 2024-03-31T18:03:10Z

fabseb60
Mar 31, 2024
Author

Hi Guillaume, thanks for your answer. The usual definition which is given of F1 is not F1= 2TP/(2TP+FN+FP) but F1= 1 if TP=FP=FN=0 2TP/(2TP+FN+FP) otherwise In order to show that this is a reasonable and correct definition, suppose there are no negatives in the test set (i.e., TP=FN=0 -- a totally legitimate situation, which arises in many practical cases), and take two extreme cases: Case 1) TN=0, FP different from 0; this means that all the elements in the test set are negative and the classifier has misclassified them all as positive. Case 2) TN different from 0, FP = 0; this means that all the elements in the test set are negative and the classifier has correctly classified them all as negative. Choosing to give a score of 0 to Case 2, as you do, means giving the same score to Case 1 (which correctly receives a score of 0) and Case 2, i.e., equating a classifier that has correctly answered in all cases (Case 2) with a classifier that has mistakenly answered in all cases (Case 1). To me, the case for returning a score of 1 in Case 2 (always, without giving the user a choice) could not be clearer. Best Fabrizio PS. If the above does not convince you, check the following example. Our classifier is indeed a spam filter, where “positive” means “spam”. Assume that in the last hour no spam messages were received but several legitimate messages were received. Spam filter A has wrongly blocked all the legitimate messages, treating them as spam, while Spam filter B has correctly allowed all the legitimate messages to enter your mailbox, treating them as legitimate. Your proposal is to give Spam filter B a zero, which is the score you also (understandably) give to Spam filter A …

…

On 31 Mar 2024, at 18:35, Guillaume Lemaitre ***@***.***> wrote: Having TP, FP, and FN undefined is a corner case since the denominator is not defined and lead to a zero division. For consistency with different metric, we chose to use 0 and we raise a warning but the behaviour is tunable using the parameter zero_division to accommodate what you think is reasonable for your application. — Reply to this email directly, view it on GitHub <#28675 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AV7EFXMOPOAQLUHWB3UU3XTY3A3M5AVCNFSM6AAAAABFBIUY2GVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSNRVHA2DO>. You are receiving this because you were mentioned.

1 reply

glemaitre Mar 31, 2024
Maintainer

As said, I think this is application specific. The fact that TP=FP=FN=0 is big read flag if we are in a case that we have many samples to evaluate a model. A sensible thing is to raise a warning to ensure that a user just check what is going on.

I completely agree that your use case is valid and can be handled by setting zero_division=1 and does not raise a warning in addition because this is legitimate in this case.

fabseb60 · 2024-03-31T18:15:30Z

fabseb60
Mar 31, 2024
Author

Ok, I assume we’ll agree to disagree :-) Best Fabrizio

…

On 31 Mar 2024, at 20:10, Guillaume Lemaitre ***@***.***> wrote: As said, I think this is application specific. The fact that TP=FP=FN=0 is big read flag if we are in case that we have many samples to evaluate a model and a sensible thing is to raise a warning. I completely agree that your use case is valid and can be handled by setting zero_division=1 and does not raise a warning in addition because this is legitimate in this case. — Reply to this email directly, view it on GitHub <#28675 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AV7EFXIA2YGDJHOEVIR6GKDY3BGPZAVCNFSM6AAAAABFBIUY2GVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DSNRWGI4DG>. You are receiving this because you were mentioned.

0 replies

Qwre-2 · 2025-01-08T16:51:21Z

Qwre-2
Jan 8, 2025

Dear FS,

Thank you for your feedback! The choice of defaulting the F1 score to 0 when there are no true positives, false negatives, or false positives aligns with the need for consistency in handling edge cases. While your point about perfect classification in an all-negative scenario is valid, the current behavior reflects a design decision to ensure interpretability across varied contexts.

Your suggestion is appreciated, and we will consider clarifying or revisiting this in future updates.

Best regards,
The scikit-learn team

0 replies

fabseb60 · 2025-01-08T17:42:35Z

fabseb60
Jan 8, 2025
Author

Dear scikit-learn team, Thanks for your reply. Let me comment that: * "The choice of defaulting the F1 score to 0 when there are no true positives, false negatives, or false positives aligns with the need for consistency in handling edge cases”. Indeed! We are always taught that measures should indeed behave consistently also in edge cases. Shouldn’t they? :-) * "a design decision to ensure interpretability across varied contexts”: Apologies, but I do not understand what this means. That a measure does not reward with a perfect score a classifier that correctly classifies all the instances it is asked to classify, is hardly interpretable for me. Anyway, it does not matter. Best FS

…

On 8 Jan 2025, at 17:51, Qwre-2 ***@***.***> wrote: Dear FS, Thank you for your feedback! The choice of defaulting the F1 score to 0 when there are no true positives, false negatives, or false positives aligns with the need for consistency in handling edge cases. While your point about perfect classification in an all-negative scenario is valid, the current behavior reflects a design decision to ensure interpretability across varied contexts. Your suggestion is appreciated, and we will consider clarifying or revisiting this in future updates. Best regards, The scikit-learn team — Reply to this email directly, view it on GitHub <#28675 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AV7EFXIPCDHP7PWDYD57UC32JVJSBAVCNFSM6AAAAABUTVH4SOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZXGYYDCMI>. You are receiving this because you were mentioned.

0 replies

fabseb60 · 2025-01-09T07:51:36Z

fabseb60
Jan 9, 2025
Author

On 9 Jan 2025, at 01:10, Alsf1994 ***@***.***> wrote: The choice to assign an F1 score of 0.0 when there are no true positives, false negatives, or false positives aligns with the convention that F1 is undefined in such cases due to a lack of positive class predictions.

It’s not true that F1 is undefined in those cases. F1 is commonly defined as F1 = 1 if TP=FP=FN=0 = 2 TP / (2TP+FP+FN) otherwise If it were undefined for some cases, it would not be a good measure. Every measure must be defined for all legitimate cases it may be asked to evaluate, and the one above is not only legitimate, but one that shows up frequently, e.g., in spotting people affected by a rare disease.

Assigning a score of 1 could misrepresent the classifier's performance.

Assume you are a kid at primary school. The teacher presents you with a set of photographs of lions, and asks you “Which of these are tigers?”. You answer “None of these are tigers”. The teacher scolds you, and gives you the lowest possible mark. Best FS

0 replies

fabseb60 · 2025-01-09T09:45:35Z

fabseb60
Jan 9, 2025
Author

Let me add one consideration. As you know, evaluation measures are not used only by humans who run their own experiments, but are also used by algorithms, such as hyperparameter optimization procedures. Assigning a score of 0 (or allowing the assignment of 0) to a classifier who has correctly performed the task it has been assigned (i.e., classifying a set that contains no positive examples) also means misguiding the hyperparameter optimization procedure. Best FS

…

Begin forwarded message: From: Fabrizio Sebastiani ***@***.***> Subject: Re: [scikit-learn/scikit-learn] The F1 score should be 1 when TP=FP=FN=0 (Discussion #28675) Date: 9 January 2025 at 08:51:20 CET To: scikit-learn/scikit-learn ***@***.***> Cc: scikit-learn/scikit-learn ***@***.***>, Mention ***@***.***> > On 9 Jan 2025, at 01:10, Alsf1994 ***@***.***> wrote: > > > The choice to assign an F1 score of 0.0 when there are no true positives, false negatives, or false positives aligns with the convention that F1 is undefined in such cases due to a lack of positive class predictions. > It’s not true that F1 is undefined in those cases. F1 is commonly defined as F1 = 1 if TP=FP=FN=0 = 2 TP / (2TP+FP+FN) otherwise If it were undefined for some cases, it would not be a good measure. Every measure must be defined for all legitimate cases it may be asked to evaluate, and the one above is not only legitimate, but one that shows up frequently, e.g., in spotting people affected by a rare disease. > Assigning a score of 1 could misrepresent the classifier's performance. > Assume you are a kid at primary school. The teacher presents you with a set of photographs of lions, and asks you “Which of these are tigers?”. You answer “None of these are tigers”. The teacher scolds you, and gives you the lowest possible mark. Best FS

0 replies

This comment was marked as spam.

Sign in to view

Uh oh!

The F1 score should be 1 when TP=FP=FN=0 #28675

Uh oh!

fabseb60 Mar 21, 2024

Replies: 17 comments · 1 reply

Uh oh!

jmarintur Mar 28, 2024

Uh oh!

fabseb60 Mar 29, 2024 Author

Uh oh!

jmarintur Mar 29, 2024

Uh oh!

fabseb60 Mar 29, 2024 Author

Uh oh!

jmarintur Mar 29, 2024

Uh oh!

jmarintur Mar 29, 2024

Uh oh!

fabseb60 Mar 29, 2024 Author

Uh oh!

fabseb60 Mar 29, 2024 Author

Uh oh!

glemaitre Mar 31, 2024 Maintainer

Uh oh!

fabseb60 Mar 31, 2024 Author

Uh oh!

Uh oh!

glemaitre Mar 31, 2024 Maintainer

Uh oh!

fabseb60 Mar 31, 2024 Author

This comment was marked as spam.

Uh oh!

Qwre-2 Jan 8, 2025

Uh oh!

fabseb60 Jan 8, 2025 Author

Uh oh!

fabseb60 Jan 9, 2025 Author

Uh oh!

fabseb60 Jan 9, 2025 Author

fabseb60
Mar 21, 2024

Replies: 17 comments 1 reply

jmarintur
Mar 28, 2024

fabseb60
Mar 29, 2024
Author

jmarintur
Mar 29, 2024

fabseb60
Mar 29, 2024
Author

jmarintur
Mar 29, 2024

jmarintur
Mar 29, 2024

fabseb60
Mar 29, 2024
Author

fabseb60
Mar 29, 2024
Author

glemaitre
Mar 31, 2024
Maintainer

fabseb60
Mar 31, 2024
Author

glemaitre Mar 31, 2024
Maintainer

fabseb60
Mar 31, 2024
Author

Qwre-2
Jan 8, 2025

fabseb60
Jan 8, 2025
Author

fabseb60
Jan 9, 2025
Author

fabseb60
Jan 9, 2025
Author