ENH Add zero_division param to `cohen_kappa_score` #29210

StefanieSenger · 2024-06-07T10:08:06Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Extracts the part for adding a zero_division param to the cohen_kappa_score from the original PR #23183 by @marctorsoc, that we go through in #29048.

github-actions · 2024-06-07T10:09:19Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6b015c5. Link to the linter CI: here}

sklearn/metrics/tests/test_classification.py

sklearn/metrics/_classification.py

glemaitre

I think it looks good. Now that I see that it remains a warning, I think that we can factorize some code.

doc/whats_new/v1.6.rst

glemaitre · 2024-06-10T15:03:05Z

sklearn/metrics/_classification.py


    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights.

+    zero_division : {"warn", 0.0, 1.0, np.nan}, default="warn"
+        Sets the return value when there is a zero division, e.g. when
+        `y1=y2={np.ones, np.zeros}`. If set to "warn", returns 0.0 input, but a


Let's express y1=y2={np.ones, np.zeros} with words instead.

To be honest, I was wondering about this part, because when I would insert data like y1=[1,1,1] y2=[0,0,0] into test_zero_division_nan_no_warning the zero division problem is not triggered. In fact, after I have tried and thought a lot it seems that ([0], [0]), ([], []) as given in the test, are the only inputs that would trigger it. Since those inputs would only occur unintendedly, there wouldn't be any value in having a param zero_division for intentionally navigating the output then. (Besides, in my understanding those edge cases would be handled by an input check.) So I was assuming, that also after several days of trying to understand this, I am not. This is why I didn't touch this part of the documentation. And also now I am pretty sure that I don't know what to do here.

What would you put here when you would express this with words instead?

Edit: the above I did and wrote before I understood that providing labels to have an "unfolded" confusion_matrix is the only way to test this. Now I am very dizzy and need to re-evaluate another day.

y1=[1,1,1] y2=[0,0,0]

In this case, the call to confusion_matrix will discover that there is two labels 0 and 1 and that they are in complete disagreement. So it means that we get the following matrix:

In [1]: confusion_matrix([1,1,1], [0,0,0]) Out[1]: array([[0, 0], [3, 0]])

So we have the have the expected matrix (a 2x2 matrix) and thus we don't have any zero division that happens in the later statistics.

([0], [0]), ([], [])

In these both case, we get a degenerated matrix because we never have the two labels appearing in y_true and y_pred.

In [11]: confusion_matrix([0], [0]) /Users/glemaitre/Documents/packages/scikit-learn/sklearn/metrics/_classification.py:395: UserWarning: A single label was found in 'y_true' and 'y_pred'. For the confusion matrix to have the correct shape, use the 'labels' parameter to pass all known labels. warnings.warn( Out[11]: array([[1]]) In [12]: confusion_matrix([], []) Out[12]: array([], shape=(0, 0), dtype=int64)

Those lead to a zero division later. If we pass the labels we partially silence some of the warning (coming from confusion_matrix).

Yes, I did the same check.

And my question was: if passing ([0], [0]), ([], []) (and obviously ([0,0],[0,0]) and so forth) are the only ways a zero division would be triggered, then adding the zero_division param doesn't make any sense here. Reason: a user that passes y1=y2=[0] or y1=y2=[], doesn't need a zero_division param, because they didn't properly inspect their data. Those users would profit from an input check.

So, are there non trivial ways to trigger a zero division? (I will go back to this question on a new day.) The fate of this PR depends on it.

Actually, you can have the tricky case that one could collect a new batch of fresh data that has a single class and you might end up with the following:

cohen_kappa_score([0, 0, 0, 0], [0, 0, 0, 0])

In this particular case, the expected call should be:

cohen_kappa_score([0, 0, 0, 0], [0, 0, 0, 0], labels=[0, 1])

To inform that we should have 0/1 labels. This call currently would raise nan. However, if we are in the setting that I describe above then we should obtain a score of 1.0 because we are perfectly classifying. So if we are able to pass zero_division:

cohen_kappa_score([0, 0, 0, 0], [0, 0, 0, 0], labels=[0, 1], zero_division=1.0)

This allow to get the score that we would expect.

As you mentioned, this is really a weird use case and we should not assume that this is normal as a default and having a warning as default and providing 0.0 would be the most safe way.

Ok, so I understand that this is to be used on an automated task, where users know from experience that batches, that would have a nan score are indeed mostly valid and they therefor chose to declare as a 1.0 score (or 0.0 or keep nan).

where users know from experience that batches, that would have a nan score are indeed mostly valid and they therefor chose to declare as a 1.0 score

Yes exactly.

sklearn/metrics/_classification.py

sklearn/metrics/tests/test_classification.py

sklearn/metrics/_classification.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre · 2024-06-12T14:20:24Z

sklearn/metrics/_classification.py

+        Sets the return value when there is a zero division, e.g. when
+        `y1=y2={np.ones, np.zeros}`. If set to "warn", returns 0.0 input, but a
+        warning is also raised.


I would propose the following. Basically, this is just a single example.

Sets the return value when there is a zero division, e.g. when a single label (or no label) is provided in `y1` and `y2`. If set to "warn", returns 0.0 input, but a warning is also raised.

Yes, and the single label also needs to be the zero class to trigger a zero division error. Maybe like this:

Sets the return value when there is a zero division. This is the case when both labelings `y1` and `y2` both exclusively contain the 0 class (e. g. `[0,0,0,0]`) (or if both are empty). If set to "warn", returns `0.0`, but a warning is also raised.

glemaitre

I think that we can avoid to call twice _check_division_zero and instead return a boolean to know if we should early return the value.

sklearn/metrics/_classification.py

sklearn/metrics/tests/test_classification.py

glemaitre

A couple of comments but it looks almost good.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger

Thanks @glemaitre. I have moved the helper function to the front.

StefanieSenger · 2024-06-13T08:24:01Z

sklearn/metrics/_classification.py

+        Sets the return value when there is a zero division, e.g. when
+        `y1=y2={np.ones, np.zeros}`. If set to "warn", returns 0.0 input, but a
+        warning is also raised.


Yes, and the single label also needs to be the zero class to trigger a zero division error. Maybe like this:

Sets the return value when there is a zero division. This is the case when both labelings `y1` and `y2` both exclusively contain the 0 class (e. g. `[0,0,0,0]`) (or if both are empty). If set to "warn", returns `0.0`, but a warning is also raised.

glemaitre

LGTM apart of this little detail.

sklearn/metrics/_classification.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger

I have commited your suggestions, @glemaitre. Thank you for reviewing.

sklearn/metrics/_classification.py

glemaitre · 2024-06-14T13:04:31Z

ping @adrinjalali to have a second look at this one.

adrinjalali

nit, otherwise LGTM.

doc/whats_new/v1.6.rst

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

glemaitre · 2024-07-09T09:02:26Z

Great to see this one merge (I was a bit under water). I'm trying to catch up. I'll check the follow-up PRs on this topic.

zero_division=nans for cohen_kappa_score

1a3203f

github-actions bot added the module:metrics label Jun 7, 2024

changelog

a77375a

StefanieSenger changed the title ~~ENH Add zero_division=nan for cohen_kappa_score~~ ENH Add zero_division param to cohen_kappa_score Jun 7, 2024

StefanieSenger commented Jun 7, 2024

View reviewed changes

sklearn/metrics/tests/test_classification.py Outdated Show resolved Hide resolved

StefanieSenger commented Jun 7, 2024

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

StefanieSenger added 4 commits June 7, 2024 15:13

add comment

2e0dd7a

little improvements

8de166d

more little improvements

6d3633c

revert error message modifiaction for now

4fcf177

glemaitre self-requested a review June 10, 2024 14:53

glemaitre reviewed Jun 10, 2024

View reviewed changes

StefanieSenger and others added 3 commits June 11, 2024 14:09

Apply suggestions from code review

5ae0e88

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

add helper function like suggested

bd57203

Update doc/whats_new/v1.6.rst

0fb0173

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre self-requested a review June 12, 2024 14:06

glemaitre reviewed Jun 12, 2024

View reviewed changes

glemaitre mentioned this pull request Jun 12, 2024

Make zero_division parameter consistent in the different metric #29048

Open

5 tasks

glemaitre reviewed Jun 12, 2024

View reviewed changes

StefanieSenger and others added 4 commits June 13, 2024 10:38

changes after review

2b6f669

Apply suggestions from code review

2b064d8

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Merge branch 'main' into cohen_kappa_score

da3a7e6

changes after review

a28da51

StefanieSenger commented Jun 13, 2024

View reviewed changes

glemaitre approved these changes Jun 13, 2024

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

glemaitre added this to the 1.6 milestone Jun 13, 2024

Apply suggestions from code review

d562cf3

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger commented Jun 14, 2024

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

fix linting

4989216

adrinjalali approved these changes Jun 17, 2024

View reviewed changes

doc/whats_new/v1.6.rst Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits June 18, 2024 10:15

Update doc/whats_new/v1.6.rst

4705d57

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Merge branch 'main' into cohen_kappa_score

6b015c5

adrinjalali enabled auto-merge (squash) June 18, 2024 08:19

adrinjalali merged commit 329a1cf into scikit-learn:main Jun 18, 2024
30 checks passed

StefanieSenger deleted the cohen_kappa_score branch June 18, 2024 10:13

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add zero_division param to `cohen_kappa_score` #29210

ENH Add zero_division param to `cohen_kappa_score` #29210

StefanieSenger commented Jun 7, 2024 •

edited

Loading

github-actions bot commented Jun 7, 2024 •

edited

Loading

glemaitre left a comment

glemaitre Jun 10, 2024

StefanieSenger Jun 11, 2024 •

edited

Loading

glemaitre Jun 12, 2024

StefanieSenger Jun 12, 2024

glemaitre Jun 12, 2024

StefanieSenger Jun 13, 2024 •

edited

Loading

glemaitre Jun 13, 2024

glemaitre Jun 12, 2024

StefanieSenger Jun 13, 2024

glemaitre Jun 13, 2024

glemaitre left a comment

glemaitre left a comment

StefanieSenger left a comment

StefanieSenger Jun 13, 2024

glemaitre left a comment

StefanieSenger left a comment

glemaitre commented Jun 14, 2024

adrinjalali left a comment

glemaitre commented Jul 9, 2024

ENH Add zero_division param to cohen_kappa_score #29210

ENH Add zero_division param to cohen_kappa_score #29210

Conversation

StefanieSenger commented Jun 7, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Jun 7, 2024 • edited Loading

✔️ Linting Passed

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

glemaitre commented Jun 14, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

glemaitre commented Jul 9, 2024

ENH Add zero_division param to `cohen_kappa_score` #29210

ENH Add zero_division param to `cohen_kappa_score` #29210

StefanieSenger commented Jun 7, 2024 •

edited

Loading

github-actions bot commented Jun 7, 2024 •

edited

Loading

StefanieSenger Jun 11, 2024 •

edited

Loading

StefanieSenger Jun 13, 2024 •

edited

Loading