ENH add a parameter pos_label in roc_auc_score #17594

glemaitre · 2020-06-15T08:13:18Z

Add a parameter pos_label to be able to specify the positive class in the case of binary classification.

glemaitre · 2020-06-15T08:56:39Z

Arfff, this is breaking the backward compatibility. I would consider the current behaviour as a bug because the way the positive class is inferred is really opaque. Here, it depends on the result of np.unique call on the label. When labels are given as str, it seems better to just ask for the pos_label to remove any ambiguity.

Any thoughts @ogrisel @thomasjpfan @amueller

ogrisel · 2020-06-24T13:04:10Z

Ooops, I realised I merged the wrong PR. I wanted to merge #17569 instead but I was on the wrong tab. Let me review this a posteriori...

ogrisel · 2020-06-24T13:23:39Z

Arfff, this is breaking the backward compatibility. I would consider the current behaviour as a bug because the way the positive class is inferred is really opaque. Here, it depends on the result of np.unique call on the label. When labels are given as str, it seems better to just ask for the pos_label to remove any ambiguity.

I agree we should ask for an explicit pos_label in this case. Something that is still unclear is the interaction with the use as a scorer in a grid search. Let me give it a closer look.

glemaitre · 2020-06-24T13:25:42Z

Ooops, I realised I merged the wrong PR. I wanted to merge #17569 instead but I was on the wrong tab. Let me review this a posteriori...

This is good strategy to get PR merged ;)

ogrisel · 2020-06-24T14:32:28Z

I am afraid we will have to revert this bug fix because it breaks the ability to run a GridSearchCV with scoring="roc_auc" and string class labels.

ValueError: y_true takes value in {'cancer', 'not cancer'} and pos_label is not specified: either make y_true take value in {0, 1} or {-1, 1} or pass pos_label explicitly.

when running this script with this PR:

https://gist.github.com/ogrisel/bfadf2dffef144d401b6e7c52d744219

This used to work before so this would be a regression for the model selection use case.

…17594)" This reverts commit fde9212.

thomasjpfan · 2020-06-24T14:52:26Z

I think we need pos_label=None to be a little more magical to handle the string case. (We can not use 'auto' because it is a string)

glemaitre · 2020-06-24T14:56:30Z

Fair enough. So we revert the change and I reopen a PR. We should probably add a test for the case mentioned earlier

) This reverts commit fde9212.

ogrisel · 2020-06-24T15:37:42Z

I think we need pos_label=None to be a little more magical to handle the string case. (We can not use 'auto' because it is a string)

But I am not so sure this is a good solution in general. For ROC AUC swapping the positive and negative classes is fine as the score is symmetric, but this is not the case for other binary classification metrics such as average precision on imbalanced data for instance.

Here is what I mean by "symmetric":

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score


X, y = load_breast_cancer(return_X_y=True)
idx_positive = np.flatnonzero(y == 1)
idx_negative = np.flatnonzero(y == 0)
idx_selected = np.hstack([idx_negative, idx_positive[:25]])
X, y = X[idx_selected], y[idx_selected]
X, y = shuffle(X, y, random_state=42)
X = X[:, :2]
y = np.array(
    ["cancer" if c == 1 else "not cancer" for c in y], dtype=object
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=0,
)

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
assert classifier.classes_.tolist() == ["cancer", "not cancer"]

y_pred = classifier.predict_proba(X_test)


def compute_score(classifier, score_func, X_test, y_test, pos_label):
    pos_idx = classifier.classes_.tolist().index(pos_label)
    score = score_func(y_test, y_pred[:, pos_idx], pos_label=pos_label)
    print(f"{score_func.__name__} with {pos_label=}: {score=:.4f}")


compute_score(classifier, roc_auc_score, X_test, y_test, "cancer")
compute_score(classifier, roc_auc_score, X_test, y_test, "not cancer")

compute_score(classifier, average_precision_score, X_test, y_test, "cancer")
compute_score(classifier, average_precision_score, X_test, y_test, "not cancer")

output:

roc_auc_score with pos_label='cancer': score=0.9568
roc_auc_score with pos_label='not cancer': score=0.9568
average_precision_score with pos_label='cancer': score=0.6339
average_precision_score with pos_label='not cancer': score=0.9953

ogrisel · 2020-06-24T15:53:09Z

So I believe the only option would be to make it possible to pass pos_label also to GridSearchCV & co (any model that accepts a scoring argument).

Furthermore, we should find a way to let GridSearchCV & co know when classification scorers are not symmetric so as to raise a ValueError when pos_label is not passed explicitly while keeping some kind of auto behavior for the symmetric ones such as roc_auc_score.

…17594)" (scikit-learn#17703) This reverts commit fde9212.

thomasjpfan · 2020-06-24T21:27:11Z

But I am not so sure this is a good solution in general. For ROC AUC swapping the positive and negative classes is fine as the score is symmetric, but this is not the case for other binary classification metrics such as average precision on imbalanced data for instance.

Good point

So I believe the only option would be to make it possible to pass pos_label also to GridSearchCV & co (any model that accepts a scoring argument).

Can this be extended to passing a "scorer_kwargs" that gets passed down to the metric? Concretely:

gs = GridSearchCV(..., scorer_kwargs={'pos_label': 'cancer'}, 
                  scorer='average_precision_scorer')

jnothman · 2020-06-24T22:23:29Z

#12385 proposes to solve the issue of users needing to pass things like labels and pos_label in a different manner. It basically means we have a function which, given a dataset, proposes a range of relevant scorers in a way that it's then easy for users to choose a subset, and ideally giving an unambiguous name and definition to each.

jnothman · 2020-06-24T22:24:19Z

And I've previously noted that the symmetry of roc_auc means that the default heuristic is fine for scoring with it without specifying pos_label.

glemaitre · 2020-06-25T08:34:29Z

And I've previously noted that the symmetry of roc_auc means that the
default heuristic is fine for scoring with it without specifying pos_label.

~~Yes so I have to revisit my initial example.~~

Oh you mean in the grid-search probably.

…17594)" (scikit-learn#17703) This reverts commit fde9212.

glemaitre added 2 commits June 15, 2020 10:11

ENH add a parameter pos_label in roc_auc_score

db390c2

add documentation

0e2937b

github-actions bot added the module:metrics label Jun 15, 2020

glemaitre added 2 commits June 15, 2020 11:18

TST pass pos_label with str in common test

7f4fa45

add versionadded

d6b4d9e

thomasjpfan mentioned this pull request Jun 17, 2020

EHN Provide a pos_label parameter in plot_precision_recall_curve #17569

Merged

glemaitre mentioned this pull request Jun 22, 2020

[MRG] EHN Provide a pos_label parameter in plot_roc_curve #17651

Merged

ogrisel merged commit fde9212 into scikit-learn:master Jun 24, 2020

ogrisel mentioned this pull request Jun 24, 2020

[MRG] Implement calibration loss metrics #11096

Open

ogrisel added a commit to ogrisel/scikit-learn that referenced this pull request Jun 24, 2020

Revert "ENH add a parameter pos_label in roc_auc_score (scikit-learn#…

6385acd

…17594)" This reverts commit fde9212.

ogrisel added a commit to ogrisel/scikit-learn that referenced this pull request Jun 24, 2020

Revert "ENH add a parameter pos_label in roc_auc_score (scikit-learn#…

f156344

…17594)" This reverts commit fde9212.

ogrisel mentioned this pull request Jun 24, 2020

Revert "ENH add a parameter pos_label in roc_auc_score (#17594)" #17703

Merged

thomasjpfan mentioned this pull request Jun 24, 2020

Issue in roc_auc_score which make wrong assumption of positive class #17572

Closed

glemaitre pushed a commit that referenced this pull request Jun 24, 2020

Revert "ENH add a parameter pos_label in roc_auc_score (#17594)" (#17703

59249d7

) This reverts commit fde9212.

rubywerman pushed a commit to MLH-Fellowship/scikit-learn that referenced this pull request Jun 24, 2020

ENH add a parameter pos_label in roc_auc_score (scikit-learn#17594)

7df0dcd

rubywerman pushed a commit to MLH-Fellowship/scikit-learn that referenced this pull request Jun 24, 2020

Revert "ENH add a parameter pos_label in roc_auc_score (scikit-learn#…

f443c6a

…17594)" (scikit-learn#17703) This reverts commit fde9212.

glemaitre mentioned this pull request Aug 5, 2020

ENH add a parameter pos_label in roc_auc_score #17704

Closed

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH add a parameter pos_label in roc_auc_score (scikit-learn#17594)

f1adf7e

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

Revert "ENH add a parameter pos_label in roc_auc_score (scikit-learn#…

2c87874

…17594)" (scikit-learn#17703) This reverts commit fde9212.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add a parameter pos_label in roc_auc_score #17594

ENH add a parameter pos_label in roc_auc_score #17594

glemaitre commented Jun 15, 2020

glemaitre commented Jun 15, 2020

ogrisel commented Jun 24, 2020

ogrisel commented Jun 24, 2020

glemaitre commented Jun 24, 2020

ogrisel commented Jun 24, 2020

thomasjpfan commented Jun 24, 2020

glemaitre commented Jun 24, 2020

ogrisel commented Jun 24, 2020 •

edited

Loading

ogrisel commented Jun 24, 2020

thomasjpfan commented Jun 24, 2020

jnothman commented Jun 24, 2020 via email

jnothman commented Jun 24, 2020 via email

glemaitre commented Jun 25, 2020 •

edited

Loading

ENH add a parameter pos_label in roc_auc_score #17594

ENH add a parameter pos_label in roc_auc_score #17594

Conversation

glemaitre commented Jun 15, 2020

glemaitre commented Jun 15, 2020

ogrisel commented Jun 24, 2020

ogrisel commented Jun 24, 2020

glemaitre commented Jun 24, 2020

ogrisel commented Jun 24, 2020

thomasjpfan commented Jun 24, 2020

glemaitre commented Jun 24, 2020

ogrisel commented Jun 24, 2020 • edited Loading

ogrisel commented Jun 24, 2020

thomasjpfan commented Jun 24, 2020

jnothman commented Jun 24, 2020 via email

jnothman commented Jun 24, 2020 via email

glemaitre commented Jun 25, 2020 • edited Loading

ogrisel commented Jun 24, 2020 •

edited

Loading

glemaitre commented Jun 25, 2020 •

edited

Loading