-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH add a parameter pos_label in roc_auc_score #17704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So here is a proposal that handles |
sklearn/metrics/_scorer.py
Outdated
@@ -296,6 +302,13 @@ def _score(self, method_caller, clf, X, y, sample_weight=None): | |||
y_pred = method_caller(clf, "predict", X) | |||
else: | |||
try: | |||
if ( | |||
y_type == "binary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here, we could have a ScorerProperty
defining that the score is symmetric and require pos_label
instead of hard coding roc_auc_score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @glemaitre !
@jnothman I agree with your argument but there is still something to solve here import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score
X, y = load_breast_cancer(return_X_y=True)
# create an highly imbalanced
idx_positive = np.flatnonzero(y == 1)
idx_negative = np.flatnonzero(y == 0)
idx_selected = np.hstack([idx_negative, idx_positive[:25]])
X, y = X[idx_selected], y[idx_selected]
X, y = shuffle(X, y, random_state=42)
# only use 2 features to make the problem even harder
X = X[:, :2]
y = np.array(
["cancer" if c == 1 else "not cancer" for c in y], dtype=object
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=0,
)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
# sanity check to be sure the positive class is classes_[0] and that we
# are betrayed by the class imbalance
assert classifier.classes_.tolist() == ["cancer", "not cancer"]
y_pred = classifier.predict_proba(X_test)
y_pred_pos = y_pred[:, 0]
roc_auc_score(y_test, y_pred_pos) So here the usage is fine but the result is incorrect. In this case, the issue is coming from the wrong assumption done in the underlying Basically, we pass So here, I am not sure how we can solve the problem without introducing a |
I am fine with adding pos_label to roc_auc_score and roc_curve. But that doesn't require modifying the scorer. |
Basically
It is where it becomes tricky. It will involve a regression as we saw there: #17594 As mentioned we have 2 solutions:
|
pinging @adrinjalali since this could be also nice to have your thoughts. |
self._score_func.__name__ == "roc_auc_score" | ||
and "pos_label" not in self._kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we add the pos_label (with the mutable aspects discussed in the other PR)
I am okay with this with a symmetric
property to _BaseScorer
that defaults to False
. This way, we can be generic and not depend on the name of the score function.
My understanding is that There is currently no way to define the positive class in We could add an argument that allows you to specify the semantically positive class, but it should definitely not be called The code in your comment just has a bug, you always need to slice the first dimension to use |
OK, so I got a couple of things wrong then.
So I assume that you mean that I should have done: roc_auc_score(y_test, y_pred[:, 1]) To be honest, I find this really confusing. It is true that it is not mentioned in the documentation to pass the probability of the positive class but it is far to be clear what to slice indeed:
I think that I was even more confused since that |
In precision, recall, etc, pos_label references the semantically positive
class, or the class of interest; it must, since there's no probabilistic
output to correspond with. Here, I agree, it indicates the correspondence
between the categorical and continuous representations.
Since this issue pertains only to "thresholded" classification scorers, we
can certainly extract the positive class from classes_ and pass it to
pos_label of the metric, as long as we can identify that the metric accepts
pos_label. I don't think that depends on symmetry, except insofar as for
non-symmetric thresholded scores, you might want to allow the user to
specify the "semantically positive class".
|
We face a related problem for the calibration error I believe: #11096. |
I am confused here. I think that a concrete example would help. I wrote the following tests: https://github.com/scikit-learn/scikit-learn/pull/18107/files#diff-fcdae0622eeb4bf500b43048996b2af5R774-R828 |
Oh now I see that this is really written in the documentation
It should be in bold :) |
OK, so it seems that I figure out some of the stuffs. I will close all my PRs and open the following:
|
closes #17572
Add a parameter pos_label to be able to specify the positive class in the case of binary classification.
We should make handle the usecase when
GridSearchCV
is used together withroc_auc
.