Skip to content

ENH add a parameter pos_label in roc_auc_score #17704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

glemaitre
Copy link
Member

closes #17572

Add a parameter pos_label to be able to specify the positive class in the case of binary classification.

We should make handle the usecase when GridSearchCV is used together with roc_auc.

@glemaitre
Copy link
Member Author

So here is a proposal that handles roc_auc in the grid-search as well.

@glemaitre
Copy link
Member Author

ping @thomasjpfan @ogrisel @jnothman

@@ -296,6 +302,13 @@ def _score(self, method_caller, clf, X, y, sample_weight=None):
y_pred = method_caller(clf, "predict", X)
else:
try:
if (
y_type == "binary"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here, we could have a ScorerProperty defining that the score is symmetric and require pos_label instead of hard coding roc_auc_score

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @glemaitre !

@glemaitre
Copy link
Member Author

glemaitre commented Aug 5, 2020

Firstly: I remain unconvinced that there is a problem with users getting
incorrect roc auc results when using the standard scorer: our convention is
clearly to encode probabilities to match classes_, which should be sorted.
I think that the need to allow for pos_label in roc_auc_score, where we do
not explicitly require the input to come from a scikit-learn-compatible
classifier is reasonable and separate.

@jnothman I agree with your argument but there is still something to solve here

    import numpy as np
    from sklearn.datasets import load_breast_cancer
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.utils import shuffle
    from sklearn.metrics import roc_auc_score

    X, y = load_breast_cancer(return_X_y=True)
    # create an highly imbalanced
    idx_positive = np.flatnonzero(y == 1)
    idx_negative = np.flatnonzero(y == 0)
    idx_selected = np.hstack([idx_negative, idx_positive[:25]])
    X, y = X[idx_selected], y[idx_selected]
    X, y = shuffle(X, y, random_state=42)
    # only use 2 features to make the problem even harder
    X = X[:, :2]
    y = np.array(
        ["cancer" if c == 1 else "not cancer" for c in y], dtype=object
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, random_state=0,
    )

    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)

    # sanity check to be sure the positive class is classes_[0] and that we
    # are betrayed by the class imbalance
    assert classifier.classes_.tolist() == ["cancer", "not cancer"]
    
    y_pred = classifier.predict_proba(X_test)
    y_pred_pos = y_pred[:, 0]
    roc_auc_score(y_test, y_pred_pos)

So here the usage is fine but the result is incorrect. In this case, the issue is coming from the wrong assumption done in the underlying roc_curve call.

Basically, we pass y_pred which is column #0 and y_test will be encoded in the same order than the column of y_pred.
The AUC will be computed using roc_curve which will be equivalent to roc_curve(y_test, y_pred_pos). However, the assumption here is that when pos_label is not given, 1 will be the positive class while it is not due to our encoding.
roc_curve(y_test, y_pred_pos, pos_label=0) would lead to the right result.

So here, I am not sure how we can solve the problem without introducing a pos_label parameter: basically, since our user already gave us the positive column, we have no clue how to find this information if it is not provided?

@jnothman
Copy link
Member

jnothman commented Aug 5, 2020

So here, I am not sure how we can solve the problem without introducing a pos_label parameter: basically, since our user already gave us the positive column, we have no clue how to find this information if it is not provided?

I am fine with adding pos_label to roc_auc_score and roc_curve. But that doesn't require modifying the scorer.

@glemaitre
Copy link
Member Author

glemaitre commented Aug 5, 2020

I am fine with adding pos_label to roc_auc_score and roc_curve.

Basically roc_curve already have a pos_label so it is just pipelining the pos_label from roc_auc_score to roc_curve.

But that doesn't require modifying the scorer.

It is where it becomes tricky. It will involve a regression as we saw there: #17594
roc_auc_scorer (returned when using scoring='roc_auc') will not have any pos_label.
Therefore, the above example will fail if the classifier is used within a GridSearchCV.

As mentioned we have 2 solutions:

  • The user pass a make_score(roc_auc_score, pos_label="cancer") but we still don't support scoring='roc_auc';
  • Or, since the score is symmetric, we can modify _ThresholdScorer to make the appropriate column selection of y_pred depending on the given y_true:
    • we add the pos_label (with the mutable aspects discussed in the other PR) or
    • we need to encode y_true in _ThresholdScorer. However, this encoding should be done only for scorer with a symmetric scoring function. Introducing something like this for f1_score would create a bug. So here we need to have some scoring property?

@glemaitre
Copy link
Member Author

pinging @adrinjalali since this could be also nice to have your thoughts.

Comment on lines +318 to +319
self._score_func.__name__ == "roc_auc_score"
and "pos_label" not in self._kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we add the pos_label (with the mutable aspects discussed in the other PR)

I am okay with this with a symmetric property to _BaseScorer that defaults to False. This way, we can be generic and not depend on the name of the score function.

@amueller
Copy link
Member

amueller commented Aug 5, 2020

My understanding is that pos_label is not the semantics of the problem, but the semantics of the predict_proba/decision function that you passed. If you have some classes, 'a' and 'b' and a decision function, then pos_label='a' means that high values of the decision function are supposed to correspond to the class 'a' being likely.
The pos_label parameter was introduced so we don't need to make assumptions about the order of classes. It has nothing to do with which class you consider semantically positive, it tells you which string label corresponds to high values in the classifier, i.e. it tells you the order of entries in classes_. That is in particularly required if your test set only has one label, so you don't know whether it's the positive or the negative one. This was the original motivation for pos_label.

There is currently no way to define the positive class in roc_auc_score, except by changing y to y == pos_label before training the model, which is actually the easiest fix that will make everything work, in particular with scorers.

We could add an argument that allows you to specify the semantically positive class, but it should definitely not be called pos_label which already has this different meaning.

The code in your comment just has a bug, you always need to slice the first dimension to use roc_auc_curve and sklearn has no way to what you'd like to do directly.

@glemaitre
Copy link
Member Author

OK, so I got a couple of things wrong then.

The code in your comment just has a bug, you always need to slice the first dimension to use roc_auc_curve and sklearn has no way to what you'd like to do directly.

So I assume that you mean that I should have done:

roc_auc_score(y_test, y_pred[:, 1])

To be honest, I find this really confusing. It is true that it is not mentioned in the documentation to pass the probability of the positive class but it is far to be clear what to slice indeed:

In the binary and multilabel cases, these can be either probability estimates or non-thresholded decision values (as returned by decision_function on some classifiers).

I think that I was even more confused since that average_precision_score will explicitly require the probability of the positive class. Would it make sense that we have something consistent regarding what to select and make the score work expecting that one gives the positive class always?

@jnothman
Copy link
Member

jnothman commented Aug 5, 2020 via email

@ogrisel
Copy link
Member

ogrisel commented Aug 6, 2020

We face a related problem for the calibration error I believe: #11096.

@glemaitre
Copy link
Member Author

I don't think that depends on symmetry, except insofar as for
non-symmetric thresholded scores, you might want to allow the user to
specify the "semantically positive class".

I am confused here. I think that a concrete example would help.
From what I see, only the average precision and the ROC are metrics used to create "thresholded" scorer.
If I understand well your comment, the average precision scorer would be a non-symmetric thresholded scorer and you actually need to pass pos_label.

I wrote the following tests: https://github.com/scikit-learn/scikit-learn/pull/18107/files#diff-fcdae0622eeb4bf500b43048996b2af5R774-R828
to illustrate the behaviour that I would expect. However, for the ROC I am using the positive class while @amueller mentioned earlier that one should use y_pred[:, 1] in all cases.

@glemaitre
Copy link
Member Author

Oh now I see that this is really written in the documentation

The binary case expects a shape (n_samples,), and the scores must be the scores of the class with the greater label.

It should be in bold :)

@amueller
Copy link
Member

amueller commented Aug 6, 2020

Of course @jnothman is right, and actually both meanings of pos_label seem to be currently present in the code-base.
And I agree, that is very confusing. I commented in #18101.

@glemaitre
Copy link
Member Author

OK, so it seems that I figure out some of the stuffs. I will close all my PRs and open the following:

  • Improve the documentation of the roc_auc_score. There is actually no bug there but the documentation could be more explicit. (We might rediscuss about the semantic of y_score but it would require much more work and API changes);
  • Solve the issue in the Scorer classes to take into account the pos_label when it is passed to make_scorer;
  • Improve the documentation regarding the last point.

@glemaitre glemaitre closed this Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue in roc_auc_score which make wrong assumption of positive class
5 participants