Skip to content

Assymetry of roc_auc_score #10247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mzoll opened this issue Dec 4, 2017 · 10 comments · Fixed by #14358
Closed

Assymetry of roc_auc_score #10247

mzoll opened this issue Dec 4, 2017 · 10 comments · Fixed by #14358

Comments

@mzoll
Copy link

mzoll commented Dec 4, 2017

Description

For binary tasks with wrapped down custom scoring functions by the metrics.make_scorerer-scheme roc_auc_score is behaving unexpectedly. When requiring the probability output from a binary classifier, which is a shape( n , 2) object, while the training/testing lables are an expected shape (n, ) input, A scoring will fail.
However, the binary task and internal handling of different y shapes is incidentially correctly understood by metrics.log_loss by internal evaluations, but roc_auc_score currently fails at this. This especially cumbersome, if the scoring function is wrapped down in a cross_val_score and a make_scorer much deeper in the code with possible nested pipelines etc, where the automatic correct evaluation of this particular metric is required.

Steps/Code to Reproduce

This should illustrate what is failing

from sklearn.metrics import roc_auc_score, log_loss
y_true0 = np.array([False, False, True, True])
y_true1 = ~y_true0
y_true = np.matrix([y_true0, y_true1]).T

y_proba0 = np.array([0.1, 0.4, 0.35, 0.8]) #predict_proba component [:,0]
y_proba1 = 1 - y_proba0 #predict_proba component [:,1]
y_proba = np.matrix([y_proba0, y_proba1]).T #as obtained by classifier.predict_proba()

log_loss(y_true1, y_proba1) # compute for positive class component >>> OK
log_loss(y_true, y_proba) # compute for all class >>> OK
log_loss(y_true1, y_proba) # compute for mixed component >>> OK

roc_auc_score(y_true1, y_proba1) # compute for positive class component >>> OK
roc_auc_score(y_true, y_proba) # compute for all class >>> OK
roc_auc_score(y_true1, y_proba) # compute for mixed component >>> FAIL: bad input shape (4, 2)

#last above line is source of error in this snippet: of binary classification and scoring task
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)

from sklearn.metrics import make_scorer, roc_auc_score
make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) # >>> FAIL: bad input shape (1000, 2)
#compare
make_scorer(log_loss, greater_is_better=True, needs_proba=True)(clf, X_test, y_test) # >>> OK

Expected Results

roc_auc_score should behave in a similar way as log_loss, guessing the binary classification task and handle different shape input correctly

Versions

Windows-10-10.0.15063-SP0
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.1

@jnothman
Copy link
Member

jnothman commented Dec 5, 2017

Yes, roc_auc_score(y_true1, y_proba) fails, but make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) does not. Both are in accordance with documented and expected behaviour.

@jnothman jnothman closed this as completed Dec 5, 2017
@mzoll
Copy link
Author

mzoll commented Dec 5, 2017

I do not understand how you find that make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) behaves correctly: I get executing this line (full trace):

ValueError                                Traceback (most recent call last)
<ipython-input-4-ae795709508b> in <module>()
     31 
     32 from sklearn.metrics import make_scorer, roc_auc_score
---> 33 make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) # >>> FAIL: bad input shape (1000, 2)
     34 #compare
     35 make_scorer(log_loss, greater_is_better=True, needs_proba=True)(clf, X_test, y_test) # >>> OK

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py in __call__(self, clf, X, y, sample_weight)
    142                                                  **self._kwargs)
    143         else:
--> 144             return self._sign * self._score_func(y, y_pred, **self._kwargs)
    145 
    146     def _factory_args(self):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in roc_auc_score(y_true, y_score, average, sample_weight)
    275     return _average_binary_score(
    276         _binary_roc_auc_score, y_true, y_score, average,
--> 277         sample_weight=sample_weight)
    278 
    279 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\base.py in _average_binary_score(binary_metric, y_true, y_score, average, sample_weight)
     73 
     74     if y_type == "binary":
---> 75         return binary_metric(y_true, y_score, sample_weight=sample_weight)
     76 
     77     check_consistent_length(y_true, y_score, sample_weight)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in _binary_roc_auc_score(y_true, y_score, sample_weight)
    270 
    271         fpr, tpr, tresholds = roc_curve(y_true, y_score,
--> 272                                         sample_weight=sample_weight)
    273         return auc(fpr, tpr, reorder=True)
    274 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    532     """
    533     fps, tps, thresholds = _binary_clf_curve(
--> 534         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
    535 
    536     # Attempt to drop thresholds corresponding to points in between and

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    320     check_consistent_length(y_true, y_score, sample_weight)
    321     y_true = column_or_1d(y_true)
--> 322     y_score = column_or_1d(y_score)
    323     assert_all_finite(y_true)
    324     assert_all_finite(y_score)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape (10000, 2)

while I understand that this is in accordance with the function documentation, which forces y_true and y_pred to be in the same shape, aka both are per class predictions. This is explicitly not the format a classifier wants to handle the y_true input, which wants it as single column array of class-labels (compare docs). This effectively prohibit the use of roc_auc_score in the depicted scheme of cross-validation, pipelines etc.

@jnothman
Copy link
Member

jnothman commented Dec 5, 2017

Oh. You're right. Although this issue is fixed in master. That's quite nasty. I'll need to track down what fixed this...

@jnothman
Copy link
Member

jnothman commented Dec 5, 2017

This behaviour changed in master in ee2025f.

The issue is that the 'roc_auc' scorer is defined internally with needs_threshold=True. needs_proba means it strictly needs probability, not some arbitrary threshold.

We should probably consider some documentation improvements, if not changing the interface.

@jnothman jnothman reopened this Dec 5, 2017
@qinhanmin2014
Copy link
Member

@mzoll The problem is fixed in #9521. Now you get the same result using _ProbaScorer (by setting needs_proba=True) and _ThresholdScorer (by setting needs_threshold=True).

make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test)
0.97641161457146342
make_scorer(roc_auc_score, needs_threshold=True)(clf, X_test, y_test)
0.97641161457146342

But according to the doc of make_scorer, it is recommented to set needs_threshold=True. Here is the definition from scikit-learn.

roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)

For the doc, there's already some information in make_scorer doc and the user guide. Maybe we still need more explanation about _ProbaScorer(needs_proba) and _ThresholdScorer(needs_threshold), at least I don't quite understand why we need _ProbaScorer. According to the code, seems that it can be replaced by _ThresholdScorer?

@jnothman
Copy link
Member

jnothman commented Dec 5, 2017

Well there a certainly metrics that can't take an unnormalised decision function, but the may not necessitate a separate class

@qinhanmin2014
Copy link
Member

@jnothman Thanks a lot for the clarification :)
Indeed, I've find the following things from the doc of the three Scorers

_PredictScorer:Evaluate predicted target values for X relative to y_true.
_ProbaScorer:Evaluate predicted probabilities for X relative to y_true.
_ThresholdScorer:Evaluate decision function output for X relative to y_true.

I'm +1 for improving the doc and the user guide.
Also, in the long run, I agree that we might combine the three classes to avoid duplicate code. (e.g., we still use predict to handle regression in _ThresholdScorer even if we have clearly stated that it only works for classification)

@SummerPapaya
Copy link

@qinhanmin2014 @jnothman Hi, I came across the same problem. Thanks for your updates, I got the AUC score using make_scorer. But when I use roc_curve to plot the ROC curve, the same issue occurred:

`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
4 train_labels=train_label,
5 test_features=test_feature,
----> 6 test_labels=test_label)

in train_predict_evaluate_model(classifier, train_features, train_labels, test_features, test_labels)
10 m=metrics.confusion_matrix(test_labels,predictions)
11 auc = make_scorer(roc_auc_score, needs_threshold=True)(classifier, test_feature, test_label)
---> 12 fpr, tpr, thresholds = roc_curve(test_labels, clf_probs)
13 logloss = log_loss(test_labels, clf_probs)
14 print(report)

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
532 """
533 fps, tps, thresholds = _binary_clf_curve(
--> 534 y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
535
536 # Attempt to drop thresholds corresponding to points in between and

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
320 check_consistent_length(y_true, y_score, sample_weight)
321 y_true = column_or_1d(y_true)
--> 322 y_score = column_or_1d(y_score)
323 assert_all_finite(y_true)
324 assert_all_finite(y_score)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
612 return np.ravel(y)
613
--> 614 raise ValueError("bad input shape {0}".format(shape))
615
616

ValueError: bad input shape (134, 2)`

Do you have any solution to solve this problem? Please kindly advice. Thank you in advance!

@qinhanmin2014
Copy link
Member

@qinhanmin2014 @jnothman Hi, I came across the same problem. Thanks for your updates, I got the AUC score using make_scorer. But when I use roc_curve to plot the ROC curve, the same issue occurred:

Please provide self-contained example code, including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.

@SummerPapaya
Copy link

@qinhanmin2014 @jnothman Hi, I came across the same problem. Thanks for your updates, I got the AUC score using make_scorer. But when I use roc_curve to plot the ROC curve, the same issue occurred:

Please provide self-contained example code, including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.

I just figured out the problem. Because I didn't notice that there are two columns in the result of predict_proba(), so the result cannot apply to roc_curve(). The problem was solved by selecting the second column of the predict_proba() as y_score. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants