Assymetry of roc_auc_score #10247

mzoll · 2017-12-04T14:23:40Z

Description

For binary tasks with wrapped down custom scoring functions by the metrics.make_scorerer-scheme roc_auc_score is behaving unexpectedly. When requiring the probability output from a binary classifier, which is a shape( n , 2) object, while the training/testing lables are an expected shape (n, ) input, A scoring will fail.
However, the binary task and internal handling of different y shapes is incidentially correctly understood by metrics.log_loss by internal evaluations, but roc_auc_score currently fails at this. This especially cumbersome, if the scoring function is wrapped down in a cross_val_score and a make_scorer much deeper in the code with possible nested pipelines etc, where the automatic correct evaluation of this particular metric is required.

Steps/Code to Reproduce

This should illustrate what is failing

from sklearn.metrics import roc_auc_score, log_loss
y_true0 = np.array([False, False, True, True])
y_true1 = ~y_true0
y_true = np.matrix([y_true0, y_true1]).T

y_proba0 = np.array([0.1, 0.4, 0.35, 0.8]) #predict_proba component [:,0]
y_proba1 = 1 - y_proba0 #predict_proba component [:,1]
y_proba = np.matrix([y_proba0, y_proba1]).T #as obtained by classifier.predict_proba()

log_loss(y_true1, y_proba1) # compute for positive class component >>> OK
log_loss(y_true, y_proba) # compute for all class >>> OK
log_loss(y_true1, y_proba) # compute for mixed component >>> OK

roc_auc_score(y_true1, y_proba1) # compute for positive class component >>> OK
roc_auc_score(y_true, y_proba) # compute for all class >>> OK
roc_auc_score(y_true1, y_proba) # compute for mixed component >>> FAIL: bad input shape (4, 2)

#last above line is source of error in this snippet: of binary classification and scoring task
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)

from sklearn.metrics import make_scorer, roc_auc_score
make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) # >>> FAIL: bad input shape (1000, 2)
#compare
make_scorer(log_loss, greater_is_better=True, needs_proba=True)(clf, X_test, y_test) # >>> OK

Expected Results

roc_auc_score should behave in a similar way as log_loss, guessing the binary classification task and handle different shape input correctly

Versions

Windows-10-10.0.15063-SP0
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.3
SciPy 0.19.1
Scikit-Learn 0.19.1

The text was updated successfully, but these errors were encountered:

jnothman · 2017-12-05T08:04:06Z

Yes, roc_auc_score(y_true1, y_proba) fails, but make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) does not. Both are in accordance with documented and expected behaviour.

mzoll · 2017-12-05T08:24:37Z

I do not understand how you find that make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) behaves correctly: I get executing this line (full trace):

ValueError                                Traceback (most recent call last)
<ipython-input-4-ae795709508b> in <module>()
     31 
     32 from sklearn.metrics import make_scorer, roc_auc_score
---> 33 make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test) # >>> FAIL: bad input shape (1000, 2)
     34 #compare
     35 make_scorer(log_loss, greater_is_better=True, needs_proba=True)(clf, X_test, y_test) # >>> OK

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\scorer.py in __call__(self, clf, X, y, sample_weight)
    142                                                  **self._kwargs)
    143         else:
--> 144             return self._sign * self._score_func(y, y_pred, **self._kwargs)
    145 
    146     def _factory_args(self):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in roc_auc_score(y_true, y_score, average, sample_weight)
    275     return _average_binary_score(
    276         _binary_roc_auc_score, y_true, y_score, average,
--> 277         sample_weight=sample_weight)
    278 
    279 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\base.py in _average_binary_score(binary_metric, y_true, y_score, average, sample_weight)
     73 
     74     if y_type == "binary":
---> 75         return binary_metric(y_true, y_score, sample_weight=sample_weight)
     76 
     77     check_consistent_length(y_true, y_score, sample_weight)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in _binary_roc_auc_score(y_true, y_score, sample_weight)
    270 
    271         fpr, tpr, tresholds = roc_curve(y_true, y_score,
--> 272                                         sample_weight=sample_weight)
    273         return auc(fpr, tpr, reorder=True)
    274 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
    532     """
    533     fps, tps, thresholds = _binary_clf_curve(
--> 534         y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
    535 
    536     # Attempt to drop thresholds corresponding to points in between and

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
    320     check_consistent_length(y_true, y_score, sample_weight)
    321     y_true = column_or_1d(y_true)
--> 322     y_score = column_or_1d(y_score)
    323     assert_all_finite(y_true)
    324     assert_all_finite(y_score)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape (10000, 2)

while I understand that this is in accordance with the function documentation, which forces y_true and y_pred to be in the same shape, aka both are per class predictions. This is explicitly not the format a classifier wants to handle the y_true input, which wants it as single column array of class-labels (compare docs). This effectively prohibit the use of roc_auc_score in the depicted scheme of cross-validation, pipelines etc.

jnothman · 2017-12-05T09:20:36Z

Oh. You're right. Although this issue is fixed in master. That's quite nasty. I'll need to track down what fixed this...

jnothman · 2017-12-05T09:54:08Z

This behaviour changed in master in ee2025f.

The issue is that the 'roc_auc' scorer is defined internally with needs_threshold=True. needs_proba means it strictly needs probability, not some arbitrary threshold.

We should probably consider some documentation improvements, if not changing the interface.

qinhanmin2014 · 2017-12-05T12:06:02Z

@mzoll The problem is fixed in #9521. Now you get the same result using _ProbaScorer (by setting needs_proba=True) and _ThresholdScorer (by setting needs_threshold=True).

make_scorer(roc_auc_score, needs_proba=True)(clf, X_test, y_test)
0.97641161457146342
make_scorer(roc_auc_score, needs_threshold=True)(clf, X_test, y_test)
0.97641161457146342

But according to the doc of make_scorer, it is recommented to set needs_threshold=True. Here is the definition from scikit-learn.

scikit-learn/sklearn/metrics/scorer.py

Lines 506 to 507 in ab984a6

    
           roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, 
        
                                        needs_threshold=True)

For the doc, there's already some information in make_scorer doc and the user guide. Maybe we still need more explanation about _ProbaScorer(needs_proba) and _ThresholdScorer(needs_threshold), at least I don't quite understand why we need _ProbaScorer. According to the code, seems that it can be replaced by _ThresholdScorer?

jnothman · 2017-12-05T13:18:21Z

Well there a certainly metrics that can't take an unnormalised decision function, but the may not necessitate a separate class

qinhanmin2014 · 2017-12-05T15:07:42Z

@jnothman Thanks a lot for the clarification :)
Indeed, I've find the following things from the doc of the three Scorers

_PredictScorer:Evaluate predicted target values for X relative to y_true.
_ProbaScorer:Evaluate predicted probabilities for X relative to y_true.
_ThresholdScorer:Evaluate decision function output for X relative to y_true.

I'm +1 for improving the doc and the user guide.
Also, in the long run, I agree that we might combine the three classes to avoid duplicate code. (e.g., we still use predict to handle regression in _ThresholdScorer even if we have clearly stated that it only works for classification)

SummerPapaya · 2019-03-18T06:52:53Z

@qinhanmin2014 @jnothman Hi, I came across the same problem. Thanks for your updates, I got the AUC score using make_scorer. But when I use roc_curve to plot the ROC curve, the same issue occurred:

`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
4 train_labels=train_label,
5 test_features=test_feature,
----> 6 test_labels=test_label)

in train_predict_evaluate_model(classifier, train_features, train_labels, test_features, test_labels)
10 m=metrics.confusion_matrix(test_labels,predictions)
11 auc = make_scorer(roc_auc_score, needs_threshold=True)(classifier, test_feature, test_label)
---> 12 fpr, tpr, thresholds = roc_curve(test_labels, clf_probs)
13 logloss = log_loss(test_labels, clf_probs)
14 print(report)

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
532 """
533 fps, tps, thresholds = _binary_clf_curve(
--> 534 y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
535
536 # Attempt to drop thresholds corresponding to points in between and

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
320 check_consistent_length(y_true, y_score, sample_weight)
321 y_true = column_or_1d(y_true)
--> 322 y_score = column_or_1d(y_score)
323 assert_all_finite(y_true)
324 assert_all_finite(y_score)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
612 return np.ravel(y)
613
--> 614 raise ValueError("bad input shape {0}".format(shape))
615
616

ValueError: bad input shape (134, 2)`

Do you have any solution to solve this problem? Please kindly advice. Thank you in advance!

qinhanmin2014 · 2019-03-18T07:43:16Z

@qinhanmin2014 @jnothman Hi, I came across the same problem. Thanks for your updates, I got the AUC score using make_scorer. But when I use roc_curve to plot the ROC curve, the same issue occurred:

Please provide self-contained example code, including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.

SummerPapaya · 2019-03-18T21:04:35Z

@qinhanmin2014 @jnothman Hi, I came across the same problem. Thanks for your updates, I got the AUC score using make_scorer. But when I use roc_curve to plot the ROC curve, the same issue occurred:

Please provide self-contained example code, including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.

I just figured out the problem. Because I didn't notice that there are two columns in the result of predict_proba(), so the result cannot apply to roc_curve(). The problem was solved by selecting the second column of the predict_proba() as y_score. Thanks!

jnothman closed this as completed Dec 5, 2017

jnothman reopened this Dec 5, 2017

qinhanmin2014 added Documentation help wanted labels Dec 5, 2017

qinhanmin2014 mentioned this issue Jul 14, 2019

DOC More notes about the difference between the three base scorers #14358

Merged

qinhanmin2014 closed this as completed in #14358 Jul 19, 2019

ZeyuSun mentioned this issue May 11, 2021

DOC correct behavior of needs_threshold in make_score #20079

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assymetry of roc_auc_score #10247

Assymetry of roc_auc_score #10247

mzoll commented Dec 4, 2017 •

edited by jnothman

Loading

jnothman commented Dec 5, 2017

mzoll commented Dec 5, 2017 •

edited

Loading

jnothman commented Dec 5, 2017

jnothman commented Dec 5, 2017

qinhanmin2014 commented Dec 5, 2017

jnothman commented Dec 5, 2017

qinhanmin2014 commented Dec 5, 2017

SummerPapaya commented Mar 18, 2019

qinhanmin2014 commented Mar 18, 2019

SummerPapaya commented Mar 18, 2019

Assymetry of roc_auc_score #10247

Assymetry of roc_auc_score #10247

Comments

mzoll commented Dec 4, 2017 • edited by jnothman Loading

Description

Steps/Code to Reproduce

Expected Results

Versions

jnothman commented Dec 5, 2017

mzoll commented Dec 5, 2017 • edited Loading

jnothman commented Dec 5, 2017

jnothman commented Dec 5, 2017

qinhanmin2014 commented Dec 5, 2017

jnothman commented Dec 5, 2017

qinhanmin2014 commented Dec 5, 2017

SummerPapaya commented Mar 18, 2019

qinhanmin2014 commented Mar 18, 2019

SummerPapaya commented Mar 18, 2019

mzoll commented Dec 4, 2017 •

edited by jnothman

Loading

mzoll commented Dec 5, 2017 •

edited

Loading