-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
BUG Fixes error with multiclass roc auc scorer #15274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG Fixes error with multiclass roc auc scorer #15274
Conversation
sklearn/metrics/_scorer.py
Outdated
@@ -323,6 +323,12 @@ def _score(self, method_caller, clf, X, y, sample_weight=None): | |||
self._score_func.__name__)) | |||
elif isinstance(y_pred, list): | |||
y_pred = np.vstack([p[:, -1] for p in y_pred]).T | |||
else: # multiclass | |||
try: | |||
y_pred = method_caller(clf, "predict_proba", X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we try decision_function
first? _ThresholdScorer
means that we "Evaluate decision function output for X relative to y_true".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, _ThresholdScorer
should be consistent and always try to use decision_function
first whatever the type of output: it's should work for any kind model that has un-normalized but continuous class assignment scores.
If ROC AUC with multiclass averaging needs normalized class assignment probabilities instead, we should use this requirement in the defintions of the scorer instances instead:
Currently, scorer.py has:
# Score functions that need decision values
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
average_precision_scorer = make_scorer(average_precision_score,
needs_threshold=True)
roc_auc_ovo_scorer = make_scorer(roc_auc_score, needs_threshold=True,
multi_class='ovo')
roc_auc_ovo_weighted_scorer = make_scorer(roc_auc_score, needs_threshold=True,
multi_class='ovo',
average='weighted')
roc_auc_ovr_scorer = make_scorer(roc_auc_score, needs_threshold=True,
multi_class='ovr')
roc_auc_ovr_weighted_scorer = make_scorer(roc_auc_score, needs_threshold=True,
multi_class='ovr',
average='weighted')
If needed, we should replace needs_threshold=True
by needs_proba=True
but I do not see why this would be the case for ROC AUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If needed, we should replace needs_threshold=True by needs_proba=True but I do not see why this would be the case for ROC AUC.
I see, so we should use need_proba=True
, we do not consider decision_function in _ProbaScorer.
For roc_auc_scorer, I think we should use _ThresholdScorer becasue we accept the output from decision_function.
In my previous comment I said:
But are you sure we really need to use proba for multiclass ROC AUC scores? Why couldn't we use unnormalized class assignment scores also in the multiclass case? |
Replying to myself: when calling
So indeed the fix implemented in this PR is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget to add an entry to the changelog.
Also, a few more nits:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@qinhanmin2014 any further comment? |
doc/whats_new/v0.22.rst
Outdated
@@ -493,6 +493,10 @@ Changelog | |||
``multioutput`` parameter. | |||
:pr:`14732` by :user:`Agamemnon Krasoulis <agamemnonc>`. | |||
|
|||
- |Fix| The scorers: 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these were never released, it doesn't really make sense to advertise as a separate change log entry does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not realized that. Indeed we should remove that changelog entry.
Moved the mentioning of the scores to the entry about the introduction of the multiclass roc_auc metric. |
I think there's still an annoying issue here.
which means that when y_type is binary, this can still be a multiclass problem, but in _ProbaScorer
So we'll get a ValueError. |
One possible solution is to remove input validation in Scorer. We can do input validation in the function where we calculate the score. |
For reference, the weird condition on import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 1, 0, 1])
y_score = np.array([[0.1 , 0.8 , 0.1 ],
[0.3 , 0.4 , 0.3 ],
[0.35, 0.5 , 0.15],
[0. , 0.2 , 0.8 ]])
roc_auc_score(y_true, y_score, labels=[0, 1, 2], multi_class='ovo') Without the |
Even if we remove the restriction on scorer = make_scorer(roc_auc_score, multi_class='ovo',
labels=[0, 1, 2], needs_proba=True)
X, y = make_classification(n_classes=3, n_informative=3, n_samples=20,
random_state=0)
lr = LogisticRegression(multi_class="multinomial").fit(X, y)
scorer(lr, X, y == 0) |
@thomasjpfan When defining built-in scorers, we often use default value of the parameters. If users do not want to use default value of the parameters, they'll need to define a scorer themselves through make_scorer. I think it's reasonable. In multiclass roc_auc_score, we infer the labels by default. When y_true do not contain all the labels, it's impossible to imfer the labels, so it's reasonble to require users to define a scorer themselves. Current issue is that when y_true only contains two labels, users can't define a scorer themselves because of the input validation in _ProbaScorer, so I think we need to remove that. |
Updated the check in Edit: If it does not look multiclass, then it raises the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe update _ThresholdScorer simultaneous?
@@ -247,7 +247,7 @@ def _score(self, method_caller, clf, X, y, sample_weight=None): | |||
if y_type == "binary": | |||
if y_pred.shape[1] == 2: | |||
y_pred = y_pred[:, 1] | |||
else: | |||
elif y_pred.shape[1] == 1: # not multiclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why is this useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize('scorer_name', [ | ||
'roc_auc_ovr', 'roc_auc_ovo', | ||
'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted']) | ||
def test_multiclass_roc_no_proba_scorer_errors(scorer_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please tell me why multiclass roc_auc_score do not support the output of decision_function? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paper this was based on used probabilities for ranking: https://link.springer.com/content/pdf/10.1023%2FA%3A1010920819831.pdf
For reference this was discussed in the original issue: #7663 (comment)
thank, though I still think that multiclass roc_auc can accept the output of decision_function. Not all estimators in scikit-learn has predict_proba :) |
Adds multiclass support for threshold metric for
roc_auc_score