-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Support for multi-class roc_auc scores #7663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/metrics/ranking.py
Outdated
@@ -273,6 +288,7 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None): | |||
|
|||
pos_label : int or str, default=None | |||
The label of the positive class | |||
A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
sklearn/metrics/ranking.py
Outdated
return _average_binary_score( | ||
_binary_roc_auc_score, y_true, y_score, average, | ||
sample_weight=sample_weight) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you use MultiLabelEncoder
for the OVR case?
You could change the above to multiclass != "ovo"
and add a call to MultiLabelEncoder
for the multi-class case.
sklearn/metrics/ranking.py
Outdated
@@ -184,7 +184,7 @@ def _binary_average_precision(y_true, y_score, sample_weight=None): | |||
average, sample_weight=sample_weight) | |||
|
|||
|
|||
def roc_auc_score(y_true, y_score, average="macro", sample_weight=None): | |||
def roc_auc_score(y_true, y_score, multiclass="ovr", average="macro", sample_weight=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too long ;)
It looks like |
sklearn/metrics/base.py
Outdated
Calculate metrics for each label, taking into account the a priori | ||
distribution of the classes. | ||
|
||
binary_metric : callable, returns shape [n_classes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the input requirement?
sklearn/metrics/base.py
Outdated
Target scores corresponding to probability estimates of a sample | ||
belonging to a particular class | ||
|
||
average : string, [None, 'macro' (default), 'weighted'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case None
would be of length N * ( N - 1) / 2
. I guess we can support that but I feel it might make the code more complicated without a lot of benefits.
sklearn/metrics/base.py
Outdated
classes. | ||
|
||
""" | ||
average_options = (None, "macro", "weighted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a private function. If these arguments were already checked in the public function (which I think this was) we don't need to check it here again.
In particular, look at the code coverage of the test (after you implemented them ;).
sklearn/metrics/base.py
Outdated
@@ -131,3 +132,94 @@ def _average_binary_score(binary_metric, y_true, y_score, average, | |||
return np.average(score, weights=average_weight) | |||
else: | |||
return score | |||
|
|||
def _average_multiclass_score(binary_metric, y_true, y_score, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8: two blank lines
please run flake8 over base.py. xrange doesn't exist in python3, there is an unused variable, and some pep8 errors. |
sklearn/metrics/base.py
Outdated
|
||
not_average_axis = 1 | ||
|
||
if y_true.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should always be the case, right? Why reshape it?
sklearn/metrics/base.py
Outdated
if y_true.ndim == 1: | ||
y_true = y_true.reshape((-1, 1)) | ||
|
||
if y_score.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confuses me. This should never be the case, right?
sklearn/metrics/base.py
Outdated
(binary_metric(y_true_filtered_10, y_score_filtered[:,pair[0]]) + | ||
binary_metric(y_true_filtered_01, y_score_filtered[:,pair[1]]))/2.0 | ||
auc_scores_sum += binary_avg_output | ||
if average == "weighted": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason for that? Because we don't have a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a weighted version of Hand & Till and/or another one-vs.-one algorithm I should look into? I was a little confused about the one specified here (search: AU1P) because Hand & Till considers AUC(j, k) to be the average (the binary_avg_output
variable I have here) so I wasn't sure if it made sense to define P(j) the same way it is done for the "weighted ovr."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm they use a slightly different formula as you. They double count each pair and weight it once by p(j) and once by p(i), right? At least that's my reading.
sklearn/metrics/base.py
Outdated
auc_scores_sum = 0 | ||
for pair in pairwise: | ||
ix = np.in1d(y_true.ravel(), [pair[0], pair[1]]) | ||
y_true_filtered = y_true[0, np.where(ix)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need a call to where
?
sklearn/metrics/base.py
Outdated
y_true_filtered_10 = np.in1d(y_true_filtered.ravel(), pair[0]).astype(int) | ||
y_true_filtered_01 = np.in1d(y_true_filtered.ravel(), pair[1]).astype(int) | ||
binary_avg_output = \ | ||
(binary_metric(y_true_filtered_10, y_score_filtered[:,pair[0]]) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing me. Why is there a +
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the paper, Hand & Till define A(i, j) as [A(i | j) + A(j | i)] / 2 [Middle of page 177 on here, a few lines above equation (7)].
My implementation was based on this paper, though I may revise to follow the "AU1U" in the paper I referred to in the other comment about doing weighted OvO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, never mind. I got it. I missed the factor of 2 in my first reading. The weighting AU1P is more natural if you write it like the AU1U, so yeah, what you said.
sklearn/metrics/base.py
Outdated
ix = np.in1d(y_true.ravel(), [pair[0], pair[1]]) | ||
y_true_filtered = y_true[0, np.where(ix)] | ||
y_score_filtered = y_score[np.where(ix)] | ||
y_true_filtered_10 = np.in1d(y_true_filtered.ravel(), pair[0]).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the same as y_true_filtered == pair[0]
right?
sklearn/metrics/base.py
Outdated
# Hand and Till 2001 (unweighted) | ||
pairwise = [p for p in itertools.combinations(xrange(n_labels), 2)] | ||
auc_scores_sum = 0 | ||
for pair in pairwise: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe for pos, neg in pairwise
?
Still have plenty of test cases to do.
|
sklearn/metrics/ranking.py
Outdated
|
||
check_consistent_length(y_true, y_score) | ||
|
||
if y_true.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be a call to y_true = check_array(y_true)
before this to ensure that it is a numpy array before we check the .ndim
attribute. See comment in PR about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y-true should always be 1d, right? So we don't need the if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially included this because it is checked in _average_binary_score
here It makes sense in that context because _average_binary_score
has to handle the multilabel y_true
. Rather than do this check, I can specify in the docs that for the multiclass case, y_true
should be 1D.
do |
sklearn/metrics/base.py
Outdated
score : float | ||
Average the sum of the pairwise binary metric scores | ||
""" | ||
label_unique, label_counts = np.unique(y_true, return_counts=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "return_counts" keyword is not supported in older versions of numpy.
…put parameter specifications
I'm not sure what is a good way to phrase the modifications in the documentation. Let me know if there are suggestions for making the multiclass more clear. |
I just spent several hours trying to figure out what the right thing to do is, and I am not sure what the correct thing to do is. The literature is wrong and contradictory. I wrote a very very naive implementation in this gist, with the goal of making it as readable and as easy to check as possible. I am mostly basing things on this paper but the Hand & Till paper, cited in the pROC r package, uses a different formula than the one cited in the paper I link. From what I can tell, Hand & Till average over distinct pairs of classes, corresponding to mc_auc_ovo(weighted=False, distinct=True) in my linked gist, while the formulations in Ferri & al correspond to distinct=False. According to my gist this makes a difference. Moreover, the results are inconsistent if passed probabilities from I tried the pROC r package but I don't really know how to use it, it seems to take a single column of scores... Finally, I want to note that the formula for weighted OvO in the documentation of this PR is different from the one given in the cited paper (Ferri & al). However it seems that the formulation in Ferri & al yields values that are 1/n_classes times smaller than what you would expect. |
Why? You sum over all combinations, so it's n ** 2 many terms, right? |
Look at the output of my script, which should be a literal implementation of Ferri & al. Ferri & al give, on page 4,
@kathyxchen's expression in the narrative docs of this PR corrects for this (?) by using There are just too many small inconsistencies between these definitions and they lead to subtly different results. I think that is very scary when it comes to something used for evaluation. |
Interesting investigation, @vene. I have been similarly sceptical of a
universal multiclass approach to balanced accuracy. We need to proceed with
caution (and I'm similarly concerned about the sudden changes described as
fixes to average precision). At least here we are introducing a new feature
and can document clearly which choice we take.
In making a choice, we need to look at the properties that makes one
against the other worthwhile, and I admit this is hard without clear
published research to build on.
Without having much time to look into your work, I'm curious where the
discrepancies between proba and decision function come from.
|
(and I'm similarly concerned about the sudden changes described as
fixes to average precision).
average precision had a clear bug, and one of my students hit it. We've
added a test that failed before:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/tests/test_ranking.py#L624
and the only thing that we changed was the faulty interpolation (we did
not merge the 11pt variant, for the reason that you mention).
|
I agree that we could implement one formulation and document it appropriately. I could not find any good evidence for what would be the "standard". Maybe looking closer into pROC multiclass.auc can help, but its api seems different from what we want (it seems to take a multi-class vector and a single vector of scores) so I couldn't even figure out how to use it.
I'm very far from grokking this but I think the intuition is that, for a pair of classes, when one's probas increase the other' must decrease, which is not the case for scores. (and this doesn't cancel out when taking all ordered pairs, it would seem.) Notably in Ferri et al, multi-class AUC is defined for probas explicitly. Of course my script might have bugs, it could use another pair of eyes. |
@jnothman: the equation implemented in the PR uses the Hand & Till paper, |
I also wanted to find a commit where we had something more similar to @vene's implementation in the gist at line 79. |
@vene: wanted to follow up on this. do my previous comments help to clarify why the equation in the PR docs is the way that it is? Also, to continue your conversation on this point:
The Hand & Till average defines AUC(i, j) as the average of AUC(i | j) and AUC(j | i), and so instead just incorporates a coefficient of 2 in equation 7 on p. 177. I will agree that Ferri et al. does not do this, and depending on how you implement The way that I've approached this is by prioritizing what is reported in Hand & Till over Ferri et al., but I'm not sure if that is the "best" decision to make in this case. |
A few thoughts on this. Firstly, we can refuse to implement if there's no
clear community consensus. Secondly, we must have a metric that is
invariant to the order of class iteration. I think @vene's distinct=True
versions will give different results if the classes are iterated in
different order. Thirdly, I think that if scores across classes are
compared to one another (and not just to the binary ground truth), there
will necessarily be a different aggregate score for probabilities and
decision functions. If necessary, we could require that y_score.sum(axis=1)
= 1, and perhaps have a parameter to optionally perform softmax.
…On 25 Jun 2017 7:29 pm, "Kathy Chen" ***@***.***> wrote:
@vene <https://github.com/vene>: wanted to follow up on this. do my
previous comments help to clarify why the equation in the PR docs is the
way that it is?
Also, to continue your conversation on this point:
From what I can tell, Hand & Till average over distinct pairs of classes,
corresponding to mc_auc_ovo(weighted=False, distinct=True) in my linked
gist, while the formulations in Ferri & al correspond to distinct=False.
According to my gist this makes a difference.
The Hand & Till average defines AUC(i, j) as the average of AUC(i | j) and
AUC(j | i), and so instead just incorporates a coefficient of 2 in equation
7 on p. 177
<https://link.springer.com/content/pdf/10.1023%2FA%3A1010920819831.pdf>.
I will agree that Ferri et al. does not do this, and depending on how you
implement auc_two_classes in the gist, that may make a difference. There
are inconsistencies between Hand & Till and Ferri et al. which I hadn't
thought about too deeply - I'm glad you are bringing it up now.
The way that I've approached this is by prioritizing what is reported in
Hand & Till over Ferri et al., but I'm not sure if that is the "best"
decision to make in this case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7663 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68NhreZx7n1YCam6GJE36mLe4C8Lks5sHihxgaJpZM4KWMTC>
.
|
I haven't looked into @vene's script too deeply (though it's on my long todo). I think we should restrict ourselves to the probability case and implement Hand & Till and Provost as two options. We should't do something with |
So the water is muddy here, but I think we should be able to agree on a path. To do so, we should:
|
I had asked @maskani-moh to look into it, I'm not sure if he's working on it or on something else? |
(@maskani-moh or are you doing #8614?) |
@amueller Also, with all the information on both this thread and Issue#3298 (e.g. inconsistencies between Hand & Till and Ferri et al) it was quite fuzzy at some point. I am wondering if there is any topic in research where researchers do not contradict themselves and agree on common implementations 😆 But I think I can follow the steps suggested by @jnothman. Thanks! 👍 |
Hey, @amueller, @maskani-moh I am also working on #8614 . I didn't have time to commit my progress lately but I think I am on a good way. should I drop it? I didn't realise there was someone else working on it :/ at least it wasn't mentioned in the issue itself. |
cool, I will keep on then :) I will push my changes in the weekend because those days I was kind of choked up with work. Since you also looked at the issue If you have any valuable input I will be glad to have your feedback. |
different from what?
…On 16 Dec 2017 11:28 am, "Mohamed Maskani" ***@***.***> wrote:
@jnothman <https://github.com/jnothman>
How different is *Provost & Domingo*'s of the current implementation (I
mean in this PR: here
<https://github.com/scikit-learn/scikit-learn/pull/7663/files#diff-7da4afad051915b763846995395fbda3R302>)
of the multiclass roc_auc_score where parameter average="weighted"? 🤔
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7663 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zJip128MaAZ1FLNKR2e5n33cutiks5tAw66gaJpZM4KWMTC>
.
|
@jnothman, my last comment was a bit confusing, but I figured it out myself! Thank you! Due to some rebase issues I've encountered while taking over this PR, I decided to start a new one #10481 from scratch, including all the previous work as well as mine in few commits. I've taken into account your comment and implemented what was left to do. Thank you, and I wish you a happy new year! 🎉 |
fixed in #12789 |
Reference Issue
Issue 3298
What does this implement/fix? Explain your changes.
Incorporates ROC AUC computations for some multiclass ROC AUC algorithms:
Any other comments?
I wanted to just get in an initial PR for this issue because I am sure there will be many modifications/suggestions from everyone.
TODOS
TypeError: unique() got an unexpected keyword argument 'return_counts'
)