Skip to content

[MRG] Support for multi-class roc_auc scores #7663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

kathyxchen
Copy link
Contributor

@kathyxchen kathyxchen commented Oct 13, 2016

Reference Issue

Issue 3298

What does this implement/fix? Explain your changes.

Incorporates ROC AUC computations for some multiclass ROC AUC algorithms:

  • Hand & Till, 2001 one vs one
  • Provost & Domingo, 2000 one vs rest

Any other comments?

I wanted to just get in an initial PR for this issue because I am sure there will be many modifications/suggestions from everyone.

TODOS

  • I haven't implemented comprehensive testing for edge cases yet. I've added a few short tests to assess correctness of the different algorithms, but would also like more feedback on this specific area.
  • Updating documentation
  • PEP8!
  • Check compatibility for different versions of Python & packages supported by sklearn. (e.g. one failure was with numpy: TypeError: unique() got an unexpected keyword argument 'return_counts')

@@ -273,6 +288,7 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):

pos_label : int or str, default=None
The label of the positive class
A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

return _average_binary_score(
_binary_roc_auc_score, y_true, y_score, average,
sample_weight=sample_weight)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you use MultiLabelEncoder for the OVR case?
You could change the above to multiclass != "ovo" and add a call to MultiLabelEncoder for the multi-class case.

@@ -184,7 +184,7 @@ def _binary_average_precision(y_true, y_score, sample_weight=None):
average, sample_weight=sample_weight)


def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
def roc_auc_score(y_true, y_score, multiclass="ovr", average="macro", sample_weight=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too long ;)

@amueller
Copy link
Member

It looks like test_common doesn't have tests for "multiclass thresholded" which is too bad :( I guess we shouldn't worry about that for now.

Calculate metrics for each label, taking into account the a priori
distribution of the classes.

binary_metric : callable, returns shape [n_classes]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the input requirement?

Target scores corresponding to probability estimates of a sample
belonging to a particular class

average : string, [None, 'macro' (default), 'weighted']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case None would be of length N * ( N - 1) / 2. I guess we can support that but I feel it might make the code more complicated without a lot of benefits.

classes.

"""
average_options = (None, "macro", "weighted")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private function. If these arguments were already checked in the public function (which I think this was) we don't need to check it here again.
In particular, look at the code coverage of the test (after you implemented them ;).

@@ -131,3 +132,94 @@ def _average_binary_score(binary_metric, y_true, y_score, average,
return np.average(score, weights=average_weight)
else:
return score

def _average_multiclass_score(binary_metric, y_true, y_score,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8: two blank lines

@amueller
Copy link
Member

please run flake8 over base.py. xrange doesn't exist in python3, there is an unused variable, and some pep8 errors.


not_average_axis = 1

if y_true.ndim == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be the case, right? Why reshape it?

if y_true.ndim == 1:
y_true = y_true.reshape((-1, 1))

if y_score.ndim == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confuses me. This should never be the case, right?

(binary_metric(y_true_filtered_10, y_score_filtered[:,pair[0]]) +
binary_metric(y_true_filtered_01, y_score_filtered[:,pair[1]]))/2.0
auc_scores_sum += binary_avg_output
if average == "weighted":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for that? Because we don't have a reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a weighted version of Hand & Till and/or another one-vs.-one algorithm I should look into? I was a little confused about the one specified here (search: AU1P) because Hand & Till considers AUC(j, k) to be the average (the binary_avg_output variable I have here) so I wasn't sure if it made sense to define P(j) the same way it is done for the "weighted ovr."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm they use a slightly different formula as you. They double count each pair and weight it once by p(j) and once by p(i), right? At least that's my reading.

auc_scores_sum = 0
for pair in pairwise:
ix = np.in1d(y_true.ravel(), [pair[0], pair[1]])
y_true_filtered = y_true[0, np.where(ix)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need a call to where?

y_true_filtered_10 = np.in1d(y_true_filtered.ravel(), pair[0]).astype(int)
y_true_filtered_01 = np.in1d(y_true_filtered.ravel(), pair[1]).astype(int)
binary_avg_output = \
(binary_metric(y_true_filtered_10, y_score_filtered[:,pair[0]]) +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing me. Why is there a +?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the paper, Hand & Till define A(i, j) as [A(i | j) + A(j | i)] / 2 [Middle of page 177 on here, a few lines above equation (7)].
My implementation was based on this paper, though I may revise to follow the "AU1U" in the paper I referred to in the other comment about doing weighted OvO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, never mind. I got it. I missed the factor of 2 in my first reading. The weighting AU1P is more natural if you write it like the AU1U, so yeah, what you said.

ix = np.in1d(y_true.ravel(), [pair[0], pair[1]])
y_true_filtered = y_true[0, np.where(ix)]
y_score_filtered = y_score[np.where(ix)]
y_true_filtered_10 = np.in1d(y_true_filtered.ravel(), pair[0]).astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the same as y_true_filtered == pair[0] right?

# Hand and Till 2001 (unweighted)
pairwise = [p for p in itertools.combinations(xrange(n_labels), 2)]
auc_scores_sum = 0
for pair in pairwise:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for pos, neg in pairwise?

@kathyxchen
Copy link
Contributor Author

kathyxchen commented Oct 19, 2016

Still have plenty of test cases to do.
Also, I am assuming I have received a numpy array as y_true input. (I'll leave a comment in the code.) I ran into trouble with using check_array on y_true and wanted to ask about this:
Given a list, check_array gives me a (1, n) numpy array. The code I have for both 'ovo' and 'ovr' expects an (n, 1) numpy array (the code is cleaner because I can filter the y_score matrix with the same indices).
Do you have suggestions about how I should improve this? e.g.

  • Should I assume y_true is (1, n) and just adjust the code to handle that?
  • Should I always apply a reshape?


check_consistent_length(y_true, y_score)

if y_true.ndim == 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a call to y_true = check_array(y_true) before this to ensure that it is a numpy array before we check the .ndim attribute. See comment in PR about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y-true should always be 1d, right? So we don't need the if?

Copy link
Contributor Author

@kathyxchen kathyxchen Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially included this because it is checked in _average_binary_score here It makes sense in that context because _average_binary_score has to handle the multilabel y_true. Rather than do this check, I can specify in the docs that for the multiclass case, y_true should be 1D.

@amueller
Copy link
Member

do check_array(ensure_2d=False). That should give you (n,) if you input a list. If you need a different shape for convenience in the code, then call reshape(-1, 1), I'd say.

score : float
Average the sum of the pairwise binary metric scores
"""
label_unique, label_counts = np.unique(y_true, return_counts=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "return_counts" keyword is not supported in older versions of numpy.

@kathyxchen
Copy link
Contributor Author

I'm not sure what is a good way to phrase the modifications in the documentation. Let me know if there are suggestions for making the multiclass more clear.
I also considered updating the plot_roc example, where it mentions multiclass needs to binarize the output, but was not sure what was appropriate: the example deals with being able to plot the ROC curves too, whereas the new multiclass functionality (accepting y_true an array of labels 0 ... n_classes - 1) only applies to the ROC AUC score computation.

@vene
Copy link
Member

vene commented Jun 10, 2017

I just spent several hours trying to figure out what the right thing to do is, and I am not sure what the correct thing to do is. The literature is wrong and contradictory.

I wrote a very very naive implementation in this gist, with the goal of making it as readable and as easy to check as possible.

I am mostly basing things on this paper but the Hand & Till paper, cited in the pROC r package, uses a different formula than the one cited in the paper I link.

From what I can tell, Hand & Till average over distinct pairs of classes, corresponding to mc_auc_ovo(weighted=False, distinct=True) in my linked gist, while the formulations in Ferri & al correspond to distinct=False. According to my gist this makes a difference.

Moreover, the results are inconsistent if passed probabilities from predict_proba versus scores from decision_function, again, as illustrated in my gist.

I tried the pROC r package but I don't really know how to use it, it seems to take a single column of scores...

Finally, I want to note that the formula for weighted OvO in the documentation of this PR is different from the one given in the cited paper (Ferri & al). However it seems that the formulation in Ferri & al yields values that are 1/n_classes times smaller than what you would expect.

@amueller
Copy link
Member

Finally, I want to note that the formula for weighted OvO in the documentation of this PR is different from the one given in the cited paper (Ferri & al). However it seems that the formulation in Ferri & al yields values that are 1/n_classes times smaller than what you would expect.

Why? You sum over all combinations, so it's n ** 2 many terms, right?

@vene
Copy link
Member

vene commented Jun 10, 2017

Look at the output of my script, which should be a literal implementation of Ferri & al.

Ferri & al give, on page 4,

AU1U = 1/ (c * (c-1)) sum_j sum_[k != j]       AUC(j, k)
AU1P = 1/ (c * (c-1)) sum_j sum_[k != j] p(j) AUC(j, k)

p(j) is defined at the top of section 3.1 (top right of p. 3) as m_j / m, in other words np.mean(y_true == j). Therefore it's always less than 1.

(Ignoring the outer factor of 1 / c(c-1), the weighted version replaces a sum with a weighted mean...) (this sentence was incorrect, sorry!)

@kathyxchen's expression in the narrative docs of this PR corrects for this (?) by using 1/(c-1) instead of 1/c(c-1), but where does that come from, what paper? Especially since the weights in the narrative doc are p(j or k) instead of p(j) as in the Ferri&al paper...

There are just too many small inconsistencies between these definitions and they lead to subtly different results. I think that is very scary when it comes to something used for evaluation.

@jnothman
Copy link
Member

jnothman commented Jun 11, 2017 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 11, 2017 via email

@vene
Copy link
Member

vene commented Jun 11, 2017

I agree that we could implement one formulation and document it appropriately. I could not find any good evidence for what would be the "standard". Maybe looking closer into pROC multiclass.auc can help, but its api seems different from what we want (it seems to take a multi-class vector and a single vector of scores) so I couldn't even figure out how to use it.

discrepancies between proba and decision function

I'm very far from grokking this but I think the intuition is that, for a pair of classes, when one's probas increase the other' must decrease, which is not the case for scores. (and this doesn't cancel out when taking all ordered pairs, it would seem.) Notably in Ferri et al, multi-class AUC is defined for probas explicitly.

Of course my script might have bugs, it could use another pair of eyes.

@kathyxchen
Copy link
Contributor Author

kathyxchen commented Jun 14, 2017

@jnothman: the equation implemented in the PR uses the Hand & Till paper,
see page 177.
If you look 4 lines above eq. (7), A(i, j) is defined as A(i | j) + A(j | i)/2

@kathyxchen
Copy link
Contributor Author

@vene we also had some discussion about Ferri et al.'s equation here that might be helpful.

@kathyxchen
Copy link
Contributor Author

I also wanted to find a commit where we had something more similar to @vene's implementation in the gist at line 79.
I had originally implemented it as such (you can remove comments at the top of this file for clarity initially, though there's some discussion in the review that is also related to this). I haven't gotten a chance to look extensively into the differences yet, but I'm hoping that might be helpful. We updated it afterwards based on the review there, and then changed from probability to prevalence from some conversation with @jnothman here in the tests.

@kathyxchen
Copy link
Contributor Author

kathyxchen commented Jun 25, 2017

@vene: wanted to follow up on this. do my previous comments help to clarify why the equation in the PR docs is the way that it is?

Also, to continue your conversation on this point:

From what I can tell, Hand & Till average over distinct pairs of classes, corresponding to mc_auc_ovo(weighted=False, distinct=True) in my linked gist, while the formulations in Ferri & al correspond to distinct=False. According to my gist this makes a difference.

The Hand & Till average defines AUC(i, j) as the average of AUC(i | j) and AUC(j | i), and so instead just incorporates a coefficient of 2 in equation 7 on p. 177. I will agree that Ferri et al. does not do this, and depending on how you implement auc_two_classes in the gist, that may make a difference. There are inconsistencies between Hand & Till and Ferri et al. which I hadn't thought about too deeply - I'm glad you are bringing it up now.

The way that I've approached this is by prioritizing what is reported in Hand & Till over Ferri et al., but I'm not sure if that is the "best" decision to make in this case.

@jnothman
Copy link
Member

jnothman commented Jun 25, 2017 via email

@amueller
Copy link
Member

amueller commented Sep 6, 2017

I haven't looked into @vene's script too deeply (though it's on my long todo). I think we should restrict ourselves to the probability case and implement Hand & Till and Provost as two options. We should't do something with decision_function that's arbitrary and non-standard, but right now we don't really have any good single-number multi-class metrics (arguably apart from macro-average precision and macro-average recall but they are not documented very well imho), and I think users would really benefit from them.

@jnothman
Copy link
Member

So the water is muddy here, but I think we should be able to agree on a path. To do so, we should:

  • begin with validating that in the multiclass case, np.allclose(1, y_score.sum(axis=1)).
  • assume that Hand and Till is the better choice, but one last time should note here the differences between their version of OvO and Ferri et al's.
  • also implement Provost and Domingo's OvR.
  • check that all versions implemented are invariant under label and column permutation (if we do not already)

@amueller
Copy link
Member

I had asked @maskani-moh to look into it, I'm not sure if he's working on it or on something else?

@amueller
Copy link
Member

(@maskani-moh or are you doing #8614?)

@maskani-moh
Copy link
Contributor

@amueller
Yes, I am working on this one. I didn't have much time to make a lot of progress on it this week because of my finals, but I am on it.

Also, with all the information on both this thread and Issue#3298 (e.g. inconsistencies between Hand & Till and Ferri et al) it was quite fuzzy at some point. I am wondering if there is any topic in research where researchers do not contradict themselves and agree on common implementations 😆

But I think I can follow the steps suggested by @jnothman. Thanks! 👍

@PGryllos
Copy link
Contributor

PGryllos commented Dec 14, 2017

Hey, @amueller, @maskani-moh I am also working on #8614 . I didn't have time to commit my progress lately but I think I am on a good way. should I drop it? I didn't realise there was someone else working on it :/ at least it wasn't mentioned in the issue itself.

@maskani-moh
Copy link
Contributor

Hey @PGryllos,

I had a look at #8614 but didn't start it. I am currently working on this one.
If you're on the right track, then I think just commit your changes so that they can be reviewed and keep up the work. 👍

@PGryllos
Copy link
Contributor

PGryllos commented Dec 14, 2017

cool, I will keep on then :) I will push my changes in the weekend because those days I was kind of choked up with work. Since you also looked at the issue If you have any valuable input I will be glad to have your feedback.

@maskani-moh
Copy link
Contributor

@jnothman

How different is Provost & Domingo's of the current implementation (I mean in this PR: here) of the multiclass roc_auc_score where parameter average="weighted"? 🤔

@jnothman
Copy link
Member

jnothman commented Dec 17, 2017 via email

@maskani-moh
Copy link
Contributor

@jnothman, my last comment was a bit confusing, but I figured it out myself! Thank you!

Due to some rebase issues I've encountered while taking over this PR, I decided to start a new one #10481 from scratch, including all the previous work as well as mine in few commits.

I've taken into account your comment and implemented what was left to do.

Thank you, and I wish you a happy new year! 🎉

@amueller
Copy link
Member

fixed in #12789

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants