-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add multilabel support to precision, recall, fscore and classification report #1945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Found |
Reviews are welcomed !!! |
finally: | ||
np.seterr(**old_err_settings) | ||
|
||
precision[size_true == 0.] = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precision[size_true == 0]
Otherwise I think it looks good :) Should one hand-check the reference values for the tests? |
y_pred != pos_label), axis=0) | ||
|
||
else: | ||
for true, pred in zip(y_true, y_pred): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify to me the datatype of y_true
and y_pred
? Your use of in
seems to suggest they are sequences of iterables (tuples?) over labels. But I got the impression from the use of LabelBinarizer
and the indicator matrix case that we had a binary array. I don't know whether I've misread some code.
If it's represented as a binary array, can't you do something like:
true_pos = (y_true & y_pred).sum(axis=0)`
false_pos = (~y_true & y_pred).sum(axis=0)
false_neg = (y_true & ~y_pred).sum(axis=0)
?
If there's some reason not to use a binary array, would set intersection and difference be better (in the case of many labels) than iterating through every label for each sample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh. I got it, I think. You're handling two different formats for the same multilabel data. Is that necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I support a dense binary array indicator matrix and a list of list of label. Both have their pros and cons.
I support both format to avoid memory copy and to keep the pros of each format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is set intersection not a better idea than doing a label-by-label comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Done in ce395ac,
Yes if possible. When reviewing, you should look if the code is
Several tools could be useful such as flake8, nose, coverage and a line by line profiler. Previously @ogrisel wrote a paragraph on how to do good review. But I can't find it. |
Really helpful @arjoly, thanks :D The |
@@ -1480,14 +1748,28 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None, | |||
avg_precision = np.mean(precision) | |||
avg_recall = np.mean(recall) | |||
avg_fscore = np.mean(fscore) | |||
elif average == 'weighted': | |||
|
|||
elif average == 'weighted' and not is_multilabel(y_true): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second condition here is redundant: you should have already returned from that case above.
I think Otherwise, LGTM. |
Thanks, you raise a pretty good point. I messed the two. I will add a new average keyword for multilabel data and correct the narrative doc. For reference, I followed the definition of precision, recall and f-score from Tsoumakas, et al for multilabel output. |
I fix the confusion between example-based and weighted-base precision, recall and f-score measure. |
The model evaluation page in the narrative documentation is getting quite big and I have the impression that it is becoming a copy of the reference documentation. It lists a bunch of metrics, explains what they are, but not really what they are for. I think it would be valuable to add some insights for the users: in which cases one should prefer a metric over the other? What is your opinion? (This can be done in another PR, I just wanted to raise the issue.) |
Yes, this is something missing. More examples are needed to illustrate the |
Let's do this in another pr. |
I rebased on top of master. If @jnothman and @Jim-Holmstroem agree on the last version, I would happily merge this into the master trunk. |
If you rename |
I have rename example to samples and I am waiting for your pr. |
It didn't let me put in a PR to your fork for whatever reason. I think I'll just merge in your PR, and then propose my doc changes separately in case they need a moment's discussion. |
merged as d33634d |
|
||
if is_multilabel(y_true): | ||
# Handle mix representation | ||
if type(y_true) != type(y_pred): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised the implications of this condition. Here you're checking if one is a sequence of sequences and the other is an indicator matrix. What if the one was an array of sequences? I know at the moment that's not allowed by is_multilabel
, but I'm not sure we want to write code that would fail if is_multilabel
is changed to accept it.
Accepting an array of sequences in the future will mean we can perform fast vectorized operations. Note that's the format used in scipy.sparse.lil_matrix.rows
, which is essentially the same data structure that we're working with.
So I think we need a helper function to do this multilabel_compatible
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the goal of this check is (the docstring does tell me
all the shapes possible in the list of labels). However, it is a very
weak test: what if I pass in an ndarray and a memmap? Also, a list is a
valid array-like, and we tend to expect code to behave right for lists.
I am not sure if it would solve your usecase, but could you simply
convert both inputs to ndarrays, using np.asarray, and test for their
shape/dimension? Judging by the discussion, I have the impression that
you want something more flexible, but I don't understand how it would
work/what would be the purpose (examples in the docstring are great for
these purposes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, there are two forms of multilabel targets supported, corresponding to dense and sparse representations of the same thing.
So,
y = [
[0, 3],
[1],
[0, 2]
]
is a sequence-of-sequences representation. If the first element of y
is a sequence, but not a string or ndarray, we assume y
is multilabel and sequence-of-sequences. It is akin to lil_matrix
, and looks like the non-multilabel targets format. (But I should correct my post above: lil_matrix
uses an array of lists. But this comparison fails with an array of lists; we reject an array of arrays.) It is useful when there are very many sparsely-assigned classes.
The same data could be represented as an indicator matrix (dense-like):
y = array([
[T F F T],
[F T F F],
[T F T F]
])
The type comparison above is to make sure y_pred
and y_true
are of the same form, and if not, to convert to the latter. In using type
, it would make an error if the sequence-of-sequences was itself an array
, being compared to a label indicator matrix (the condition should be True
, but would be False
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but could you simply convert both inputs to ndarrays, using np.asarray, and test for their shape/dimension?
If there is redundant labels in the list of list of label format, it won't work in all case.
However, it is a very weak test: what if I pass in an ndarray and a memmap?
I have no experience with memmap
, so I haven't consider this possibility.
Is there a test somewhere that check that scikit-learn works correctly with memmap
? I had a quicklook in test_common.py
and found nothing.
I added several format invariance tests (see test_format_invariance_with_1d_vectors
and test_multilabel_representation_invariance
in test_metrics
) to check that different formats give the same answer. Any suggestions is welcome.
In using type, it would make an error if the sequence-of-sequences was itself an array, being compared to a label indicator matrix (the condition should be True, but would be False).
Can you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take my two y
s from above and wrap them in np.asarray
. Both would have the same type (but not the same dtype).
This pull request adds support for mulltilabel output support for the following functions:
One more pull request to tackle the issue #558.
Two questions:
nan
, the related measure is set to0.0
?