[MRG] Add multilabel support to precision, recall, fscore and classification report #1945

arjoly · 2013-05-07T11:42:26Z

This pull request adds support for mulltilabel output support for the following functions:

classification_report
f1_score
fbeta_score
precision_recall_fscore_support
precision_score
recall_score

One more pull request to tackle the issue #558.

Two questions:

What should we do when the computation of the precision, recall or fscore lead to a nan? Currently if the computation lead to a nan, the related measure is set to 0.0?
Which tests should I change to use the new precision, recall and fscore function?

arjoly · 2013-05-07T12:11:43Z

Found test_multiclass.py that use a custom multilabel precision and recall metrics => fix in 60892eced544ffed80dc559b0efad12fc92532e1.

arjoly · 2013-05-07T12:19:15Z

Reviews are welcomed !!!

Jim-Holmstroem · 2013-05-07T13:19:10Z

sklearn/metrics/metrics.py

+            finally:
+                np.seterr(**old_err_settings)
+
+            precision[size_true == 0.] = 1.0


precision[size_true == 0]

Jim-Holmstroem · 2013-05-07T13:49:24Z

Otherwise I think it looks good :) Should one hand-check the reference values for the tests?

jnothman · 2013-05-08T04:59:59Z

sklearn/metrics/metrics.py

+                                              y_pred != pos_label), axis=0)
+
+        else:
+            for true, pred in zip(y_true, y_pred):


Could you please clarify to me the datatype of y_true and y_pred? Your use of in seems to suggest they are sequences of iterables (tuples?) over labels. But I got the impression from the use of LabelBinarizer and the indicator matrix case that we had a binary array. I don't know whether I've misread some code.

If it's represented as a binary array, can't you do something like:

true_pos = (y_true & y_pred).sum(axis=0)` false_pos = (~y_true & y_pred).sum(axis=0) false_neg = (y_true & ~y_pred).sum(axis=0)

?
If there's some reason not to use a binary array, would set intersection and difference be better (in the case of many labels) than iterating through every label for each sample?

Argh. I got it, I think. You're handling two different formats for the same multilabel data. Is that necessary?

Yes, I support a dense binary array indicator matrix and a list of list of label. Both have their pros and cons.
I support both format to avoid memory copy and to keep the pros of each format.

So is set intersection not a better idea than doing a label-by-label comparison?

You are right. Done in ce395ac,

arjoly · 2013-05-08T07:19:46Z

Otherwise I think it looks good :) Should one hand-check the reference values for the tests?

Yes if possible.

When reviewing, you should look if the code is

correct
clean: pep8, pep257, follows the standard api;
well documented: a narrative doc, good doctstring, examples, references, note to the end user, note for maintenance, good function names;
well tested: high test coverage, test edge case, check correct behavior;
sufficiently fast : no obvious optimisation to perform,

Several tools could be useful such as flake8, nose, coverage and a line by line profiler.

Previously @ogrisel wrote a paragraph on how to do good review. But I can't find it.

Jim-Holmstroem · 2013-05-08T12:40:20Z

Really helpful @arjoly, thanks :D The good review text would be great

jnothman · 2013-05-08T12:51:58Z

sklearn/metrics/metrics.py

@@ -1480,14 +1748,28 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
            avg_precision = np.mean(precision)
            avg_recall = np.mean(recall)
            avg_fscore = np.mean(fscore)
-        elif average == 'weighted':
+
+        elif average == 'weighted' and not is_multilabel(y_true):


The second condition here is redundant: you should have already returned from that case above.

jnothman · 2013-05-08T12:58:32Z

I think average="weighted" needs a bit more clarification in the docstring. For multi-label it seems to calculate each metric for each sample and return the mean across samples. In single-label, it calculates metrics for each label and averages them weighted by the "true" label distribution. Do they have any correspondence?

Otherwise, LGTM.

arjoly · 2013-05-08T14:02:12Z

I think average="weighted" needs a bit more clarification in the docstring. For multi-label it seems to calculate each metric for each sample and return the mean across samples. In single-label, it calculates metrics for each label and averages them weighted by the "true" label distribution. Do they have any correspondence?

Thanks, you raise a pretty good point. I messed the two. I will add a new average keyword for multilabel data and correct the narrative doc.

For reference, I followed the definition of precision, recall and f-score from Tsoumakas, et al for multilabel output.

arjoly · 2013-05-10T07:12:20Z

I fix the confusion between example-based and weighted-base precision, recall and f-score measure.

glouppe · 2013-05-10T07:24:16Z

The model evaluation page in the narrative documentation is getting quite big and I have the impression that it is becoming a copy of the reference documentation. It lists a bunch of metrics, explains what they are, but not really what they are for. I think it would be valuable to add some insights for the users: in which cases one should prefer a metric over the other? What is your opinion?

(This can be done in another PR, I just wanted to raise the issue.)

arjoly · 2013-05-10T08:53:45Z

Yes, this is something missing.

More examples are needed to illustrate the metrics module and others should rewritten to give more insight to our user. Moreover something similar to the clustering metrics documentation would be a great addition to the documentation.

arjoly · 2013-05-17T12:09:17Z

The model evaluation page in the narrative documentation is getting quite big and I have the impression that it is becoming a copy of the reference documentation. It lists a bunch of metrics, explains what they are, but not really what they are for. I think it would be valuable to add some insights for the users: in which cases one should prefer a metric over the other? What is your opinion?

(This can be done in another PR, I just wanted to raise the issue.)

Let's do this in another pr.

arjoly · 2013-05-17T13:23:08Z

I rebased on top of master.

If @jnothman and @Jim-Holmstroem agree on the last version, I would happily merge this into the master trunk.

jnothman · 2013-05-18T12:39:01Z

If you rename average="example" to average="samples", I'm more than happy with it. We can patch in my documentation rewrite separately (as it seems I failed to submit a PR to your scikit-learn fork).

…l f-score

arjoly · 2013-05-19T17:36:30Z

I have rename example to samples and I am waiting for your pr.

jnothman · 2013-05-20T03:00:26Z

It didn't let me put in a PR to your fork for whatever reason. I think I'll just merge in your PR, and then propose my doc changes separately in case they need a moment's discussion.

jnothman · 2013-05-20T03:22:24Z

merged as d33634d

jnothman · 2013-05-22T05:02:19Z

sklearn/metrics/metrics.py

+
+    if is_multilabel(y_true):
+        # Handle mix representation
+        if type(y_true) != type(y_pred):


I just realised the implications of this condition. Here you're checking if one is a sequence of sequences and the other is an indicator matrix. What if the one was an array of sequences? I know at the moment that's not allowed by is_multilabel, but I'm not sure we want to write code that would fail if is_multilabel is changed to accept it.

Accepting an array of sequences in the future will mean we can perform fast vectorized operations. Note that's the format used in scipy.sparse.lil_matrix.rows, which is essentially the same data structure that we're working with.

So I think we need a helper function to do this multilabel_compatible check.

I am not sure what the goal of this check is (the docstring does tell me
all the shapes possible in the list of labels). However, it is a very
weak test: what if I pass in an ndarray and a memmap? Also, a list is a
valid array-like, and we tend to expect code to behave right for lists.

I am not sure if it would solve your usecase, but could you simply
convert both inputs to ndarrays, using np.asarray, and test for their
shape/dimension? Judging by the discussion, I have the impression that
you want something more flexible, but I don't understand how it would
work/what would be the purpose (examples in the docstring are great for
these purposes).

Basically, there are two forms of multilabel targets supported, corresponding to dense and sparse representations of the same thing.

So,

y = [ [0, 3], [1], [0, 2] ]

is a sequence-of-sequences representation. If the first element of y is a sequence, but not a string or ndarray, we assume y is multilabel and sequence-of-sequences. It is akin to lil_matrix, and looks like the non-multilabel targets format. (But I should correct my post above: lil_matrix uses an array of lists. But this comparison fails with an array of lists; we reject an array of arrays.) It is useful when there are very many sparsely-assigned classes.

The same data could be represented as an indicator matrix (dense-like):

y = array([ [T F F T], [F T F F], [T F T F] ])

The type comparison above is to make sure y_pred and y_true are of the same form, and if not, to convert to the latter. In using type, it would make an error if the sequence-of-sequences was itself an array, being compared to a label indicator matrix (the condition should be True, but would be False).

but could you simply convert both inputs to ndarrays, using np.asarray, and test for their shape/dimension?

If there is redundant labels in the list of list of label format, it won't work in all case.

However, it is a very weak test: what if I pass in an ndarray and a memmap?

I have no experience with memmap, so I haven't consider this possibility.
Is there a test somewhere that check that scikit-learn works correctly with memmap? I had a quicklook in test_common.py and found nothing.

I added several format invariance tests (see test_format_invariance_with_1d_vectors and test_multilabel_representation_invariance in test_metrics) to check that different formats give the same answer. Any suggestions is welcome.

In using type, it would make an error if the sequence-of-sequences was itself an array, being compared to a label indicator matrix (the condition should be True, but would be False).

Can you give an example?

take my two ys from above and wrap them in np.asarray. Both would have the same type (but not the same dtype).

Jim-Holmstroem reviewed May 7, 2013
View reviewed changes

jnothman reviewed May 8, 2013
View reviewed changes

arjoly added 12 commits May 19, 2013 19:34

ENH add multilabel support to precision recall fscore

2965167

ENH add multilabel support to classification_report

66394ab

DOC remove example

af64f04

pep8

21155fc

Update what's new?

7d89dc9

ENH refactor test_multiclass to use the new metrics

6bd6640

FIX compatibility divide issue

fad38ab

FIX set dtype in np.empty

1a43117

FIX correct type comparison

5a42e16

COSMIT fix argument order in assert

a045330

ENH faster tp, fp, fn count with list of list of labels

2b47f79

ENH remove redundant condition

44a4c9d

arjoly added 3 commits May 19, 2013 19:34

ENH and FIX average weighted and example

dbcb3cd

FIX confusion between example-base and weighted-based precision recal…

eafa3e7

…l f-score

ENH rename average='example' to average='samples'

f1b2d68

jnothman closed this May 20, 2013

This was referenced May 20, 2013

DOC rewrite descriptions of P/R/F averages and define support #1974

Closed

DOC explain when to use which metric #1978

Open

jnothman reviewed May 22, 2013
View reviewed changes

arjoly mentioned this pull request May 22, 2013

[MRG] FIX remaining bug in precision, recall and fscore with multilabel data #1988

Closed

arjoly deleted the precision-recall-f-multilabel branch May 28, 2013 07:07

Uh oh!

[MRG] Add multilabel support to precision, recall, fscore and classification report #1945

[MRG] Add multilabel support to precision, recall, fscore and classification report #1945

Uh oh!

Conversation

arjoly commented May 7, 2013

Uh oh!

arjoly commented May 7, 2013

Uh oh!

arjoly commented May 7, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jim-Holmstroem commented May 7, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented May 8, 2013

Uh oh!

Jim-Holmstroem commented May 8, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 8, 2013

Uh oh!

arjoly commented May 8, 2013

Uh oh!

arjoly commented May 10, 2013

Uh oh!

glouppe commented May 10, 2013

Uh oh!

arjoly commented May 10, 2013

Uh oh!

arjoly commented May 17, 2013

Uh oh!

arjoly commented May 17, 2013

Uh oh!

jnothman commented May 18, 2013

Uh oh!

arjoly commented May 19, 2013

Uh oh!

jnothman commented May 20, 2013

Uh oh!

jnothman commented May 20, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!