Skip to content

[MRG] Add multilabel support to precision, recall, fscore and classification report #1945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented May 7, 2013

This pull request adds support for mulltilabel output support for the following functions:

  • classification_report
  • f1_score
  • fbeta_score
  • precision_recall_fscore_support
  • precision_score
  • recall_score

One more pull request to tackle the issue #558.

Two questions:

  • What should we do when the computation of the precision, recall or fscore lead to a nan? Currently if the computation lead to a nan, the related measure is set to 0.0?
  • Which tests should I change to use the new precision, recall and fscore function?

@arjoly
Copy link
Member Author

arjoly commented May 7, 2013

Found test_multiclass.py that use a custom multilabel precision and recall metrics => fix in 60892eced544ffed80dc559b0efad12fc92532e1.

@arjoly
Copy link
Member Author

arjoly commented May 7, 2013

Reviews are welcomed !!!

finally:
np.seterr(**old_err_settings)

precision[size_true == 0.] = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precision[size_true == 0]

@Jim-Holmstroem
Copy link
Contributor

Otherwise I think it looks good :) Should one hand-check the reference values for the tests?

y_pred != pos_label), axis=0)

else:
for true, pred in zip(y_true, y_pred):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify to me the datatype of y_true and y_pred? Your use of in seems to suggest they are sequences of iterables (tuples?) over labels. But I got the impression from the use of LabelBinarizer and the indicator matrix case that we had a binary array. I don't know whether I've misread some code.

If it's represented as a binary array, can't you do something like:

true_pos = (y_true & y_pred).sum(axis=0)`
false_pos = (~y_true & y_pred).sum(axis=0)
false_neg = (y_true & ~y_pred).sum(axis=0)

?
If there's some reason not to use a binary array, would set intersection and difference be better (in the case of many labels) than iterating through every label for each sample?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh. I got it, I think. You're handling two different formats for the same multilabel data. Is that necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I support a dense binary array indicator matrix and a list of list of label. Both have their pros and cons.
I support both format to avoid memory copy and to keep the pros of each format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is set intersection not a better idea than doing a label-by-label comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Done in ce395ac,

@arjoly
Copy link
Member Author

arjoly commented May 8, 2013

Otherwise I think it looks good :) Should one hand-check the reference values for the tests?

Yes if possible.

When reviewing, you should look if the code is

  • correct
  • clean: pep8, pep257, follows the standard api;
  • well documented: a narrative doc, good doctstring, examples, references, note to the end user, note for maintenance, good function names;
  • well tested: high test coverage, test edge case, check correct behavior;
  • sufficiently fast : no obvious optimisation to perform,

Several tools could be useful such as flake8, nose, coverage and a line by line profiler.

Previously @ogrisel wrote a paragraph on how to do good review. But I can't find it.

@Jim-Holmstroem
Copy link
Contributor

Really helpful @arjoly, thanks :D The good review text would be great

@@ -1480,14 +1748,28 @@ def precision_recall_fscore_support(y_true, y_pred, beta=1.0, labels=None,
avg_precision = np.mean(precision)
avg_recall = np.mean(recall)
avg_fscore = np.mean(fscore)
elif average == 'weighted':

elif average == 'weighted' and not is_multilabel(y_true):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second condition here is redundant: you should have already returned from that case above.

@jnothman
Copy link
Member

jnothman commented May 8, 2013

I think average="weighted" needs a bit more clarification in the docstring. For multi-label it seems to calculate each metric for each sample and return the mean across samples. In single-label, it calculates metrics for each label and averages them weighted by the "true" label distribution. Do they have any correspondence?

Otherwise, LGTM.

@arjoly
Copy link
Member Author

arjoly commented May 8, 2013

I think average="weighted" needs a bit more clarification in the docstring. For multi-label it seems to calculate each metric for each sample and return the mean across samples. In single-label, it calculates metrics for each label and averages them weighted by the "true" label distribution. Do they have any correspondence?

Thanks, you raise a pretty good point. I messed the two. I will add a new average keyword for multilabel data and correct the narrative doc.

For reference, I followed the definition of precision, recall and f-score from Tsoumakas, et al for multilabel output.

@arjoly
Copy link
Member Author

arjoly commented May 10, 2013

I fix the confusion between example-based and weighted-base precision, recall and f-score measure.

@glouppe
Copy link
Contributor

glouppe commented May 10, 2013

The model evaluation page in the narrative documentation is getting quite big and I have the impression that it is becoming a copy of the reference documentation. It lists a bunch of metrics, explains what they are, but not really what they are for. I think it would be valuable to add some insights for the users: in which cases one should prefer a metric over the other? What is your opinion?

(This can be done in another PR, I just wanted to raise the issue.)

@arjoly
Copy link
Member Author

arjoly commented May 10, 2013

Yes, this is something missing.

More examples are needed to illustrate the metrics module and others should rewritten to give more insight to our user. Moreover something similar to the clustering metrics documentation would be a great addition to the documentation.

@arjoly
Copy link
Member Author

arjoly commented May 17, 2013

The model evaluation page in the narrative documentation is getting quite big and I have the impression that it is becoming a copy of the reference documentation. It lists a bunch of metrics, explains what they are, but not really what they are for. I think it would be valuable to add some insights for the users: in which cases one should prefer a metric over the other? What is your opinion?

(This can be done in another PR, I just wanted to raise the issue.)

Let's do this in another pr.

@arjoly
Copy link
Member Author

arjoly commented May 17, 2013

I rebased on top of master.

If @jnothman and @Jim-Holmstroem agree on the last version, I would happily merge this into the master trunk.

@jnothman
Copy link
Member

If you rename average="example" to average="samples", I'm more than happy with it. We can patch in my documentation rewrite separately (as it seems I failed to submit a PR to your scikit-learn fork).

@arjoly
Copy link
Member Author

arjoly commented May 19, 2013

I have rename example to samples and I am waiting for your pr.

@jnothman
Copy link
Member

It didn't let me put in a PR to your fork for whatever reason. I think I'll just merge in your PR, and then propose my doc changes separately in case they need a moment's discussion.

@jnothman
Copy link
Member

merged as d33634d


if is_multilabel(y_true):
# Handle mix representation
if type(y_true) != type(y_pred):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised the implications of this condition. Here you're checking if one is a sequence of sequences and the other is an indicator matrix. What if the one was an array of sequences? I know at the moment that's not allowed by is_multilabel, but I'm not sure we want to write code that would fail if is_multilabel is changed to accept it.

Accepting an array of sequences in the future will mean we can perform fast vectorized operations. Note that's the format used in scipy.sparse.lil_matrix.rows, which is essentially the same data structure that we're working with.

So I think we need a helper function to do this multilabel_compatible check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the goal of this check is (the docstring does tell me
all the shapes possible in the list of labels). However, it is a very
weak test: what if I pass in an ndarray and a memmap? Also, a list is a
valid array-like, and we tend to expect code to behave right for lists.

I am not sure if it would solve your usecase, but could you simply
convert both inputs to ndarrays, using np.asarray, and test for their
shape/dimension? Judging by the discussion, I have the impression that
you want something more flexible, but I don't understand how it would
work/what would be the purpose (examples in the docstring are great for
these purposes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, there are two forms of multilabel targets supported, corresponding to dense and sparse representations of the same thing.

So,

y = [
[0, 3],
[1],
[0, 2]
]

is a sequence-of-sequences representation. If the first element of y is a sequence, but not a string or ndarray, we assume y is multilabel and sequence-of-sequences. It is akin to lil_matrix, and looks like the non-multilabel targets format. (But I should correct my post above: lil_matrix uses an array of lists. But this comparison fails with an array of lists; we reject an array of arrays.) It is useful when there are very many sparsely-assigned classes.

The same data could be represented as an indicator matrix (dense-like):

y = array([
[T F F T],
[F T F F],
[T F T F]
])

The type comparison above is to make sure y_pred and y_true are of the same form, and if not, to convert to the latter. In using type, it would make an error if the sequence-of-sequences was itself an array, being compared to a label indicator matrix (the condition should be True, but would be False).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but could you simply convert both inputs to ndarrays, using np.asarray, and test for their shape/dimension?

If there is redundant labels in the list of list of label format, it won't work in all case.

However, it is a very weak test: what if I pass in an ndarray and a memmap?

I have no experience with memmap, so I haven't consider this possibility.
Is there a test somewhere that check that scikit-learn works correctly with memmap? I had a quicklook in test_common.py and found nothing.

I added several format invariance tests (see test_format_invariance_with_1d_vectors and test_multilabel_representation_invariance in test_metrics) to check that different formats give the same answer. Any suggestions is welcome.

In using type, it would make an error if the sequence-of-sequences was itself an array, being compared to a label indicator matrix (the condition should be True, but would be False).

Can you give an example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take my two ys from above and wrap them in np.asarray. Both would have the same type (but not the same dtype).

@arjoly arjoly deleted the precision-recall-f-multilabel branch May 28, 2013 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants