[MRG] FIX remaining bug in precision, recall and fscore with multilabel data #1988

arjoly · 2013-05-22T10:08:43Z

Fix in precision, recall and fscore as pointed in #1945.

jnothman · 2013-05-22T10:58:58Z

Fine, but those are the wrong values because your precision and recall are swapped. I.e. in the indicator matrix test, we should have:

>>> p = np.array([1/3, 2/3, 1/3])
>>> r = np.array([1,1,1]
>>> (1.25 * p * r / (.25 * p + r)).mean()
0.494...

(not .779 which is (1.25 * p * r / (.25 * r + p)).mean())

arjoly · 2013-05-22T13:59:47Z

From pr #1945, there is still one remaining question

What should we do when the computation of the precision, recall or fscore lead to a nan ? Currently if the computation lead to a nan, the related measure is set to 0.0.

jnothman · 2013-05-22T14:24:04Z

Yes, I didn't see that question there. I think logically that when the denominator for P or R is 0, the metric should be 1, but this is not what previous versions of scikit-learn did. For backwards consistency, you should implement the = 0 version in average='samples'. But I would like to see this changed, with others' approval.

Finally, if the numerator is 0 for F-score (no true positives), the metric should be 0. If the denominator is 0, the user shouldn't be calling the function.

In any case, I've rewritten the entire function, but not fixed the tests in #1990. Thanks for your hard work on the way to this cleaner implementation.

glouppe · 2013-05-23T06:42:30Z

Yes, I didn't see that question there. I think logically that when the denominator for P or R is 0, the metric should be 1, but this is not what previous versions of scikit-learn did. For backwards consistency, you should implement the = 0 version in average='samples'. But I would like to see this changed, with others' approval.

I would keep the implementation as it is - i.e., return 0 if the denominator is 0. In addition, we could trigger a warning when such a case occurs since the value is actually undefined.

Note that this is what Weka does as well, it returns 0 in such cases.

Just my 2 cents.

arjoly · 2013-05-23T06:55:39Z

In Mulan, the multilabel extension of weka, the precision is set to 1 if (tp + fp+ fn == 0),
to 0 if (tp + fp == 0) and compute the precision otherwise.

At the moment, I will stick with default current behavior.

@jnothman Note that we have called average="samples" correspond to their example based metrics.

GaelVaroquaux · 2013-05-23T07:07:03Z

I would keep the implementation as it is - i.e., return 0 if the
denominator is 0. In addition, we could trigger a warning when such a
case occurs since the value is actually undefined.

+1 and +1 for the warning.

arjoly · 2013-05-23T07:10:32Z

+1 for the warning

jnothman · 2013-05-23T07:28:42Z

Would "example" be better than "samples" for consistency with the literature? I suggested "samples" because this seems to be the term in common sklearn use for the first axis, where it might otherwise be "examples" or "instances".

A warning's a good idea. But actually, I think recall with 0 true samples is undefined, but at least in the context of precision-recall curves, precision with 0 predicted samples (or above the threshold where any samples are recalled) should be 1.

arjoly · 2013-05-23T13:34:20Z

Now warnings are raised. However, those are not perfect. Undefined recall could raise a warning when precision_score is called. This can't be avoided due the coupling in computation of the precision, recall and fscore in the current implementation.

I have also removed the pos_label argument with the binary indicator format.

arjoly · 2013-05-24T07:55:18Z

sklearn/metrics/metrics.py

-        f_score[(beta2 * size_pred + size_true) == 0] = 1.0
+        precision[size_pred == 0] = 0.0
+        recall[size_true == 0] = 0.0
+        f_score[(beta2 * size_true + size_pred) == 0] = 0.0


Warning messages are not raised for this case.

I can understand why you might want that to be so, but to me it's a bigger problem that in a multilabel problem with these true values:

[ [], [1] ]

For the first example you get 0 recall and 0 precision no matter your prediction, when surely [] is as precise a prediction as you can make! In the second case, if you predict 1 as well as infinitely many wrong values, i.e. y_pred[1] = [1, 2, 3, 4, ...], you still get recall = 1, while precision -> 0!

This applies too to label-based averaging, although there such cases are considered pathological.

Yes, I agree.

But this means, we have to change the default behaviour...

No, this means sample-based averaging is not a good idea if some samples have y_true == []. Use micro instead.

arjoly · 2013-05-24T15:05:42Z

A note on the fact that micro-precision, micro-f1 and micro-recall might not be equal is needed.

jnothman · 2013-05-25T11:20:16Z

A note on the fact that micro-precision, micro-f1 and micro-recall might not be equal is needed.

You mean the contrary? A note that micro-PRF is equal in certain cases? I don't actually think this belongs in the docstring, but in the narrative doc when the advantages and disadvantages of each metric will hopefully be described. Certainly, it's not related to this PR.

jnothman · 2013-05-25T21:29:20Z

sklearn/metrics/tests/test_metrics.py

-    assert_almost_equal(r, 1)
-    assert_almost_equal(f, 1)
-    assert_equal(s, None)
+    with warnings.catch_warnings(True):


Since we have considered this warning a functional requirement, surely we should assert that it actually happens.

Perhaps we should copy numpy's assert_warns. (And don't just import it: the return value was None in 1.6.)

I'm not entirely satisfied with that either, in that it only allows you to test the first warning, and then only by class. The alternative may be a bit overblown, but can be found at https://gist.github.com/jnothman/5650845

This would require at least a backport since it is not present in numpy 1.3.

arjoly · 2013-05-28T13:43:41Z

I don't understand why the warning is not raised at the moment. :-(

Otherwise, I think that I have taken into account all your remarks @jnothman.

jnothman · 2013-05-28T14:02:15Z

I don't understand why the warning is not raised at the moment. :-(

It's only failing for 'samples'... but I also can't understand why.

jnothman · 2013-06-10T15:13:07Z

Note that if you run the test by itself, there's no error:

$ nosetests sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels

This suggests the warning not being caught is due to interaction with other times that warning has been called. Perhaps assert_warns (and its warnings.simplefilter(...)) is not working very well. Maybe you're best off reverting the porting of assert_warns (sorry! I only wish this was easier to test.).

And when tests are passing you might want to change the title of this PR from WIP to MRG... As a bugfix, I think we want it in master soon.

jnothman · 2013-06-10T22:23:45Z

In particular, the test fails only if run after one of two other tests:

$ tests=$(grep '^def test_' sklearn/metrics/tests/test_metrics.py | sed 's/def //;s/(.*)://')
$ for f in $tests
> do
>   nosetests sklearn/metrics/tests/test_metrics.py:$f \
>             sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels \
>             2>/dev/null ||
>   echo $f
> done
test_multilabel_representation_invariance
test_multilabel_invariance_with_pos_labels

It may not be about the warning filter failing, as the following runs without a problem:

$ nosetests sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels

jnothman · 2013-06-11T01:48:42Z

Right. Apparently the solution is to stick warnings.simplefilter('always') inside the catch_warnings context (at least in those two tests). I don't understand why this is necessary, seeing as assert_warns creates its own context in which it calls simplefilter('always'). Surely that should mean it is independent of prior filter settings and the record of previously-seen messages...?

jnothman · 2013-06-11T05:11:13Z

I should stop dwelling on this, but I think the right way to handle warnings in tests is:

have a separate test to check that a warning is raised in certain cases, in which you first assert that no warning is raised in a valid variant, then assert that one is raised in a problematic variant;
tests that are known to raise warnings as a side-effect, but that's not the purpose of the test, should have a decorator available like @ignore_warnings([DeprecationWarning, UserWarning]);
deprecation warnings should probably be tested in a separate test module.

jnothman · 2013-06-19T01:40:58Z

Testing warnings is messy. I have drafted a few notes on a more ideal solution, but the current problem with this PR is due to a known Python bug.

jnothman · 2013-06-25T12:03:19Z

They might belong to another PR, but #2094 reminded me of a couple of issues that should be tested with relation to binary data (not that they address #2094 directly):

ensure that if y_true and y_pred include a single class label, it is treated as binary and pos_label is observed (where average is not None). (Perhaps this is already being tested.)
ensure that if labels is specified and has more than two entries, it is treated as multiclass.

arjoly · 2013-07-08T12:33:50Z

Warning bugs is fixed !!! Thanks @jnothman .

I switch this pr in MRG state.

GaelVaroquaux · 2013-07-09T19:36:24Z

About warnings (I am replying to old comments by @jnothman in this thread): we might consider implementing our own warnings, and setting them to be always raised, and not only once. In sklearn.utils.init there is currently a hack that forces DeprecationWarning to be always raised.

GaelVaroquaux · 2013-07-09T19:39:41Z

sklearn/metrics/metrics.py

+                            "samples. ")
+
+        if warning_msg:
+            warnings.warn(warning_msg)


Do we want a "stacklevel" here? I think that stacklevel=2 would be a good idea.

See 0f0f4a3

GaelVaroquaux · 2013-07-09T19:44:26Z

The code looks good. With regards to the maths, I will have to trust you on the changes, and to hope that the tests are good, as I haven't done the maths.

arjoly · 2013-07-10T14:32:46Z

sklearn/metrics/tests/test_metrics.py

@@ -1446,11 +1494,12 @@ def test_precision_recall_f1_score_with_an_empty_prediction():
    y_true_bi = lb.transform(y_true_ll)
    y_pred_bi = lb.transform(y_pred_ll)

+    warnings.simplefilter("ignore")


Not sure if that is the best way to silence warning.

arjoly · 2013-07-11T13:16:18Z

LGTM, but this is somewhat outside the feature set I use, so before
merging it would be good to have a second point of view.

Awesome ! Thanks for the review !

jnothman · 2013-07-11T13:24:40Z

Oh dear. Travis says that Python warning bug strikes again; all catch_warnings uses in tests need an always filter! It's a good reason to use something like assert_warns.

Btw, for whoever else reviews this if another is needed (@glouppe?), the focus should be on the tests, as the implementation is likely to change soon.

jnothman · 2013-07-25T14:05:36Z

It can go within any warning context. If you mean to switch it off for all tests: you can write a setup function for nosetests that sets up a warning context. It's not the most pleasant of code:

def setup():
    global warning_ctx
    warning_ctx = warnings.catch_warnings()
    warning_ctx.__enter__()
    warnings.simplefilter('ignore', UndefinedPRFWarning)

def teardown():
    global warning_ctx
    warning_ctx.__exit__()

@arjoly I know the goal of the PR, but among the bugs was some disagreement on how to handle this case, and we seem to have agreed that warnings are the solution.

amueller · 2013-07-25T14:06:59Z

no i wanted it to switch it off for all users ;)

arjoly · 2013-07-25T14:23:05Z

Ok let's live with it. I removed those commits.

jnothman · 2013-07-25T22:35:45Z

@amueller you mean you don't think there should be a warning?

amueller · 2013-07-27T08:30:39Z

@jnothman I'm not sure but warning too much tends to be just annoying.

jnothman · 2013-07-27T09:31:16Z

I agree that warnings can be annoying, and they're particularly annoying in test output or when in a module you've not called directly. Hence they can be easily disabled, and we could do so by default with a warnings filter in the metrics module.

But sometimes they're also important, such as to suggest the user should have chosen a different setting for their input. In these cases we don't expect the warning to be frequent, and if the user knows what they're doing they can disable it.

I think the default warnings setting shows each warning once per execution; however, this is once per identical (class, message, triggering module/line). Frustratingly, this means warning messages need to be non-specific.

In sum:

use warnings where they indicate the user may have passed a bad data/settings combination
avoid detail in the message so filtering works
if you think users will wish to silence that warning in particular, use a custom subclass of UserWarning
ideally, warnings that are functional requirements should be tested, but in practice this can be painful

amueller · 2013-07-27T09:38:55Z

should we merge this now?

amueller · 2013-07-27T10:06:35Z

could you please rebase / merge? I guess your example removals caused merge problems.

jnothman · 2013-07-27T10:31:39Z

As far as I'm concerned, it's good to merge.

arjoly · 2013-07-27T11:08:28Z

@amueller I rebase on top of master

amueller · 2013-07-27T13:01:35Z

merged :) thanks a lot!

arjoly · 2013-07-27T13:19:46Z

Cool !!! :-) Finally merged thanks to all reviewers !!!

jnothman mentioned this pull request May 23, 2013

[WIP] rewrite precision_recall_fscore_support #1990

Closed

arjoly reviewed May 24, 2013
View reviewed changes

jnothman reviewed May 25, 2013
View reviewed changes

GaelVaroquaux reviewed Jul 9, 2013
View reviewed changes

arjoly reviewed Jul 10, 2013
View reviewed changes

arjoly added 14 commits July 27, 2013 12:59

FIX bug in f_score with beta !=1

068dcc3

FIX formula inversion for sample-based precision/recall

1868790

FIX set same default behavior for precision, recall and f-score

42a1270

ENH raise warning with ill define precision, recall and fscore

98a6b70

Backport assert_warns and assert_no_warnings from np 1.7

cc6963b

TST test warning + ENH Add warning average=samples

6a4a362

FIX TST with warnings thx to @jnothman

645bacc

flake8

377a963

ENH set warning to stacklevel 2

06c2c7b

TST silence warning

6dafe57

ENH use with np.errstate

79a0cc9

DOC TST correct comment

aa0c47e

FIX warning test

a0aa777

FIX warning tests in preprocessing

a5a026c

amueller closed this Jul 27, 2013

arjoly deleted the fix-prf branch July 27, 2013 13:19

ogrisel mentioned this pull request Jul 27, 2013

Running tests should not print anything on stdout / stderr or warnings #2274

Closed

arjoly mentioned this pull request Oct 22, 2013

[MRG]: fix tests, centralize warnings and reset __warningregistry__ #2541

Merged

[MRG] FIX remaining bug in precision, recall and fscore with multilabel data #1988

[MRG] FIX remaining bug in precision, recall and fscore with multilabel data #1988

Conversation

arjoly commented May 22, 2013

jnothman commented May 22, 2013

arjoly commented May 22, 2013

jnothman commented May 22, 2013

glouppe commented May 23, 2013

arjoly commented May 23, 2013

GaelVaroquaux commented May 23, 2013

arjoly commented May 23, 2013

jnothman commented May 23, 2013

arjoly commented May 23, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjoly commented May 24, 2013

jnothman commented May 25, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjoly commented May 28, 2013

jnothman commented May 28, 2013

jnothman commented Jun 10, 2013

jnothman commented Jun 10, 2013

jnothman commented Jun 11, 2013

jnothman commented Jun 11, 2013

jnothman commented Jun 19, 2013

jnothman commented Jun 25, 2013

arjoly commented Jul 8, 2013

GaelVaroquaux commented Jul 9, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jul 9, 2013

Choose a reason for hiding this comment

arjoly commented Jul 11, 2013

jnothman commented Jul 11, 2013

jnothman commented Jul 25, 2013

amueller commented Jul 25, 2013

arjoly commented Jul 25, 2013

jnothman commented Jul 25, 2013

amueller commented Jul 27, 2013

jnothman commented Jul 27, 2013

amueller commented Jul 27, 2013

amueller commented Jul 27, 2013

jnothman commented Jul 27, 2013

arjoly commented Jul 27, 2013

amueller commented Jul 27, 2013

arjoly commented Jul 27, 2013