Skip to content

[MRG] FIX remaining bug in precision, recall and fscore with multilabel data #1988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented May 22, 2013

Fix in precision, recall and fscore as pointed in #1945.

@jnothman
Copy link
Member

Fine, but those are the wrong values because your precision and recall are swapped. I.e. in the indicator matrix test, we should have:

>>> p = np.array([1/3, 2/3, 1/3])
>>> r = np.array([1,1,1]
>>> (1.25 * p * r / (.25 * p + r)).mean()
0.494...

(not .779 which is (1.25 * p * r / (.25 * r + p)).mean())

@arjoly
Copy link
Member Author

arjoly commented May 22, 2013

From pr #1945, there is still one remaining question

  • What should we do when the computation of the precision, recall or fscore lead to a nan ? Currently if the computation lead to a nan, the related measure is set to 0.0.

@jnothman
Copy link
Member

Yes, I didn't see that question there. I think logically that when the denominator for P or R is 0, the metric should be 1, but this is not what previous versions of scikit-learn did. For backwards consistency, you should implement the = 0 version in average='samples'. But I would like to see this changed, with others' approval.

Finally, if the numerator is 0 for F-score (no true positives), the metric should be 0. If the denominator is 0, the user shouldn't be calling the function.

In any case, I've rewritten the entire function, but not fixed the tests in #1990. Thanks for your hard work on the way to this cleaner implementation.

@glouppe
Copy link
Contributor

glouppe commented May 23, 2013

Yes, I didn't see that question there. I think logically that when the denominator for P or R is 0, the metric should be 1, but this is not what previous versions of scikit-learn did. For backwards consistency, you should implement the = 0 version in average='samples'. But I would like to see this changed, with others' approval.

I would keep the implementation as it is - i.e., return 0 if the denominator is 0. In addition, we could trigger a warning when such a case occurs since the value is actually undefined.

Note that this is what Weka does as well, it returns 0 in such cases.

Just my 2 cents.

@arjoly
Copy link
Member Author

arjoly commented May 23, 2013

In Mulan, the multilabel extension of weka, the precision is set to 1 if (tp + fp+ fn == 0),
to 0 if (tp + fp == 0) and compute the precision otherwise.

At the moment, I will stick with default current behavior.

@jnothman Note that we have called average="samples" correspond to their example based metrics.

@GaelVaroquaux
Copy link
Member

I would keep the implementation as it is - i.e., return 0 if the
denominator is 0. In addition, we could trigger a warning when such a
case occurs since the value is actually undefined.

+1 and +1 for the warning.

@arjoly
Copy link
Member Author

arjoly commented May 23, 2013

+1 for the warning

@jnothman
Copy link
Member

Would "example" be better than "samples" for consistency with the literature? I suggested "samples" because this seems to be the term in common sklearn use for the first axis, where it might otherwise be "examples" or "instances".

A warning's a good idea. But actually, I think recall with 0 true samples is undefined, but at least in the context of precision-recall curves, precision with 0 predicted samples (or above the threshold where any samples are recalled) should be 1.

@arjoly
Copy link
Member Author

arjoly commented May 23, 2013

Now warnings are raised. However, those are not perfect. Undefined recall could raise a warning when precision_score is called. This can't be avoided due the coupling in computation of the precision, recall and fscore in the current implementation.

I have also removed the pos_label argument with the binary indicator format.

f_score[(beta2 * size_pred + size_true) == 0] = 1.0
precision[size_pred == 0] = 0.0
recall[size_true == 0] = 0.0
f_score[(beta2 * size_true + size_pred) == 0] = 0.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning messages are not raised for this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand why you might want that to be so, but to me it's a bigger problem that in a multilabel problem with these true values:

[
  [],
  [1]
]

For the first example you get 0 recall and 0 precision no matter your prediction, when surely [] is as precise a prediction as you can make! In the second case, if you predict 1 as well as infinitely many wrong values, i.e. y_pred[1] = [1, 2, 3, 4, ...], you still get recall = 1, while precision -> 0!

This applies too to label-based averaging, although there such cases are considered pathological.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree.

But this means, we have to change the default behaviour...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this means sample-based averaging is not a good idea if some samples have y_true == []. Use micro instead.

@arjoly
Copy link
Member Author

arjoly commented May 24, 2013

A note on the fact that micro-precision, micro-f1 and micro-recall might not be equal is needed.

@jnothman
Copy link
Member

A note on the fact that micro-precision, micro-f1 and micro-recall might not be equal is needed.

You mean the contrary? A note that micro-PRF is equal in certain cases? I don't actually think this belongs in the docstring, but in the narrative doc when the advantages and disadvantages of each metric will hopefully be described. Certainly, it's not related to this PR.

assert_almost_equal(r, 1)
assert_almost_equal(f, 1)
assert_equal(s, None)
with warnings.catch_warnings(True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have considered this warning a functional requirement, surely we should assert that it actually happens.

Perhaps we should copy numpy's assert_warns. (And don't just import it: the return value was None in 1.6.)

I'm not entirely satisfied with that either, in that it only allows you to test the first warning, and then only by class. The alternative may be a bit overblown, but can be found at https://gist.github.com/jnothman/5650845

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require at least a backport since it is not present in numpy 1.3.

@arjoly
Copy link
Member Author

arjoly commented May 28, 2013

I don't understand why the warning is not raised at the moment. :-(

Otherwise, I think that I have taken into account all your remarks @jnothman.

@jnothman
Copy link
Member

I don't understand why the warning is not raised at the moment. :-(

It's only failing for 'samples'... but I also can't understand why.

@jnothman
Copy link
Member

Note that if you run the test by itself, there's no error:

$ nosetests sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels

This suggests the warning not being caught is due to interaction with other times that warning has been called. Perhaps assert_warns (and its warnings.simplefilter(...)) is not working very well. Maybe you're best off reverting the porting of assert_warns (sorry! I only wish this was easier to test.).

And when tests are passing you might want to change the title of this PR from WIP to MRG... As a bugfix, I think we want it in master soon.

@jnothman
Copy link
Member

In particular, the test fails only if run after one of two other tests:

$ tests=$(grep '^def test_' sklearn/metrics/tests/test_metrics.py | sed 's/def //;s/(.*)://')
$ for f in $tests
> do
>   nosetests sklearn/metrics/tests/test_metrics.py:$f \
>             sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels \
>             2>/dev/null ||
>   echo $f
> done
test_multilabel_representation_invariance
test_multilabel_invariance_with_pos_labels

It may not be about the warning filter failing, as the following runs without a problem:

$ nosetests sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels

@jnothman
Copy link
Member

Right. Apparently the solution is to stick warnings.simplefilter('always') inside the catch_warnings context (at least in those two tests). I don't understand why this is necessary, seeing as assert_warns creates its own context in which it calls simplefilter('always'). Surely that should mean it is independent of prior filter settings and the record of previously-seen messages...?

@jnothman
Copy link
Member

I should stop dwelling on this, but I think the right way to handle warnings in tests is:

  • have a separate test to check that a warning is raised in certain cases, in which you first assert that no warning is raised in a valid variant, then assert that one is raised in a problematic variant;
  • tests that are known to raise warnings as a side-effect, but that's not the purpose of the test, should have a decorator available like @ignore_warnings([DeprecationWarning, UserWarning]);
  • deprecation warnings should probably be tested in a separate test module.

@jnothman
Copy link
Member

Testing warnings is messy. I have drafted a few notes on a more ideal solution, but the current problem with this PR is due to a known Python bug.

@jnothman
Copy link
Member

They might belong to another PR, but #2094 reminded me of a couple of issues that should be tested with relation to binary data (not that they address #2094 directly):

  • ensure that if y_true and y_pred include a single class label, it is treated as binary and pos_label is observed (where average is not None). (Perhaps this is already being tested.)
  • ensure that if labels is specified and has more than two entries, it is treated as multiclass.

@arjoly
Copy link
Member Author

arjoly commented Jul 8, 2013

Warning bugs is fixed !!! Thanks @jnothman .

I switch this pr in MRG state.

@GaelVaroquaux
Copy link
Member

About warnings (I am replying to old comments by @jnothman in this thread): we might consider implementing our own warnings, and setting them to be always raised, and not only once. In sklearn.utils.init there is currently a hack that forces DeprecationWarning to be always raised.

"samples. ")

if warning_msg:
warnings.warn(warning_msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a "stacklevel" here? I think that stacklevel=2 would be a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 0f0f4a3

@GaelVaroquaux
Copy link
Member

The code looks good. With regards to the maths, I will have to trust you on the changes, and to hope that the tests are good, as I haven't done the maths.

@@ -1446,11 +1494,12 @@ def test_precision_recall_f1_score_with_an_empty_prediction():
y_true_bi = lb.transform(y_true_ll)
y_pred_bi = lb.transform(y_pred_ll)

warnings.simplefilter("ignore")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that is the best way to silence warning.

@arjoly
Copy link
Member Author

arjoly commented Jul 11, 2013

LGTM, but this is somewhat outside the feature set I use, so before
merging it would be good to have a second point of view.

Awesome ! Thanks for the review !

@jnothman
Copy link
Member

Oh dear. Travis says that Python warning bug strikes again; all catch_warnings uses in tests need an always filter! It's a good reason to use something like assert_warns.

Btw, for whoever else reviews this if another is needed (@glouppe?), the focus should be on the tests, as the implementation is likely to change soon.

@jnothman
Copy link
Member

It can go within any warning context. If you mean to switch it off for all tests: you can write a setup function for nosetests that sets up a warning context. It's not the most pleasant of code:

def setup():
    global warning_ctx
    warning_ctx = warnings.catch_warnings()
    warning_ctx.__enter__()
    warnings.simplefilter('ignore', UndefinedPRFWarning)

def teardown():
    global warning_ctx
    warning_ctx.__exit__()

@arjoly I know the goal of the PR, but among the bugs was some disagreement on how to handle this case, and we seem to have agreed that warnings are the solution.

@amueller
Copy link
Member

no i wanted it to switch it off for all users ;)

@arjoly
Copy link
Member Author

arjoly commented Jul 25, 2013

Ok let's live with it. I removed those commits.

@jnothman
Copy link
Member

@amueller you mean you don't think there should be a warning?

@amueller
Copy link
Member

@jnothman I'm not sure but warning too much tends to be just annoying.

@jnothman
Copy link
Member

I agree that warnings can be annoying, and they're particularly annoying in test output or when in a module you've not called directly. Hence they can be easily disabled, and we could do so by default with a warnings filter in the metrics module.

But sometimes they're also important, such as to suggest the user should have chosen a different setting for their input. In these cases we don't expect the warning to be frequent, and if the user knows what they're doing they can disable it.

I think the default warnings setting shows each warning once per execution; however, this is once per identical (class, message, triggering module/line). Frustratingly, this means warning messages need to be non-specific.

In sum:

  • use warnings where they indicate the user may have passed a bad data/settings combination
  • avoid detail in the message so filtering works
  • if you think users will wish to silence that warning in particular, use a custom subclass of UserWarning
  • ideally, warnings that are functional requirements should be tested, but in practice this can be painful

@amueller
Copy link
Member

should we merge this now?

@amueller
Copy link
Member

could you please rebase / merge? I guess your example removals caused merge problems.

@jnothman
Copy link
Member

As far as I'm concerned, it's good to merge.

@arjoly
Copy link
Member Author

arjoly commented Jul 27, 2013

@amueller I rebase on top of master

@amueller
Copy link
Member

merged :) thanks a lot!

@amueller amueller closed this Jul 27, 2013
@arjoly arjoly deleted the fix-prf branch July 27, 2013 13:19
@arjoly
Copy link
Member Author

arjoly commented Jul 27, 2013

Cool !!! :-) Finally merged thanks to all reviewers !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants