-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FIX remaining bug in precision, recall and fscore with multilabel data #1988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fine, but those are the wrong values because your precision and recall are swapped. I.e. in the indicator matrix test, we should have:
(not .779 which is |
From pr #1945, there is still one remaining question
|
Yes, I didn't see that question there. I think logically that when the denominator for P or R is 0, the metric should be 1, but this is not what previous versions of scikit-learn did. For backwards consistency, you should implement the Finally, if the numerator is 0 for F-score (no true positives), the metric should be 0. If the denominator is 0, the user shouldn't be calling the function. In any case, I've rewritten the entire function, but not fixed the tests in #1990. Thanks for your hard work on the way to this cleaner implementation. |
I would keep the implementation as it is - i.e., return 0 if the denominator is 0. In addition, we could trigger a warning when such a case occurs since the value is actually undefined. Note that this is what Weka does as well, it returns 0 in such cases. Just my 2 cents. |
In Mulan, the multilabel extension of weka, the precision is set to 1 if (tp + fp+ fn == 0), At the moment, I will stick with default current behavior. @jnothman Note that we have called |
+1 and +1 for the warning. |
+1 for the warning |
Would "example" be better than "samples" for consistency with the literature? I suggested "samples" because this seems to be the term in common sklearn use for the first axis, where it might otherwise be "examples" or "instances". A warning's a good idea. But actually, I think recall with 0 true samples is undefined, but at least in the context of precision-recall curves, precision with 0 predicted samples (or above the threshold where any samples are recalled) should be 1. |
Now warnings are raised. However, those are not perfect. Undefined recall could raise a warning when I have also removed the |
f_score[(beta2 * size_pred + size_true) == 0] = 1.0 | ||
precision[size_pred == 0] = 0.0 | ||
recall[size_true == 0] = 0.0 | ||
f_score[(beta2 * size_true + size_pred) == 0] = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning messages are not raised for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand why you might want that to be so, but to me it's a bigger problem that in a multilabel problem with these true values:
[
[],
[1]
]
For the first example you get 0 recall and 0 precision no matter your prediction, when surely []
is as precise a prediction as you can make! In the second case, if you predict 1 as well as infinitely many wrong values, i.e. y_pred[1] = [1, 2, 3, 4, ...]
, you still get recall = 1, while precision -> 0!
This applies too to label-based averaging, although there such cases are considered pathological.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree.
But this means, we have to change the default behaviour...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this means sample-based averaging is not a good idea if some samples have y_true == []
. Use micro instead.
A note on the fact that micro-precision, micro-f1 and micro-recall might not be equal is needed. |
You mean the contrary? A note that micro-PRF is equal in certain cases? I don't actually think this belongs in the docstring, but in the narrative doc when the advantages and disadvantages of each metric will hopefully be described. Certainly, it's not related to this PR. |
assert_almost_equal(r, 1) | ||
assert_almost_equal(f, 1) | ||
assert_equal(s, None) | ||
with warnings.catch_warnings(True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have considered this warning a functional requirement, surely we should assert that it actually happens.
Perhaps we should copy numpy's assert_warns
. (And don't just import it: the return value was None
in 1.6.)
I'm not entirely satisfied with that either, in that it only allows you to test the first warning, and then only by class. The alternative may be a bit overblown, but can be found at https://gist.github.com/jnothman/5650845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require at least a backport since it is not present in numpy 1.3.
I don't understand why the warning is not raised at the moment. :-( Otherwise, I think that I have taken into account all your remarks @jnothman. |
It's only failing for 'samples'... but I also can't understand why. |
Note that if you run the test by itself, there's no error:
This suggests the warning not being caught is due to interaction with other times that warning has been called. Perhaps And when tests are passing you might want to change the title of this PR from WIP to MRG... As a bugfix, I think we want it in master soon. |
In particular, the test fails only if run after one of two other tests: $ tests=$(grep '^def test_' sklearn/metrics/tests/test_metrics.py | sed 's/def //;s/(.*)://')
$ for f in $tests
> do
> nosetests sklearn/metrics/tests/test_metrics.py:$f \
> sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels \
> 2>/dev/null ||
> echo $f
> done
test_multilabel_representation_invariance
test_multilabel_invariance_with_pos_labels It may not be about the warning filter failing, as the following runs without a problem: $ nosetests sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels sklearn/metrics/tests/test_metrics.py:test_precision_recall_f1_no_labels |
Right. Apparently the solution is to stick |
I should stop dwelling on this, but I think the right way to handle warnings in tests is:
|
Testing warnings is messy. I have drafted a few notes on a more ideal solution, but the current problem with this PR is due to a known Python bug. |
They might belong to another PR, but #2094 reminded me of a couple of issues that should be tested with relation to binary data (not that they address #2094 directly):
|
Warning bugs is fixed !!! Thanks @jnothman . I switch this pr in MRG state. |
About warnings (I am replying to old comments by @jnothman in this thread): we might consider implementing our own warnings, and setting them to be always raised, and not only once. In sklearn.utils.init there is currently a hack that forces DeprecationWarning to be always raised. |
"samples. ") | ||
|
||
if warning_msg: | ||
warnings.warn(warning_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a "stacklevel" here? I think that stacklevel=2 would be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 0f0f4a3
The code looks good. With regards to the maths, I will have to trust you on the changes, and to hope that the tests are good, as I haven't done the maths. |
@@ -1446,11 +1494,12 @@ def test_precision_recall_f1_score_with_an_empty_prediction(): | |||
y_true_bi = lb.transform(y_true_ll) | |||
y_pred_bi = lb.transform(y_pred_ll) | |||
|
|||
warnings.simplefilter("ignore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that is the best way to silence warning.
Awesome ! Thanks for the review ! |
Oh dear. Travis says that Python warning bug strikes again; all Btw, for whoever else reviews this if another is needed (@glouppe?), the focus should be on the tests, as the implementation is likely to change soon. |
It can go within any warning context. If you mean to switch it off for all tests: you can write a def setup():
global warning_ctx
warning_ctx = warnings.catch_warnings()
warning_ctx.__enter__()
warnings.simplefilter('ignore', UndefinedPRFWarning)
def teardown():
global warning_ctx
warning_ctx.__exit__() @arjoly I know the goal of the PR, but among the bugs was some disagreement on how to handle this case, and we seem to have agreed that warnings are the solution. |
no i wanted it to switch it off for all users ;) |
Ok let's live with it. I removed those commits. |
@amueller you mean you don't think there should be a warning? |
@jnothman I'm not sure but warning too much tends to be just annoying. |
I agree that warnings can be annoying, and they're particularly annoying in test output or when in a module you've not called directly. Hence they can be easily disabled, and we could do so by default with a warnings filter in the metrics module. But sometimes they're also important, such as to suggest the user should have chosen a different setting for their input. In these cases we don't expect the warning to be frequent, and if the user knows what they're doing they can disable it. I think the default warnings setting shows each warning once per execution; however, this is once per identical (class, message, triggering module/line). Frustratingly, this means warning messages need to be non-specific. In sum:
|
should we merge this now? |
could you please rebase / merge? I guess your example removals caused merge problems. |
As far as I'm concerned, it's good to merge. |
@amueller I rebase on top of master |
merged :) thanks a lot! |
Cool !!! :-) Finally merged thanks to all reviewers !!! |
Fix in precision, recall and fscore as pointed in #1945.