[MRG] Multilabel-indicator roc auc and average precision #2460

arjoly · 2013-09-19T17:14:40Z

The goal of this pr is to add multilabel-indicator support with various types of averaging
for roc_auc_score and average_precision_score.

Still to do:

Implementation of micro, macro, weighted and sample roc_auc_score
Implementation of micro, macro, weighted and sample average_precision_score
Add a general test for binary metric that are extended through averaging
Harmonize how global tests are performed for the metrics module
Write the narrative documentation
Ensure that the scorer interface work with both possible representations (see Multi-label and multi-output multi-class decision functions and predict proba aren't consistent #2451)
pep8

A priori, I won't add ranking-based average_precision_score.
I don't want to add support for the multilabel-sequence format.

ogrisel · 2013-09-20T09:27:09Z

doc/modules/model_evaluation.rst

@@ -176,17 +176,17 @@ Classification metrics

 The :mod:`sklearn.metrics` implements several losses, scores and utility
 functions to measure classification performance.
+Some metrics might require probability estimates of the positive class,
+confidence values or binary decisions value.


binary decisions value => binary decision values?

arjoly · 2013-09-20T09:30:50Z

Thanks @ogrisel for reviewing !!!

ogrisel · 2013-09-20T09:32:52Z

sklearn/metrics/metrics.py

@@ -2045,9 +2151,6 @@ def r2_score(y_true, y_pred):
    """
    y_type, y_true, y_pred = _check_reg_targets(y_true, y_pred)

-    if len(y_true) == 1:
-        raise ValueError("r2_score can only be computed given more than one"
-                         " sample.")


The message was wrong but there is still a zero devision error (undefined metrics problem) if there is only one sample or if the y_true is constant or one element of y_true is exactly equal to its mean isn't it?

This case is already treated in the function. If the denominator is zero and the numerator is zero, then the score is 1. If the denominator is zero and the numerator is non-zero, then the score is 0.

This makes r2_score behave as the explained_variance_score.

Good, I could not see it from the diff view in github.

ogrisel · 2013-09-20T09:59:03Z

Thanks @ogrisel for reviewing !!!

I just had a quick look. I don't have time to review it deeper now. Could you please put a png rendering of the new plots in the PR description?

arjoly · 2013-09-20T10:00:26Z

sklearn/metrics/tests/test_metrics.py

+        _check_averaging(metric, y_true, y_score, y_true_binarize,
+                         y_score, is_multilabel)
+    else:
+        ValueError("Metric is not recorded has having an average option")


raise ValuError...`

arjoly · 2013-09-20T10:03:00Z

Thanks @ogrisel for reviewing !!!

I just had a quick look. I don't have time to review it deeper now. Could you please put a png rendering of the new plots in the PR description?

Which new plot? There is not new plot at the moment.

ogrisel · 2013-09-20T10:05:23Z

I thought the ROC example was updated to demonstrate averaging. I think it should :)

ogrisel · 2013-09-20T10:06:29Z

Could you please add a couple of tests for the various averaging case on very tiny (minimalist) inline-defined multi-label datasets that could be checked by computing the expected output manually?

arjoly · 2013-09-20T10:13:09Z

I thought the ROC example was updated to demonstrate averaging. I think it should :)

Good point !

arjoly · 2013-09-20T11:44:38Z

@ogrisel I have update the example about roc curves and precision-recall curves. Here are the generated plot:

arjoly · 2013-09-20T12:51:02Z

Could you please add a couple of tests for the various averaging case on very tiny (minimalist) inline-defined multi-label datasets that could be checked by computing the expected output manually?

I have added some tests on toy data for multilabel-indicator data.
And, I have also added some tests on binary inputs, since there were none.

glouppe · 2013-09-20T13:46:36Z

examples/plot_precision_recall.py

@@ -57,27 +57,39 @@
 recall. See the corner at recall = .59, precision = .8 for an example of this
 phenomenon.

+Precision-recall curves are typically use in binary classification to study


use => used

arjoly · 2013-09-20T13:57:19Z

Thanks @glouppe !!!

arjoly · 2013-09-24T11:56:15Z

Just rebased on top of master !
More reviews are welcome :-)

jaquesgrobler · 2013-09-24T13:53:22Z

I had a look through quick - all seems well to me. Nice work.
Also built the branches docs and all renders fine and looks good on my side.
I'll try and have a more in depth look later, but as far as I can see, all has been covered
👍

arjoly · 2013-09-24T16:03:54Z

@jaquesgrobler Thanks for reviewing !!!

ogrisel · 2013-09-25T09:33:01Z

Could you please run a test coverage report and paste the relevant lines here? (and also add more tests if this report highlight uncovered options / blocks / exceptions...) :)

arjoly · 2013-09-25T10:49:28Z

Current code coverage

$ nosetests sklearn/metrics --with-coverage

Name                                                  Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------------------
sklearn.metrics                                           9      0   100%   
sklearn.metrics.metrics                                 409      1    99%   459
sklearn.metrics.scorer                                   85      8    91%   42, 173-181, 184, 187

All missing lines in sklearn.metrics.scorerdoes not concern this pull request.
I will add a test for line 459 in sklearn.metrics.metrics.

arjoly · 2013-09-25T11:01:41Z

Now I have 100% coverage for code related to this pr.

$ nosetests sklearn/metrics --with-coverage

Name                                                  Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------------------
sklearn.metrics                                           9      0   100%   
sklearn.metrics.metrics                                 409      0   100%   
sklearn.metrics.scorer                                   85      8    91%   42, 173-181, 184, 187

arjoly · 2013-09-25T11:13:04Z

sklearn/metrics/tests/test_metrics.py

+# first a specific test for the given metric and then add a general test for
+# all metrics that have the same behavior.
+#
+# Two type of datastructures are used in order to implement this system:


arjoly · 2013-12-02T10:37:19Z

Rebased on top of master.

I will update the what's new when it's merged.

coveralls · 2013-12-02T10:44:01Z

Coverage remained the same when pulling 8c2c1c5 on arjoly:auc-multilabel into 391c913 on scikit-learn:master.

ogrisel · 2013-12-02T14:22:59Z

Merging!

[MRG] Multilabel-indicator roc auc and average precision

arjoly · 2013-12-02T15:19:23Z

Thanks, I am working at fixing the jenkins build.

ogrisel · 2013-12-02T15:29:02Z

I think I fixed the python 3 issue. No idea about the numpy 1.3.1 issue.

ogrisel reviewed Sep 20, 2013
View reviewed changes

arjoly reviewed Sep 20, 2013
View reviewed changes

glouppe reviewed Sep 20, 2013
View reviewed changes

arjoly mentioned this pull request Sep 25, 2013

[MRG] Add unicode support to sklearn.metrics.classification_report #2462

Merged

arjoly reviewed Sep 25, 2013
View reviewed changes

arjoly added 22 commits December 2, 2013 11:32

DOC wording

c53e977

FIX doctests

87b2785

DOC extend roc example

fbb62f4

DOC update example of precision-recall curve

8832bde

TST properly raise ValueError...

eb56b9e

DOC nicer plot

e778c36

TST add roc_curve and precision_recall_curve on toydata

7bafbca

DOC TST explain how to use common tests in test_metrics

a9aa616

Typo

59fd981

TST clean copy paste mistake

85e1db0

TST remove print

303a465

TST full coverage for _average_binary_score

da1b56a

DOC TST typo

3d0211c

WIP TST more sample invariance test

19bf47e

WIP TST adapt test for sample based metrics

1aa0f07

TST finish to add invariance test for sample based metrics

5cf711a

DOC typo

41ecb9b

DOC wording and typo

3c4dc02

TST add test for garbage averaging input string

909d45d

TST typo

050f7ec

DOC explain why roc auc score is useful in multilabel classification

49186bb

typo

8c2c1c5

ogrisel added a commit that referenced this pull request Dec 2, 2013

Merge pull request #2460 from arjoly/auc-multilabel

a82c8ac

[MRG] Multilabel-indicator roc auc and average precision

ogrisel merged commit a82c8ac into scikit-learn:master Dec 2, 2013

arjoly deleted the auc-multilabel branch December 2, 2013 16:27

Uh oh!

[MRG] Multilabel-indicator roc auc and average precision #2460

[MRG] Multilabel-indicator roc auc and average precision #2460

Uh oh!

Conversation

arjoly commented Sep 19, 2013

Uh oh!

ogrisel Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented Sep 20, 2013

Uh oh!

ogrisel Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

arjoly Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

ogrisel commented Sep 20, 2013

Uh oh!

arjoly commented Sep 20, 2013

Uh oh!

arjoly commented Sep 20, 2013

Uh oh!

arjoly commented Sep 20, 2013

Uh oh!

glouppe Sep 20, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented Sep 20, 2013

Uh oh!

arjoly commented Sep 24, 2013

Uh oh!

jaquesgrobler commented Sep 24, 2013

Uh oh!

arjoly commented Sep 24, 2013

Uh oh!

ogrisel commented Sep 25, 2013

Uh oh!

arjoly commented Sep 25, 2013

Uh oh!

arjoly commented Sep 25, 2013

Uh oh!

arjoly Sep 25, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented Dec 2, 2013

Uh oh!

coveralls commented Dec 2, 2013

Uh oh!

ogrisel commented Dec 2, 2013

Uh oh!

arjoly commented Dec 2, 2013

Uh oh!

ogrisel commented Dec 2, 2013

Uh oh!

Uh oh!