[MRG] Invariance test for non-scalar metric #10730

dmohns · 2018-02-28T17:31:30Z

Reference Issues/PRs

Following up a discussion from #10591

Automated invariance test are currently implemented in test_common.py for scalar ranking functions. It is desirable to have these tests also be available for ranking metrics with potentially non-scalar outputs. For example roc_curve, classification_report or confusion_matrix.

What does this implement/fix? Explain your changes.

Scope of this PR is to generalize test_common.py to allow testing of metrics with non-scalar output. This shall be achieved by replacing all assert in such a away that they allow assertion on array-like structures.

Any other comments?

It should first be discussed if a solution like this is feasible. An alternated route suggested by @jnothman is to aggregate non-scalar metrics using an arbitrary measure and apply the scalar invariance tests.

To-do

Add other metrics to the tests
Add wrappers for assert_array_greater and assert_array_not_equal
Check if certain tests become obsolete (if so remove them)

jnothman

Who knew it could be this easy! Nice

jnothman · 2018-02-28T21:50:15Z

sklearn/metrics/tests/test_common.py

-            "equal (%f) for %s" % (weighted_score, name))
+    assert_raises(
+        AssertionError,
+        assert_array_equal, unweighted_score, weighted_score,


Use approximate equality here, please

You mean assert_array_almost_equal right? If I do that it will make the test fail micro_average_precision_score

def _raiseFailure(self, standardMsg): msg = self.test_case._formatMessage(self.msg, standardMsg) > raise self.test_case.failureException(msg) E AssertionError: Unweighted and weighted scores are unexpectedly almost equal (0.118667604216) and (0.118668873707) for micro_average_precision_score msg = 'Unweighted and weighted scores are unexpectedly almost equal (0.118667604216) and (0.118668873707) for micro_average_precision_score' self = <sklearn.utils._unittest_backport._AssertRaisesContext instance at 0x7ff47bf8f758> standardMsg = 'AssertionError not raised'

I think micro_average_precision_score should be identical to accuracy_score. Perhaps this is not as robust a test as we thought? Does that still fail if assert_allclose is used? It relies on relative tolerance rather than absolute and is probably a better test here

Interesting. The test passes when using assert_allclose.

Additionally the use of assert_array_almost_equal is dis-advised by NumPy. I decided to replace all approximate equality checks with assert_allclose to increase overall robustness of the invariance tests.

- Increase weight range to circumvent approximate equality for some metrics.

dmohns · 2018-03-03T21:24:25Z

I am not 100% sure if we would need wrappers for the array asserts greater and not_equal? Especially the non-equality test is a bit cumbersome. To the unfamiliar reader this might disguise what the test is actually trying to check.
On the other hand these checks are only ever used in test_common.py. Adding abstract versions of the above mentioned asserts would add unneeded complexity without benefits to other parts of scikit-learn.

What do you think?

PS: I gave up on trying to add classification_report to test_common.py. The fact that it returns a string makes it sheer impossible to write a reasonable assertion.

- The use of `assert_allclose` over `assert_array_almost_equal` is recommended by NumPy. - Revert initial change of sample weights as obsolete.

jnothman

I'm not certain I've reviewd this thoroughly, but it LGTM

jnothman

I'm not certain I've reviewd this thoroughly, but it LGTM

amueller

Looks good, minor comments.

amueller · 2018-05-23T21:08:21Z

sklearn/metrics/tests/test_common.py

@@ -103,7 +96,13 @@
    "accuracy_score": accuracy_score,
    "balanced_accuracy_score": balanced_accuracy_score,
    "unnormalized_accuracy_score": partial(accuracy_score, normalize=False),
-    "confusion_matrix": confusion_matrix,
+
+    "unnormalized_confusion_matrix": confusion_matrix,


I think it's a bit weird to name them this way, given that confusion_matrix is unnormalized for us. I'd rather call this one confusion_matrix and the other one normalized_confusion_matrix.

Agreed. It felt very counter-intuitive to name confusion_matrix that way.
However, this was necessary to have the scaling test schedule the correct metric.
See https://github.com/dmohns/scikit-learn/blob/94b9bd0d52ee55478ea7607cfd3037c83e4c2f4b/sklearn/metrics/tests/test_common.py#L1009

For the scope of this PR I didn't want to touch any of the actual test code.

The other option would be to introduce a new list of "unnormalized" metrics and
use that for scheduling rather than name.startswith.

Thoughts?

What are the other unnormalized metrics? I wouldn't mind too much keeping it this way if you add a comment on why it is done this way.

Basically the ones that can be called with a normalize which defaults to True parameter. Currently this affects balanced_accuracy_score, jaccard_similarity_score, zero_one_loss and log_loss. Let me add a comment on why I am doing what I am doing there and see if it gets clearer.

amueller · 2018-05-23T21:11:34Z

sklearn/metrics/tests/test_common.py

        NOT_SYMMETRIC_METRICS, THRESHOLDED_METRICS,
        METRIC_UNDEFINED_BINARY_MULTICLASS),
        set(ALL_METRICS))

-    assert_equal(
+    assert_array_equal(


I don't understand this one

Good catch. This is an artifact of me horse-blindly removing all occurrences of
assert_equal where technically possible. In this situation, it is more clear to keep to original assert_equal to avoid confusion.

Will undo this change.

amueller · 2018-05-23T21:15:51Z

sklearn/metrics/tests/test_common.py

+# Metrics are tested with numpy's assert_array. Therefor box-shaped outputs
+# are required.
+CURVE_METRICS = {
+    "roc_curve": lambda *args, **kwargs: roc_curve(*args, **kwargs)[:2],


you're throwing away the thresholds, right? Which makes sense, but I think the comment should say that. That's also not only about the shape right? The thresholds probably don't behave like a metric.

Yes, I am throwing away the thresholds here. But no, I am only really doing it because of shape, as thresholds will have a different dimension compared to other returned values.

If I recall correctly the thresholds on their own should pass all the tests. I will test if we can keep the thresholds by padding the result into a box-shaped array rather than throwing stuff away.

That seems really weird to me...

In this situation, it is more clear to keep to original assert_equal to avoid confusion.

dmohns · 2018-05-24T20:24:50Z

sklearn/metrics/tests/test_common.py

@@ -148,6 +152,36 @@
    "cohen_kappa_score": cohen_kappa_score,
 }

+
+def precision_recall_curve_padded_thresholds(*args, **kwargs):


@amueller Following up the conversation in earlier thread: I added an alternative way to make invariance tests work with precision_recall_curve. Does this make things clearer?

To be sure, we could pad with something more conspicuous, like nan???

Otherwise lgtm

I like the idea of padding with NaNs. It makes things a bit more verbose.

jnothman · 2018-06-28T14:29:06Z

@dmohns, could you please resolve conflicts with master. We'll try to merge soon, given that @amueller and I have both more-or-less approved.

dmohns · 2018-06-28T16:05:12Z

@jnothman Sure, will do.

Resolved Merge Conflicts: - sklearn/metrics/tests/test_common.py

- Specifically NumPy 1.10.4 seems to have a problem with casting ``` ((val1, val2), ) => ((val1, val2), (val1, val2), ...) ``` see https://github.com/numpy/numpy/blob/v1.10.4/numpy/lib/arraypad.py#L989 We can circumvent this by using the simpler trait of np.pad, passing a singleton to `constant_values`.

- Specifically NumPy 1.8.2 seems to have a problem with casting ``` int => ((int, int), (int, int), ...) ``` when shape is NaN. See https://github.com/numpy/numpy/blob/v1.8.2/numpy/lib/arraypad.py#L1042 We can circumvent this by wrapping NaN value into list instead of passing a NaN-singleton to `constant_values`.

jnothman · 2018-06-30T09:57:17Z

Thanks! Good work!

dmohns added 2 commits February 27, 2018 20:42

Support non-scalar output in metrics invariance tests

79c6c48

Add curves to invariance tests

0d92ebb

dmohns changed the title ~~Invariance test for non-scalar metric~~ [WIP] Invariance test for non-scalar metric Feb 28, 2018

jnothman approved these changes Feb 28, 2018

View reviewed changes

dmohns added 2 commits March 2, 2018 21:10

Apply invariance tests to curve metrics

cb5e189

Test sample weight invariance using approximation

2e002e6

- Increase weight range to circumvent approximate equality for some metrics.

dmohns added 4 commits March 3, 2018 22:59

Remove obsolete metrics tests

42fd60b

Test confusion matrix with scaling

b928b59

Make invariance tests more robust

0e8126d

- The use of `assert_allclose` over `assert_array_almost_equal` is recommended by NumPy. - Revert initial change of sample weights as obsolete.

Sort imports alphabetically

94b9bd0

jnothman approved these changes Mar 4, 2018

View reviewed changes

dmohns changed the title ~~[WIP] Invariance test for non-scalar metric~~ [MRG] Invariance test for non-scalar metric Mar 5, 2018

amueller reviewed May 23, 2018

View reviewed changes

dmohns added 3 commits May 24, 2018 20:50

Revert to assert_equal for list asserts

e34d979

In this situation, it is more clear to keep to original assert_equal to avoid confusion.

Clarify unnormalized behaviour of confusion_matrix

3e6e3bf

Add alternative way to handle precision_recall_curve

fa5a011

dmohns commented May 24, 2018

View reviewed changes

jnothman mentioned this pull request Jun 28, 2018

[MRG] FEA multilabel confusion matrix #11179

Merged

8 tasks

dmohns and others added 4 commits June 28, 2018 18:39

Merge branch 'master' into invariance-test-non-scalar-metric

3c05bd7

Resolved Merge Conflicts: - sklearn/metrics/tests/test_common.py

Make padding more conspicuous

b455464

jnothman merged commit c7e498b into scikit-learn:master Jun 30, 2018

dmohns mentioned this pull request Nov 30, 2018

[MRG] Add Detection Error Tradeoff (DET) curve classification metrics #10591

Merged

6 tasks

Uh oh!

[MRG] Invariance test for non-scalar metric #10730

[MRG] Invariance test for non-scalar metric #10730

Uh oh!

Conversation

dmohns commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

To-do

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmohns commented Mar 3, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 28, 2018

Uh oh!

dmohns commented Jun 28, 2018

Uh oh!

jnothman commented Jun 30, 2018

Uh oh!

Uh oh!

dmohns commented Feb 28, 2018 •

edited

Loading