Skip to content

[MRG] Invariance test for non-scalar metric #10730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 30, 2018

Conversation

dmohns
Copy link
Contributor

@dmohns dmohns commented Feb 28, 2018

Reference Issues/PRs

Following up a discussion from #10591

Automated invariance test are currently implemented in test_common.py for scalar ranking functions. It is desirable to have these tests also be available for ranking metrics with potentially non-scalar outputs. For example roc_curve, classification_report or confusion_matrix.

What does this implement/fix? Explain your changes.

Scope of this PR is to generalize test_common.py to allow testing of metrics with non-scalar output. This shall be achieved by replacing all assert in such a away that they allow assertion on array-like structures.

Any other comments?

It should first be discussed if a solution like this is feasible. An alternated route suggested by @jnothman is to aggregate non-scalar metrics using an arbitrary measure and apply the scalar invariance tests.

To-do

  • Add other metrics to the tests
  • Add wrappers for assert_array_greater and assert_array_not_equal
  • Check if certain tests become obsolete (if so remove them)

@dmohns dmohns changed the title Invariance test for non-scalar metric [WIP] Invariance test for non-scalar metric Feb 28, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who knew it could be this easy! Nice

"equal (%f) for %s" % (weighted_score, name))
assert_raises(
AssertionError,
assert_array_equal, unweighted_score, weighted_score,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use approximate equality here, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean assert_array_almost_equal right? If I do that it will make the test fail micro_average_precision_score

    def _raiseFailure(self, standardMsg):
        msg = self.test_case._formatMessage(self.msg, standardMsg)
>       raise self.test_case.failureException(msg)
E       AssertionError: Unweighted and weighted scores are unexpectedly almost equal (0.118667604216) and (0.118668873707) for micro_average_precision_score

msg        = 'Unweighted and weighted scores are unexpectedly almost equal (0.118667604216) and (0.118668873707) for micro_average_precision_score'
self       = <sklearn.utils._unittest_backport._AssertRaisesContext instance at 0x7ff47bf8f758>
standardMsg = 'AssertionError not raised'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think micro_average_precision_score should be identical to accuracy_score. Perhaps this is not as robust a test as we thought? Does that still fail if assert_allclose is used? It relies on relative tolerance rather than absolute and is probably a better test here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The test passes when using assert_allclose.

Additionally the use of assert_array_almost_equal is dis-advised by NumPy. I decided to replace all approximate equality checks with assert_allclose to increase overall robustness of the invariance tests.

dmohns added 2 commits March 2, 2018 21:10
- Increase weight range to circumvent approximate equality for some
metrics.
@dmohns
Copy link
Contributor Author

dmohns commented Mar 3, 2018

I am not 100% sure if we would need wrappers for the array asserts greater and not_equal? Especially the non-equality test is a bit cumbersome. To the unfamiliar reader this might disguise what the test is actually trying to check.
On the other hand these checks are only ever used in test_common.py. Adding abstract versions of the above mentioned asserts would add unneeded complexity without benefits to other parts of scikit-learn.

What do you think?

PS: I gave up on trying to add classification_report to test_common.py. The fact that it returns a string makes it sheer impossible to write a reasonable assertion.

dmohns added 4 commits March 3, 2018 22:59
- The use of `assert_allclose` over `assert_array_almost_equal` is
recommended by NumPy.
- Revert initial change of sample weights as obsolete.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain I've reviewd this thoroughly, but it LGTM

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain I've reviewd this thoroughly, but it LGTM

@dmohns dmohns changed the title [WIP] Invariance test for non-scalar metric [MRG] Invariance test for non-scalar metric Mar 5, 2018
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, minor comments.

@@ -103,7 +96,13 @@
"accuracy_score": accuracy_score,
"balanced_accuracy_score": balanced_accuracy_score,
"unnormalized_accuracy_score": partial(accuracy_score, normalize=False),
"confusion_matrix": confusion_matrix,

"unnormalized_confusion_matrix": confusion_matrix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit weird to name them this way, given that confusion_matrix is unnormalized for us. I'd rather call this one confusion_matrix and the other one normalized_confusion_matrix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It felt very counter-intuitive to name confusion_matrix that way.
However, this was necessary to have the scaling test schedule the correct metric.
See https://github.com/dmohns/scikit-learn/blob/94b9bd0d52ee55478ea7607cfd3037c83e4c2f4b/sklearn/metrics/tests/test_common.py#L1009

For the scope of this PR I didn't want to touch any of the actual test code.

The other option would be to introduce a new list of "unnormalized" metrics and
use that for scheduling rather than name.startswith.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the other unnormalized metrics? I wouldn't mind too much keeping it this way if you add a comment on why it is done this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically the ones that can be called with a normalize which defaults to True parameter. Currently this affects balanced_accuracy_score, jaccard_similarity_score, zero_one_loss and log_loss. Let me add a comment on why I am doing what I am doing there and see if it gets clearer.

NOT_SYMMETRIC_METRICS, THRESHOLDED_METRICS,
METRIC_UNDEFINED_BINARY_MULTICLASS),
set(ALL_METRICS))

assert_equal(
assert_array_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is an artifact of me horse-blindly removing all occurrences of
assert_equal where technically possible. In this situation, it is more clear to keep to original assert_equal to avoid confusion.

Will undo this change.

# Metrics are tested with numpy's assert_array. Therefor box-shaped outputs
# are required.
CURVE_METRICS = {
"roc_curve": lambda *args, **kwargs: roc_curve(*args, **kwargs)[:2],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're throwing away the thresholds, right? Which makes sense, but I think the comment should say that. That's also not only about the shape right? The thresholds probably don't behave like a metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am throwing away the thresholds here. But no, I am only really doing it because of shape, as thresholds will have a different dimension compared to other returned values.

If I recall correctly the thresholds on their own should pass all the tests. I will test if we can keep the thresholds by padding the result into a box-shaped array rather than throwing stuff away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems really weird to me...

@@ -148,6 +152,36 @@
"cohen_kappa_score": cohen_kappa_score,
}


def precision_recall_curve_padded_thresholds(*args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Following up the conversation in earlier thread: I added an alternative way to make invariance tests work with precision_recall_curve. Does this make things clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure, we could pad with something more conspicuous, like nan???

Otherwise lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of padding with NaNs. It makes things a bit more verbose.

@jnothman
Copy link
Member

@dmohns, could you please resolve conflicts with master. We'll try to merge soon, given that @amueller and I have both more-or-less approved.

@dmohns
Copy link
Contributor Author

dmohns commented Jun 28, 2018

@jnothman Sure, will do.

dmohns and others added 4 commits June 28, 2018 18:39
Resolved Merge Conflicts:
- sklearn/metrics/tests/test_common.py
- Specifically NumPy 1.10.4 seems to have a problem with casting 
```
((val1, val2), )                  => ((val1, val2), (val1, val2), ...)
```
see https://github.com/numpy/numpy/blob/v1.10.4/numpy/lib/arraypad.py#L989

We can circumvent this by using the simpler trait of np.pad, passing a singleton to `constant_values`.
- Specifically NumPy 1.8.2 seems to have a problem with casting 
```
int                               => ((int, int), (int, int), ...)
```
when shape is NaN. See https://github.com/numpy/numpy/blob/v1.8.2/numpy/lib/arraypad.py#L1042

We can circumvent this by wrapping NaN value into list instead of passing a NaN-singleton to `constant_values`.
@jnothman
Copy link
Member

Thanks! Good work!

@jnothman jnothman merged commit c7e498b into scikit-learn:master Jun 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants