-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Invariance test for non-scalar metric #10730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Invariance test for non-scalar metric #10730
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who knew it could be this easy! Nice
sklearn/metrics/tests/test_common.py
Outdated
"equal (%f) for %s" % (weighted_score, name)) | ||
assert_raises( | ||
AssertionError, | ||
assert_array_equal, unweighted_score, weighted_score, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use approximate equality here, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean assert_array_almost_equal
right? If I do that it will make the test fail micro_average_precision_score
def _raiseFailure(self, standardMsg):
msg = self.test_case._formatMessage(self.msg, standardMsg)
> raise self.test_case.failureException(msg)
E AssertionError: Unweighted and weighted scores are unexpectedly almost equal (0.118667604216) and (0.118668873707) for micro_average_precision_score
msg = 'Unweighted and weighted scores are unexpectedly almost equal (0.118667604216) and (0.118668873707) for micro_average_precision_score'
self = <sklearn.utils._unittest_backport._AssertRaisesContext instance at 0x7ff47bf8f758>
standardMsg = 'AssertionError not raised'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think micro_average_precision_score should be identical to accuracy_score. Perhaps this is not as robust a test as we thought? Does that still fail if assert_allclose is used? It relies on relative tolerance rather than absolute and is probably a better test here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. The test passes when using assert_allclose
.
Additionally the use of assert_array_almost_equal
is dis-advised by NumPy. I decided to replace all approximate equality checks with assert_allclose
to increase overall robustness of the invariance tests.
- Increase weight range to circumvent approximate equality for some metrics.
I am not 100% sure if we would need wrappers for the array asserts What do you think? PS: I gave up on trying to add |
- The use of `assert_allclose` over `assert_array_almost_equal` is recommended by NumPy. - Revert initial change of sample weights as obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain I've reviewd this thoroughly, but it LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain I've reviewd this thoroughly, but it LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, minor comments.
sklearn/metrics/tests/test_common.py
Outdated
@@ -103,7 +96,13 @@ | |||
"accuracy_score": accuracy_score, | |||
"balanced_accuracy_score": balanced_accuracy_score, | |||
"unnormalized_accuracy_score": partial(accuracy_score, normalize=False), | |||
"confusion_matrix": confusion_matrix, | |||
|
|||
"unnormalized_confusion_matrix": confusion_matrix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit weird to name them this way, given that confusion_matrix
is unnormalized for us. I'd rather call this one confusion_matrix
and the other one normalized_confusion_matrix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It felt very counter-intuitive to name confusion_matrix
that way.
However, this was necessary to have the scaling test schedule the correct metric.
See https://github.com/dmohns/scikit-learn/blob/94b9bd0d52ee55478ea7607cfd3037c83e4c2f4b/sklearn/metrics/tests/test_common.py#L1009
For the scope of this PR I didn't want to touch any of the actual test code.
The other option would be to introduce a new list of "unnormalized" metrics and
use that for scheduling rather than name.startswith
.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the other unnormalized metrics? I wouldn't mind too much keeping it this way if you add a comment on why it is done this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the ones that can be called with a normalize
which defaults to True
parameter. Currently this affects balanced_accuracy_score
, jaccard_similarity_score
, zero_one_loss
and log_loss
. Let me add a comment on why I am doing what I am doing there and see if it gets clearer.
sklearn/metrics/tests/test_common.py
Outdated
NOT_SYMMETRIC_METRICS, THRESHOLDED_METRICS, | ||
METRIC_UNDEFINED_BINARY_MULTICLASS), | ||
set(ALL_METRICS)) | ||
|
||
assert_equal( | ||
assert_array_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This is an artifact of me horse-blindly removing all occurrences of
assert_equal
where technically possible. In this situation, it is more clear to keep to original assert_equal
to avoid confusion.
Will undo this change.
sklearn/metrics/tests/test_common.py
Outdated
# Metrics are tested with numpy's assert_array. Therefor box-shaped outputs | ||
# are required. | ||
CURVE_METRICS = { | ||
"roc_curve": lambda *args, **kwargs: roc_curve(*args, **kwargs)[:2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're throwing away the thresholds, right? Which makes sense, but I think the comment should say that. That's also not only about the shape right? The thresholds probably don't behave like a metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am throwing away the thresholds here. But no, I am only really doing it because of shape, as thresholds
will have a different dimension compared to other returned values.
If I recall correctly the thresholds
on their own should pass all the tests. I will test if we can keep the thresholds
by padding the result into a box-shaped array rather than throwing stuff away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems really weird to me...
In this situation, it is more clear to keep to original assert_equal to avoid confusion.
@@ -148,6 +152,36 @@ | |||
"cohen_kappa_score": cohen_kappa_score, | |||
} | |||
|
|||
|
|||
def precision_recall_curve_padded_thresholds(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Following up the conversation in earlier thread: I added an alternative way to make invariance tests work with precision_recall_curve
. Does this make things clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure, we could pad with something more conspicuous, like nan???
Otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of padding with NaNs. It makes things a bit more verbose.
@jnothman Sure, will do. |
Resolved Merge Conflicts: - sklearn/metrics/tests/test_common.py
- Specifically NumPy 1.10.4 seems to have a problem with casting ``` ((val1, val2), ) => ((val1, val2), (val1, val2), ...) ``` see https://github.com/numpy/numpy/blob/v1.10.4/numpy/lib/arraypad.py#L989 We can circumvent this by using the simpler trait of np.pad, passing a singleton to `constant_values`.
- Specifically NumPy 1.8.2 seems to have a problem with casting ``` int => ((int, int), (int, int), ...) ``` when shape is NaN. See https://github.com/numpy/numpy/blob/v1.8.2/numpy/lib/arraypad.py#L1042 We can circumvent this by wrapping NaN value into list instead of passing a NaN-singleton to `constant_values`.
Thanks! Good work! |
Reference Issues/PRs
Following up a discussion from #10591
Automated invariance test are currently implemented in
test_common.py
for scalar ranking functions. It is desirable to have these tests also be available for ranking metrics with potentially non-scalar outputs. For exampleroc_curve
,classification_report
orconfusion_matrix
.What does this implement/fix? Explain your changes.
Scope of this PR is to generalize
test_common.py
to allow testing of metrics with non-scalar output. This shall be achieved by replacing all assert in such a away that they allow assertion on array-like structures.Any other comments?
It should first be discussed if a solution like this is feasible. An alternated route suggested by @jnothman is to aggregate non-scalar metrics using an arbitrary measure and apply the scalar invariance tests.
To-do
assert_array_greater
andassert_array_not_equal