-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Add Detection Error Tradeoff (DET) curve classification metrics #10591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add Detection Error Tradeoff (DET) curve classification metrics #10591
Conversation
All code contributors will be noted in the changelog entry, which goes in doc/whats_new/v0.20.rst. Could you add an example plotting a det curve and illustrating its usage, perhaps in contrast to roc? |
sklearn/metrics/ranking.py
Outdated
sample_weight=None): | ||
"""Compute error rates for different probability thresholds. | ||
|
||
Note: This implementation is restricted to the binary classification task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it commonly applied to non-binary tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I am aware of, but I am anything but an expert on that domain. By their nature of comparing error types they are best suited for evaluation of binary detection tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can reword so this does not seem like a fault of the implementation
doc/modules/model_evaluation.rst
Outdated
`detection error tradeoff curve, or DET curve <https://en.wikipedia.org/wiki/Detection_error_tradeoff>`_. | ||
Quoting Wikipedia : | ||
|
||
"A detection error tradeoff (DET) graph is a graphical plot of error rates for binary classification systems, plotting false reject rate vs. false accept rate. The x- and y-axes are scaled non-linearly by their standard normal deviates (or just by logarithmic transformation), yielding tradeoff curves that are more linear than ROC curves, and use most of the image area to highlight the differences of importance in the critical operating region." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please limit line lengths to 80 characters. Blockquotes can be indented.
Could you add discussion of how to use this for model selection / evaluation / critique?
I added a small example to compare ROC and DET curves on the same classification task amongst further documentation and usage notes about DET curves. Happy to receive feedback for both. |
doc/modules/model_evaluation.rst
Outdated
|
||
* The normal deviate scale transformation spreads out the points such that a | ||
comparatively larger space of plot is occupied. | ||
Therefor curves with similar classification performance might be easier to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*therefore
doc/modules/model_evaluation.rst
Outdated
DET curves are intuitive to read and hence allow quick visual assessment of a | ||
classifiers performance. | ||
Additionally DET curves can be consulted for threshold and operating point | ||
analysis if an comparison error types is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not grammatical
doc/modules/model_evaluation.rst
Outdated
analysis if an comparison error types is required. | ||
|
||
One the other hand DET curves do not provide their metric as a single number. | ||
Therefor for either automated evaluation or comparison to other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*therefore
sklearn/metrics/ranking.py
Outdated
# and reverse the outputs so list of false positives is decreasing | ||
last_ind = tps.searchsorted(tps[-1]) + 1 | ||
first_ind = fps[::-1].searchsorted(fps[0]) | ||
sl = range(first_ind, last_ind)[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a slice object instead of a range
doc/modules/model_evaluation.rst
Outdated
.. topic:: References: | ||
|
||
.. [1] `Wikipedia entry for Detection error tradeoff | ||
<https://en.wikipedia.org/wiki/Detection_error_tradeoff>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a versioned Wikipedia link
doc/modules/model_evaluation.rst
Outdated
|
||
.. topic:: References: | ||
|
||
.. [1] `Wikipedia entry for Detection error tradeoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use named, rather than numbered, references, to avoid conflicts with other references in the page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/metrics/ranking.py
Outdated
first_ind = fps[::-1].searchsorted(fps[0]) | ||
sl = range(first_ind, last_ind)[::-1] | ||
return fps[sl] / n_count, fns[sl] / p_count, thresholds[sl] | ||
sl = slice(first_ind, last_ind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set the step so that the slice is reversed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially went that route. Unfortunately that makes the computation of the slice borders very opaque. To deal with the fact that the "upper" slice border is exclusive we would have subtract -2
in total with additional exception handling for the cases 1
and 0
.
I figured no one would ever be able follow what is happening and opted for the explicit approach.
sklearn/metrics/ranking.py
Outdated
sample_weight=None): | ||
"""Compute error rates for different probability thresholds. | ||
|
||
Note: This implementation is restricted to the binary classification task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can reword so this does not seem like a fault of the implementation
Please add an entry to the change log at |
I have two more questions regarding this PR:
|
Oh no! I let you get away without a unit test? Yes you should try to test its main properties and it's support for edge cases (ties, perfect scores, all same prediction). As to docs, you can see the rendered docs through circle ci artists (see our developer docs for details) |
|
Sorry I've been a bit hasty! |
I added some basic tests oriented at other tests in Side question: Now that I am touching |
Re pep8: Please do not change unrelated things. It makes your contribution harder to review and may introduce merge conflicts to other pull requests. If it's really only missing line breaks, the second issue may not be severe so you could submit a quick separate PR. |
|
||
def test_detection_error_tradeoff_curve_toydata(): | ||
# Check on a batch of small examples. | ||
test_cases = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use pytest.mark.parametrize instead
[0, 1, 1], [0, 0.5] | ||
) | ||
|
||
# When the true_y values are all the same a detection error tradeoff cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y_true
Ok, makes a lot of sense. Will open a separate PR for those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
(0), (0.25), (0.5), (0.75), (1) | ||
]) | ||
def test_detection_error_tradeoff_curve_constant_scores(y_score): | ||
# Check computation with perfect and constant scores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're not testing with perfect scores, are you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will add.
([1, 0, 1], [0.5, 0.75, 1], [1, 1, 0], [0, 0.5, 0.5]), | ||
([1, 0, 1], [0.25, 0.5, 0.75], [1, 1, 0], [0, 0.5, 0.5]), | ||
]) | ||
def test_detection_error_tradeoff_curve_toydata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would more conventionally style this as:
def test_detection_error_tradeoff_curve_toydata(y_true, y_score,
expected_fpr, expected_fnr):
([1, 0, 1], [0.25, 0.5, 0.5], [1], [0]), | ||
([1, 1, 0], [0.25, 0.5, 0.5], [1], [0]), | ||
]) | ||
def test_detection_error_tradeoff_curve_tie_handling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
# be computed. | ||
with np.errstate(all="raise"): | ||
assert_raises( | ||
FloatingPointError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we be better off with a more explicit error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering it, but decided against it.
Currently the function behaves like that: When you call detection_error_tradeoff_curve with all y_true
the same it will return NaN
in one of the return values and NumPy will throw a Warning. I find this behavior sort-of intuitive, as it suggests that the issue might be in the data and not in the way the function is called.
But I have no strong feeling towards either solution. I can change it, if this is standard behavior in other scikit-learn functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that cases like this do happen, at least when the calling of the metric is automated within some cross-validation or similar routine (and hence data sub-samples are taken). If it is going to be hard for users to debug, we should give them a reasonably informative error message... if it doesn't involve excessive additional computation to handle their edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Given the computational overhead is indeed very small I will make this a little more explicit.
Should this now be marked [MRG]? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to include the metric in test_common.py like other metrics and remove redundant tests if possible.
Interesting, I didn't realize there was automated invariance checking happening. Unfortunately it doesn't yet cover curves https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/tests/test_common.py#L53 For now, I will add DET curve to the comment. |
Thanks :) That's exactly what I mean. |
I suppose to ensure input checking, putting some arbitrary aggregate of the
curves in common tests would be good, actually?
|
I had a quick look at the test_common: Some of the tests do make sense for curve metrics in general and DET curve in particular, e.g. order invariance, pos_label etc... Decent input checking could probably be achieved using existing test without even the need to mess with arbitrary aggregates. Funny side fact: Now that we raise an error when only one class is present in y_true this line in invariance check Regardless, adding curve support to test_common might (or might not) involve some major changes to the test_common module itself. I am not really sure if this can be in scope of this PR. Thoughts? |
that's why we fix the random state!
|
I would very much enjoy a PR which extended the invariance tests for
metrics to curves, classification_report and confusion_matrix. I find those
invariance tests very helpful. it would make sense to separate it from this
PR.
|
Agreed. The invariance tests seem like a desirable thing to have for curves. Another thought crossed my mind: Maybe we can change all asserts in test_common to assert_arrays. Assuming assert_arrays work for scalar, which I think they do. That way we could simply reuse the already existing tests. |
What is the best way to proceed with this? I managed to put something together that will make test_common support non-scalar metrics. It will probably need some more beautification but it seem to work for now. If you want I can share. Should we put the DET curve PR on hold and try to add curves to invariance test first or vice-versa? |
it might take some time for this to get a second review in any case. Make a
separate PR with common tests changes, and we'll take a look.
|
Wikipedia, The Free Encyclopedia. September 4, 2017, 23:33 UTC. | ||
Available at: https://en.wikipedia.org/w/index.php?title=Detection_error_tradeoff&oldid=798982054. | ||
Accessed February 19, 2018. | ||
.. [Martin1997] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Martin 1997 is not cited in the text: to avoid the sphinx warning you can either cite it where appropriate either remove the square bracket content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this will help in fixing sphinx warnings and documentation rendering.
"""Compute error rates for different probability thresholds. | ||
Note: This metrics is used for ranking evaluation of a binary | ||
classification task. | ||
Read more in the :ref:`User Guide <det_curve>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read more in the :ref:`User Guide <det_curve>`. | |
Read more in the :ref:`User Guide <det_curve>`. | |
pos_label : int, optional (default=None) | ||
The label of the positive class | ||
sample_weight : array-like of shape = [n_samples], optional | ||
Sample weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sample weights. | |
Sample weights. | |
rate of predictions with score >= thresholds[i]. This is occasionally | ||
referred to as false rejection or miss rate. | ||
thresholds : array, shape = [n_thresholds] | ||
Decreasing score values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decreasing score values. | |
Decreasing score values. | |
See also | ||
-------- | ||
roc_curve : Compute Receiver operating characteristic (ROC) curve | ||
precision_recall_curve : Compute precision-recall curve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precision_recall_curve : Compute precision-recall curve | |
precision_recall_curve : Compute precision-recall curve | |
- Swallowed by copy/paste
examples/model_selection/plot_det.py
Outdated
) | ||
|
||
# finally add legend | ||
ax_det = plt.subplot(1, 2, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this line generates a DeprecationWarning
from matplotlib (see the rendered example): it would be nice if the warning could be avoided. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will see what I can do, currently matplot lib is giving me a hard time ...
Thanks @dmohns for syncing, I think this can be moved to the [MRG] status... :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at example https://113142-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_det.html to get convinced. Maybe you can make the point clearer in the example?
examples/model_selection/plot_det.py
Outdated
|
||
In this example we compare receiver operating characteristic (ROC) and | ||
detection error tradeoff (DET) curves to demonstrate how DET curves can help | ||
to asses the performance of different classification algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to asses the performance of different classification algorithms. | |
to assess the performance of different classification algorithms. |
DET curves are commonly plotted in normal deviate scale. | ||
To achieve this we transform the errors rates as returned by the | ||
``detection_error_tradeoff_curve`` function and the axis scale using | ||
``scipy.stats.norm``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a few words on what this example demonstrates? what is the take home message? Are you saying that DET makes it easier to highlight that RF is better than ROC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example merely tries to demonstrate to the unfamiliar user how an example of a DET curve can look like. I tried to not repeat myself too much, and the properties, pros and cons are already discussed in the usage guide (which uses the sample plot).
Would you be fine if I added an explicit link to the usage guide here too?
let me try to clarify what I suggest. When you say "can help to assess
...". Please add a sentence
why it helps as evidenced by the plot below? I am more asking "why" users
should consider DET
and this example should be a proof that it does help. Maybe a couple more
sentences commenting
the plot can help looking at the right thing. I am sure it's obvious to you
but not to anyone.
thanks
… |
@scikit-learn/core-devs a two years old PR, double approved! :) Someone available for merging? |
thx @dmohns |
Thank you, @dmohns, for this contribution. Would it make sense to have a |
@lorentzenchr You can comment there #18169 I have a couple of things that I did not get time to add before the merge to happen. |
The plot_det_curve was submitted there: #18176 |
…scikit-learn#10591) * Initial add DET curve to classification metrics * Add DET to exports * Fix DET-curve doctest errors - Sample snippet in model_evaluation documentation was outdated. * Clarify wording in DET-curve computation - Align to the wording of ranking module to make it consistent. - Add correct describtion of input and outputs. - Update and fix non-existent links * Beautify DET curve documentation source - Limit line length to 80 characters. * Expand DET curve documentation - Add an example plot to show difference between ROC and DET curves. - Expand Usage Note section with background information and properties of DET curves. * Update DET-curve documentation - Fix typos and some grammar improvements. - Use named references to avoid potential conflicts with other sections. - Remove unneeded references and improved existing ones by using e.g. using versioned links. * Select relevant DET points using slice object * Remove some dubiety from DET curve doc-string * Add DET curve contributors * Add tests for DET curves * Streamline DET test by using parametrization * Increase verbosity of DET curve error handling - Explicitly sanity check input before computing a DET curve. - Add test for perfect scores. - Adapt indentation style to match the test module. * Add reference for DET curves in invariance test * Add automated invariance checks for DET curves * Resolve merge artifacts * Make doctest happy * Fix whitespaces for doctest * Revert unintended whitespace changes * Revert unintended white space changes scikit-learn#2 * Fix typos and grammar * Fix white space in doc * Streamline test code * Remove rebase artifacts * Fix PR link in doc * Fix test_ranking * Fix rebase errors * Fix import * Bring back newlines - Swallowed by copy/paste * Remove uncited ref link * Remove matplotlib deprecation warning * Bring back hidden reference * Add motivation to DET example * Fix lint * Add citation * Use modern matplotlib API Co-authored-by: Jeremy Karnowski <jeremy.karnowski@gmail.com> Co-authored-by: Julien Cornebise <julien@cornebise.com> Co-authored-by: Daniel Mohns <daniel.mohns@zenguard.org>
…ikit-learn#18169) * follow-up on scikit-learn#10591 * DOC improvements * TST add additional test for pos_label Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Reference Issues/PRs
Continuation of the awesome work in PR #4980 and PR #8724 respectively.
What does this implement/fix? Explain your changes.
Scope of this PR (and the PRs it is based on) is to implement the functionality
to compute Detection Error Tradeoff (DET) curves from within scikit-learn.
DET-curves can used as a classification evaluation metric in scenarios where
an explicit comparison of different errors types is required.
Any other comments?
Credit goes to the authors @jkarnows @jucor of the aforementioned PRs likewise.
For now I have essentially fixed some basic inconsistencies which made test fail
and prevented the doc from being compiled.
DET curves are a niche classification evaluation metric.
Moving forward I would like to know whether they have a place in Scikit and if contribution like this is appreciated before putting additional effort into it.
If so, I am happy to continue to work on this PR as per ToDo list below.
ToDo list and next steps (without particular ordering)
Open questions
Additionally I compiled a list of question about the status-quo so far
I compared to similar modules like ROC and Precision-Recall curves but the use of references is somewhat inconsistent. I personally would prefer to not put references in multiple places and opted to put them into the documentation.