Skip to content

[MRG] Add Detection Error Tradeoff (DET) curve classification metrics #10591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Aug 16, 2020

Conversation

dmohns
Copy link
Contributor

@dmohns dmohns commented Feb 5, 2018

Reference Issues/PRs

Continuation of the awesome work in PR #4980 and PR #8724 respectively.

What does this implement/fix? Explain your changes.

Scope of this PR (and the PRs it is based on) is to implement the functionality
to compute Detection Error Tradeoff (DET) curves from within scikit-learn.
DET-curves can used as a classification evaluation metric in scenarios where
an explicit comparison of different errors types is required.

Any other comments?

Credit goes to the authors @jkarnows @jucor of the aforementioned PRs likewise.
For now I have essentially fixed some basic inconsistencies which made test fail
and prevented the doc from being compiled.

DET curves are a niche classification evaluation metric.
Moving forward I would like to know whether they have a place in Scikit and if contribution like this is appreciated before putting additional effort into it.

If so, I am happy to continue to work on this PR as per ToDo list below.

ToDo list and next steps (without particular ordering)

  • Fix broken/non-existent links in doc
  • Fix failing tests
  • Add tests for DET curves
  • Expand Usage notes with more background information about DET curves. Explain how DET curves can be used and be interpreted.
  • (Optional) Add example plot
  • Add automated invariance tests

Open questions

Additionally I compiled a list of question about the status-quo so far

  • Do we need references in the module description also?
    I compared to similar modules like ROC and Precision-Recall curves but the use of references is somewhat inconsistent. I personally would prefer to not put references in multiple places and opted to put them into the documentation.
  • I could not find any documentation whether or not to add contributors to the list of authors. I removed the references for now, but am happy to add them back if desired.
  • Any further comments of the original authors @jkarnows and @jucor ?
  • Any further suggestion for the ToDo list?

@jnothman
Copy link
Member

jnothman commented Feb 5, 2018

All code contributors will be noted in the changelog entry, which goes in doc/whats_new/v0.20.rst.

Could you add an example plotting a det curve and illustrating its usage, perhaps in contrast to roc?

sample_weight=None):
"""Compute error rates for different probability thresholds.

Note: This implementation is restricted to the binary classification task.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it commonly applied to non-binary tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I am aware of, but I am anything but an expert on that domain. By their nature of comparing error types they are best suited for evaluation of binary detection tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can reword so this does not seem like a fault of the implementation

`detection error tradeoff curve, or DET curve <https://en.wikipedia.org/wiki/Detection_error_tradeoff>`_.
Quoting Wikipedia :

"A detection error tradeoff (DET) graph is a graphical plot of error rates for binary classification systems, plotting false reject rate vs. false accept rate. The x- and y-axes are scaled non-linearly by their standard normal deviates (or just by logarithmic transformation), yielding tradeoff curves that are more linear than ROC curves, and use most of the image area to highlight the differences of importance in the critical operating region."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please limit line lengths to 80 characters. Blockquotes can be indented.

Could you add discussion of how to use this for model selection / evaluation / critique?

@dmohns
Copy link
Contributor Author

dmohns commented Feb 18, 2018

I added a small example to compare ROC and DET curves on the same classification task amongst further documentation and usage notes about DET curves. Happy to receive feedback for both.


* The normal deviate scale transformation spreads out the points such that a
comparatively larger space of plot is occupied.
Therefor curves with similar classification performance might be easier to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*therefore

DET curves are intuitive to read and hence allow quick visual assessment of a
classifiers performance.
Additionally DET curves can be consulted for threshold and operating point
analysis if an comparison error types is required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not grammatical

analysis if an comparison error types is required.

One the other hand DET curves do not provide their metric as a single number.
Therefor for either automated evaluation or comparison to other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*therefore

# and reverse the outputs so list of false positives is decreasing
last_ind = tps.searchsorted(tps[-1]) + 1
first_ind = fps[::-1].searchsorted(fps[0])
sl = range(first_ind, last_ind)[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a slice object instead of a range

.. topic:: References:

.. [1] `Wikipedia entry for Detection error tradeoff
<https://en.wikipedia.org/wiki/Detection_error_tradeoff>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a versioned Wikipedia link


.. topic:: References:

.. [1] `Wikipedia entry for Detection error tradeoff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use named, rather than numbered, references, to avoid conflicts with other references in the page

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

first_ind = fps[::-1].searchsorted(fps[0])
sl = range(first_ind, last_ind)[::-1]
return fps[sl] / n_count, fns[sl] / p_count, thresholds[sl]
sl = slice(first_ind, last_ind)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set the step so that the slice is reversed

Copy link
Contributor Author

@dmohns dmohns Feb 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially went that route. Unfortunately that makes the computation of the slice borders very opaque. To deal with the fact that the "upper" slice border is exclusive we would have subtract -2 in total with additional exception handling for the cases 1 and 0.

I figured no one would ever be able follow what is happening and opted for the explicit approach.

sample_weight=None):
"""Compute error rates for different probability thresholds.

Note: This implementation is restricted to the binary classification task.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can reword so this does not seem like a fault of the implementation

@jnothman
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman jnothman changed the title [WIP] Add Detection Error Tradeoff (DET) curve classification metrics [MRG+1] Add Detection Error Tradeoff (DET) curve classification metrics Feb 20, 2018
@dmohns
Copy link
Contributor Author

dmohns commented Feb 21, 2018

I have two more questions regarding this PR:

  • When I build the documentation locally the auto generated example will not hyperlink detection_error_tradeoff_curve in the source code. Am I missing something or is this behavior expected?
  • Would we need a dedicated unit test for this metric or is the implicit test through doctest and auto generated example sufficient in this case?

@jnothman
Copy link
Member

Oh no! I let you get away without a unit test? Yes you should try to test its main properties and it's support for edge cases (ties, perfect scores, all same prediction).

As to docs, you can see the rendered docs through circle ci artists (see our developer docs for details)

@dmohns
Copy link
Contributor Author

dmohns commented Feb 22, 2018

  1. Ok, I have to admit I was a little surprised to see this approved before even finishing my to-do list 😉. Will try to add unit tests for the cases mentioned.

  2. After a little hassle to access the CircleCI documentation I could confirm the link does work on the documentation built by CircleCI.

@jnothman jnothman changed the title [MRG+1] Add Detection Error Tradeoff (DET) curve classification metrics [WIP] Add Detection Error Tradeoff (DET) curve classification metrics Feb 22, 2018
@jnothman
Copy link
Member

Sorry I've been a bit hasty!

@dmohns
Copy link
Contributor Author

dmohns commented Feb 25, 2018

I added some basic tests oriented at other tests in test_ranking. Any other cases that should be tested?

Side question: Now that I am touching ranking.py and test_ranking.py anyways can I also PEP8 them? (There are just a few line breaks missing.)

@jnothman
Copy link
Member

Re pep8: Please do not change unrelated things. It makes your contribution harder to review and may introduce merge conflicts to other pull requests. If it's really only missing line breaks, the second issue may not be severe so you could submit a quick separate PR.


def test_detection_error_tradeoff_curve_toydata():
# Check on a batch of small examples.
test_cases = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use pytest.mark.parametrize instead

[0, 1, 1], [0, 0.5]
)

# When the true_y values are all the same a detection error tradeoff cannot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y_true

@dmohns
Copy link
Contributor Author

dmohns commented Feb 25, 2018

Re pep8: Please do not change unrelated things.

Ok, makes a lot of sense. Will open a separate PR for those.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work

(0), (0.25), (0.5), (0.75), (1)
])
def test_detection_error_tradeoff_curve_constant_scores(y_score):
# Check computation with perfect and constant scores.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're not testing with perfect scores, are you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will add.

([1, 0, 1], [0.5, 0.75, 1], [1, 1, 0], [0, 0.5, 0.5]),
([1, 0, 1], [0.25, 0.5, 0.75], [1, 1, 0], [0, 0.5, 0.5]),
])
def test_detection_error_tradeoff_curve_toydata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would more conventionally style this as:

def test_detection_error_tradeoff_curve_toydata(y_true, y_score,
                                                expected_fpr, expected_fnr):

([1, 0, 1], [0.25, 0.5, 0.5], [1], [0]),
([1, 1, 0], [0.25, 0.5, 0.5], [1], [0]),
])
def test_detection_error_tradeoff_curve_tie_handling(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

# be computed.
with np.errstate(all="raise"):
assert_raises(
FloatingPointError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we be better off with a more explicit error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering it, but decided against it.

Currently the function behaves like that: When you call detection_error_tradeoff_curve with all y_true the same it will return NaN in one of the return values and NumPy will throw a Warning. I find this behavior sort-of intuitive, as it suggests that the issue might be in the data and not in the way the function is called.

But I have no strong feeling towards either solution. I can change it, if this is standard behavior in other scikit-learn functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that cases like this do happen, at least when the calling of the metric is automated within some cross-validation or similar routine (and hence data sub-samples are taken). If it is going to be hard for users to debug, we should give them a reasonably informative error message... if it doesn't involve excessive additional computation to handle their edge case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Given the computational overhead is indeed very small I will make this a little more explicit.

@jnothman
Copy link
Member

Should this now be marked [MRG]?

@dmohns dmohns changed the title [WIP] Add Detection Error Tradeoff (DET) curve classification metrics [MRG] Add Detection Error Tradeoff (DET) curve classification metrics Feb 26, 2018
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to include the metric in test_common.py like other metrics and remove redundant tests if possible.

@dmohns
Copy link
Contributor Author

dmohns commented Feb 26, 2018

Interesting, I didn't realize there was automated invariance checking happening. Unfortunately it doesn't yet cover curves https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/tests/test_common.py#L53

For now, I will add DET curve to the comment.

@qinhanmin2014
Copy link
Member

For now, I will add DET curve to the comment.

Thanks :) That's exactly what I mean.

@jnothman
Copy link
Member

jnothman commented Feb 26, 2018 via email

@dmohns
Copy link
Contributor Author

dmohns commented Feb 26, 2018

I had a quick look at the test_common: Some of the tests do make sense for curve metrics in general and DET curve in particular, e.g. order invariance, pos_label etc... Decent input checking could probably be achieved using existing test without even the need to mess with arbitrary aggregates.
EDIT: Actually the last sentence is not true as all test require scalar output. We would either need to aggregate as suggested or create a whole new class of test cases.

Funny side fact: Now that we raise an error when only one class is present in y_true this line in invariance check y_true = random_state.randint(0, 2, size=(20, ) is going to give us some headaches.

Regardless, adding curve support to test_common might (or might not) involve some major changes to the test_common module itself. I am not really sure if this can be in scope of this PR. Thoughts?

@jnothman
Copy link
Member

jnothman commented Feb 26, 2018 via email

@jnothman
Copy link
Member

jnothman commented Feb 26, 2018 via email

@dmohns
Copy link
Contributor Author

dmohns commented Feb 27, 2018

Agreed. The invariance tests seem like a desirable thing to have for curves.

Another thought crossed my mind: Maybe we can change all asserts in test_common to assert_arrays. Assuming assert_arrays work for scalar, which I think they do. That way we could simply reuse the already existing tests.

@dmohns
Copy link
Contributor Author

dmohns commented Feb 27, 2018

What is the best way to proceed with this?

I managed to put something together that will make test_common support non-scalar metrics. It will probably need some more beautification but it seem to work for now. If you want I can share.

Should we put the DET curve PR on hold and try to add curves to invariance test first or vice-versa?

@jnothman
Copy link
Member

jnothman commented Feb 27, 2018 via email

Wikipedia, The Free Encyclopedia. September 4, 2017, 23:33 UTC.
Available at: https://en.wikipedia.org/w/index.php?title=Detection_error_tradeoff&oldid=798982054.
Accessed February 19, 2018.
.. [Martin1997] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Martin 1997 is not cited in the text: to avoid the sphinx warning you can either cite it where appropriate either remove the square bracket content.

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this will help in fixing sphinx warnings and documentation rendering.

"""Compute error rates for different probability thresholds.
Note: This metrics is used for ranking evaluation of a binary
classification task.
Read more in the :ref:`User Guide <det_curve>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Read more in the :ref:`User Guide <det_curve>`.
Read more in the :ref:`User Guide <det_curve>`.

pos_label : int, optional (default=None)
The label of the positive class
sample_weight : array-like of shape = [n_samples], optional
Sample weights.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Sample weights.
Sample weights.

rate of predictions with score >= thresholds[i]. This is occasionally
referred to as false rejection or miss rate.
thresholds : array, shape = [n_thresholds]
Decreasing score values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Decreasing score values.
Decreasing score values.

See also
--------
roc_curve : Compute Receiver operating characteristic (ROC) curve
precision_recall_curve : Compute precision-recall curve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
precision_recall_curve : Compute precision-recall curve
precision_recall_curve : Compute precision-recall curve

)

# finally add legend
ax_det = plt.subplot(1, 2, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this line generates a DeprecationWarning from matplotlib (see the rendered example): it would be nice if the warning could be avoided. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will see what I can do, currently matplot lib is giving me a hard time ...

@cmarmo
Copy link
Contributor

cmarmo commented Jul 19, 2020

Thanks @dmohns for syncing, I think this can be moved to the [MRG] status... :)
@agramfort if you can find some time to review this one... Thanks a lot!

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at example https://113142-843222-gh.circle-artifacts.com/0/doc/auto_examples/model_selection/plot_det.html to get convinced. Maybe you can make the point clearer in the example?


In this example we compare receiver operating characteristic (ROC) and
detection error tradeoff (DET) curves to demonstrate how DET curves can help
to asses the performance of different classification algorithms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to asses the performance of different classification algorithms.
to assess the performance of different classification algorithms.

DET curves are commonly plotted in normal deviate scale.
To achieve this we transform the errors rates as returned by the
``detection_error_tradeoff_curve`` function and the axis scale using
``scipy.stats.norm``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a few words on what this example demonstrates? what is the take home message? Are you saying that DET makes it easier to highlight that RF is better than ROC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example merely tries to demonstrate to the unfamiliar user how an example of a DET curve can look like. I tried to not repeat myself too much, and the properties, pros and cons are already discussed in the usage guide (which uses the sample plot).
Would you be fine if I added an explicit link to the usage guide here too?

@agramfort
Copy link
Member

agramfort commented Jul 25, 2020 via email

@dmohns dmohns changed the title [WIP] Add Detection Error Tradeoff (DET) curve classification metrics [MRG] Add Detection Error Tradeoff (DET) curve classification metrics Aug 1, 2020
@cmarmo
Copy link
Contributor

cmarmo commented Aug 15, 2020

@scikit-learn/core-devs a two years old PR, double approved! :) Someone available for merging?

@agramfort agramfort merged commit 41d648e into scikit-learn:master Aug 16, 2020
@agramfort
Copy link
Member

thx @dmohns

@lorentzenchr
Copy link
Member

Thank you, @dmohns, for this contribution.

Would it make sense to have a plot_detection_error_tradeoff_curve like plot_roc_curve?

@glemaitre
Copy link
Member

@lorentzenchr You can comment there #18169

I have a couple of things that I did not get time to add before the merge to happen.

@glemaitre
Copy link
Member

The plot_det_curve was submitted there: #18176

lorentzenchr added a commit that referenced this pull request Aug 19, 2020
* follow-up on #10591 

* DOC improvements

* TST add additional test for pos_label

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…scikit-learn#10591)

* Initial add DET curve to classification metrics

* Add DET to exports

* Fix DET-curve doctest errors

- Sample snippet in  model_evaluation documentation was outdated.

* Clarify wording in DET-curve computation

- Align to the wording of ranking module to make it consistent.
- Add correct describtion of input and outputs.
- Update and fix non-existent links

* Beautify DET curve documentation source

- Limit line length to 80 characters.

* Expand DET curve documentation

- Add an example plot to show difference between ROC and DET curves.
- Expand Usage Note section with background information and properties
of DET curves.

* Update DET-curve documentation

- Fix typos and some grammar improvements.
- Use named references to avoid potential conflicts with other sections.
- Remove unneeded references and improved existing ones by using e.g.
using versioned links.

* Select relevant DET points using slice object

* Remove some dubiety from DET curve doc-string

* Add DET curve contributors

* Add tests for DET curves

* Streamline DET test by using parametrization

* Increase verbosity of DET curve error handling

- Explicitly sanity check input before computing a DET curve.
- Add test for perfect scores.
- Adapt indentation style to match the test module.

* Add reference for DET curves in invariance test

* Add automated invariance checks for DET curves

* Resolve merge artifacts

* Make doctest happy

* Fix whitespaces for doctest

* Revert unintended whitespace changes

* Revert unintended white space changes scikit-learn#2

* Fix typos and grammar

* Fix white space in doc

* Streamline test code

* Remove rebase artifacts

* Fix PR link in doc

* Fix test_ranking

* Fix rebase errors

* Fix import

* Bring back newlines

- Swallowed by copy/paste

* Remove uncited ref link

* Remove matplotlib deprecation warning

* Bring back hidden reference

* Add motivation to DET example

* Fix lint

* Add citation

* Use modern matplotlib API

Co-authored-by: Jeremy Karnowski <jeremy.karnowski@gmail.com>
Co-authored-by: Julien Cornebise <julien@cornebise.com>
Co-authored-by: Daniel Mohns <daniel.mohns@zenguard.org>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…ikit-learn#18169)

* follow-up on scikit-learn#10591 

* DOC improvements

* TST add additional test for pos_label

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants