Skip to content

[WIP] ENH Multilabel confusion matrix #10628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Feb 12, 2018

This PR considers a helper for multilabel/set-wise evaluation metrics such as precision, recall, fbeta, jaccard (#10083), fall-out, miss rate and specificity (#5516). It also incorporates suggestions from #8126 regarding efficiency of multilabel true positives calculation (but does not optimise for micro-average, perhaps unfortunately). Unlike confusion_matrix it is optimised for the multilabel case, but also handles multiclass problems like they are handled in precision_recall_fscore_support: as binarised OvR problems.

It benefits us by simplifying the precision_recall_fscore_support and future jaccard implementations greatly, and allows for further refactors between them. It also benefits us by making a clear calculation of sufficient statistics (although perhaps more statistics than necessary) from which standard metrics are a simple calculation: it makes the code less mystifying. In that sense, this is mostly a cosmetic change, but it provides users with the ability to easily generalise the P/R/F/S implementation to related metrics.

TODO:

  • implement multilabel_confusion_matrix and use it in precision_recall_fscore_support as an indirect form of testing
  • fix up edge cases that fail tests
  • benchmark multiclass implementation against incumbent P/R/F/S
  • benchmark multilabel implementation with benchmarks/bench_multilabel_metrics.py extended to consider non-micro averaging, sample_weight and perhaps other cases
  • directly test multilabel_confusion_matrix
  • document under model_evaluation.rst
  • document how to calculate fall-out, miss-rate, sensitivity, specificity from multilabel_confusion_matrix
  • refactor jaccard similarity implementation once [MRG] average parameter for jaccard_similarity_score #10083 is merged

If another contributor would like to take this on, I would welcome it. I have marked this as Easy because the code and technical knowledge involved is not hard, but it will take a bit of work, and clarity of understanding.

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Enhancement help wanted labels Feb 12, 2018
@sklearn-lgtm
Copy link

This pull request fixes 2 alerts - view on lgtm.com

fixed alerts:

  • 2 for Potentially uninitialized local variable

Comment posted by lgtm.com

@jnothman jnothman changed the title ENH Multilabel confusion matrix [WIP] ENH Multilabel confusion matrix Feb 12, 2018
@sklearn-lgtm
Copy link

This pull request fixes 2 alerts - view on lgtm.com

fixed alerts:

  • 2 for Potentially uninitialized local variable

Comment posted by lgtm.com

@sklearn-lgtm
Copy link

This pull request fixes 2 alerts - view on lgtm.com

fixed alerts:

  • 2 for Potentially uninitialized local variable

Comment posted by lgtm.com

@sklearn-lgtm
Copy link

This pull request fixes 2 alerts when merging 542ec86 into e78263f - view on lgtm.com

fixed alerts:

  • 2 for Potentially uninitialized local variable

Comment posted by lgtm.com

@ShangwuYao
Copy link
Contributor

ShangwuYao commented May 31, 2018

Hi @jnothman , I am continuing your work on this. But I am not familiar with the codecov thing, this check seems to be failing? Do I need to fix this?
And by benchmark, do you mean comparing the test results? How do I report this then? I don't think this will goes into the code, right?
Thanks!
----edit: I figured out the benchmark thing.

@jnothman
Copy link
Member Author

jnothman commented Jun 1, 2018 via email

@ShangwuYao
Copy link
Contributor

ShangwuYao commented Jun 1, 2018

Thanks a lot for the help.
I think it would be better for me to check with you first before I mess up with your code...
The benchmarking results are:

Metric                                                               csc     csr   dense
precision_recall_fscore_support                                    0.007   0.003   0.007
precision_recall_fscore_support_with_multilabel_confusion_matrix   0.008   0.005   0.009

Since the new implementation of precision_recall_fscore_support is slower than the original one, should I just remove the new one? You are just using it for testing, correct?

And I think your implementation doesn't support multiclass-multioutput (it supports multilabel-indicator), I probably should raise an valueerror.

And the use of sample_weight in multilabel_confusion_matrix doesn't seem correct.

>>> y_true
array([[1, 0, 1],
       [0, 1, 0],
       [1, 1, 0]])
>>> y_pred
array([[1, 0, 0],
       [0, 1, 1],
       [0, 0, 1]])
>>> sample_weight
array([[3, 2, 1],
       [1, 2, 3],
       [2, 3, 4]])
>>> multilabel_confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
array([[[-2,  0],
        [ 2,  3]],

       [[-2,  0],
        [ 3,  2]],

       [[-5,  7],
        [ 1,  0]]])

--edit: this is multiclass case, not multilabel-indicator.

@jnothman
Copy link
Member Author

jnothman commented Jun 2, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement help wanted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants