Skip to content

[WIP] score function computing balanced accuracy #6752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

xyguo
Copy link
Contributor

@xyguo xyguo commented May 3, 2016

Reference Issue

This PR comes to address issue #6747, which suggests to implement an score function calculating the balanced accuracy.

What does this implement/fix? Explain your changes.

The balanced accuracy is actually an unweighted average of recall scores for each class. And the functionality is already provided by the sklearn.metrics.recall_score -- just pass the argument average='macro' (and pos_label=None for version before 0.18).

So the balanced_accuracy_score in this PR is a simple wrapper of the recall_score.

Any other comments?

I'm not sure if there should be an test case for this function since the corresponding scenario already tested for recall_score.

@xyguo
Copy link
Contributor Author

xyguo commented May 3, 2016

According to the latest comment under issue #6747, the balanced accuracy should only be conducted for binary classification problem as well as multi-label problems.

Fixing the implementation.

@jnothman
Copy link
Member

jnothman commented May 3, 2016

You don't necessary need to support multilabel initially. You do need to ensure this has:

  • narrative documentation in model_evaluation.rst
  • metrics common tests applied
  • unit tests, perhaps
  • a scorer

@xyguo
Copy link
Contributor Author

xyguo commented May 4, 2016

@jnothman I see, by the way, I think it'd be better not to accept multilabel input, because this is essentially not an metric for multilabel problems. Maybe we should leave it to user.

@jnothman
Copy link
Member

jnothman commented May 4, 2016

You may be right that it's not often reported for multilabel problems, but any metric applicable to binary problems is applicable to each label of a multilabel problem: a multilabel problem can be seen as multiple binary tasks. But as I said, we can leave multilabel support out for the moment.

@xyguo
Copy link
Contributor Author

xyguo commented May 5, 2016

Now I've made a preliminary version of the metric function, with corresponding documentations, all tests passed.
Needs some review. Any comments are welcome. Thx.

The balanced accuracy is used in binary classification problems to deal
with imbalanced datasets. It is defined as the arithmetic mean of sensitivity
(true positive rate) and specificity (true negative rate), or the average
accuracy obtained on either class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use "recall" here or be explicit that it's "on either class's gold standard instances" or something.

@xyguo
Copy link
Contributor Author

xyguo commented May 6, 2016

@jnothman I have updated the doc as you commented.

What's the next should I do now? (this is the first time I contribute code to an open source project >_<)

Currently I'm trying to extend it to multilabel problems. As your comments mentioned, this quantity is equivalent to roc_auc_score with binary inputs, which already handles multilabel inputs by setting an average parameter. So I find the most straightforward way is to simply wrap the roc_auc_score. And on my computer the implementation based on roc_auc_score is roughly 2x faster than that based on recall_score. But this have to import a function from ranking.py, which may not be desired. Any suggestions?

@jnothman
Copy link
Member

jnothman commented May 6, 2016

Yes, I think wrapping roc_auc_score is the way to go... I don't think the module coupling is an issue in this case...

conventional accuracy (i.e., the number of correct predictions divided by the total
number of predictions). In contrast, if the conventional accuracy is above chance only
because the classifier takes advantage of an imbalanced test set, then the balanced
accuracy, as appropriate, will drop to chance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.5 or 50% is clearer than 'chance'

@jnothman
Copy link
Member

jnothman commented May 8, 2016

For the most part, this looks great. I'm not sure if more specific tests for balanced accuracy are needed, or whether the doctests + common tests suffice.

One problem with using roc_auc_score, I've realised, is that it won't already support a sparse matrix for y_pred in the multilabel cases.

@xyguo
Copy link
Contributor Author

xyguo commented May 8, 2016

Yes, I've noticed the problem caused by sparse input.
Actually there're more problems than I expected when extending to multilabel setting, I need some suggestions for the following problems:

  1. if y_true contains only one class, for example

    y_true = np.array([1, 1, 1, 1])
    y_pred = np.array([1, 1, 0, 0])
    

    then recall_score(y_true, y_pred, average='macro', pos_label=None) will issue a warning and return 0.25, i.e., the recall on the absent class default to zero; but roc_auc_score would raise an exception. I'm not sure which behavior should balanced_accuracy follow. Personally I prefer the recall_score's way since most classification metrics seem to act like that.

  2. To deal with multilabel inputs, I mimic other classification metrics and add an average parameter in the function definition. But in the tests/test_common.py, the case test_no_averaging_labels assumes that classification metrics with an average parameter also accept an labels parameter, while roc_auc_score doesn't support that.

Maybe I'll have to reimplement it from scratch. -_- ||

@jnothman
Copy link
Member

jnothman commented May 8, 2016

y_true containing only one class is a pretty special case. In terms of giving meaningful numbers, I think it makes sense just to report the recall of that one class, but it's hard to justify directly from the definition of balanced accuracy. Note that your proposal means that if y_pred == y_true == np.ones(n) the score is 0.5.

The labels argument is only actually necessary for the multiclass case (and it's a bit weird that we support it for the multilabel case, but it hails back to when we had a different format for multilabel). In any case, you could certainly argue to skip from that test any of the METRIC_UNDEFINED_MULTICLASS.

@xyguo
Copy link
Contributor Author

xyguo commented May 9, 2016

The labels argument may make sense if we are going to support multilabel setting: say you want the macro metric on all labels except some undesired ones, then the labels argument could help to exclude those unwanted labels (see the doc of recall_score for example).

May be we should also support multi-class problems, the definition of balanced accuracy generalizes to multi-class settings naturally (although it may not be so useful when number of classes exceeds two).

@jnothman
Copy link
Member

jnothman commented May 9, 2016

How does it generalise to multiclass naturally? I don't think it's obvious.

I don't think the need to exclude labels is important for multilabel; it is important for multiclass which is why it is supported in recall_score (I would know; I proposed and implemented it).

@jnothman
Copy link
Member

FYI, #5588 was an existing PR attempting this enhancement. I don't know why we didn't just continue on that one... but between these two PRs we should attempt some convergence...

@xyguo
Copy link
Contributor Author

xyguo commented May 12, 2016

I have finished the support for multilabel, but several tests fails in test_common.py. For example, if the metric function accepts an average argument, then the case test_averaging_multiclass implicitly assume the metric can deal with multiclass problem, while my implementation would raise an exception...

And there are several other cases failed due to similar problems. I doubt maybe we should clarify the interface for different type of metrics ...

@jnothman
Copy link
Member

So you need to get those tests to check for METRIC_UNDEFINED_MULTICLASS, no?

@jnothman
Copy link
Member

@rhiever has convinced me over at #6747 that we should be indeed supporting multiclass as just a macro-average over binarised problems (i.e. the same as calculating multilabel balanced accuracy after LabelBinarizer).

@xyguo
Copy link
Contributor Author

xyguo commented Jun 17, 2016

@jnothman got it, I will work on it soon

@amueller
Copy link
Member

@xyguo are you still working on this?

@xyguo
Copy link
Contributor Author

xyguo commented Oct 11, 2016

@amueller Yes. I have been writing my thesis and don't have much time for this project. I plan to resume it later this month.

@dalmia
Copy link
Contributor

dalmia commented Dec 5, 2016

@xyguo Are you still working on this or can I take this up?

@xyguo
Copy link
Contributor Author

xyguo commented Dec 5, 2016

@dalmia Please take this up, I'm just too busy to work on it recently. Thanks!

@dalmia
Copy link
Contributor

dalmia commented Dec 5, 2016

Thanks @xyguo

dalmia added a commit to dalmia/scikit-learn that referenced this pull request Dec 16, 2016
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Oct 9, 2017
amueller pushed a commit that referenced this pull request Oct 17, 2017
* add function computing balanced accuracy

* documentation for the balanced_accuracy_score

* apply common tests to balanced_accuracy_score

* constrained to binary classification problems only

* add balanced_accuracy_score for CLF test

* add scorer for balanced_accuracy

* reorder the place of importing balanced_accuracy_score to be consistent with others

* eliminate an accidentally added non-ascii character

* remove balanced_accuracy_score from METRICS_WITH_LABELS

* eliminate all non-ascii charaters in the doc of balanced_accuracy_score

* fix doctest for nonexistent scoring function

* fix documentation, clarify linkages to recall and auc

* FIX: added changes as per last review See #6752, fixes #6747

* FIX: fix typo

* FIX: remove flake8 errors

* DOC: merge fixes

* DOC: remove unwanted files

* DOC update what's new
@lesteve
Copy link
Member

lesteve commented Oct 18, 2017

Closed by #8066.

@lesteve lesteve closed this Oct 18, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* add function computing balanced accuracy

* documentation for the balanced_accuracy_score

* apply common tests to balanced_accuracy_score

* constrained to binary classification problems only

* add balanced_accuracy_score for CLF test

* add scorer for balanced_accuracy

* reorder the place of importing balanced_accuracy_score to be consistent with others

* eliminate an accidentally added non-ascii character

* remove balanced_accuracy_score from METRICS_WITH_LABELS

* eliminate all non-ascii charaters in the doc of balanced_accuracy_score

* fix doctest for nonexistent scoring function

* fix documentation, clarify linkages to recall and auc

* FIX: added changes as per last review See scikit-learn#6752, fixes scikit-learn#6747

* FIX: fix typo

* FIX: remove flake8 errors

* DOC: merge fixes

* DOC: remove unwanted files

* DOC update what's new
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* add function computing balanced accuracy

* documentation for the balanced_accuracy_score

* apply common tests to balanced_accuracy_score

* constrained to binary classification problems only

* add balanced_accuracy_score for CLF test

* add scorer for balanced_accuracy

* reorder the place of importing balanced_accuracy_score to be consistent with others

* eliminate an accidentally added non-ascii character

* remove balanced_accuracy_score from METRICS_WITH_LABELS

* eliminate all non-ascii charaters in the doc of balanced_accuracy_score

* fix doctest for nonexistent scoring function

* fix documentation, clarify linkages to recall and auc

* FIX: added changes as per last review See scikit-learn#6752, fixes scikit-learn#6747

* FIX: fix typo

* FIX: remove flake8 errors

* DOC: merge fixes

* DOC: remove unwanted files

* DOC update what's new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants