[MRG+1] Adding support for balanced accuracy #8066

dalmia · 2016-12-16T01:30:24Z

Reference Issue

What does this implement/fix? Explain your changes.

This is a continuation from #6752. @xyguo isn't available to work on this recently and hence, I have taken up from him and want to thank him for his contribution which has greatly helped me understand the issue. As per my changes, I have made the changes suggested in the last review in the PR linked and resolved merge conflicts.

…nt with others

…ikit-learn#6747

dalmia · 2016-12-22T05:24:50Z

As discussed in the earlier PR thread, extending it for the multilabel case yields a lot of edges and has several corner cases. So, do we want to implement this for multilabel?

xyguo · 2016-12-22T05:45:12Z

That might be difficult by simply wrapping recall_score or roc_auc_score. I thought maybe I have to rewrite it from scratch, which resulted in a function similar to precision_recall_fscore_support. But it is just so ugly and remains in a draft version...

In addition, the file test_common.py also needs a lot of modification. Because the definition of balance_accuracy is a bit "impure" and it couldn't pass several tests: for example, some tests assume that if you accept multi-label input, then you should also accept some parameter, while balance_accuracy doesn't.

dalmia · 2016-12-22T05:57:01Z

Yes, I read the discussion on your thread. Since you have already tried implementing it, do you suggest we should try adding it ?

jnothman · 2016-12-22T10:49:41Z

I would've thought multilabel is easy to do like other metrics; multiclass was disputed.

…

On 22 December 2016 at 16:57, Aman Dalmia ***@***.***> wrote: Yes, I read the discussion on your thread. Since you have already tried implementing it, do you suggest we should try adding it ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8066 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65pL7rooje_U0kgEGwYufeqfig_sks5rKhEugaJpZM4LOwgi> .

Conflicts: doc/modules/model_evaluation.rst sklearn/metrics/scorer.py

dalmia · 2017-01-07T08:47:08Z

@jnothman So, I've went over the discussion on this enhancement on the issue thread and the previous pull requests and am presenting a summary to wrap everything up:

We can extend this to multiclass problems by calculating the macro-average over binarized problems.
But the problem lies in extending it to the multilabel setting - roc_auc_curve doesn't support sparse matrix for y_pred in the multilabel case.

So please let me know if you feel we can simply support binary problems or is it critical to try something different for the multilabel case?

xyguo · 2017-01-07T13:04:07Z

I wrote the balanced_acc as follows (based on the precision_recall_fscore_support). It do accept sparse input for multi-label cases, but it has to generate a dense matrix internally: Because the balanced accuracy needs to calculate acc on the negative class, while the sparse matrix stores only the positive labels. There should be an space-efficient way for this since all the information of the negative class can be derived from the sparse matrix, but I don't know if it could be implemented compactly.

def balanced_accuracy_score(y_true, y_pred, labels=None,
                            average=None, balance=0.5):
    """Compute the balanced accuracy

    The balanced accuracy is used in binary classification problems to deal
    with imbalanced datasets. It can also be extend to multilabel problems.

    It is defined as the weighted arithmetic mean of sensitivity
    (true positive rate, TPR) and specificity (true negative rate, TNR), or
    the weighted average recall obtained on either class:

    balanced accuracy = balance * TPR + (1 - balance) * TNR

    It is also equal to the ROC AUC score for binary inputs when balance is 0.5.

    The best value is 1 and the worst value is 0.

    Note: this implementation is restricted to binary classification tasks
    or multilabel tasks in label indicator format.

    Read more in the :ref:`User Guide <balanced_accuracy_score>`.

    Parameters
    ----------
    y_true : 1d array-like
        Ground truth (correct) target values.

    y_pred : 1d array-like
        Estimated targets as returned by a classifier.

    labels : list, optional
        The set of labels to include for multilabel problem, and their
        order if ``average is None``. For multilabel targets,
        labels are column indices. By default, all labels in ``y_true`` and
        ``y_pred`` are used in sorted order.

    average : string, [None (default), 'micro', 'macro']
        If ``None``, the scores for each class are returned. Otherwise,
        this determines the type of averaging performed on the data:

        ``'micro'``:
            Calculate metrics globally by considering each element of the label
            indicator matrix as a label.
        ``'macro'``:
            Calculate metrics for each label, and find their unweighted
            mean.  This does not take label imbalance into account.

    balance : float between 0 and 1.
        Weight associated with the sensitivity (or recall) against specificity in
        final score.

    Returns
    -------
    balanced_accuracy : float.
        The average of sensitivity and specificity

    See also
    --------
    recall_score, roc_auc_score

    References
    ----------
    .. [1] Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010).
           The balanced accuracy and its posterior distribution.
           Proceedings of the 20th International Conference on Pattern Recognition,
           3121-24.

    Examples
    --------
    >>> from sklearn.metrics import balanced_accuracy_score
    >>> y_true = [0, 1, 0, 0, 1, 0]
    >>> y_pred = [0, 1, 0, 0, 0, 1]
    >>> balanced_accuracy_score(y_true, y_pred)
    0.625
    >>> y_true = np.array([[1, 0], [1, 0], [0, 1]])
    >>> y_pred = np.array([[1, 1], [0, 1], [1, 1]])
    >>> balanced_accuracy_score(y_true, y_pred, average=None)
    array([ 0.25,  0.5 ])

    """
    # TODO: handle sparse input in multilabel setting
    # TODO: ensure `sample_weight`'s shape is consistent with `y_true` and `y_pred`
    # TODO: handle situations when only one class presents in `y_true`
    # TODO: accept an `labels` argument
    average_options = (None, 'micro', 'macro', 'samples')
    if average not in average_options:
        raise ValueError('average has to be one of ' +
                         str(average_options))

    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    present_labels = unique_labels(y_true, y_pred)

    if y_type == 'multiclass':
        raise ValueError('Balanced accuracy is only meaningful '
                         'for binary classification or '
                         'multilabel problems.')
    if labels is None:
        labels = present_labels
        n_labels = None
    else:
        n_labels = len(labels)
        labels = np.hstack([labels, np.setdiff1d(present_labels, labels,
                                                 assume_unique=True)])

    # Calculate tp_sum, pred_sum, true_sum ###

    if y_type.startswith('multilabel'):

        sum_axis = 1 if average == 'samples' else 0

        # All labels are index integers for multilabel.
        # Select labels:
        if not np.all(labels == present_labels):
            if np.max(labels) > np.max(present_labels):
                raise ValueError('All labels must be in [0, n labels). '
                                 'Got %d > %d' %
                                 (np.max(labels), np.max(present_labels)))
            if np.min(labels) < 0:
                raise ValueError('All labels must be in [0, n labels). '
                                 'Got %d < 0' % np.min(labels))

            y_true = y_true[:, labels[:n_labels]]
            y_pred = y_pred[:, labels[:n_labels]]

        # indicator matrix for the negative (zero) class
        # TODO: Inefficient due to the generation of dense matrices.
        y_true_z = np.ones(y_true.shape)
        y_true_z[y_true.nonzero()] = 0
        y_true_z = csr_matrix(y_true_z)
        y_pred_z = np.ones(y_true.shape)
        y_pred_z[y_pred.nonzero()] = 0
        y_pred_z = csr_matrix(y_pred_z)

        # calculate weighted counts for the positive class
        true_and_pred_p = y_true.multiply(y_pred)
        tp_sum_p = count_nonzero(true_and_pred_p, axis=sum_axis)
        true_sum_p = count_nonzero(y_true, axis=sum_axis)

        # calculate weighted counts for the negative class
        true_and_pred_n = y_true_z.multiply(y_pred_z)
        tp_sum_n = count_nonzero(true_and_pred_n, axis=sum_axis)
        true_sum_n = count_nonzero(y_true_z, axis=sum_axis)

        # the final true positive and positive
        tp_sum = np.vstack((tp_sum_p, tp_sum_n))
        true_sum = np.vstack((true_sum_p, true_sum_n))

        if average == 'micro':
            tp_sum = np.array([tp_sum.sum(axis=1)])
            true_sum = np.array([true_sum.sum(axis=1)])

    elif average == 'samples':
        raise ValueError("Sample-based balanced accuracy is "
                         "not meaningful outside multilabel "
                         "problems.")
    else:
        # binary classification case ##
        if labels is not None:
            warnings.warn("The `labels` argument will be ignored "
                          "in binary classification problems.")

        le = LabelEncoder()
        le.fit(labels)
        y_true = le.transform(y_true)
        y_pred = le.transform(y_pred)

        # labels are now either 0 or 1 -> use bincount
        tp = y_true == y_pred
        tp_bins = y_true[tp]
        tp_bins_weights = None

        if len(tp_bins):
            tp_sum = bincount(tp_bins, weights=tp_bins_weights,
                              minlength=2)
        else:
            # Pathological case
            true_sum = tp_sum = np.zeros(2)
        if len(y_true):
            true_sum = bincount(y_true, minlength=2)

    # Finally, we have all our sufficient statistics. Divide! #

    with np.errstate(divide='ignore', invalid='ignore'):
        # Divide, and on zero-division, set scores to 0 and warn:

        # Oddly, we may get an "invalid" rather than a "divide" error
        # here.
        recalls = _prf_divide(tp_sum, true_sum,
                              'recall', 'true', average, ('recall',))
        bacs = np.average(recalls, axis=0)

    # Average the results
    if average is not None:
        bacs = np.average(bacs)

    return bacs

jnothman · 2017-01-07T13:06:03Z

I am okay with simply supporting binary problems. If the multiclass formulation is standard (there are many multiclass ROC formulations), then supporting that makes sense too.

dalmia · 2017-01-08T02:22:02Z

I may not claim that the multiclass formulation is standard, but I mentioned the formulation above based on this. Please let me know what you think.

dalmia · 2017-01-10T18:12:02Z

Do we have an opinion on this?

amueller · 2017-07-21T17:49:25Z

sklearn/metrics/classification.py

+
+    References
+    ----------
+    .. [1] Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010).


This paper only treats the binary case and it's not clear to me that it does the same thing as this code. We need more references.

Oh wait, this PR is only for the binary case? hm...

amueller · 2017-07-21T17:53:06Z

maybe call this metric binary_balanced_accuracy?

jnothman · 2017-07-22T13:20:52Z

I think balanced accuracy is ordinarily binary. The attempt to extend it to the multiclass case without clear references is a major reason for the contribution to stall. On 22 Jul 2017 3:53 am, "Andreas Mueller" <notifications@github.com> wrote: maybe call this metric binary_balanced_accuracy? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8066 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz64GlD64GHv4g87tOvDNMCB_Rd_1Vks5sQOWEgaJpZM4LOwgi> .

amueller · 2017-07-23T01:52:43Z

Well there are several extensions. I'd say we call this binary_balanced_accuracy and just do that case for now.

amueller · 2017-09-06T16:49:12Z

@jnothman so do you think it should be called balanced_accuracy and just implement the binary case? I'm also fine with that. I thought binary_balanced_accuracy might be more explicit but might also be redundant.

jnothman · 2017-09-06T23:06:10Z

I think that would be a very good place to start. It's a useful and uncontroversial metric in the binary case, but lots of people won't use it if we don't provide it by that name.

…

On 7 Sep 2017 2:49 am, "Andreas Mueller" ***@***.***> wrote: @jnothman <https://github.com/jnothman> so do you think it should be called balanced_accuracy and just implement the binary case? I'm also fine with that. I thought binary_balanced_accuracy might be more explicit but might also be redundant. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8066 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61r2kyOvaRQAKTvazBtcCQThepBBks5sfs0LgaJpZM4LOwgi> .

maskani-moh · 2017-10-09T15:57:52Z

@amueller @jnothman

What's actually left to do in this PR?
It seems like you've agreed to let the multiclass implementation for later.

Shall I then just change the function name from balanced_accuracy to binary_balanced_accuracy?

jnothman · 2017-10-09T22:33:27Z

Yes, I think this PR is good. I don't know why it's labelled WIP. LGTM.

(I know we could support multilabel balacc, and we can't just do that with recall_score, but could with roc_auc_score, but I think we should just get this in the library.)

jnothman · 2017-10-09T22:41:09Z

@amueller, let's do this?

amueller · 2017-10-17T15:11:28Z

@jnothman yes. Sorry for the slow reply. Not handling the teacher life well.

* add function computing balanced accuracy * documentation for the balanced_accuracy_score * apply common tests to balanced_accuracy_score * constrained to binary classification problems only * add balanced_accuracy_score for CLF test * add scorer for balanced_accuracy * reorder the place of importing balanced_accuracy_score to be consistent with others * eliminate an accidentally added non-ascii character * remove balanced_accuracy_score from METRICS_WITH_LABELS * eliminate all non-ascii charaters in the doc of balanced_accuracy_score * fix doctest for nonexistent scoring function * fix documentation, clarify linkages to recall and auc * FIX: added changes as per last review See scikit-learn#6752, fixes scikit-learn#6747 * FIX: fix typo * FIX: remove flake8 errors * DOC: merge fixes * DOC: remove unwanted files * DOC update what's new

xyguo and others added 16 commits May 3, 2016 18:04

add function computing balanced accuracy

fad6759

documentation for the balanced_accuracy_score

5d84236

apply common tests to balanced_accuracy_score

717727b

constrained to binary classification problems only

49befec

add balanced_accuracy_score for CLF test

f60e311

add scorer for balanced_accuracy

1ce2ebc

reorder the place of importing balanced_accuracy_score to be consiste…

09c127e

…nt with others

eliminate an accidentally added non-ascii character

9b668e9

remove balanced_accuracy_score from METRICS_WITH_LABELS

0c5389b

eliminate all non-ascii charaters in the doc of balanced_accuracy_score

17b0d5b

fix doctest for nonexistent scoring function

34fa3e1

fix documentation, clarify linkages to recall and auc

ecb48b2

FIX: added changes as per last review See scikit-learn#6752, fixes sc…

348b1ac

…ikit-learn#6747

Resolve merge conflicts with master

41d1120

FIX: fix typo

72492f9

FIX: remove flake8 errors

bb2a3c5

dalmia changed the title ~~[WIP] Adding support for balanced accuracy~~ [MRG] Adding support for balanced accuracy Dec 21, 2016

dalmia changed the title ~~[MRG] Adding support for balanced accuracy~~ [WIP] Adding support for balanced accuracy Jan 7, 2017

dalmia added 3 commits January 7, 2017 13:25

Merge branch 'master' into 6747

566699f

Conflicts: doc/modules/model_evaluation.rst sklearn/metrics/scorer.py

DOC: merge fixes

575dba9

DOC: remove unwanted files

5daab0d

amueller mentioned this pull request Jul 21, 2017

Replace balanced_accuracy with macro-averaged recall from sklearn EpistasisLab/tpot#108

Closed

amueller reviewed Jul 21, 2017

View reviewed changes

jnothman changed the title ~~[WIP] Adding support for balanced accuracy~~ [MRG+1] Adding support for balanced accuracy Oct 9, 2017

jnothman added 2 commits October 10, 2017 09:36

Merge branch 'master' into 6747

4ab6442

DOC update what's new

67376cc

amueller merged commit 8daad06 into scikit-learn:master Oct 17, 2017

qinhanmin2014 mentioned this pull request Oct 18, 2017

Several fixed issues/PRs that might be closed #9948

Closed

This was referenced Oct 18, 2017

[WIP] score function computing balanced accuracy #6752

Closed

[MRG] Add balanced accuracy score in metrics #5588

Closed

Add balanced_accuracy_score metrics #3506

Closed

This was referenced Oct 23, 2017

Add references for multiclass balanced-accuracy definitions #9982

Merged

Balanced accuracy doc - 2 #10040

Merged

qinhanmin2014 mentioned this pull request Oct 11, 2018

[MRG+2] Add max_error to the existing set of metrics for regression #12232

Merged

Uh oh!

[MRG+1] Adding support for balanced accuracy #8066

[MRG+1] Adding support for balanced accuracy #8066

Uh oh!

Conversation

dalmia commented Dec 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

dalmia commented Dec 22, 2016

Uh oh!

xyguo commented Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dalmia commented Dec 22, 2016

Uh oh!

jnothman commented Dec 22, 2016 via email

Uh oh!

dalmia commented Jan 7, 2017

Uh oh!

xyguo commented Jan 7, 2017

Uh oh!

jnothman commented Jan 7, 2017

Uh oh!

dalmia commented Jan 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dalmia commented Jan 10, 2017

Uh oh!

amueller Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

amueller Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

amueller commented Jul 21, 2017

Uh oh!

jnothman commented Jul 22, 2017 via email

Uh oh!

amueller commented Jul 23, 2017

Uh oh!

amueller commented Sep 6, 2017

Uh oh!

jnothman commented Sep 6, 2017 via email

Uh oh!

maskani-moh commented Oct 9, 2017

Uh oh!

jnothman commented Oct 9, 2017

Uh oh!

jnothman commented Oct 9, 2017

Uh oh!

amueller commented Oct 17, 2017

Uh oh!

Uh oh!

dalmia commented Dec 16, 2016 •

edited

Loading

xyguo commented Dec 22, 2016 •

edited

Loading

dalmia commented Jan 8, 2017 •

edited

Loading