Skip to content

FIX helper to check multilabel types #1985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from

Conversation

jnothman
Copy link
Member

Fixes bugs related to simply checking type(..) equality and refactors.

@arjoly does this seem reasonable?

@jnothman
Copy link
Member Author

I'm not sure about all the terminology and codes returned.

@arjoly
Copy link
Member

arjoly commented May 22, 2013

Can you have a look at #1643 ?

I would use something like _get_label_type and separate the datatype identification and the check.
Your check_multilabel_type doesn't follow the same behaviour as check_arrays which is confusing.

@jnothman
Copy link
Member Author

I don't mind removing the size checks, and I'm happy to change check_ to get_. However, I've realised now that this is in fact just an extension on is_multilabel. Anything that returns is_multilabel True will return non-False here. So I say we rename check_multilabel_type -> is_multilabel -> _is_multilabel. The name then being clearer, I will remove the shape checks...

@jnothman
Copy link
Member Author

And after looking at #1643, I'll adopt 'sequences' and 'indicator' instead of 'matrix' and 'sos'.

@jnothman
Copy link
Member Author

Please check this again @arjoly, and @amueller given your earlier implementation of something similar.

@GaelVaroquaux
Copy link
Member

Also, @arjoly, can we not as a rule assume that a label indicator matrix
consists of 0s and 1s or Falses and Trues? Do we need to worry about pos_label?

I am thinking the same: it's a source of confusion and makes code more
complex.

@arjoly
Copy link
Member

arjoly commented May 23, 2013

Also, @arjoly, can we not as a rule assume that a label indicator matrix consists of 0s and 1s or Falses and Trues? Do we need to worry about pos_label?
I am thinking the same: it's a source of confusion and makes code more complex.

Yes, this can be remove.

'Received a mix of single-label and multi-label data')


def is_multilabel(*ys):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get rid of the magic *?
This is not necessary here.

This function doesn't return a boolean, so could rename this function
get_something, e.g. get_label_type or get_classification_type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get rid of the magic *?
This is not necessary here.

What I mean is that if you have a get_classification_type function,
you can easily compare the format of y_pred and y_true without
using the magic *:

>>> ind = np.array([[0, 1], [1, 0]])
>>> seq = [[0], [0, 2]]
>>>get_classification_type(ind)
"multilabel-indicator-matrix"
>>> get_classification_type(seq)
"multilabel-sequence-of-sequence"
>>> get_classification_type(ind) ==  get_classification_type(seq)
False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't return a boolean, so could rename this function
get_something, e.g. get_label_type or get_classification_type?

With the given name, you return a string output and a boolean output.
That's counter-intuitive.

@arjoly
Copy link
Member

arjoly commented May 23, 2013

Could you add tests in sklearn/utils/tests/test_multiclass.py?
We do not really use docstring for testing.

@arjoly
Copy link
Member

arjoly commented May 23, 2013

Since you want to add the support for list of list of string, I think that you should extend the invariance test for multilabel data format in test_metrics.py.

@jnothman
Copy link
Member Author

I did support binary and multiclass, just not in the case of multiple. Your comments are fair. I think get_classification_type (or type_of_targets) is a great idea for general utility.

I think the returned strings should be:

  • 'multilabel-indicator'
  • 'multilabel-sequences'
  • 'multiclass'
  • 'binary'? I'm not sure we want to check for this as it involves looking through all the labels.
  • 'regression'? but if we're going to look through all the labels, we might as well check for non-discrete labels?

Could you add tests in sklearn/utils/tests/test_multiclass.py?
We do not really use docstring for testing.

I thought this might be okay in utils, but can change it.

Since you want to add the support for list of list of string, I think that you should extend the invariance test for multilabel data format in test_metrics.py.

List of list of string was already supported, effectively (it passed is_multilabel and you map labels to integer indices in your _tp_tn_fp_fn). I'm not sure it's necessary to support, but if we support list of string for multiclass, then why not?

@jnothman
Copy link
Member Author

maximally:

def type_of_target(y):
    if is_sequence_of_sequences(y):
        return 'multilabel-sequences'
    elif is_label_indicator_matrix(y):
        return 'multilabel-indicator'
    y = np.asarray(y)
    if issubclass(y.dtype.type, np.float) and np.any(y != y.astype(int)):
        return 'regression'  # or 'continuous'? 'real'? delete this case?
    if len(np.unique(y)) == 2:
        return 'binary'
    else:
        return 'multiclass'

@jnothman
Copy link
Member Author

And maybe something like this would be useful to help metrics pass invariance tests (this is untested):

def _check_clf_targets(y_true, y_pred, encode=False, ravel=True):
    y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
    type_true = type_of_target(y_true)
    type_pred = type_of_target(y_pred)
    labels = unique_labels(y_true, y_pred)

    if type_true.startswith('multilabel'):
        if not type_pred.startswith('multilabel'):
            raise ValueError("Can't handle mix of multilabel and multiclass targets")

        if type_true != type_pred:
            enc = LabelBinarizer()
            enc.fit([labels.tolist()])
            y_true = enc.transform(y_true)
            y_pred = enc.transform(y_pred)
            type_true = type_pred = 'multilabel-indicator'

    elif type_pred.startswith('multilabel'):
        raise ValueError("Can't handle mix of multilabel and multiclass targets")

    elif 'regression' in (type_true, type_pred):
        raise ValueError("Can't handle continuous targets")

    else:
        if 'multiclass' in (type_true, type_pred):
            # 'binary' can be removed
            type_true = type_pred = 'multiclass'

        y_true, y_pred = _check_1d_array(y_true, y_pred, ravel=ravel)

    if encode and type_true != 'multilabel-indicator':
        enc = LabelEncoder()
        enc.fit([labels.tolist()] if type_true.startswith('multilabel')
                else labels.tolist())
        y_true = enc.transform(y_true)
        y_pred = enc.transform(y_pred)

    return labels, type_true, y_true, y_pred

@arjoly
Copy link
Member

arjoly commented May 24, 2013

I like your type_of_target and _check_clf_targets functions.

Will it simplify anything to support the binary case or regression case?

@jnothman
Copy link
Member Author

Well, precision_recall_fscore_support and LabelBinarizer both identify the binary case in contrast with multilabel. So it avoids the condition if not multilabel and len(labels) == 2, and hence the bug introduced if if len(labels) == 2 is used alone.

Regression? (Or "continuous"... or "real-valued".) I can't think of anything, except to produce a ValueError for categorical metrics if the output of decision_function or a regressor is provided.

@jnothman
Copy link
Member Author

okay, have another look.

@@ -149,6 +149,95 @@ def _check_1d_array(y1, y2, ravel=False):
return y1, y2


def _check_clf_targets(y_true, y_pred, encode=False, ravel=False, labels=None):
"""Check that y1 and y2 correspond to the same classification task type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y1 => y_true ?
y2 => y_pred ?

@arjoly
Copy link
Member

arjoly commented May 26, 2013

You can add your name in the authors of sklearn/utils/multiclass.py.


encode : boolean, optional (default=False),
If ``encode`` is set to ``True``, then non-multilabel ``y_true`` and
``y_pred`` are encoded as integers from 0 to ``len(labels)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have this option at the moment?
Is it used in the codebase?

@jnothman
Copy link
Member Author

jnothman commented Jun 8, 2013

Rebased on master.

@jnothman
Copy link
Member Author

jnothman commented Jun 8, 2013

In terms of performance, I guess the slow checks in type_of_target are np.unique and y.astype(int) == y.astype. And by ignoring the continuous case, we can avoid the latter... Not really sure if there's any nice way around this, though.

PRF benchmark (10 classes; multiclass len(y)=1e6, multlabel list of lists 1e3; multilabel label indicator 1e6) averaged over 50 trials (on a not-very-fast laptop, script @ https://gist.github.com/jnothman/5734967):

  • master: 0.854, 0.573, 0.000979
  • multilabel_type: 1.12, 0.566, 0.00123
  • prf_rewrite: 0.658, 0.721, 0.00059
  • d33634d^ (before multilabel support): 0.947, N/A, N/A

These results make me wonder how long binarizing takes for multiclass data and whether it's worth doing in general, and the answer seems to be "almost certainly":

In [6]: %timeit sklearn.preprocessing.LabelBinarizer().fit_transform(y)
1 loops, best of 3: 421 ms per loop
In [7]: y.shape
Out[7]: (1000000,)

@arjoly might be interested...

@arjoly
Copy link
Member

arjoly commented Jun 8, 2013

Thanks for the benchmark!

@jnothman
Copy link
Member Author

jnothman commented Jun 8, 2013

I take back "almost certainly": of course, LabelBinarizer needs to transform two arrays of that size, so it adds up to a bit more; and it blows out as we increase the number of labels:

  • n_classes ; direct ; binarized
  • 10 0.564 2.79
  • 100 0.76 26.7

That's multiclass 1e6 samples.

What about sequences of sequences 1e3 samples?

  • 10 0.694 0.0288
  • 100 0.849 0.293
  • 1000 2.65 2.85

Well, that's a nice reduction. Perhaps we should always binarize sequences of sequences in PRF.

y.shape[0] > 1 and np.size(np.unique(y)) <= 2)
return (hasattr(y, "shape") and y.ndim == 2 and y.shape[1] > 1 and
np.size(np.unique(y)) <= 2 and
issubclass(y.dtype.type, (np.int, np.bool_, np.float)) and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently this fails for np.int8, etc. Fix and tests coming soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed 1bead14

@jnothman
Copy link
Member Author

jnothman commented Jun 8, 2013

I should note that much of that binarizing time is taken up by finding the unique labels in fit. Note that unique_labels is being rewritten (#1985), and yet if the scorer could receive a labels parameter, this could be avoided...

@ogrisel
Copy link
Member

ogrisel commented Jun 8, 2013

Thanks @jnothman for the rebase and the benchmarks. I think we are good to merge. 👍 on my side.

@jnothman
Copy link
Member Author

jnothman commented Jun 9, 2013

Well, what've we got to lose? Merged as fd3bd2d.

@jnothman jnothman closed this Jun 9, 2013
@jnothman
Copy link
Member Author

jnothman commented Jun 9, 2013

@arjoly: let the rebasing begin!

@arjoly
Copy link
Member

arjoly commented Jun 22, 2013

thanks a lot @jnothman !!! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants