FIX helper to check multilabel types #1985

jnothman · 2013-05-22T06:39:36Z

Fixes bugs related to simply checking type(..) equality and refactors.

@arjoly does this seem reasonable?

jnothman · 2013-05-22T06:41:26Z

I'm not sure about all the terminology and codes returned.

arjoly · 2013-05-22T07:37:33Z

Can you have a look at #1643 ?

I would use something like _get_label_type and separate the datatype identification and the check.
Your check_multilabel_type doesn't follow the same behaviour as check_arrays which is confusing.

jnothman · 2013-05-22T07:45:23Z

I don't mind removing the size checks, and I'm happy to change check_ to get_. However, I've realised now that this is in fact just an extension on is_multilabel. Anything that returns is_multilabel True will return non-False here. So I say we rename check_multilabel_type -> is_multilabel -> _is_multilabel. The name then being clearer, I will remove the shape checks...

jnothman · 2013-05-22T09:02:11Z

And after looking at #1643, I'll adopt 'sequences' and 'indicator' instead of 'matrix' and 'sos'.

jnothman · 2013-05-22T09:43:55Z

Please check this again @arjoly, and @amueller given your earlier implementation of something similar.

GaelVaroquaux · 2013-05-23T08:35:11Z

Also, @arjoly, can we not as a rule assume that a label indicator matrix
consists of 0s and 1s or Falses and Trues? Do we need to worry about pos_label?

I am thinking the same: it's a source of confusion and makes code more
complex.

arjoly · 2013-05-23T08:37:17Z

Also, @arjoly, can we not as a rule assume that a label indicator matrix consists of 0s and 1s or Falses and Trues? Do we need to worry about pos_label?
I am thinking the same: it's a source of confusion and makes code more complex.

Yes, this can be remove.

arjoly · 2013-05-23T14:01:01Z

sklearn/utils/multiclass.py

+    'Received a mix of single-label and multi-label data')
+
+
+def is_multilabel(*ys):


Can you get rid of the magic *?
This is not necessary here.

This function doesn't return a boolean, so could rename this function
get_something, e.g. get_label_type or get_classification_type?

Can you get rid of the magic *?
This is not necessary here.

What I mean is that if you have a get_classification_type function,
you can easily compare the format of y_pred and y_true without
using the magic *:

>>> ind = np.array([[0, 1], [1, 0]]) >>> seq = [[0], [0, 2]] >>>get_classification_type(ind) "multilabel-indicator-matrix" >>> get_classification_type(seq) "multilabel-sequence-of-sequence" >>> get_classification_type(ind) == get_classification_type(seq) False

This function doesn't return a boolean, so could rename this function
get_something, e.g. get_label_type or get_classification_type?

With the given name, you return a string output and a boolean output.
That's counter-intuitive.

arjoly · 2013-05-23T14:12:04Z

Could you add tests in sklearn/utils/tests/test_multiclass.py?
We do not really use docstring for testing.

arjoly · 2013-05-23T14:18:09Z

Since you want to add the support for list of list of string, I think that you should extend the invariance test for multilabel data format in test_metrics.py.

jnothman · 2013-05-23T22:46:22Z

I did support binary and multiclass, just not in the case of multiple. Your comments are fair. I think get_classification_type (or type_of_targets) is a great idea for general utility.

I think the returned strings should be:

'multilabel-indicator'
'multilabel-sequences'
'multiclass'
'binary'? I'm not sure we want to check for this as it involves looking through all the labels.
'regression'? but if we're going to look through all the labels, we might as well check for non-discrete labels?

Could you add tests in sklearn/utils/tests/test_multiclass.py?
We do not really use docstring for testing.

I thought this might be okay in utils, but can change it.

Since you want to add the support for list of list of string, I think that you should extend the invariance test for multilabel data format in test_metrics.py.

List of list of string was already supported, effectively (it passed is_multilabel and you map labels to integer indices in your _tp_tn_fp_fn). I'm not sure it's necessary to support, but if we support list of string for multiclass, then why not?

jnothman · 2013-05-23T22:59:00Z

maximally:

def type_of_target(y):
    if is_sequence_of_sequences(y):
        return 'multilabel-sequences'
    elif is_label_indicator_matrix(y):
        return 'multilabel-indicator'
    y = np.asarray(y)
    if issubclass(y.dtype.type, np.float) and np.any(y != y.astype(int)):
        return 'regression'  # or 'continuous'? 'real'? delete this case?
    if len(np.unique(y)) == 2:
        return 'binary'
    else:
        return 'multiclass'

jnothman · 2013-05-23T23:30:15Z

And maybe something like this would be useful to help metrics pass invariance tests (this is untested):

def _check_clf_targets(y_true, y_pred, encode=False, ravel=True):
    y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
    type_true = type_of_target(y_true)
    type_pred = type_of_target(y_pred)
    labels = unique_labels(y_true, y_pred)

    if type_true.startswith('multilabel'):
        if not type_pred.startswith('multilabel'):
            raise ValueError("Can't handle mix of multilabel and multiclass targets")

        if type_true != type_pred:
            enc = LabelBinarizer()
            enc.fit([labels.tolist()])
            y_true = enc.transform(y_true)
            y_pred = enc.transform(y_pred)
            type_true = type_pred = 'multilabel-indicator'

    elif type_pred.startswith('multilabel'):
        raise ValueError("Can't handle mix of multilabel and multiclass targets")

    elif 'regression' in (type_true, type_pred):
        raise ValueError("Can't handle continuous targets")

    else:
        if 'multiclass' in (type_true, type_pred):
            # 'binary' can be removed
            type_true = type_pred = 'multiclass'

        y_true, y_pred = _check_1d_array(y_true, y_pred, ravel=ravel)

    if encode and type_true != 'multilabel-indicator':
        enc = LabelEncoder()
        enc.fit([labels.tolist()] if type_true.startswith('multilabel')
                else labels.tolist())
        y_true = enc.transform(y_true)
        y_pred = enc.transform(y_pred)

    return labels, type_true, y_true, y_pred

arjoly · 2013-05-24T14:25:19Z

I like your type_of_target and _check_clf_targets functions.

Will it simplify anything to support the binary case or regression case?

jnothman · 2013-05-25T10:56:21Z

Well, precision_recall_fscore_support and LabelBinarizer both identify the binary case in contrast with multilabel. So it avoids the condition if not multilabel and len(labels) == 2, and hence the bug introduced if if len(labels) == 2 is used alone.

Regression? (Or "continuous"... or "real-valued".) I can't think of anything, except to produce a ValueError for categorical metrics if the output of decision_function or a regressor is provided.

jnothman · 2013-05-25T13:47:25Z

okay, have another look.

arjoly · 2013-05-26T09:34:44Z

sklearn/metrics/metrics.py

@@ -149,6 +149,95 @@ def _check_1d_array(y1, y2, ravel=False):
        return y1, y2


+def _check_clf_targets(y_true, y_pred, encode=False, ravel=False, labels=None):
+    """Check that y1 and y2 correspond to the same classification task type.


y1 => y_true ?
y2 => y_pred ?

arjoly · 2013-05-26T10:08:38Z

You can add your name in the authors of sklearn/utils/multiclass.py.

arjoly · 2013-05-26T10:11:48Z

sklearn/metrics/metrics.py

+
+    encode : boolean, optional (default=False),
+        If ``encode`` is set to ``True``, then non-multilabel ``y_true`` and
+        ``y_pred`` are encoded as integers from 0 to ``len(labels)``.


Is it necessary to have this option at the moment?
Is it used in the codebase?

jnothman · 2013-06-08T10:31:42Z

Rebased on master.

jnothman · 2013-06-08T12:11:53Z

In terms of performance, I guess the slow checks in type_of_target are np.unique and y.astype(int) == y.astype. And by ignoring the continuous case, we can avoid the latter... Not really sure if there's any nice way around this, though.

PRF benchmark (10 classes; multiclass len(y)=1e6, multlabel list of lists 1e3; multilabel label indicator 1e6) averaged over 50 trials (on a not-very-fast laptop, script @ https://gist.github.com/jnothman/5734967):

master: 0.854, 0.573, 0.000979
multilabel_type: 1.12, 0.566, 0.00123
prf_rewrite: 0.658, 0.721, 0.00059
d33634d^ (before multilabel support): 0.947, N/A, N/A

These results make me wonder how long binarizing takes for multiclass data and whether it's worth doing in general, and the answer seems to be "almost certainly":

In [6]: %timeit sklearn.preprocessing.LabelBinarizer().fit_transform(y)
1 loops, best of 3: 421 ms per loop
In [7]: y.shape
Out[7]: (1000000,)

@arjoly might be interested...

arjoly · 2013-06-08T13:11:01Z

Thanks for the benchmark!

jnothman · 2013-06-08T13:49:38Z

I take back "almost certainly": of course, LabelBinarizer needs to transform two arrays of that size, so it adds up to a bit more; and it blows out as we increase the number of labels:

n_classes ; direct ; binarized
10 0.564 2.79
100 0.76 26.7

That's multiclass 1e6 samples.

What about sequences of sequences 1e3 samples?

10 0.694 0.0288
100 0.849 0.293
1000 2.65 2.85

Well, that's a nice reduction. Perhaps we should always binarize sequences of sequences in PRF.

jnothman · 2013-06-08T13:50:22Z

sklearn/utils/multiclass.py

-            y.shape[0] > 1 and np.size(np.unique(y)) <= 2)
+    return (hasattr(y, "shape") and y.ndim == 2 and y.shape[1] > 1 and
+            np.size(np.unique(y)) <= 2 and
+            issubclass(y.dtype.type, (np.int, np.bool_, np.float)) and


apparently this fails for np.int8, etc. Fix and tests coming soon.

fixed 1bead14

jnothman · 2013-06-08T14:17:04Z

I should note that much of that binarizing time is taken up by finding the unique labels in fit. Note that unique_labels is being rewritten (#1985), and yet if the scorer could receive a labels parameter, this could be avoided...

ogrisel · 2013-06-08T15:45:17Z

Thanks @jnothman for the rebase and the benchmarks. I think we are good to merge. 👍 on my side.

jnothman · 2013-06-09T12:07:00Z

Well, what've we got to lose? Merged as fd3bd2d.

jnothman · 2013-06-09T12:08:58Z

@arjoly: let the rebasing begin!

arjoly · 2013-06-22T20:38:58Z

thanks a lot @jnothman !!! :-)

jnothman mentioned this pull request May 22, 2013

MRG added classes parameter to LabelBinarizer #1643

Closed

jaquesgrobler mentioned this pull request May 22, 2013

DOC Fix references to missing examples #1980

Closed

jnothman mentioned this pull request May 22, 2013

[WIP] rewrite precision_recall_fscore_support #1990

Closed

jnothman mentioned this pull request May 23, 2013

ENH un-confuse pos_label use for label indicator matrices #1992

Closed

arjoly reviewed May 23, 2013
View reviewed changes

arjoly reviewed May 26, 2013
View reviewed changes

jnothman added 16 commits June 8, 2013 20:28

Comment updated to not admit row vectors

0af5bd4

Empty second dimension -> unknown

0e1073f

Sequence of sequences example with duplicate values

09dbbf6

Raise error for type_of_targets(non_array_like)

08a782c

Unknown examples confusable with sequence of sequences

e6e8c84

It should not be possible to have 2d array-like reach binary case

5dd7554

Better indenting of condition

b137e9f

Remove metrics support for mixed multilabel formats

54a8329

Update comment

9ed01d9

Test _check_clf_targets

5c5b049

Invalidate row vector and mixed multilabel input

6d481ed

Minor tests changes from comments

398d684

Add test for single-class case to PRF

66b5e1d

Fix pep8

7636cd9

DOC doc for target_type = 'unknown'

8b61c0a

TST use fixed random seed

6d05a89

jnothman reviewed Jun 8, 2013
View reviewed changes

jnothman added 2 commits June 9, 2013 00:02

TST more dtypes for type_of_target

10da6be

FIX conditions for integral float arrays

1bead14

jnothman closed this Jun 9, 2013

arjoly mentioned this pull request Jul 29, 2013

[MRG] Missing contributions #2324

Merged

		'Received a mix of single-label and multi-label data')


		def is_multilabel(*ys):

Uh oh!

FIX helper to check multilabel types #1985

FIX helper to check multilabel types #1985

Uh oh!

Conversation

jnothman commented May 22, 2013

Uh oh!

jnothman commented May 22, 2013

Uh oh!

arjoly commented May 22, 2013

Uh oh!

jnothman commented May 22, 2013

Uh oh!

jnothman commented May 22, 2013

Uh oh!

jnothman commented May 22, 2013

Uh oh!

GaelVaroquaux commented May 23, 2013

Uh oh!

arjoly commented May 23, 2013

Uh oh!

arjoly May 23, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly May 23, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly May 23, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented May 23, 2013

Uh oh!

arjoly commented May 23, 2013

Uh oh!

jnothman commented May 23, 2013

Uh oh!

jnothman commented May 23, 2013

Uh oh!

jnothman commented May 23, 2013

Uh oh!

arjoly commented May 24, 2013

Uh oh!

jnothman commented May 25, 2013

Uh oh!

jnothman commented May 25, 2013

Uh oh!

arjoly May 26, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly commented May 26, 2013

Uh oh!

arjoly May 26, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 8, 2013

Uh oh!

jnothman commented Jun 8, 2013

Uh oh!

arjoly commented Jun 8, 2013

Uh oh!

jnothman commented Jun 8, 2013

Uh oh!

jnothman Jun 8, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 8, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 8, 2013

Uh oh!

ogrisel commented Jun 8, 2013

Uh oh!

jnothman commented Jun 9, 2013

Uh oh!

jnothman commented Jun 9, 2013

Uh oh!

arjoly commented Jun 22, 2013

Uh oh!

Uh oh!