[MRG] FIX corner cases with `unique_labels` #2015

arjoly · 2013-05-29T12:27:27Z

Bugs were discovered in unique_labels in #1985:

Mix of multilabel and multiclass format doesn't raise any errors
Mix of input data type doesn't raise any error (e.g. [1, 2] and ["a"])
Mix of indicator matrix with different number of labels doesn't raise
any error.

~~In my opinion, the implementation could be greatly simplify and improve using type_of_target in #1985 (see https://gist.github.com/arjoly/5665632).~~

~~However this pr is not merged yet .~~

Remaining bugs that won't be treated in this pr:

mix of multioutput multiclass format and multiclass / multilabel format
mix of unknown / continuous / continuous multi-output format and multiclass / multilabel format

jnothman · 2013-05-29T13:07:04Z

mix of multioutput multiclass format and multiclass / multilabel format
mix of unknown / continuous / continuous multi-output format and multiclass / multilabel format

Not sure these are bugs. Non-integral/string labels are maybe an issue, but I don't think these mixes are.

And I see that in testing this does overlap somewhat with #1985. There are some chicken and egg quandries, but I guess we should merge in this order assuming all are accepted: #2002, #1985, #2015, #1987, #1990...

I'll take a look at the content of this PR some time soon...

arjoly · 2013-05-29T13:14:37Z

And I see that in testing this does overlap somewhat with #1985

You have done a pretty good work with testing cases in #1985.

jnothman · 2013-05-30T07:39:55Z

sklearn/utils/multiclass.py

+    ys_is_multilabels = [is_multilabel(y) for y in ys]
+
+    if len(set(ys_is_multilabels)) != 1:
+        raise ValueError("Mix of binary / mutliclass and multilabel type")


I wonder if you're pre-empting the user of this function a bit too much. They haven't asked you to validate the types of data, they've asked you to find the unique labels... But whatever validation you do needs documenting.

In any case, there's something special about the label indicator matrix format in that its labels are implicit. There seems to me no reason to deny the user the unique labels of multiclass and seq-of-seq targets together. But I'm having another thought on this that I'll post directly on the issue.

Changed my mind.

jnothman · 2013-05-30T08:13:22Z

I think you can simplify it a bit, relax the conditions, and make it clearer at the same time. Roughly

labels = np.array([])
is_string = None
label_indicator_size = None
for y in ys:
    if is_lim(y):
        if label_indicator_size is None:
            label_indicator_size = y.shape[1]
            new_labels = ...
        else:
            assert label_indicator_size == y.shape[1]
    elif is_seqs(y):
        new_labels = ...
    else:
        new_labels = ...
    new_is_string = ...
    if is_string is None:
        is_string = new_is_string
    else:
        assert is_string == new_is_string
    labels.append(new_labels)
return np.unique(labels)

jnothman · 2013-06-03T07:47:28Z

Hm. It's only marginally related to unique_labels, but do we ever have any use cases where a mix of sequences of sequences and label indicator matrices are likely to appear, e.g. as input to metrics? Or were you just handling them for completeness? I think such mixes should throw an error in all cases, because we can't be sure we're interpreting the labels correctly.

Basically, you could throw out anything that mixes label indicator matrices and labels.

arjoly · 2013-06-03T07:59:19Z

Agree, let's simplify everything. I will open a pr to remove this functionality in the metrics module.

jnothman · 2013-06-13T00:26:49Z

please rebase on master to fix the failing test and incorporate type_of_targets here.

glouppe · 2013-06-13T07:05:11Z

@jnothman I think Arnaud is on vacation for two weeks. So don't expect any immediate reply ;)

jnothman · 2013-06-13T07:09:14Z

Thanks for letting me know. There're a few issues cascaded and better rebase on a familiar master than a stale one!

arjoly · 2013-06-22T22:01:34Z

I have rebased on top of master.

jnothman · 2013-06-22T23:56:22Z

sklearn/utils/tests/test_multiclass.py

+        assert_raises(ValueError, unique_labels, y_multilabel, y_multiclass)
+
+    # Mix input type
+    assert_raises(ValueError, unique_labels, [[1, 2], [3]],


I'd prefer these string labels to be numeric strings ("1", "2"), just to be sure that we're not allowing conversion in the other direction.

jnothman · 2013-06-23T00:09:38Z

So, to summarise the intended functionality...
Return the set of class labels from all the given target arrays, disallowing:

mix of multilabel and multiclass (single label) targets (we might want to remove this constraint in the future)
mix of label indicator matrix and anything else (because there are no explicit labels)
mix of label indicator matrices of different sizes
mix of string and integer labels

You may want to document that first point in the docstring, while the others are necessary for sanity.

LGTM; needs another reviewer.

jnothman · 2013-06-23T00:11:55Z

I guess disallowing multioutput targets is also something worth noting in the docstring, especially as it's not plainly obvious from the code.

arjoly · 2013-06-24T15:17:22Z

A last review is welcome !

arjoly · 2013-06-25T06:23:17Z

Previoulsy to 11e449f, those cases weren't treated correctly and no error
was raised. Everything was converted into strings.

    assert_raises(ValueError, unique_labels, ["1", 2])
    assert_raises(ValueError, unique_labels, [["1", 2], [3]])
    assert_raises(ValueError, unique_labels, [["1", "2"], [3]])

For d6605ab, flat is better than nested.

jnothman · 2013-06-25T06:42:22Z

sklearn/utils/multiclass.py

+            len(set([isinstance(x, basestring)
+                     for y in ys for x in y])) > 1) or
+        (label_type == "multilabel-sequences" and
+            len(set.union(*[set(imap(lambda x: isinstance(x, basestring),


Instead of repeating this work in every case (iteration through sequences of sequences is slow), how about you only check this if the final dtype is indeed a string type. You could also have _unique_sequence_of_sequence not return an array, and then you're checking over a much smaller set of values.

But altogether, I'm not sure the validation that a single set of targets has a consistent datatype is necessary. It seems to me too pathological a case to handle.

I am pretty satisfied with 9a62e8c and it should solve this issue.

jnothman · 2013-06-25T12:45:55Z

Yes, LGTM.

arjoly · 2013-06-25T13:02:23Z

I have rebased on top of master.

Thanks @jnothman for the review.

arjoly · 2013-07-01T16:29:29Z

Can I merge this?

arjoly · 2013-07-04T07:05:29Z

(Ping @glouppe, @amueller, @jakevdp)

GaelVaroquaux · 2013-07-04T18:16:54Z

Hm. It's only marginally related to unique_labels, but do we ever have
any use cases where a mix of sequences of sequences and label indicator
matrices are likely to appear, e.g. as input to metrics? Or were you
just handling them for completeness? I think such mixes should throw an
error in all cases, because we can't be sure we're interpreting the
labels correctly.

Coming late to the discussion, but I fully agree :)

GaelVaroquaux · 2013-07-04T18:20:31Z

LGTM, but this is not code for which I have usecases, so I would prefer another review from someone using these features.

arjoly · 2013-07-05T13:35:42Z

Thanks @GaelVaroquaux .

Would it be ok to wait until monday before merging?

GaelVaroquaux · 2013-07-05T14:05:40Z

Would it be ok to wait until monday before merging?

Sure, no pb

arjoly · 2013-07-08T11:57:52Z

This pr is merged! Thanks to @jnothman and @GaelVaroquaux for the review.

jnothman · 2013-07-08T11:58:59Z

Thanks for the PR and working through the issues with multilabel data. I'm glad it's finally through.

jnothman reviewed May 30, 2013
View reviewed changes

arjoly mentioned this pull request Jun 3, 2013

TST Need tests for multilabel format issues #2022

Closed

jnothman reviewed Jun 22, 2013
View reviewed changes

jnothman reviewed Jun 25, 2013
View reviewed changes

arjoly added 3 commits July 8, 2013 13:49

FIX unique_labels in corner case

e40a7f5

FIX issue with comparable but different dtype

e3ca5c1

ENH don't allow mix of input multilabel format

adcb0a8

arjoly added 13 commits July 8, 2013 13:49

ENH simpler check for mix of string and number input

825b2c0

COSMIT better name

45847a7

Typo

46d4718

ENH use type_of_target within unique_labels

91e4b9a

ENH improve documentation with allowed label types

61d9f41

ENH check that we don't mix number and strings

aef47a7

Flatten label type checking

7856a72

TST add smoke test for all supported format

33016ed

COSMIT

6c68cac

PY3K use six.string_type

7699100

OPTIM + ENH simplify mix string and number check

109f2fb

FIX bug with indicator format

a9197f3

ENH use a comprehension over imap

a62abe6

arjoly merged commit a62abe6 into scikit-learn:master Jul 8, 2013

arjoly deleted the fix-unique_labels-mix-type branch July 8, 2013 11:57

arjoly mentioned this pull request Jul 29, 2013

[MRG] Missing contributions #2324

Merged

Uh oh!

[MRG] FIX corner cases with unique_labels #2015

[MRG] FIX corner cases with unique_labels #2015

Uh oh!

Conversation

arjoly commented May 29, 2013

Uh oh!

jnothman commented May 29, 2013

Uh oh!

arjoly commented May 29, 2013

Uh oh!

jnothman May 30, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman May 30, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman May 30, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 30, 2013

Uh oh!

jnothman commented Jun 3, 2013

Uh oh!

arjoly commented Jun 3, 2013

Uh oh!

jnothman commented Jun 13, 2013

Uh oh!

glouppe commented Jun 13, 2013

Uh oh!

jnothman commented Jun 13, 2013

Uh oh!

arjoly commented Jun 22, 2013

Uh oh!

jnothman Jun 22, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 23, 2013

Uh oh!

jnothman commented Jun 23, 2013

Uh oh!

arjoly commented Jun 24, 2013

Uh oh!

arjoly commented Jun 25, 2013

Uh oh!

jnothman Jun 25, 2013

Choose a reason for hiding this comment

Uh oh!

arjoly Jun 25, 2013

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 25, 2013

Uh oh!

arjoly commented Jun 25, 2013

Uh oh!

arjoly commented Jul 1, 2013

Uh oh!

arjoly commented Jul 4, 2013

Uh oh!

GaelVaroquaux commented Jul 4, 2013

Uh oh!

GaelVaroquaux commented Jul 4, 2013

Uh oh!

arjoly commented Jul 5, 2013

Uh oh!

GaelVaroquaux commented Jul 5, 2013

Uh oh!

arjoly commented Jul 8, 2013

Uh oh!

jnothman commented Jul 8, 2013

Uh oh!

Uh oh!

[MRG] FIX corner cases with `unique_labels` #2015

[MRG] FIX corner cases with `unique_labels` #2015