Skip to content

[MRG] FIX corner cases with unique_labels #2015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 8, 2013

Conversation

arjoly
Copy link
Member

@arjoly arjoly commented May 29, 2013

Bugs were discovered in unique_labels in #1985:

  • Mix of multilabel and multiclass format doesn't raise any errors
  • Mix of input data type doesn't raise any error (e.g. [1, 2] and ["a"])
  • Mix of indicator matrix with different number of labels doesn't raise
    any error.

In my opinion, the implementation could be greatly simplify and improve using type_of_target in #1985 (see https://gist.github.com/arjoly/5665632).

However this pr is not merged yet .

Remaining bugs that won't be treated in this pr:

  • mix of multioutput multiclass format and multiclass / multilabel format
  • mix of unknown / continuous / continuous multi-output format and multiclass / multilabel format

@jnothman
Copy link
Member

mix of multioutput multiclass format and multiclass / multilabel format
mix of unknown / continuous / continuous multi-output format and multiclass / multilabel format

Not sure these are bugs. Non-integral/string labels are maybe an issue, but I don't think these mixes are.

And I see that in testing this does overlap somewhat with #1985. There are some chicken and egg quandries, but I guess we should merge in this order assuming all are accepted: #2002, #1985, #2015, #1987, #1990...

I'll take a look at the content of this PR some time soon...

@arjoly
Copy link
Member Author

arjoly commented May 29, 2013

And I see that in testing this does overlap somewhat with #1985

You have done a pretty good work with testing cases in #1985.

ys_is_multilabels = [is_multilabel(y) for y in ys]

if len(set(ys_is_multilabels)) != 1:
raise ValueError("Mix of binary / mutliclass and multilabel type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you're pre-empting the user of this function a bit too much. They haven't asked you to validate the types of data, they've asked you to find the unique labels... But whatever validation you do needs documenting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, there's something special about the label indicator matrix format in that its labels are implicit. There seems to me no reason to deny the user the unique labels of multiclass and seq-of-seq targets together. But I'm having another thought on this that I'll post directly on the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed my mind.

@jnothman
Copy link
Member

I think you can simplify it a bit, relax the conditions, and make it clearer at the same time. Roughly

labels = np.array([])
is_string = None
label_indicator_size = None
for y in ys:
    if is_lim(y):
        if label_indicator_size is None:
            label_indicator_size = y.shape[1]
            new_labels = ...
        else:
            assert label_indicator_size == y.shape[1]
    elif is_seqs(y):
        new_labels = ...
    else:
        new_labels = ...
    new_is_string = ...
    if is_string is None:
        is_string = new_is_string
    else:
        assert is_string == new_is_string
    labels.append(new_labels)
return np.unique(labels)

@jnothman
Copy link
Member

jnothman commented Jun 3, 2013

Hm. It's only marginally related to unique_labels, but do we ever have any use cases where a mix of sequences of sequences and label indicator matrices are likely to appear, e.g. as input to metrics? Or were you just handling them for completeness? I think such mixes should throw an error in all cases, because we can't be sure we're interpreting the labels correctly.

Basically, you could throw out anything that mixes label indicator matrices and labels.

@arjoly
Copy link
Member Author

arjoly commented Jun 3, 2013

Agree, let's simplify everything. I will open a pr to remove this functionality in the metrics module.

@jnothman
Copy link
Member

please rebase on master to fix the failing test and incorporate type_of_targets here.

@glouppe
Copy link
Contributor

glouppe commented Jun 13, 2013

@jnothman I think Arnaud is on vacation for two weeks. So don't expect any immediate reply ;)

@jnothman
Copy link
Member

Thanks for letting me know. There're a few issues cascaded and better rebase on a familiar master than a stale one!

@arjoly
Copy link
Member Author

arjoly commented Jun 22, 2013

I have rebased on top of master.

assert_raises(ValueError, unique_labels, y_multilabel, y_multiclass)

# Mix input type
assert_raises(ValueError, unique_labels, [[1, 2], [3]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer these string labels to be numeric strings ("1", "2"), just to be sure that we're not allowing conversion in the other direction.

@jnothman
Copy link
Member

So, to summarise the intended functionality...
Return the set of class labels from all the given target arrays, disallowing:

  • mix of multilabel and multiclass (single label) targets (we might want to remove this constraint in the future)
  • mix of label indicator matrix and anything else (because there are no explicit labels)
  • mix of label indicator matrices of different sizes
  • mix of string and integer labels

You may want to document that first point in the docstring, while the others are necessary for sanity.

LGTM; needs another reviewer.

@jnothman
Copy link
Member

I guess disallowing multioutput targets is also something worth noting in the docstring, especially as it's not plainly obvious from the code.

@arjoly
Copy link
Member Author

arjoly commented Jun 24, 2013

A last review is welcome !

@arjoly
Copy link
Member Author

arjoly commented Jun 25, 2013

Previoulsy to 11e449f, those cases weren't treated correctly and no error
was raised. Everything was converted into strings.

    assert_raises(ValueError, unique_labels, ["1", 2])
    assert_raises(ValueError, unique_labels, [["1", 2], [3]])
    assert_raises(ValueError, unique_labels, [["1", "2"], [3]])

For d6605ab, flat is better than nested.

len(set([isinstance(x, basestring)
for y in ys for x in y])) > 1) or
(label_type == "multilabel-sequences" and
len(set.union(*[set(imap(lambda x: isinstance(x, basestring),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating this work in every case (iteration through sequences of sequences is slow), how about you only check this if the final dtype is indeed a string type. You could also have _unique_sequence_of_sequence not return an array, and then you're checking over a much smaller set of values.

But altogether, I'm not sure the validation that a single set of targets has a consistent datatype is necessary. It seems to me too pathological a case to handle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty satisfied with 9a62e8c and it should solve this issue.

@jnothman
Copy link
Member

Yes, LGTM.

@arjoly
Copy link
Member Author

arjoly commented Jun 25, 2013

I have rebased on top of master.

Thanks @jnothman for the review.

@arjoly
Copy link
Member Author

arjoly commented Jul 1, 2013

Can I merge this?

@arjoly
Copy link
Member Author

arjoly commented Jul 4, 2013

(Ping @glouppe, @amueller, @jakevdp)

@GaelVaroquaux
Copy link
Member

Hm. It's only marginally related to unique_labels, but do we ever have
any use cases where a mix of sequences of sequences and label indicator
matrices are likely to appear, e.g. as input to metrics? Or were you
just handling them for completeness? I think such mixes should throw an
error in all cases, because we can't be sure we're interpreting the
labels correctly.

Coming late to the discussion, but I fully agree :)

@GaelVaroquaux
Copy link
Member

LGTM, but this is not code for which I have usecases, so I would prefer another review from someone using these features.

@arjoly
Copy link
Member Author

arjoly commented Jul 5, 2013

Thanks @GaelVaroquaux .

Would it be ok to wait until monday before merging?

@GaelVaroquaux
Copy link
Member

Would it be ok to wait until monday before merging?

Sure, no pb

@arjoly arjoly merged commit a62abe6 into scikit-learn:master Jul 8, 2013
@arjoly arjoly deleted the fix-unique_labels-mix-type branch July 8, 2013 11:57
@arjoly
Copy link
Member Author

arjoly commented Jul 8, 2013

This pr is merged! Thanks to @jnothman and @GaelVaroquaux for the review.

@jnothman
Copy link
Member

jnothman commented Jul 8, 2013

Thanks for the PR and working through the issues with multilabel data. I'm glad it's finally through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants