-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FIX corner cases with unique_labels
#2015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX corner cases with unique_labels
#2015
Conversation
Not sure these are bugs. Non-integral/string labels are maybe an issue, but I don't think these mixes are. And I see that in testing this does overlap somewhat with #1985. There are some chicken and egg quandries, but I guess we should merge in this order assuming all are accepted: #2002, #1985, #2015, #1987, #1990... I'll take a look at the content of this PR some time soon... |
ys_is_multilabels = [is_multilabel(y) for y in ys] | ||
|
||
if len(set(ys_is_multilabels)) != 1: | ||
raise ValueError("Mix of binary / mutliclass and multilabel type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you're pre-empting the user of this function a bit too much. They haven't asked you to validate the types of data, they've asked you to find the unique labels... But whatever validation you do needs documenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, there's something special about the label indicator matrix format in that its labels are implicit. There seems to me no reason to deny the user the unique labels of multiclass and seq-of-seq targets together. But I'm having another thought on this that I'll post directly on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed my mind.
I think you can simplify it a bit, relax the conditions, and make it clearer at the same time. Roughly labels = np.array([])
is_string = None
label_indicator_size = None
for y in ys:
if is_lim(y):
if label_indicator_size is None:
label_indicator_size = y.shape[1]
new_labels = ...
else:
assert label_indicator_size == y.shape[1]
elif is_seqs(y):
new_labels = ...
else:
new_labels = ...
new_is_string = ...
if is_string is None:
is_string = new_is_string
else:
assert is_string == new_is_string
labels.append(new_labels)
return np.unique(labels) |
Hm. It's only marginally related to Basically, you could throw out anything that mixes label indicator matrices and labels. |
Agree, let's simplify everything. I will open a pr to remove this functionality in the metrics module. |
please rebase on master to fix the failing test and incorporate |
@jnothman I think Arnaud is on vacation for two weeks. So don't expect any immediate reply ;) |
Thanks for letting me know. There're a few issues cascaded and better rebase on a familiar master than a stale one! |
I have rebased on top of master. |
assert_raises(ValueError, unique_labels, y_multilabel, y_multiclass) | ||
|
||
# Mix input type | ||
assert_raises(ValueError, unique_labels, [[1, 2], [3]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer these string labels to be numeric strings ("1", "2"), just to be sure that we're not allowing conversion in the other direction.
So, to summarise the intended functionality...
You may want to document that first point in the docstring, while the others are necessary for sanity. LGTM; needs another reviewer. |
I guess disallowing multioutput targets is also something worth noting in the docstring, especially as it's not plainly obvious from the code. |
A last review is welcome ! |
Previoulsy to 11e449f, those cases weren't treated correctly and no error
For d6605ab, flat is better than nested. |
len(set([isinstance(x, basestring) | ||
for y in ys for x in y])) > 1) or | ||
(label_type == "multilabel-sequences" and | ||
len(set.union(*[set(imap(lambda x: isinstance(x, basestring), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating this work in every case (iteration through sequences of sequences is slow), how about you only check this if the final dtype is indeed a string type. You could also have _unique_sequence_of_sequence
not return an array, and then you're checking over a much smaller set of values.
But altogether, I'm not sure the validation that a single set of targets has a consistent datatype is necessary. It seems to me too pathological a case to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty satisfied with 9a62e8c and it should solve this issue.
Yes, LGTM. |
I have rebased on top of master. Thanks @jnothman for the review. |
Can I merge this? |
Coming late to the discussion, but I fully agree :) |
LGTM, but this is not code for which I have usecases, so I would prefer another review from someone using these features. |
Thanks @GaelVaroquaux . Would it be ok to wait until monday before merging? |
Sure, no pb |
This pr is merged! Thanks to @jnothman and @GaelVaroquaux for the review. |
Thanks for the PR and working through the issues with multilabel data. I'm glad it's finally through. |
Bugs were discovered in
unique_labels
in #1985:any error.
In my opinion, the implementation could be greatly simplify and improve usingtype_of_target
in #1985 (see https://gist.github.com/arjoly/5665632).However this pr is not merged yet .Remaining bugs that won't be treated in this pr: