-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX helper to check multilabel types #1985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not sure about all the terminology and codes returned. |
Can you have a look at #1643 ? I would use something like |
I don't mind removing the size checks, and I'm happy to change |
And after looking at #1643, I'll adopt |
I am thinking the same: it's a source of confusion and makes code more |
Yes, this can be remove. |
'Received a mix of single-label and multi-label data') | ||
|
||
|
||
def is_multilabel(*ys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get rid of the magic *
?
This is not necessary here.
This function doesn't return a boolean, so could rename this function
get_something
, e.g. get_label_type
or get_classification_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get rid of the magic *?
This is not necessary here.
What I mean is that if you have a get_classification_type
function,
you can easily compare the format of y_pred
and y_true
without
using the magic *
:
>>> ind = np.array([[0, 1], [1, 0]])
>>> seq = [[0], [0, 2]]
>>>get_classification_type(ind)
"multilabel-indicator-matrix"
>>> get_classification_type(seq)
"multilabel-sequence-of-sequence"
>>> get_classification_type(ind) == get_classification_type(seq)
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't return a boolean, so could rename this function
get_something
, e.g.get_label_type
orget_classification_type
?
With the given name, you return a string output and a boolean output.
That's counter-intuitive.
Could you add tests in |
Since you want to add the support for list of list of string, I think that you should extend the invariance test for multilabel data format in |
I did support binary and multiclass, just not in the case of multiple. Your comments are fair. I think I think the returned strings should be:
I thought this might be okay in utils, but can change it.
List of list of string was already supported, effectively (it passed |
maximally: def type_of_target(y):
if is_sequence_of_sequences(y):
return 'multilabel-sequences'
elif is_label_indicator_matrix(y):
return 'multilabel-indicator'
y = np.asarray(y)
if issubclass(y.dtype.type, np.float) and np.any(y != y.astype(int)):
return 'regression' # or 'continuous'? 'real'? delete this case?
if len(np.unique(y)) == 2:
return 'binary'
else:
return 'multiclass' |
And maybe something like this would be useful to help metrics pass invariance tests (this is untested): def _check_clf_targets(y_true, y_pred, encode=False, ravel=True):
y_true, y_pred = check_arrays(y_true, y_pred, allow_lists=True)
type_true = type_of_target(y_true)
type_pred = type_of_target(y_pred)
labels = unique_labels(y_true, y_pred)
if type_true.startswith('multilabel'):
if not type_pred.startswith('multilabel'):
raise ValueError("Can't handle mix of multilabel and multiclass targets")
if type_true != type_pred:
enc = LabelBinarizer()
enc.fit([labels.tolist()])
y_true = enc.transform(y_true)
y_pred = enc.transform(y_pred)
type_true = type_pred = 'multilabel-indicator'
elif type_pred.startswith('multilabel'):
raise ValueError("Can't handle mix of multilabel and multiclass targets")
elif 'regression' in (type_true, type_pred):
raise ValueError("Can't handle continuous targets")
else:
if 'multiclass' in (type_true, type_pred):
# 'binary' can be removed
type_true = type_pred = 'multiclass'
y_true, y_pred = _check_1d_array(y_true, y_pred, ravel=ravel)
if encode and type_true != 'multilabel-indicator':
enc = LabelEncoder()
enc.fit([labels.tolist()] if type_true.startswith('multilabel')
else labels.tolist())
y_true = enc.transform(y_true)
y_pred = enc.transform(y_pred)
return labels, type_true, y_true, y_pred |
I like your Will it simplify anything to support the binary case or regression case? |
Well, Regression? (Or "continuous"... or "real-valued".) I can't think of anything, except to produce a ValueError for categorical metrics if the output of |
okay, have another look. |
@@ -149,6 +149,95 @@ def _check_1d_array(y1, y2, ravel=False): | |||
return y1, y2 | |||
|
|||
|
|||
def _check_clf_targets(y_true, y_pred, encode=False, ravel=False, labels=None): | |||
"""Check that y1 and y2 correspond to the same classification task type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y1
=> y_true
?
y2
=> y_pred
?
You can add your name in the authors of |
|
||
encode : boolean, optional (default=False), | ||
If ``encode`` is set to ``True``, then non-multilabel ``y_true`` and | ||
``y_pred`` are encoded as integers from 0 to ``len(labels)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have this option at the moment?
Is it used in the codebase?
Rebased on master. |
In terms of performance, I guess the slow checks in PRF benchmark (10 classes; multiclass
These results make me wonder how long binarizing takes for multiclass data and whether it's worth doing in general, and the answer seems to be "almost certainly":
@arjoly might be interested... |
Thanks for the benchmark! |
I take back "almost certainly": of course,
That's multiclass 1e6 samples. What about sequences of sequences 1e3 samples?
Well, that's a nice reduction. Perhaps we should always binarize sequences of sequences in PRF. |
y.shape[0] > 1 and np.size(np.unique(y)) <= 2) | ||
return (hasattr(y, "shape") and y.ndim == 2 and y.shape[1] > 1 and | ||
np.size(np.unique(y)) <= 2 and | ||
issubclass(y.dtype.type, (np.int, np.bool_, np.float)) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently this fails for np.int8
, etc. Fix and tests coming soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed 1bead14
I should note that much of that binarizing time is taken up by finding the unique labels in |
Thanks @jnothman for the rebase and the benchmarks. I think we are good to merge. 👍 on my side. |
Well, what've we got to lose? Merged as fd3bd2d. |
@arjoly: let the rebasing begin! |
thanks a lot @jnothman !!! :-) |
Fixes bugs related to simply checking type(..) equality and refactors.
@arjoly does this seem reasonable?