-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Multi-label and multi-output multi-class decision functions and predict proba aren't consistent #2451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A small example to understand the issue from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.datasets import make_multilabel_classification
X, Y = make_multilabel_classification(random_state=0, n_samples=5,
return_indicator=True, n_classes=3)
print("rf")
rf = RandomForestClassifier(random_state=0).fit(X, Y)
print(rf.predict_proba(X))
# rf
# [array([[ 0.7, 0.3],
# [ 0.2, 0.8],
# [ 0.9, 0.1],
# [ 0.8, 0.2],
# [ 0.2, 0.8]]), array([[ 0.6, 0.4],
# [ 0.2, 0.8],
# [ 0.2, 0.8],
# [ 0.9, 0.1],
# [ 0.8, 0.2]]), array([[ 0.3, 0.7],
# [ 0.8, 0.2],
# [ 0.1, 0.9],
# [ 1. , 0. ],
# [ 0.9, 0.1]])]
print("ovr rf")
ovr_rf = OneVsRestClassifier(RandomForestClassifier(random_state=0)).fit(X, Y)
print(ovr_rf.predict_proba(X))
# ovr rf
# [[ 0.2 0.4 0.7]
# [ 0.8 0.8 0.1]
# [ 0.1 0.9 0.9]
# [ 0.2 0.1 0. ]
# [ 0.8 0.1 0.2]] |
There is 3 possibilities to solve this issue:
Option 1. means more format to support, option 2. won't work with a grid search estimator What is your opinion on this issue? Do you have better ideas? |
Part of the issue that you've not stated is that a multilabel label Now, we already have the quirky case of binary classification resulting in So I more-or-less think your option (1) is agreeable, but you haven't told ~J On Wed, Sep 18, 2013 at 1:18 AM, Arnaud Joly notifications@github.comwrote:
|
At the moment, I am thinking of metrics with a score or a probability. None |
Related to #1781 |
As far as I know / remember, the only "multi-label" (and not multi-output multi-class) aware classifier is the OneVsRestClassifier. The issue could be handled by deprecating the multilabel support of ovr and implementing a separate class or module for a binary relevance / multi-output classifier. |
What is the motivation for deprecation? It seems to me OVR has the right interface. I'd rather remove multi-output multi-class support as it is a rather rare setting and messes with API contracts. |
Working with both formats is a pain and you have to perform the format normalisation in your code. I don't think that going the other way around by deprecating multi-output multi-class format is possible. There are more estimators supporting this format (e.g. dummy, k-nn, tree, forest). It will also break people code without any replacement. |
I didn't realize KNeighborsClassifier supported multi-output multi-class.
|
Do you have a full example? |
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_multilabel_classification
X, Y = make_multilabel_classification(random_state=0, n_samples=5,
return_indicator=True, n_classes=3)
# works:
rf = RandomForestClassifier(random_state=0).fit(X, Y[:, 0])
Y_pred = rf.predict_proba(X).argmax(axis=1)
# attribute error:
rf = RandomForestClassifier(random_state=0).fit(X, Y)
Y_pred = rf.predict_proba(X).argmax(axis=1) I don't like that. |
I am not sure how useful the multiclass multi-output is in general. Do you have any references? |
The alternative could be to have 3D numpy array, but them some columns would be meaningless.
There are applications, e.g. with pixel labelling, but I am not familiar with those. I know that some real problems are tackled using the multi-output code. @glouppe might know more about this. |
I know one paper for image patches, but I don't think our implementation is very good for image data, because you want to sample the features on the fly. |
Closing as a duplicate of a more recent / detailed issue: #19880 |
The
decision_function
andpredict_proba
of a multi-label classifier (e.g.OneVsRestClassifier
) is a 2d arrays where each column correspond to a label and each row correspond to a sample. (added in 0.14?)The
decision_function
andpredict_proba
of multi-output multi-class classifier (e.g.RandomForestClassifier
) is a list of length equal to the number of output with a multi-class decision_function or predict_proba output (a 2d array where each row corresponds to the samples and where each columns correspond to a class).So this means that multi-output problem with only binary class output is a multi-label task, but isn't consistent with the multi-label format...
This is problematic if you want to code a
roc_auc_score
function to support multi-label output.The text was updated successfully, but these errors were encountered: