-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Confusion Matrix Representation / Return Value #19012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree that the current output is unnecessarily difficult, and the confusion matrix is naturally portrayed as a dataframe... not sure if this is something we would consider introducing as a breaking change, for which we would require a hard dep on pandas... |
Could we return a Python dict or a |
@jnothman @glemaitre Actually I'm suggesting an alternative output which shouldn't be affecting dependent codes/breaking changes. For example: By adding a parameter, if parameter By default when parameter I would be working on the issue, if you think it's a viable issue/feature. |
Hi, I was suggesting the output as shown in the screenshot below. I have added a parameter to the existing function called as Example:
Function:
|
The rows should be true values, the columns predicted. I don't think we want pprint like that, although I admit that whether we return a dict or a DataFrame or index and columns it remains tricky to communicate which is true and which is predicted. |
@jnothman @glemaitre taking your comments in due consideration, I've added another change so that it returns an output in a default data type i.e. dict() Eliminating need for hard dependencies, increasing usability with any other 3rd party libs too. Please refer to the example code and screenshots. Thanks! Example: Code: Output as a dict() if pprint:
labelList = labels.tolist()
cm_lol = cm.tolist()
cm_dict = {str(labelList[j]): {str(labelList[i]): cm_lol[i][j] for i in
range(0, len(labelList))} for j in range(0, len(cm_lol))}
return cm_dict Output w/o pprint(False): array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]], dtype=int64) Output w/ pprint(True): {'ant': {'ant': 2, 'bird': 0, 'cat': 1},
'bird': {'ant': 0, 'bird': 0, 'cat': 0},
'cat': {'ant': 0, 'bird': 1, 'cat': 2}} For this solution changes required: def confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None,
normalize=None, pprint=False):
.......
if pprint: # Logic
return cm_dict #As the suggested output
.....
return cm Option 2: For better understanding the true and pred values. if pprint:
labelList = labels.tolist()
cm_lol = cm.tolist()
cm_dict = {"pred_" + str(labelList[j]): {"true_" + str(labelList[i]): cm_lol[i][j] for i in
range(0, len(labelList))} for j in range(0, len(cm_lol))}
return cm_dict Output: {'pred_ant': {'true_ant': 2, 'true_bird': 0, 'true_cat': 1},
'pred_bird': {'true_ant': 0, 'true_bird': 0, 'true_cat': 0},
'pred_cat': {'true_ant': 0, 'true_bird': 1, 'true_cat': 2}} |
`from sklearn.metrics import confusion_matrix y_true = ["cat", "ant", "cat", "cat", "ant", "bird"] plot the confusion matrixf,ax = plt.subplots(figsize=(10, 10)) |
I'm okay with something like your Option 2, but I don't really like the use
of underscores and am a little uncomfortable about coercing class names to
strings. Does returning tuples as keys, such as ("true", 1) work with
casting to DataFrame? Another option is to have a flat dict with
{(true_class, pred_class): count} which is a collapsed form of the matrix.
|
Returning tuples as keys within the nested dict work well with casting to DataFrame, please refer below dict {('pred', 'ant'): {('true', 'ant'): 2, ('true', 'bird'): 0, ('true', 'cat'): 1},
('pred', 'bird'): {('true', 'ant'): 0, ('true', 'bird'): 0, ('true', 'cat'): 0},
('pred', 'cat'): {('true', 'ant'): 0, ('true', 'bird'): 1, ('true', 'cat'): 2}} Another Option: Flat Dict Flat Dict: {('pred_ant', 'true_ant'): 2, ('pred_ant', 'true_bird'): 0, ('pred_ant', 'true_cat'): 1,
('pred_bird', 'true_ant'): 0, ('pred_bird', 'true_bird'): 0, ('pred_bird', 'true_cat'): 0,
('pred_cat', 'true_ant'): 0, ('pred_cat', 'true_bird'): 1, ('pred_cat', 'true_cat'): 2} I think nested dict with tuples as key represent the data very well preserving class names and easy conversion to dataframe. |
Re the flat dict, I was thinking to put true then pred in the keys, in line with the input args. Because it parallels the input args, I would not need "true" and "pred" explicitly there (although OTOH, your example violated my intuitions!). So more along the lines of: C = {('ant', 'ant'): 2, ('bird', 'ant'): 0, ('cat', 'ant'): 1,
('ant', 'bird'): 0, ('bird', 'bird'): 0, ('cat', 'bird'): 0,
('ant', 'cat'): 0, ('bird', 'cat'): 1, ('cat', 'cat'): 2} then >>> pd.Series(C)
ant ant 2
bird ant 0
cat ant 1
ant bird 0
bird bird 0
cat bird 0
ant cat 0
bird cat 1
cat cat 2
>>> pd.Series(C).unstack()
ant bird cat
ant 2 0 0
bird 0 0 1
cat 1 0 2 |
@jnothman What would be the issue with nested |
How would you make it intuitively clear which axis is true and which pred? Or would we rely on documentation? |
@jnothman wont this reliance towards documentation and obscurity about true and pred axis persist even with tuple key dictionary? |
@jnothman When I thought about the nested dict, I was thinking that "Actual label" and |
I think that a flag to get output with more labels would be a huge win - I guess we'd need to try and get this PR merged? #19190 |
Describe the workflow you want to enable
An enhancement to the output of confusion matrix function, better representing the true and predicted values for multilevel classes.
from sklearn.metrics import confusion_matrix
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
Output:
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
Describe your proposed solution
When you have multiple levels you can have difficulty reading the ndarray, associating the levels with the True and Predicted values.
Proposed solution should look similar to the table below, providing better readability of the confusion matrix.
Possible Solutions:
For example:
cm = confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
index=["true:ant", "true:bird", "true:cat"]
columns=["pred:ant", "pred:bird", "pred:cat"]
return cm, index, columns
Which can be easily converted into a dataframe for further use
The text was updated successfully, but these errors were encountered: