Description
When n_classes > 2, the precision / recall / f1-score need to be averaged in some way.
Currently the code in precision_recall_fscore_support does:
precision = true_pos / (true_pos + false_pos) recall = true_pos / (true_pos + false_neg)
Since true_pos, false_pos and false_neg are arrays of size n_classes, precision and recall are also arrays of the same size. Then to obtain a single average, the weighted sum is taken.
In the literature, the macro-average and micro-average are usually used but as far as I understand the current code does neither one. The macro is the unweighted average of the precision/recall taken separately for each class. Therefore it is an average over classes. The micro average on the contrary is an average over instances: therefore classes which have many instances are given more importance. However, AFAIK it's not the same as taking the weighted average as currently done in the code.
I think the code should be:
micro_avg_precision = true_pos.sum() / (true_pos.sum() + false_pos.sum()) micro_avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum())
macro_avg_precision = np.mean(true_pos / (true_pos + false_pos)) macro_avg_recall = np.mean(true_pos / (true_pos + false_neg))
It's easy to fix (add a micro=True|False option) but the tests may be a pain to update :-/