-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG]Clusters-Class auto-match #10604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why not use rand score instead? I've seen "clustering accuracy" but I think it's mostly used by people that are not aware of rand score. |
Cool, thanks for the references, that's probably enough evidence to include it. |
Hmm... So you've got a linear programme in there. Is this different, fundamentally, from what I have for a while considered the fact that this should really be available in sklearn.metrics.cluster, not just for biclustering (which I gather very few people use). It is the essence of the CEAF metric used for coreference resolution evaluation in NLP, which is also a generic clustering metric (albeit extensible to the case where the set of items being clustered differs between the ground truth and the system), and which allows configuration of the metric used to compare clusters for then calculating maximal assignment. I think we should be moving consensus_score, renaming it, and providing an interface where an arbitrary set or binary vector comparison can take place. A binary vector comparison is more compatible with the existing classification metrics; a set comparison is likely to be more efficient. A binary vector with See also my implementation of CEAF in neleval which has extra optimised for the case where the contingency matrix is sparse and not completely connected, which also potentially apply here (although more traditional in coreference resolution, particularly cross-document coreference resolution). In any case, this needs tests. |
Note that we also exported that |
I'd propose calling it |
@jnothman, I am relatively new to the AI field, and to be sincere, did not really understood the implementation behind the max_main_diagonal function. As I said, the credits should go to that guy in StackOverflow linked. |
If you do so, do you consistently get the same results? |
No, it does not. And I do not have enough knowledge in linear programming to check any other way :(. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the two functions have different interfaces. This returns a sorted matrix. linear_assignment
returns matched pair indices. Did you account for that in comparing them?
Also, would be glad if someone could help me to fix the python2 CircleCI. |
@jnothman Could you give an input/output example for linear_assignment? I didn't really get how "matched pair indices" should work... |
Also, did not get the error from Travis CI:
|
I haven't understood what the output of your This (untested) should return the predicted cluster number for each true cluster number, assuming all are integers 0...n_values-1. def get_corresponding_clusters(y_true, y_pred):
mapping_true, mapping_pred = linear_assignment(-contingency_matrix(y_true, y_pred)).T
order = np.argsort(mapping_true)
return mapping_pred[order] |
>>> from sklearn.metrics.cluster import max_main_diagonal | ||
>>> import numpy as np | ||
>>> A = np.matrix([[2, 1, 0], | ||
>>> [1, 0, 0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need ...
instead of >>>
to not upset doctest
The name def max_assignment_score(labels_true, labels_pred, binary_metric=None):
C = contingency_matrix(labels_true, labels_pred)
# TODO: transform C according to the binary metric
if binary_metric is None: # accuracy
C /= len(labels_true)
true_idx, pred_idx = linear_assignment(-C).T
return C[true_idx, pred_idx].sum() I can see uses for just getting the mapping as you proposed. But I'd call it |
And this is perhaps a bit hacky, but it should be a way to adjust the contingency matrix for an arbitrary metric (untested): bin_y_true = [0, 0, 1, 1]
bin_y_pred = [0, 1, 0, 1]
idx_true, idx_pred = C.nonzero()
tp = C[idx_true, idx_pred]
fp = C.sum(axis=0)[idx_pred]
fn = C.sum(axis=1)[idx_true]
tn = len(labels_true) - tp - fp - fn
sample_weights = np.transpose([tn, fp, fn, tp])
values = [binary_metric(bin_y_true, bin_y_pred, sample_weight=sample_weight)
for sample_weight in sample_weights]
C[idx_true, idx_pred] = values |
Hmm... Are you feeling out of your depth? |
Actually, yes. For example, I have no idea what a contingency matrix is.... |
Okay. Never mind the binary_metric stuff. The idea is that I've seen this matching used to then calculate an overall score. We should provide a function to do this all in one go. |
How's that weekend coming along, @LPugens ? |
Wow! sorry about that. Had some deadlines after the last commit and ended up forgetting about it. |
…g_match # Conflicts: # sklearn/metrics/cluster/__init__.py
I am just not sure about a better way to remove the if/elif
Also, when you mentioned the |
This is related to #27259 I think. @lorentzenchr @ogrisel do you think this is worth adding? There hasn't been much activity in the past 6 years here. Closing, but happy to have it reopened if we have renewed interest. |
Since I needed to calculate accuracy from a clustering result in my research, I think it could be really useful to have it on the library depending on the task.
Specially useful when the task being implemented involves different sets of classes for each new dataset.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
The class_cluster_match(y_true, y_pred) translates clusters found through clustering methods, generally given in numbers to the corresponding best possible match of true class labels, generally strings.
Any other comments?
The max_main_diagonal was implemented by user Paul Panzer at https://stackoverflow.com/questions/48511584/how-to-efficiently-make-class-to-cluster-matching-in-order-to-calculate-resultin.