Skip to content

[MRG]Clusters-Class auto-match #10604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 43 commits into from

Conversation

LPugens
Copy link

@LPugens LPugens commented Feb 7, 2018

Since I needed to calculate accuracy from a clustering result in my research, I think it could be really useful to have it on the library depending on the task.
Specially useful when the task being implemented involves different sets of classes for each new dataset.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The class_cluster_match(y_true, y_pred) translates clusters found through clustering methods, generally given in numbers to the corresponding best possible match of true class labels, generally strings.

Any other comments?

The max_main_diagonal was implemented by user Paul Panzer at https://stackoverflow.com/questions/48511584/how-to-efficiently-make-class-to-cluster-matching-in-order-to-calculate-resultin.

@amueller
Copy link
Member

amueller commented Feb 7, 2018

Why not use rand score instead? I've seen "clustering accuracy" but I think it's mostly used by people that are not aware of rand score.

@LPugens
Copy link
Author

LPugens commented Feb 7, 2018

This, this and this are examples of peer-reviewed papers using accuracy and related metrics for clustering tasks. As I said in the merge request, it is useful in certain tasks and is in no way a replacement for other clustering metrics. It is just a new tool available to the user.

@amueller
Copy link
Member

amueller commented Feb 7, 2018

Cool, thanks for the references, that's probably enough evidence to include it.

@jnothman
Copy link
Member

jnothman commented Feb 7, 2018

Hmm... So you've got a linear programme in there. Is this different, fundamentally, from what sklearn.metrics.cluster.bicluster.consensus_score is doing? It calculates the assignment between two clusterings that maximises, in this case, jaccard similarity among the clusters (although any similarity function should work there). The linear_assignment function used there is highly optimised to this kind of problem.

I have for a while considered the fact that this should really be available in sklearn.metrics.cluster, not just for biclustering (which I gather very few people use). It is the essence of the CEAF metric used for coreference resolution evaluation in NLP, which is also a generic clustering metric (albeit extensible to the case where the set of items being clustered differs between the ground truth and the system), and which allows configuration of the metric used to compare clusters for then calculating maximal assignment.

I think we should be moving consensus_score, renaming it, and providing an interface where an arbitrary set or binary vector comparison can take place. A binary vector comparison is more compatible with the existing classification metrics; a set comparison is likely to be more efficient. A binary vector with sample_weight passed would be sufficient to meet both needs.

See also my implementation of CEAF in neleval which has extra optimised for the case where the contingency matrix is sparse and not completely connected, which also potentially apply here (although more traditional in coreference resolution, particularly cross-document coreference resolution).

In any case, this needs tests.

@jnothman
Copy link
Member

jnothman commented Feb 7, 2018

Note that we also exported that linear_assingment implementation to scipy as linear_sum_assignment

@jnothman
Copy link
Member

jnothman commented Feb 7, 2018

I'd propose calling it max_assignment_score.

@LPugens
Copy link
Author

LPugens commented Feb 8, 2018

@jnothman, I am relatively new to the AI field, and to be sincere, did not really understood the implementation behind the max_main_diagonal function. As I said, the credits should go to that guy in StackOverflow linked.
That said, it seems that linear_assignment is related to my implementation. Do you think it would be the case to simply replace my max_main_diagonal function with linear_assignment?

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

If you do so, do you consistently get the same results?

@LPugens
Copy link
Author

LPugens commented Feb 8, 2018

No, it does not. And I do not have enough knowledge in linear programming to check any other way :(.
For my purpose (allow for generic calculations of metrics such as accuracy and F-Measure), it works well and relatively fast, while also having a simple usage.
However, you would be really welcome to commit on my code if you think can improve the efficiency.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the two functions have different interfaces. This returns a sorted matrix. linear_assignment returns matched pair indices. Did you account for that in comparing them?

@LPugens
Copy link
Author

LPugens commented Feb 8, 2018

Also, would be glad if someone could help me to fix the python2 CircleCI.
It seems the scipy version being tested there is the 0.14.0, while linprog was implemented in version 0.15.0.
Is there any config file where I can fix it?

@LPugens
Copy link
Author

LPugens commented Feb 8, 2018

@jnothman Could you give an input/output example for linear_assignment? I didn't really get how "matched pair indices" should work...

@LPugens
Copy link
Author

LPugens commented Feb 8, 2018

Also, did not get the error from Travis CI:


/home/travis/build/scikit-learn/scikit-learn/sklearn/metrics/cluster/supervised.py:943: DocTestFailure
________ [doctest] sklearn.metrics.cluster.supervised.max_main_diagonal ________
889     -------
890     B : array, shape = [n,n]
891         Pivot matrix that sorts A for maximum main diagonal sum
892     References
893     ----------
894     Examples
895     --------
896     >>> from sklearn.metrics.cluster import max_main_diagonal
897     >>> import numpy as np
898     >>> A = np.matrix([[2, 1, 0],
UNEXPECTED EXCEPTION: SyntaxError('unexpected EOF while parsing', ('<doctest sklearn.metrics.cluster.supervised.max_main_diagonal[2]>', 1, 26, 'A = np.matrix([[2, 1, 0],\n'))
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/doctest.py", line 1318, in __run
    compileflags, 1), test.globs)
  File "<doctest sklearn.metrics.cluster.supervised.max_main_diagonal[2]>", line 1
    A = np.matrix([[2, 1, 0],
                            ^
SyntaxError: unexpected EOF while parsing

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

I haven't understood what the output of your class_cluster_match is.

This (untested) should return the predicted cluster number for each true cluster number, assuming all are integers 0...n_values-1.

def get_corresponding_clusters(y_true, y_pred):
    mapping_true, mapping_pred = linear_assignment(-contingency_matrix(y_true, y_pred)).T
    order = np.argsort(mapping_true)
    return mapping_pred[order]

>>> from sklearn.metrics.cluster import max_main_diagonal
>>> import numpy as np
>>> A = np.matrix([[2, 1, 0],
>>> [1, 0, 0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need ... instead of >>> to not upset doctest

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

The name max_assignment_score is not just to get the mapping, but to get the total score:

def max_assignment_score(labels_true, labels_pred, binary_metric=None):
    C = contingency_matrix(labels_true, labels_pred)
    # TODO: transform C according to the binary metric
    if binary_metric is None:  # accuracy
        C /= len(labels_true)
    true_idx, pred_idx = linear_assignment(-C).T
    return C[true_idx, pred_idx].sum()

I can see uses for just getting the mapping as you proposed. But I'd call it max_assignment.

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

And this is perhaps a bit hacky, but it should be a way to adjust the contingency matrix for an arbitrary metric (untested):

bin_y_true = [0, 0, 1, 1]
bin_y_pred = [0, 1, 0, 1]
idx_true, idx_pred = C.nonzero()
tp = C[idx_true, idx_pred]
fp = C.sum(axis=0)[idx_pred]
fn = C.sum(axis=1)[idx_true]
tn = len(labels_true) - tp - fp - fn
sample_weights = np.transpose([tn, fp, fn, tp])
values = [binary_metric(bin_y_true, bin_y_pred, sample_weight=sample_weight)
          for sample_weight in sample_weights]
C[idx_true, idx_pred] = values

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

Hmm... Are you feeling out of your depth?

@LPugens
Copy link
Author

LPugens commented Feb 8, 2018

Actually, yes. For example, I have no idea what a contingency matrix is....
I am just a bit apprehensive that, albeit efficient, your ideas will not be as intuitive as mine.
But also, I am a bit tired. Tomorrow night I will further investigate your tips.
Anyway, Thanks for the help so far :)

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

Okay. Never mind the binary_metric stuff. The idea is that I've seen this matching used to then calculate an overall score. We should provide a function to do this all in one go.

@jnothman
Copy link
Member

How's that weekend coming along, @LPugens ?

@LPugens
Copy link
Author

LPugens commented Jun 28, 2018

Wow! sorry about that. Had some deadlines after the last commit and ended up forgetting about it.
Will try to wrap it up and commit on this weekend. 😅

@LPugens
Copy link
Author

LPugens commented Jul 1, 2018

I am just not sure about a better way to remove the if/elif

    if n_clusters > n_classes:
        classes += ['DEFAULT_LABEL_'+str(i) for i in range(n_clusters-n_classes)]
    elif n_classes > n_clusters:
        clusters += ['DEF_CLUSTER_'+str(i) for i in range(n_classes-n_clusters)]

Also, when you mentioned the default_label, do you want me to add an key argument to the function to replace the DEFAULT strings with that?

Base automatically changed from master to main January 22, 2021 10:50
@adrinjalali
Copy link
Member

This is related to #27259 I think. @lorentzenchr @ogrisel do you think this is worth adding? There hasn't been much activity in the past 6 years here. Closing, but happy to have it reopened if we have renewed interest.

@adrinjalali adrinjalali closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants