[MRG]Clusters-Class auto-match #10604

LPugens · 2018-02-07T21:53:01Z

Since I needed to calculate accuracy from a clustering result in my research, I think it could be really useful to have it on the library depending on the task.
Specially useful when the task being implemented involves different sets of classes for each new dataset.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The class_cluster_match(y_true, y_pred) translates clusters found through clustering methods, generally given in numbers to the corresponding best possible match of true class labels, generally strings.

Any other comments?

The max_main_diagonal was implemented by user Paul Panzer at https://stackoverflow.com/questions/48511584/how-to-efficiently-make-class-to-cluster-matching-in-order-to-calculate-resultin.

amueller · 2018-02-07T22:18:02Z

Why not use rand score instead? I've seen "clustering accuracy" but I think it's mostly used by people that are not aware of rand score.

LPugens · 2018-02-07T22:35:36Z

This, this and this are examples of peer-reviewed papers using accuracy and related metrics for clustering tasks. As I said in the merge request, it is useful in certain tasks and is in no way a replacement for other clustering metrics. It is just a new tool available to the user.

amueller · 2018-02-07T22:42:01Z

Cool, thanks for the references, that's probably enough evidence to include it.

jnothman · 2018-02-07T23:22:50Z

Hmm... So you've got a linear programme in there. Is this different, fundamentally, from what sklearn.metrics.cluster.bicluster.consensus_score is doing? It calculates the assignment between two clusterings that maximises, in this case, jaccard similarity among the clusters (although any similarity function should work there). The linear_assignment function used there is highly optimised to this kind of problem.

I have for a while considered the fact that this should really be available in sklearn.metrics.cluster, not just for biclustering (which I gather very few people use). It is the essence of the CEAF metric used for coreference resolution evaluation in NLP, which is also a generic clustering metric (albeit extensible to the case where the set of items being clustered differs between the ground truth and the system), and which allows configuration of the metric used to compare clusters for then calculating maximal assignment.

I think we should be moving consensus_score, renaming it, and providing an interface where an arbitrary set or binary vector comparison can take place. A binary vector comparison is more compatible with the existing classification metrics; a set comparison is likely to be more efficient. A binary vector with sample_weight passed would be sufficient to meet both needs.

See also my implementation of CEAF in neleval which has extra optimised for the case where the contingency matrix is sparse and not completely connected, which also potentially apply here (although more traditional in coreference resolution, particularly cross-document coreference resolution).

In any case, this needs tests.

jnothman · 2018-02-07T23:24:15Z

Note that we also exported that linear_assingment implementation to scipy as linear_sum_assignment

jnothman · 2018-02-07T23:26:26Z

I'd propose calling it max_assignment_score.

LPugens · 2018-02-08T00:08:10Z

@jnothman, I am relatively new to the AI field, and to be sincere, did not really understood the implementation behind the max_main_diagonal function. As I said, the credits should go to that guy in StackOverflow linked.
That said, it seems that linear_assignment is related to my implementation. Do you think it would be the case to simply replace my max_main_diagonal function with linear_assignment?

jnothman · 2018-02-08T00:16:46Z

If you do so, do you consistently get the same results?

LPugens · 2018-02-08T00:29:30Z

No, it does not. And I do not have enough knowledge in linear programming to check any other way :(.
For my purpose (allow for generic calculations of metrics such as accuracy and F-Measure), it works well and relatively fast, while also having a simple usage.
However, you would be really welcome to commit on my code if you think can improve the efficiency.

jnothman

Well, the two functions have different interfaces. This returns a sorted matrix. linear_assignment returns matched pair indices. Did you account for that in comparing them?

LPugens · 2018-02-08T00:34:49Z

Also, would be glad if someone could help me to fix the python2 CircleCI.
It seems the scipy version being tested there is the 0.14.0, while linprog was implemented in version 0.15.0.
Is there any config file where I can fix it?

LPugens · 2018-02-08T00:37:59Z

@jnothman Could you give an input/output example for linear_assignment? I didn't really get how "matched pair indices" should work...

LPugens · 2018-02-08T00:44:55Z

Also, did not get the error from Travis CI:


/home/travis/build/scikit-learn/scikit-learn/sklearn/metrics/cluster/supervised.py:943: DocTestFailure
________ [doctest] sklearn.metrics.cluster.supervised.max_main_diagonal ________
889     -------
890     B : array, shape = [n,n]
891         Pivot matrix that sorts A for maximum main diagonal sum
892     References
893     ----------
894     Examples
895     --------
896     >>> from sklearn.metrics.cluster import max_main_diagonal
897     >>> import numpy as np
898     >>> A = np.matrix([[2, 1, 0],
UNEXPECTED EXCEPTION: SyntaxError('unexpected EOF while parsing', ('<doctest sklearn.metrics.cluster.supervised.max_main_diagonal[2]>', 1, 26, 'A = np.matrix([[2, 1, 0],\n'))
Traceback (most recent call last):
  File "/home/travis/miniconda/envs/testenv/lib/python3.4/doctest.py", line 1318, in __run
    compileflags, 1), test.globs)
  File "<doctest sklearn.metrics.cluster.supervised.max_main_diagonal[2]>", line 1
    A = np.matrix([[2, 1, 0],
                            ^
SyntaxError: unexpected EOF while parsing

jnothman · 2018-02-08T00:46:15Z

I haven't understood what the output of your class_cluster_match is.

This (untested) should return the predicted cluster number for each true cluster number, assuming all are integers 0...n_values-1.

def get_corresponding_clusters(y_true, y_pred):
    mapping_true, mapping_pred = linear_assignment(-contingency_matrix(y_true, y_pred)).T
    order = np.argsort(mapping_true)
    return mapping_pred[order]

jnothman · 2018-02-08T00:46:49Z

sklearn/metrics/cluster/supervised.py

+    >>> from sklearn.metrics.cluster import max_main_diagonal
+    >>> import numpy as np
+    >>> A = np.matrix([[2, 1, 0],
+    >>>                [1, 0, 0],


you need ... instead of >>> to not upset doctest

…jnothman

jnothman · 2018-02-08T00:51:39Z

The name max_assignment_score is not just to get the mapping, but to get the total score:

def max_assignment_score(labels_true, labels_pred, binary_metric=None):
    C = contingency_matrix(labels_true, labels_pred)
    # TODO: transform C according to the binary metric
    if binary_metric is None:  # accuracy
        C /= len(labels_true)
    true_idx, pred_idx = linear_assignment(-C).T
    return C[true_idx, pred_idx].sum()

I can see uses for just getting the mapping as you proposed. But I'd call it max_assignment.

jnothman · 2018-02-08T01:12:43Z

And this is perhaps a bit hacky, but it should be a way to adjust the contingency matrix for an arbitrary metric (untested):

bin_y_true = [0, 0, 1, 1]
bin_y_pred = [0, 1, 0, 1]
idx_true, idx_pred = C.nonzero()
tp = C[idx_true, idx_pred]
fp = C.sum(axis=0)[idx_pred]
fn = C.sum(axis=1)[idx_true]
tn = len(labels_true) - tp - fp - fn
sample_weights = np.transpose([tn, fp, fn, tp])
values = [binary_metric(bin_y_true, bin_y_pred, sample_weight=sample_weight)
          for sample_weight in sample_weights]
C[idx_true, idx_pred] = values

jnothman · 2018-02-08T01:25:24Z

Hmm... Are you feeling out of your depth?

LPugens · 2018-02-08T01:42:49Z

Actually, yes. For example, I have no idea what a contingency matrix is....
I am just a bit apprehensive that, albeit efficient, your ideas will not be as intuitive as mine.
But also, I am a bit tired. Tomorrow night I will further investigate your tips.
Anyway, Thanks for the help so far :)

jnothman · 2018-02-08T01:53:38Z

Okay. Never mind the binary_metric stuff. The idea is that I've seen this matching used to then calculate an overall score. We should provide a function to do this all in one go.

jnothman · 2018-06-27T13:40:01Z

How's that weekend coming along, @LPugens ?

LPugens · 2018-06-28T01:09:45Z

Wow! sorry about that. Had some deadlines after the last commit and ended up forgetting about it.
Will try to wrap it up and commit on this weekend. 😅

…g_match # Conflicts: # sklearn/metrics/cluster/__init__.py

LPugens · 2018-07-01T22:06:04Z

I am just not sure about a better way to remove the if/elif

    if n_clusters > n_classes:
        classes += ['DEFAULT_LABEL_'+str(i) for i in range(n_clusters-n_classes)]
    elif n_classes > n_clusters:
        clusters += ['DEF_CLUSTER_'+str(i) for i in range(n_classes-n_clusters)]

Also, when you mentioned the default_label, do you want me to add an key argument to the function to replace the DEFAULT strings with that?

adrinjalali · 2024-03-06T11:53:43Z

This is related to #27259 I think. @lorentzenchr @ogrisel do you think this is worth adding? There hasn't been much activity in the past 6 years here. Closing, but happy to have it reopened if we have renewed interest.

LPugens added 3 commits February 7, 2018 21:17

adding functionality to allow more clustering metrics

2213018

formatting compliant

ef2ec2b

formatting and adding an functionality example

5fab323

LPugens added 2 commits February 7, 2018 21:02

adding comma to comply with formatting

8402cf2

Fixed doc generator

e97ef2b

more modifications to be compliant with coding guidelines

625ced6

fixing doc bug

46fd79f

jnothman reviewed Feb 8, 2018

View reviewed changes

fixing doctest and adopting max_assignment_score name as proposed by …

d1bdae1

…jnothman

fixing examle again

07be65c

Doc fixing to pass Travis verification

2b01392

jnothman mentioned this pull request Jun 27, 2018

Proposal: Indices for Clustering. #11369

Open

LPugens added 17 commits July 1, 2018 12:52

fixing nomenclature

fefe91c

Merge branch 'master' into clustering_match

c10f69c

Merge branch 'master' into clustering_match

0225c32

fixing nomenclature

bd31fb9

Merge remote-tracking branch 'origin/clustering_match' into clusterin…

cd3a09d

…g_match # Conflicts: # sklearn/metrics/cluster/__init__.py

fixing commit bug

1539a0b

fixing doc title

9d39163

avoinding messing with imports

1e08a8b

changing default label nomenclature

51a68cb

adding negative indices to the test

01e9253

nomenclature fix

e11cd91

comment fix

320858d

doc fix

0a65cc2

comment fix

4847df3

simplifying test

42a7a1d

fixing name style on labels

cdbe4a3

fixing name style on labels

1474d53

LPugens added 3 commits July 1, 2018 20:09

fixing line length

f582884

fixing line length

7227ab5

fixing comment code

e1686d6

amueller added the Waiting for Reviewer label Aug 5, 2019

github-actions bot added the module:metrics label Mar 2, 2020

cmarmo removed the Waiting for Reviewer label Dec 15, 2020

Base automatically changed from master to main January 22, 2021 10:50

adrinjalali closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG]Clusters-Class auto-match #10604

[MRG]Clusters-Class auto-match #10604

LPugens commented Feb 7, 2018 •

edited by jnothman

Loading

amueller commented Feb 7, 2018

LPugens commented Feb 7, 2018

amueller commented Feb 7, 2018

jnothman commented Feb 7, 2018 •

edited

Loading

jnothman commented Feb 7, 2018

jnothman commented Feb 7, 2018

LPugens commented Feb 8, 2018

jnothman commented Feb 8, 2018

LPugens commented Feb 8, 2018

jnothman left a comment

LPugens commented Feb 8, 2018

LPugens commented Feb 8, 2018

LPugens commented Feb 8, 2018

jnothman commented Feb 8, 2018

jnothman Feb 8, 2018

jnothman commented Feb 8, 2018 •

edited

Loading

jnothman commented Feb 8, 2018 •

edited

Loading

jnothman commented Feb 8, 2018

LPugens commented Feb 8, 2018

jnothman commented Feb 8, 2018

jnothman commented Jun 27, 2018

LPugens commented Jun 28, 2018

LPugens commented Jul 1, 2018 •

edited

Loading

adrinjalali commented Mar 6, 2024

[MRG]Clusters-Class auto-match #10604

[MRG]Clusters-Class auto-match #10604

Conversation

LPugens commented Feb 7, 2018 • edited by jnothman Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

amueller commented Feb 7, 2018

LPugens commented Feb 7, 2018

amueller commented Feb 7, 2018

jnothman commented Feb 7, 2018 • edited Loading

jnothman commented Feb 7, 2018

jnothman commented Feb 7, 2018

LPugens commented Feb 8, 2018

jnothman commented Feb 8, 2018

LPugens commented Feb 8, 2018

jnothman left a comment

Choose a reason for hiding this comment

LPugens commented Feb 8, 2018

LPugens commented Feb 8, 2018

LPugens commented Feb 8, 2018

jnothman commented Feb 8, 2018

jnothman Feb 8, 2018

Choose a reason for hiding this comment

jnothman commented Feb 8, 2018 • edited Loading

jnothman commented Feb 8, 2018 • edited Loading

jnothman commented Feb 8, 2018

LPugens commented Feb 8, 2018

jnothman commented Feb 8, 2018

jnothman commented Jun 27, 2018

LPugens commented Jun 28, 2018

LPugens commented Jul 1, 2018 • edited Loading

adrinjalali commented Mar 6, 2024

LPugens commented Feb 7, 2018 •

edited by jnothman

Loading

jnothman commented Feb 7, 2018 •

edited

Loading

jnothman commented Feb 8, 2018 •

edited

Loading

jnothman commented Feb 8, 2018 •

edited

Loading

LPugens commented Jul 1, 2018 •

edited

Loading