[MRG+1] Use sparse cluster contingency matrix by default #7419

jnothman · 2016-09-14T12:41:12Z

Supersedes #7203. Obviates the need for max_n_classes, to some extent and removes it as unreleased API (introduced in #5445 to fix #4976).

A sparse contingency matrix should handle the common case with a large number of clusters that overlap among them is sparse; where there are few clusters we suffer a penalty for using a sparse contingency matrix, but this penalty is small in absolute term.

The change is being benchmarked on my sparse_cluster_contingency_option branch, performing multiple segmentations of an image into different numbers of clusters, then timing the evaluation of metrics for different contingency matrix shapes. Benchmarks will be posted shortly.

jnothman · 2016-09-14T13:38:24Z

benchmarks at https://github.com/jnothman/scikit-learn/blob/sparse_cluster_contingency_option/stuppie-metrics_sparse_comparison.ipynb (see the end) evaluate timings for a 31000-point dataset (an image), with contingency matrix shapes in ({5, 10, 20, 50, 100, 250}, {5, 10, 20, 50, 100, 250}). We see:

small absolute differences (<4ms) for homogeneity_completeness_v_measure
negligible or negative differences (substantial in low density) for adjusted_rand_score
small absolute (<10ms) and relative (<0.3x, proportional to density) differences for adjusted_mutual_info_score
small absolute differences (<3ms) for normalized_mutual_info_score
small absolute differences (<3ms) for fowlkes_mallows_score

Remove max_n_classes option

ogrisel · 2016-09-14T13:59:51Z

sklearn/metrics/cluster/supervised.py


    # Compute the ARI using the contingency data
-    sum_comb_c = sum(comb2(n_c) for n_c in contingency.sum(axis=1))
-    sum_comb_k = sum(comb2(n_k) for n_k in contingency.sum(axis=0))
+    if isinstance(contingency, np.ndarray):


Why keep the code for the dense case if sparse=True in the previous line?

Given the previous line with sparse=True there is no need for supporting array-based contingency matrix here. We can delete this branch of the if statement as well as the final else clause.

ogrisel · 2016-09-14T14:01:35Z

Thanks very much, I am in favor of always using sparse=True in those metrics and remove max_n_classes.

jnothman · 2016-09-14T14:21:07Z

After giving myself a review and making numerous cosmetic tweaks, I'm going to bed. Would be a nice surprise to wake up to see this merged :p

jnothman · 2016-09-14T14:22:00Z

(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.)

ogrisel · 2016-09-14T14:49:11Z

(Unfortunately, my numerous cosmetic tweaks have overwhelmed appveyor, and I do not have perms to cancel queued tasks there.)

I will do it and send you the credentials by private email.

ogrisel · 2016-09-14T15:12:02Z

BTW I confirm that this PR solves the original issue (#4976) with large floating point valued vectors passed to clustering metrics:

>>> from sklearn.metrics import mutual_info_score
>>> import numpy as np
>>> y1 = np.random.randn(int(1e5))
>>> y2 = np.random.randn(int(1e5))
>>> %time mutual_info_score(y1, y2)
CPU times: user 96 ms, sys: 8 ms, total: 104 ms
Wall time: 99.1 ms
11.512925464970223
>>> %load_ext memory_profiler
>>> %memit mutual_info_score(y1, y2)
peak memory: 75.73 MiB, increment: 8.95 MiB

There is definitely no need for the ugly max_n_classes in our public API.

ogrisel · 2016-09-14T15:15:43Z

sklearn/metrics/cluster/supervised.py

-                                     max_n_classes=max_n_classes)
-    contingency = np.array(contingency, dtype='float')
+    contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
+    contingency = contingency.astype(float)


.astype(float) is not guaranteed to use the same level of precision on all platforms. To get platform agnostic results it's better to be specific:

contingency = contingency.astype(np.float64)

ogrisel · 2016-09-14T15:23:04Z

Hum it seems that I made a bunch of comments on an old version of the diff. Sorry for the noise.

Anyways, besides my comments on astype(float), +1 for merge.

amueller · 2016-09-14T15:51:56Z

sklearn/metrics/cluster/expected_mutual_info_fast.pyx

@@ -28,8 +28,8 @@ def expected_mutual_information(contingency, int n_samples):
    #cdef np.ndarray[int, ndim=2] start, end
    R, C = contingency.shape
    N = <DOUBLE>n_samples
-    a = np.sum(contingency, axis=1).astype(np.int32)
-    b = np.sum(contingency, axis=0).astype(np.int32)
+    a = np.ravel(contingency.sum(axis=1).astype(np.int32))


why is the ravel needed?

Because it can now be a sparse matrix and the sum over the rows of a sparse matrix is a sparse matrix with a single column instead of a 1D array.

ogrisel · 2016-09-14T17:39:03Z

@amueller if you agree I can address the comments, merge and backport while @jnothman is AFK :)

amueller · 2016-09-14T18:20:41Z

@ogrisel sounds good :)

ogrisel · 2016-09-14T19:12:35Z

polished, squashed, rebased and backported.

jnothman · 2016-09-14T20:40:39Z

ooh :) you two are so kindly obliging!

jnothman mentioned this pull request Sep 14, 2016

Use sparse contingency matrix for supervised cluster metrics #7203

Closed

jnothman added this to the 0.18 milestone Sep 14, 2016

stuppie and others added 10 commits September 14, 2016 23:53

use sparse contingency matrix for supervised cluster metrics

51aa4a1

Remove max_n_classes option

merge sparse and max_n_classes functionality

0e19274

clarify docs

983747f

pep8

925e1a2

sparse=False parameter in supervised clustering metrics

952ddaa

Use sparse=True by default; remove max_n_classes

a294c05

PEP8

acb66be

+author

7209a12

Minor doc changes and test error message with sparse+eps combination

0af677b

Variable names and return type descr

d91f939

jnothman force-pushed the sparse_cluster_contingency branch from 78e0464 to d91f939 Compare September 14, 2016 13:53

ogrisel reviewed Sep 14, 2016
View reviewed changes

jnothman added 4 commits September 15, 2016 00:10

Simplify and test sparse/dense cases in mutual_info_score

91dceee

Remove unused code in adjusted_rand_score

0c991fd

COSMIT

31baa1e

COSMIT

5b76cd5

jnothman force-pushed the sparse_cluster_contingency branch from 84f1758 to 5b76cd5 Compare September 14, 2016 14:13

jnothman added 2 commits September 15, 2016 00:13

DOCstring format

009c5b0

COSMIT

0cbe30d

jnothman force-pushed the sparse_cluster_contingency branch from 4469114 to 0cbe30d Compare September 14, 2016 14:17

jnothman added Blocker Waiting for Reviewer labels Sep 14, 2016

ogrisel reviewed Sep 14, 2016
View reviewed changes

ogrisel changed the title ~~[MRG] Use sparse cluster contingency matrix by default~~ [MRG+1] Use sparse cluster contingency matrix by default Sep 14, 2016

amueller reviewed Sep 14, 2016
View reviewed changes

ogrisel closed this Sep 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Use sparse cluster contingency matrix by default #7419

[MRG+1] Use sparse cluster contingency matrix by default #7419

jnothman commented Sep 14, 2016 •

edited

Loading

jnothman commented Sep 14, 2016 •

edited

Loading

ogrisel Sep 14, 2016

jnothman Sep 14, 2016

ogrisel Sep 14, 2016

ogrisel commented Sep 14, 2016

jnothman commented Sep 14, 2016

jnothman commented Sep 14, 2016

ogrisel commented Sep 14, 2016

ogrisel commented Sep 14, 2016

ogrisel Sep 14, 2016

ogrisel commented Sep 14, 2016

amueller Sep 14, 2016

ogrisel Sep 14, 2016 •

edited

Loading

ogrisel commented Sep 14, 2016

amueller commented Sep 14, 2016

ogrisel commented Sep 14, 2016

jnothman commented Sep 14, 2016

[MRG+1] Use sparse cluster contingency matrix by default #7419

[MRG+1] Use sparse cluster contingency matrix by default #7419

Conversation

jnothman commented Sep 14, 2016 • edited Loading

jnothman commented Sep 14, 2016 • edited Loading

ogrisel Sep 14, 2016

Choose a reason for hiding this comment

jnothman Sep 14, 2016

Choose a reason for hiding this comment

ogrisel Sep 14, 2016

Choose a reason for hiding this comment

ogrisel commented Sep 14, 2016

jnothman commented Sep 14, 2016

jnothman commented Sep 14, 2016

ogrisel commented Sep 14, 2016

ogrisel commented Sep 14, 2016

ogrisel Sep 14, 2016

Choose a reason for hiding this comment

ogrisel commented Sep 14, 2016

amueller Sep 14, 2016

Choose a reason for hiding this comment

ogrisel Sep 14, 2016 • edited Loading

Choose a reason for hiding this comment

ogrisel commented Sep 14, 2016

amueller commented Sep 14, 2016

ogrisel commented Sep 14, 2016

jnothman commented Sep 14, 2016

jnothman commented Sep 14, 2016 •

edited

Loading

jnothman commented Sep 14, 2016 •

edited

Loading

ogrisel Sep 14, 2016 •

edited

Loading