ENH Improve performance for expected mutual information #24794

Kshitij68 · 2022-10-31T09:55:22Z

Reference Issues/PRs

This fixes #24254

What does this implement/fix? Explain your changes.

Used numpy arrays instead of for loops to leverage vectorization.

Any other comments?

I noticed a performance improvement of 4x with this. It would be great if some people could also try it on their machines.

Update 31/10/2022

For 6000 records,
before -> 145 seconds
after -> 30 seconds

Kshitij68

I will add changelog once this solution is termed acceptable

Kshitij68 · 2022-10-31T09:55:59Z

sklearn/metrics/cluster/_expected_mutual_info_fast.pyx

@@ -50,7 +50,8 @@ def expected_mutual_information(contingency, int n_samples):
    gln_N = gammaln(N + 1)
    gln_nij = gammaln(nijs + 1)
    # start and end values for nij terms for each summation.
-    start = np.array([[v - N + w for w in b] for v in a], dtype='int')
+    start = np.array(np.meshgrid(a,b),dtype='int').T
+    start = np.array([start[i].sum(axis=1)-N for i in range(len(start))])


I'd be open to suggestions on how to do these operations using numpy in more elegant way

I've been trying to understand the algorithm, but only with average success :-/

When I compare the start that this PR produces and what is in main they look the same.

So if this is faster and all the tests still pass, we should use it?!

The issue that I can spot here is that we are creating a huge matrix in memory.

the size of matrix will be double than previous method.

Precisely, if a has M elements and b has N elements, then

by the old method you will have M*N elements

by the new method you have 2*M*N elements

>>> a = np.array([1,2]) >>> b = np.array([3,4,5]) >>> np.meshgrid(a,b) [array([[1, 2], [1, 2], [1, 2]]), array([[3, 3], [4, 4], [5, 5]])]

So although the difference exists, I don't think it should be the deciding factor.

betatim · 2022-10-31T12:08:57Z

Could you post some before/after timing numbers for the three different cases mentioned in the issue?

Kshitij68 · 2022-10-31T14:26:49Z

Could you post some before/after timing numbers for the three different cases mentioned in the issue?

Sure,
Before -> 145 seconds
After -> 30 seconds

glemaitre · 2022-11-02T18:31:55Z

I think that we need to be extremely careful with the benchmark here.

I came up with a Cython implementation that does not rely on Python interaction:

from libc.math cimport exp, lgamma, log
import cython
cimport numpy as cnp

import numpy as np


cnp.import_array()

cpdef expected_mutual_information(contingency, cnp.int64_t n_samples):
    """Compute the expected mutual information.

    Parameters
    ----------
    contingency : sparse-matrix of shape (n_labels, n_labels), dtype=np.int64
        Contengency matrix storing the label counts. This is as sparse
        matrix at the CSR format.

    n_samples : int
        The number of samples which is the sum of the counts in the contengency
        matrix.

    Returns
    -------
    emi : float
        The expected mutual information between the two vectors of labels.
    """
    cdef:
        unsigned long long n_rows, n_cols
        cnp.ndarray[cnp.int64_t] a, b
        cnp.int64_t a_i, b_j
        cnp.intp_t i, j, nij, nij_nz
        cnp.intp_t start, end
        cnp.float64_t emi = 0.0
        double term1, term2, term3
        double log_n_samples = log(n_samples)

    n_rows, n_cols = contingency.shape
    a = np.ravel(contingency.sum(axis=1))
    b = np.ravel(contingency.sum(axis=0))

    if a.size == 1 or b.size == 1:
        return 0.0

    for i in range(n_rows):
        for j in range(n_cols):
            a_i, b_j = a[i], b[j]
            start = max(1, a_i - n_samples + b_j)
            end = min(a_i, b_j) + 1
            for nij in range(start, end):
                nij_nz = 1 if nij == 0 else nij
                term1 = nij_nz / <double>n_samples
                term2 = log_n_samples + log(nij_nz) - log(a_i) - log(b_j)
                term3 = exp(
                    lgamma(a_i + 1) + lgamma(b_j + 1)
                    + lgamma(n_samples - a_i + 1) + lgamma(n_samples - b_j + 1)
                    - lgamma(n_samples + 1) - lgamma(nij_nz + 1)
                    - lgamma(a_i - nij + 1)
                    - lgamma(b_j - nij + 1)
                    - lgamma(n_samples - a_i - b_j + nij + 1)
                )
                emi += (term1 * term2 * term3)
    return emi

It will require to change a line of code in _supervised.py:

contingency = contingency_matrix(
        labels_true, labels_pred, sparse=True, dtype=np.int64
    )

which avoid making 3 copy/conversion of data (i.e. int64 -> float64 -> int32) in the current implementation.

However, I can see a trade-off with n_samples and n_labels I suppose.

Benchmark 1

This is the benchmark reported in the original issue

import numpy as np
from sklearn.metrics import adjusted_mutual_info_score

x = np.array([x % 8000 for x in range(10_000)])
y = np.array([x % 7000 for x in range(10_000)])

mi = adjusted_mutual_info_score(x, y)

This implementation will be much faster, even faster than the implementation of this PR. However, this benchmark is unique in the way that we have a large number of labels.

With the current PR:

In [1]: %timeit %run bench.py
14.2 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With the cython version above:

In [1]: %timeit %run bench.py
4.91 s ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark 2

I ran another benchmark where you have a large number of samples but few labels:

import numpy as np
from sklearn.metrics import adjusted_mutual_info_score

rng = np.random.default_rng(0)
n_samples = 1_000_000
x = rng.integers(low=0, high=5, size=n_samples)
y = rng.integers(low=0, high=5, size=n_samples)

mi = adjusted_mutual_info_score(x, y)

and in this case, the fully cythonize version will be slower than the vectorized version using NumPy.

For the current PR:

In [1]: %timeit %run bench.py
447 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For the fully cythonize version

In [1]: %timeit %run bench.py
625 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusions

Since the gain is important in one case, it might overcome the case where n_samples is large and n_labels is small.

@betatim what do you think.

glemaitre · 2022-11-02T20:26:05Z

It would be great if we could run an extensive benchmark varying n_labels and n_samples and storing all those in a DataFrame to make proper plots.

betatim · 2022-11-03T09:35:09Z

After some digging by @glemaitre we found that the formula that the code implements (probably) is https://en.wikipedia.org/wiki/Adjusted_mutual_information#Adjustment_for_chance. I think no matter what we do as optimisation it would be worth leaving a note for people from the future that what expected_mutual_information computes is (actually) the adjusted mutual information.

betatim · 2022-11-03T09:40:53Z

From the fact that I am consulting wikipedia you can tell I've not made a lot of progress understanding what exactly the code does/why it is written the way it is. So I don't have anything to say about what to do.

I have a slight preference for a version that doesn't use meshgrid() because meshgrid needs more memory (according to @glemaitre) and a bit of brain power to understand what it does/why it works in this case. This code already contains quite a few things that require a bit of thinking to understand (using log identities, gamma function for factorials, etc). Each of these isn't hard to understand, but if you combine several of these tricks together I find it makes the code quite a hard read. So if we can avoid that, that would be nice.

I think 450ms vs 600ms for one case is a small difference compared to 14s vs 5s. So it might make sense to make the slow case (much) faster and pay the price that the fast case gets slower. This is assuming that in all cases this function only gets called a handful of times, so that for a human the difference between 400 and 600ms is basically impossible to notice, but the difference between 14 and 5s is noticeable.

Kshitij68 · 2022-11-04T03:58:27Z

After some digging by @glemaitre we found that the formula that the code implements (probably) is https://en.wikipedia.org/wiki/Adjusted_mutual_information#Adjustment_for_chance. I think no matter what we do as optimisation it would be worth leaving a note for people from the future that what expected_mutual_information computes is (actually) the adjusted mutual information.

Since the cython function is only called for calculating adjusted_mutual_info_score, why not just rename the cython function itself?

Kshitij68 · 2022-11-04T04:01:54Z

So to summarize, we have multiple things to look at when deciding between current PR vs @glemaitre 's approach:-

Code Readability and Maintainability
Memory consumption
Time taken with n_samples >>> n_labels
Time taken with n_samples on similar scale as n_labels

Kshitij68 · 2022-11-04T04:38:58Z

I think 450ms vs 600ms for one case is a small difference compared to 14s vs 5s. So it might make sense to make the slow case (much) faster and pay the price that the fast case gets slower.

I would disagree with the statement here. In real life, most use cases tend to have many more samples than labels.
But this is based on personal opinion obviously.

betatim · 2022-11-04T09:04:30Z

I was wrong. The function does implement the expected mutual information, but the formula used is on the "adjusted mutual info" wikipedia page (they took the formula before it gets adjusted ...). Sorry for the confusion.

My thinking for "lat's take the speed up even with a small hit to the common case" was that even though the common case is common, the slow down might be so small that in practice no one will notice. I don't know if that is true, so I wanted to put it out there for discussion.

glemaitre · 2022-11-04T19:23:50Z

Although, the full Cython version does not allocate temporary vectors and avoid multiple data conversion.

Kshitij68 · 2022-11-04T20:42:20Z

It would be great if we could run an extensive benchmark varying n_labels and n_samples and storing all those in a DataFrame to make proper plots.

@glemaitre Should the plots reside in asv benchmarks folder or benchmarks folder ?

glemaitre · 2022-11-04T20:43:30Z

The plot can be posted in this PR, together with the benchmark code.

glemaitre · 2022-11-04T20:52:19Z

I was right now implementing a quick bench:

from collections import defaultdict
from itertools import product
from time import time

import numpy as np
import pandas as pd

from sklearn.metrics import adjusted_mutual_info_score

repeat = 10
n_samples = [1_000, 10_000, 100_000, 1_000_000]
n_labels = [10, 100, 1_000]

rng = np.random.default_rng(0)

results = defaultdict(list)
for ns, nl in product(n_samples, n_labels):
    local_result = []
    for i in range(repeat):
        print(f"Repetition {i} for n_samples={ns} and n_labels={nl}")
        x = rng.integers(low=0, high=nl, size=ns)
        y = rng.integers(low=0, high=nl, size=ns)

        start = time()
        adjusted_mutual_info_score(x, y)
        end = time()
        local_result.append(end - start)

    results["n_samples"].append(ns)
    results["n_labels"].append(nl)
    results["mean_time"].append(np.mean(local_result))

results = pd.DataFrame(results)

Kshitij68 · 2022-11-15T16:46:19Z

Thanks for the review. I have made the changes and the performance plots are remaining.
I will try to find time to do this coming weekend.

github-actions bot added module:metrics cython labels Oct 31, 2022

Kshitij68 commented Oct 31, 2022

View reviewed changes

Kshitij68 changed the title ~~[Fix] Improve performnace for expected mutual information~~ [Fix] Improve performance for expected mutual information Oct 31, 2022

Kshitij68 force-pushed the Fix-regression-for-expected-mutual-info branch from 1b8a2cc to 8dcfcac Compare October 31, 2022 15:19

glemaitre changed the title ~~[Fix] Improve performance for expected mutual information~~ ENH Improve performance for expected mutual information Nov 2, 2022

Kshitij68 force-pushed the Fix-regression-for-expected-mutual-info branch 2 times, most recently from 1b3307b to baa7761 Compare November 7, 2022 05:53

Kshitij68 added 2 commits November 7, 2022 11:26

optimize calculating start values

827044c

Add benchmarks

39dd0bf

Kshitij68 force-pushed the Fix-regression-for-expected-mutual-info branch from baa7761 to 39dd0bf Compare November 7, 2022 05:57

jjerphan mentioned this pull request Feb 14, 2023

MAINT Use memoryviews in _expected_mutual_info_fast #25575

Closed

lorentzenchr added the Stalled label Feb 19, 2023

jeremiedbb mentioned this pull request Feb 27, 2023

ENH Speed-up expected mutual information #25713

Merged

glemaitre closed this in #25713 Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Improve performance for expected mutual information #24794

ENH Improve performance for expected mutual information #24794

Kshitij68 commented Oct 31, 2022 •

edited

Loading

Kshitij68 left a comment

Kshitij68 Oct 31, 2022

betatim Oct 31, 2022

glemaitre Nov 2, 2022

Kshitij68 Nov 4, 2022

betatim commented Oct 31, 2022

Kshitij68 commented Oct 31, 2022 •

edited

Loading

glemaitre commented Nov 2, 2022

glemaitre commented Nov 2, 2022

betatim commented Nov 3, 2022

betatim commented Nov 3, 2022 •

edited

Loading

Kshitij68 commented Nov 4, 2022

Kshitij68 commented Nov 4, 2022 •

edited

Loading

Kshitij68 commented Nov 4, 2022

betatim commented Nov 4, 2022

glemaitre commented Nov 4, 2022

Kshitij68 commented Nov 4, 2022

glemaitre commented Nov 4, 2022

glemaitre commented Nov 4, 2022 •

edited

Loading

Kshitij68 commented Nov 15, 2022

ENH Improve performance for expected mutual information #24794

ENH Improve performance for expected mutual information #24794

Conversation

Kshitij68 commented Oct 31, 2022 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Update 31/10/2022

Kshitij68 left a comment

Choose a reason for hiding this comment

Kshitij68 Oct 31, 2022

Choose a reason for hiding this comment

betatim Oct 31, 2022

Choose a reason for hiding this comment

glemaitre Nov 2, 2022

Choose a reason for hiding this comment

Kshitij68 Nov 4, 2022

Choose a reason for hiding this comment

betatim commented Oct 31, 2022

Kshitij68 commented Oct 31, 2022 • edited Loading

glemaitre commented Nov 2, 2022

Benchmark 1

Benchmark 2

Conclusions

glemaitre commented Nov 2, 2022

betatim commented Nov 3, 2022

betatim commented Nov 3, 2022 • edited Loading

Kshitij68 commented Nov 4, 2022

Kshitij68 commented Nov 4, 2022 • edited Loading

Kshitij68 commented Nov 4, 2022

betatim commented Nov 4, 2022

glemaitre commented Nov 4, 2022

Kshitij68 commented Nov 4, 2022

glemaitre commented Nov 4, 2022

glemaitre commented Nov 4, 2022 • edited Loading

Kshitij68 commented Nov 15, 2022

Kshitij68 commented Oct 31, 2022 •

edited

Loading

Kshitij68 commented Oct 31, 2022 •

edited

Loading

betatim commented Nov 3, 2022 •

edited

Loading

Kshitij68 commented Nov 4, 2022 •

edited

Loading

glemaitre commented Nov 4, 2022 •

edited

Loading