Skip to content

ENH Improve performance for expected mutual information #24794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Kshitij68
Copy link
Contributor

@Kshitij68 Kshitij68 commented Oct 31, 2022

Reference Issues/PRs

This fixes #24254

What does this implement/fix? Explain your changes.

Used numpy arrays instead of for loops to leverage vectorization.

Any other comments?

I noticed a performance improvement of 4x with this. It would be great if some people could also try it on their machines.

Update 31/10/2022

For 6000 records,
before -> 145 seconds
after -> 30 seconds

Copy link
Contributor Author

@Kshitij68 Kshitij68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add changelog once this solution is termed acceptable

@@ -50,7 +50,8 @@ def expected_mutual_information(contingency, int n_samples):
gln_N = gammaln(N + 1)
gln_nij = gammaln(nijs + 1)
# start and end values for nij terms for each summation.
start = np.array([[v - N + w for w in b] for v in a], dtype='int')
start = np.array(np.meshgrid(a,b),dtype='int').T
start = np.array([start[i].sum(axis=1)-N for i in range(len(start))])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be open to suggestions on how to do these operations using numpy in more elegant way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to understand the algorithm, but only with average success :-/

When I compare the start that this PR produces and what is in main they look the same.

So if this is faster and all the tests still pass, we should use it?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue that I can spot here is that we are creating a huge matrix in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the size of matrix will be double than previous method.

Precisely, if a has M elements and b has N elements, then

  • by the old method you will have M*N elements
  • by the new method you have 2*M*N elements
>>> a = np.array([1,2])
>>> b = np.array([3,4,5])
>>> np.meshgrid(a,b)
[array([[1, 2],
       [1, 2],
       [1, 2]]), 
array([[3, 3],
       [4, 4],
       [5, 5]])]

So although the difference exists, I don't think it should be the deciding factor.

@Kshitij68 Kshitij68 changed the title [Fix] Improve performnace for expected mutual information [Fix] Improve performance for expected mutual information Oct 31, 2022
@betatim
Copy link
Member

betatim commented Oct 31, 2022

Could you post some before/after timing numbers for the three different cases mentioned in the issue?

@Kshitij68
Copy link
Contributor Author

Kshitij68 commented Oct 31, 2022

Could you post some before/after timing numbers for the three different cases mentioned in the issue?

Sure,
Before -> 145 seconds
After -> 30 seconds

@Kshitij68 Kshitij68 force-pushed the Fix-regression-for-expected-mutual-info branch from 1b8a2cc to 8dcfcac Compare October 31, 2022 15:19
@glemaitre
Copy link
Member

I think that we need to be extremely careful with the benchmark here.

I came up with a Cython implementation that does not rely on Python interaction:

from libc.math cimport exp, lgamma, log
import cython
cimport numpy as cnp

import numpy as np


cnp.import_array()

cpdef expected_mutual_information(contingency, cnp.int64_t n_samples):
    """Compute the expected mutual information.

    Parameters
    ----------
    contingency : sparse-matrix of shape (n_labels, n_labels), dtype=np.int64
        Contengency matrix storing the label counts. This is as sparse
        matrix at the CSR format.

    n_samples : int
        The number of samples which is the sum of the counts in the contengency
        matrix.

    Returns
    -------
    emi : float
        The expected mutual information between the two vectors of labels.
    """
    cdef:
        unsigned long long n_rows, n_cols
        cnp.ndarray[cnp.int64_t] a, b
        cnp.int64_t a_i, b_j
        cnp.intp_t i, j, nij, nij_nz
        cnp.intp_t start, end
        cnp.float64_t emi = 0.0
        double term1, term2, term3
        double log_n_samples = log(n_samples)

    n_rows, n_cols = contingency.shape
    a = np.ravel(contingency.sum(axis=1))
    b = np.ravel(contingency.sum(axis=0))

    if a.size == 1 or b.size == 1:
        return 0.0

    for i in range(n_rows):
        for j in range(n_cols):
            a_i, b_j = a[i], b[j]
            start = max(1, a_i - n_samples + b_j)
            end = min(a_i, b_j) + 1
            for nij in range(start, end):
                nij_nz = 1 if nij == 0 else nij
                term1 = nij_nz / <double>n_samples
                term2 = log_n_samples + log(nij_nz) - log(a_i) - log(b_j)
                term3 = exp(
                    lgamma(a_i + 1) + lgamma(b_j + 1)
                    + lgamma(n_samples - a_i + 1) + lgamma(n_samples - b_j + 1)
                    - lgamma(n_samples + 1) - lgamma(nij_nz + 1)
                    - lgamma(a_i - nij + 1)
                    - lgamma(b_j - nij + 1)
                    - lgamma(n_samples - a_i - b_j + nij + 1)
                )
                emi += (term1 * term2 * term3)
    return emi

It will require to change a line of code in _supervised.py:

contingency = contingency_matrix(
        labels_true, labels_pred, sparse=True, dtype=np.int64
    )

which avoid making 3 copy/conversion of data (i.e. int64 -> float64 -> int32) in the current implementation.

However, I can see a trade-off with n_samples and n_labels I suppose.

Benchmark 1

This is the benchmark reported in the original issue

import numpy as np
from sklearn.metrics import adjusted_mutual_info_score

x = np.array([x % 8000 for x in range(10_000)])
y = np.array([x % 7000 for x in range(10_000)])

mi = adjusted_mutual_info_score(x, y)

This implementation will be much faster, even faster than the implementation of this PR. However, this benchmark is unique in the way that we have a large number of labels.

With the current PR:

In [1]: %timeit %run bench.py
14.2 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With the cython version above:

In [1]: %timeit %run bench.py
4.91 s ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark 2

I ran another benchmark where you have a large number of samples but few labels:

import numpy as np
from sklearn.metrics import adjusted_mutual_info_score

rng = np.random.default_rng(0)
n_samples = 1_000_000
x = rng.integers(low=0, high=5, size=n_samples)
y = rng.integers(low=0, high=5, size=n_samples)

mi = adjusted_mutual_info_score(x, y)

and in this case, the fully cythonize version will be slower than the vectorized version using NumPy.

For the current PR:

In [1]: %timeit %run bench.py
447 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For the fully cythonize version

In [1]: %timeit %run bench.py
625 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusions

Since the gain is important in one case, it might overcome the case where n_samples is large and n_labels is small.

@betatim what do you think.

@glemaitre
Copy link
Member

It would be great if we could run an extensive benchmark varying n_labels and n_samples and storing all those in a DataFrame to make proper plots.

@glemaitre glemaitre changed the title [Fix] Improve performance for expected mutual information ENH Improve performance for expected mutual information Nov 2, 2022
@betatim
Copy link
Member

betatim commented Nov 3, 2022

After some digging by @glemaitre we found that the formula that the code implements (probably) is https://en.wikipedia.org/wiki/Adjusted_mutual_information#Adjustment_for_chance. I think no matter what we do as optimisation it would be worth leaving a note for people from the future that what expected_mutual_information computes is (actually) the adjusted mutual information.

@betatim
Copy link
Member

betatim commented Nov 3, 2022

From the fact that I am consulting wikipedia you can tell I've not made a lot of progress understanding what exactly the code does/why it is written the way it is. So I don't have anything to say about what to do.

I have a slight preference for a version that doesn't use meshgrid() because meshgrid needs more memory (according to @glemaitre) and a bit of brain power to understand what it does/why it works in this case. This code already contains quite a few things that require a bit of thinking to understand (using log identities, gamma function for factorials, etc). Each of these isn't hard to understand, but if you combine several of these tricks together I find it makes the code quite a hard read. So if we can avoid that, that would be nice.

I think 450ms vs 600ms for one case is a small difference compared to 14s vs 5s. So it might make sense to make the slow case (much) faster and pay the price that the fast case gets slower. This is assuming that in all cases this function only gets called a handful of times, so that for a human the difference between 400 and 600ms is basically impossible to notice, but the difference between 14 and 5s is noticeable.

@Kshitij68
Copy link
Contributor Author

After some digging by @glemaitre we found that the formula that the code implements (probably) is https://en.wikipedia.org/wiki/Adjusted_mutual_information#Adjustment_for_chance. I think no matter what we do as optimisation it would be worth leaving a note for people from the future that what expected_mutual_information computes is (actually) the adjusted mutual information.

Since the cython function is only called for calculating adjusted_mutual_info_score, why not just rename the cython function itself?

@Kshitij68
Copy link
Contributor Author

Kshitij68 commented Nov 4, 2022

So to summarize, we have multiple things to look at when deciding between current PR vs @glemaitre 's approach:-

  • Code Readability and Maintainability
  • Memory consumption
  • Time taken with n_samples >>> n_labels
  • Time taken with n_samples on similar scale as n_labels

@Kshitij68
Copy link
Contributor Author

I think 450ms vs 600ms for one case is a small difference compared to 14s vs 5s. So it might make sense to make the slow case (much) faster and pay the price that the fast case gets slower.

I would disagree with the statement here. In real life, most use cases tend to have many more samples than labels.
But this is based on personal opinion obviously.

@betatim
Copy link
Member

betatim commented Nov 4, 2022

I was wrong. The function does implement the expected mutual information, but the formula used is on the "adjusted mutual info" wikipedia page (they took the formula before it gets adjusted ...). Sorry for the confusion.

My thinking for "lat's take the speed up even with a small hit to the common case" was that even though the common case is common, the slow down might be so small that in practice no one will notice. I don't know if that is true, so I wanted to put it out there for discussion.

@glemaitre
Copy link
Member

Although, the full Cython version does not allocate temporary vectors and avoid multiple data conversion.

@Kshitij68
Copy link
Contributor Author

It would be great if we could run an extensive benchmark varying n_labels and n_samples and storing all those in a DataFrame to make proper plots.

@glemaitre Should the plots reside in asv benchmarks folder or benchmarks folder ?

@glemaitre
Copy link
Member

The plot can be posted in this PR, together with the benchmark code.

@glemaitre
Copy link
Member

glemaitre commented Nov 4, 2022

I was right now implementing a quick bench:

from collections import defaultdict
from itertools import product
from time import time

import numpy as np
import pandas as pd

from sklearn.metrics import adjusted_mutual_info_score

repeat = 10
n_samples = [1_000, 10_000, 100_000, 1_000_000]
n_labels = [10, 100, 1_000]

rng = np.random.default_rng(0)

results = defaultdict(list)
for ns, nl in product(n_samples, n_labels):
    local_result = []
    for i in range(repeat):
        print(f"Repetition {i} for n_samples={ns} and n_labels={nl}")
        x = rng.integers(low=0, high=nl, size=ns)
        y = rng.integers(low=0, high=nl, size=ns)

        start = time()
        adjusted_mutual_info_score(x, y)
        end = time()
        local_result.append(end - start)

    results["n_samples"].append(ns)
    results["n_labels"].append(nl)
    results["mean_time"].append(np.mean(local_result))

results = pd.DataFrame(results)

@Kshitij68 Kshitij68 force-pushed the Fix-regression-for-expected-mutual-info branch 2 times, most recently from 1b3307b to baa7761 Compare November 7, 2022 05:53
@Kshitij68 Kshitij68 force-pushed the Fix-regression-for-expected-mutual-info branch from baa7761 to 39dd0bf Compare November 7, 2022 05:57
@Kshitij68
Copy link
Contributor Author

Thanks for the review. I have made the changes and the performance plots are remaining.
I will try to find time to do this coming weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adjusted_mutual_info_score takes a long time with lists containing many unique values
4 participants