-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Improve performance for expected mutual information #24794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Improve performance for expected mutual information #24794
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add changelog once this solution is termed acceptable
@@ -50,7 +50,8 @@ def expected_mutual_information(contingency, int n_samples): | |||
gln_N = gammaln(N + 1) | |||
gln_nij = gammaln(nijs + 1) | |||
# start and end values for nij terms for each summation. | |||
start = np.array([[v - N + w for w in b] for v in a], dtype='int') | |||
start = np.array(np.meshgrid(a,b),dtype='int').T | |||
start = np.array([start[i].sum(axis=1)-N for i in range(len(start))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be open to suggestions on how to do these operations using numpy in more elegant way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to understand the algorithm, but only with average success :-/
When I compare the start
that this PR produces and what is in main
they look the same.
So if this is faster and all the tests still pass, we should use it?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue that I can spot here is that we are creating a huge matrix in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the size of matrix will be double than previous method.
Precisely, if a
has M
elements and b
has N
elements, then
- by the old method you will have M*N elements
- by the new method you have 2*M*N elements
>>> a = np.array([1,2])
>>> b = np.array([3,4,5])
>>> np.meshgrid(a,b)
[array([[1, 2],
[1, 2],
[1, 2]]),
array([[3, 3],
[4, 4],
[5, 5]])]
So although the difference exists, I don't think it should be the deciding factor.
Could you post some before/after timing numbers for the three different cases mentioned in the issue? |
Sure, |
1b8a2cc
to
8dcfcac
Compare
I think that we need to be extremely careful with the benchmark here. I came up with a Cython implementation that does not rely on Python interaction: from libc.math cimport exp, lgamma, log
import cython
cimport numpy as cnp
import numpy as np
cnp.import_array()
cpdef expected_mutual_information(contingency, cnp.int64_t n_samples):
"""Compute the expected mutual information.
Parameters
----------
contingency : sparse-matrix of shape (n_labels, n_labels), dtype=np.int64
Contengency matrix storing the label counts. This is as sparse
matrix at the CSR format.
n_samples : int
The number of samples which is the sum of the counts in the contengency
matrix.
Returns
-------
emi : float
The expected mutual information between the two vectors of labels.
"""
cdef:
unsigned long long n_rows, n_cols
cnp.ndarray[cnp.int64_t] a, b
cnp.int64_t a_i, b_j
cnp.intp_t i, j, nij, nij_nz
cnp.intp_t start, end
cnp.float64_t emi = 0.0
double term1, term2, term3
double log_n_samples = log(n_samples)
n_rows, n_cols = contingency.shape
a = np.ravel(contingency.sum(axis=1))
b = np.ravel(contingency.sum(axis=0))
if a.size == 1 or b.size == 1:
return 0.0
for i in range(n_rows):
for j in range(n_cols):
a_i, b_j = a[i], b[j]
start = max(1, a_i - n_samples + b_j)
end = min(a_i, b_j) + 1
for nij in range(start, end):
nij_nz = 1 if nij == 0 else nij
term1 = nij_nz / <double>n_samples
term2 = log_n_samples + log(nij_nz) - log(a_i) - log(b_j)
term3 = exp(
lgamma(a_i + 1) + lgamma(b_j + 1)
+ lgamma(n_samples - a_i + 1) + lgamma(n_samples - b_j + 1)
- lgamma(n_samples + 1) - lgamma(nij_nz + 1)
- lgamma(a_i - nij + 1)
- lgamma(b_j - nij + 1)
- lgamma(n_samples - a_i - b_j + nij + 1)
)
emi += (term1 * term2 * term3)
return emi It will require to change a line of code in contingency = contingency_matrix(
labels_true, labels_pred, sparse=True, dtype=np.int64
) which avoid making 3 copy/conversion of data (i.e. int64 -> float64 -> int32) in the current implementation. However, I can see a trade-off with Benchmark 1This is the benchmark reported in the original issue import numpy as np
from sklearn.metrics import adjusted_mutual_info_score
x = np.array([x % 8000 for x in range(10_000)])
y = np.array([x % 7000 for x in range(10_000)])
mi = adjusted_mutual_info_score(x, y) This implementation will be much faster, even faster than the implementation of this PR. However, this benchmark is unique in the way that we have a large number of labels. With the current PR:
With the cython version above:
Benchmark 2I ran another benchmark where you have a large number of samples but few labels: import numpy as np
from sklearn.metrics import adjusted_mutual_info_score
rng = np.random.default_rng(0)
n_samples = 1_000_000
x = rng.integers(low=0, high=5, size=n_samples)
y = rng.integers(low=0, high=5, size=n_samples)
mi = adjusted_mutual_info_score(x, y) and in this case, the fully cythonize version will be slower than the vectorized version using NumPy. For the current PR:
For the fully cythonize version
ConclusionsSince the gain is important in one case, it might overcome the case where @betatim what do you think. |
It would be great if we could run an extensive benchmark varying |
After some digging by @glemaitre we found that the formula that the code implements (probably) is https://en.wikipedia.org/wiki/Adjusted_mutual_information#Adjustment_for_chance. I think no matter what we do as optimisation it would be worth leaving a note for people from the future that what |
From the fact that I am consulting wikipedia you can tell I've not made a lot of progress understanding what exactly the code does/why it is written the way it is. So I don't have anything to say about what to do. I have a slight preference for a version that doesn't use I think 450ms vs 600ms for one case is a small difference compared to 14s vs 5s. So it might make sense to make the slow case (much) faster and pay the price that the fast case gets slower. This is assuming that in all cases this function only gets called a handful of times, so that for a human the difference between 400 and 600ms is basically impossible to notice, but the difference between 14 and 5s is noticeable. |
Since the cython function is only called for calculating |
So to summarize, we have multiple things to look at when deciding between current PR vs @glemaitre 's approach:-
|
I would disagree with the statement here. In real life, most use cases tend to have many more samples than labels. |
I was wrong. The function does implement the expected mutual information, but the formula used is on the "adjusted mutual info" wikipedia page (they took the formula before it gets adjusted ...). Sorry for the confusion. My thinking for "lat's take the speed up even with a small hit to the common case" was that even though the common case is common, the slow down might be so small that in practice no one will notice. I don't know if that is true, so I wanted to put it out there for discussion. |
Although, the full Cython version does not allocate temporary vectors and avoid multiple data conversion. |
@glemaitre Should the plots reside in asv benchmarks folder or benchmarks folder ? |
The plot can be posted in this PR, together with the benchmark code. |
I was right now implementing a quick bench: from collections import defaultdict
from itertools import product
from time import time
import numpy as np
import pandas as pd
from sklearn.metrics import adjusted_mutual_info_score
repeat = 10
n_samples = [1_000, 10_000, 100_000, 1_000_000]
n_labels = [10, 100, 1_000]
rng = np.random.default_rng(0)
results = defaultdict(list)
for ns, nl in product(n_samples, n_labels):
local_result = []
for i in range(repeat):
print(f"Repetition {i} for n_samples={ns} and n_labels={nl}")
x = rng.integers(low=0, high=nl, size=ns)
y = rng.integers(low=0, high=nl, size=ns)
start = time()
adjusted_mutual_info_score(x, y)
end = time()
local_result.append(end - start)
results["n_samples"].append(ns)
results["n_labels"].append(nl)
results["mean_time"].append(np.mean(local_result))
results = pd.DataFrame(results) |
1b3307b
to
baa7761
Compare
baa7761
to
39dd0bf
Compare
Thanks for the review. I have made the changes and the performance plots are remaining. |
Reference Issues/PRs
This fixes #24254
What does this implement/fix? Explain your changes.
Used numpy arrays instead of for loops to leverage vectorization.
Any other comments?
I noticed a performance improvement of 4x with this. It would be great if some people could also try it on their machines.
Update 31/10/2022
For 6000 records,
before -> 145 seconds
after -> 30 seconds