adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

bchen1116 · 2022-08-24T19:49:30Z

Describe the bug

The runtime of adjusted_mutual_info_score jumps significantly when we have large amounts of unique values in the two lists. Hovering around 6k total unique values (ie 2 columns of 3k unique values) keeps the runtime around a minute, but when we increase the number of unique values, the runtime shoots up.

Steps/Code to Reproduce

import pandas as pd
from sklearn.metrics import adjusted_mutual_info_score as ams
import time

df = pd.DataFrame({'a': [x % 8000 for x in range(1000000)],
                   'b': [x % 7000 for x in range(1000000)]})

start = time.time()
mi = ams(df['a'], df['b'])
end = time.time()
end - start

Expected Results

small or linear increase in runtime as we increase the number of unique values.

Actual Results

Large increase in runtime

2 rows of 6k unique values: 598s
2 rows of 8k unique values: 889s
2 rows of 10k unique values: 1118s

Versions

System:
    python: 3.8.0 (default, Aug 17 2020, 18:01:34)  [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/bryan.chen/.pyenv/versions/woodwork/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
          pip: 22.2.2
   setuptools: 41.2.0
      sklearn: 1.0.2
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: 0.29.28
       pandas: 1.4.3
   matplotlib: 3.5.1
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2022-09-09T16:34:57Z

Thank you for opening the issue. One would need to profile the implementation to see what the slow parts are to move forward on this issue.

Kshitij68 · 2022-10-31T02:44:52Z

Performed a high-level analysis and found that in a sample of 6000 records, 99% of the time is taken by sklearn.metrics.cluster._expected_mutual_info_fast.expected_mutual_information

System:
    python: 3.9.7 (default, Sep 16 2021, 08:50:36)  [Clang 10.0.0 ]
executable: /Users/kshitijmathur/opt/anaconda3/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.3
          pip: 22.2.2
   setuptools: 58.0.4
        numpy: 1.22.4
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.4.3
   matplotlib: 3.4.3
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

Kshitij68 · 2022-10-31T08:11:02Z

almost entire performance degradation is coming from this line of code
start = np.array([[v - N + w for w in b] for v in a], dtype='int')
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/cluster/_expected_mutual_info_fast.pyx#L53

For 6000 records,
above line takes 125 seconds
The for nested loop at the end of Cython function takes ~25 seconds

Kshitij68 · 2022-10-31T08:32:49Z

So to me, since the time complexity of above line of code is O(N^2), then we can't actually expect linear increases in runtime.
I'm not an expert in Cython, but I will try to see if something can be done about this.

bchen1116 added Bug Needs Triage Issue requires triage labels Aug 24, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Aug 26, 2022

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Aug 26, 2022

thomasjpfan added module:metrics Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Sep 9, 2022

Kshitij68 mentioned this issue Oct 31, 2022

ENH Improve performance for expected mutual information #24794

Closed

glemaitre removed the Bug label Nov 2, 2022

jeremiedbb mentioned this issue Feb 27, 2023

ENH Speed-up expected mutual information #25713

Merged

glemaitre closed this as completed in #25713 Mar 6, 2023

github-project-automation bot moved this from Todo📬 to Done🚀 in Quansight's scikit-learn Project Board Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

bchen1116 commented Aug 24, 2022

thomasjpfan commented Sep 9, 2022

Kshitij68 commented Oct 31, 2022

Kshitij68 commented Oct 31, 2022 •

edited

Loading

Kshitij68 commented Oct 31, 2022

adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

Comments

bchen1116 commented Aug 24, 2022

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

thomasjpfan commented Sep 9, 2022

Kshitij68 commented Oct 31, 2022

Kshitij68 commented Oct 31, 2022 • edited Loading

Kshitij68 commented Oct 31, 2022

Kshitij68 commented Oct 31, 2022 •

edited

Loading