Skip to content

adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bchen1116 opened this issue Aug 24, 2022 · 4 comments · Fixed by #25713
Closed
Labels
module:metrics Needs Investigation Issue requires investigation

Comments

@bchen1116
Copy link
Contributor

Describe the bug

The runtime of adjusted_mutual_info_score jumps significantly when we have large amounts of unique values in the two lists. Hovering around 6k total unique values (ie 2 columns of 3k unique values) keeps the runtime around a minute, but when we increase the number of unique values, the runtime shoots up.

Steps/Code to Reproduce

import pandas as pd
from sklearn.metrics import adjusted_mutual_info_score as ams
import time

df = pd.DataFrame({'a': [x % 8000 for x in range(1000000)],
                   'b': [x % 7000 for x in range(1000000)]})

start = time.time()
mi = ams(df['a'], df['b'])
end = time.time()
end - start

Expected Results

small or linear increase in runtime as we increase the number of unique values.

Actual Results

Large increase in runtime

2 rows of 6k unique values: 598s
2 rows of 8k unique values: 889s
2 rows of 10k unique values: 1118s

Versions

System:
    python: 3.8.0 (default, Aug 17 2020, 18:01:34)  [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/bryan.chen/.pyenv/versions/woodwork/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
          pip: 22.2.2
   setuptools: 41.2.0
      sklearn: 1.0.2
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: 0.29.28
       pandas: 1.4.3
   matplotlib: 3.5.1
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True
@thomasjpfan
Copy link
Member

Thank you for opening the issue. One would need to profile the implementation to see what the slow parts are to move forward on this issue.

@thomasjpfan thomasjpfan added module:metrics Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Sep 9, 2022
@Kshitij68
Copy link
Contributor

Performed a high-level analysis and found that in a sample of 6000 records, 99% of the time is taken by sklearn.metrics.cluster._expected_mutual_info_fast.expected_mutual_information

System:
    python: 3.9.7 (default, Sep 16 2021, 08:50:36)  [Clang 10.0.0 ]
executable: /Users/kshitijmathur/opt/anaconda3/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.3
          pip: 22.2.2
   setuptools: 58.0.4
        numpy: 1.22.4
        scipy: 1.7.1
       Cython: 0.29.24
       pandas: 1.4.3
   matplotlib: 3.4.3
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

@Kshitij68
Copy link
Contributor

Kshitij68 commented Oct 31, 2022

almost entire performance degradation is coming from this line of code
start = np.array([[v - N + w for w in b] for v in a], dtype='int')
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/cluster/_expected_mutual_info_fast.pyx#L53

For 6000 records,
above line takes 125 seconds
The for nested loop at the end of Cython function takes ~25 seconds

@Kshitij68
Copy link
Contributor

So to me, since the time complexity of above line of code is O(N^2), then we can't actually expect linear increases in runtime.
I'm not an expert in Cython, but I will try to see if something can be done about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:metrics Needs Investigation Issue requires investigation
Projects
None yet
4 participants