Skip to content

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

@gittar

Description

@gittar

Describe the bug

While doing experiments with the KMeans class I observed hugely varying running times for the same problem when using different versions of numpy. A systematic check revealed the following:

  • for numpy version 1.9.0+ KMeans was 100-500% slower than for numpy 1.8.5-

Statistics (figures are repeatable with little variance):

time:  7.67, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.16.4
time:  8.20, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.18.5
time: 40.85, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.19.0
time:  9.69, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.16.4
time: 10.21, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.18.5
time: 17.68, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.19.0

time:  7.74, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.00, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.47, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.37, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.80, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.54, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.21.1

time:  7.82, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.54, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.29, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.25, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.06, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.20, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.21.1

time: 14.45, Err: 0.01360200 python= 3.9.6  sklearn= 0.23.2 numpy 1.21.1
time: 17.44, Err: 0.01361583 python= 3.9.6  sklearn= 0.24.2 numpy 1.21.1

Steps/Code to Reproduce

Here is the python benchmark I used to produce each of the above lines. It generates three random data sets and runs k-means++ with k in {50,100}. The time for running the fit() method is aggregated as well as the normalized error

# file 'bench.py'
from sklearn.cluster import KMeans
import numpy as np
from time import time
import sys
import sklearn

def bench():
    np.random.seed(2) # fix random generator
    # generate datasets of size 1000,5000,10000
    # run k-means++ with k= 50,100
    Ds=[]
    for n in [1000,5000,10000]:
        Ds.append(np.random.random(size=(n,2)))
    ks=[50,100]
    t_total=0
    err=0
    for D in Ds:
        for k in ks:
            km=KMeans(n_clusters=k)
            t0=time()
            km.fit(D)
            t=time()-t0
            t_total+=t
            err+=km.inertia_/len(D)
        
    return f"time: {t_total:5.2f}, Err: {err:.8f}"
    
print(bench(),"python=",sys.version.split()[0],
"sklearn=",sklearn.__version__, "numpy",np.__version__)

The environments were produced with conda, e.g. the one for Python3.9 and scikit-learn 0.24.2 as follows:

conda create --name py3.9-sk0.24.2 python=3.9
conda activate py3.9-sk0.24.2
pip install scikit-learn==0.24.2
pip install numpy==1.18.5

Then for each environment the above benchmark was run:

python src/bench.py

Expected Results

I was expecting similar execution times for the different numpy versions or even better performance for higher versions.

Actual Results

The performance dropped sharply when switching from numpy 1.18.5 to 1.19.0. This was the case for Python 6,7,8 and scikit-learn 0.23.2 and 0.24.2. The performance hit was sharper larger for scikit-learn 0.23.2 than for scikit-learn 0.24.2.

With numpy 1.18.5 the performance of scikit-learn 0.23.2 was better than that of scikit-learn 0.24.2.

THe computed error values were identical for a given verson of scikit-learn and differred only slightly between scikit-learn 0.23.2 and 0.24.2

Question: What could be the reason for the huge performance deterioration for all numpy version starting with numpy 1.19.0?

Versions

(see table above)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions