KMeans(init='k-means++') performance issue with OpenBLAS #17334

ogrisel · 2020-05-25T09:51:23Z

I open this issue to investigate a performance problem that might be related to #17230.

I adapted the reproducer of #17230 to display more info and make it work on a medium-size ranom dataset.

from sklearn import cluster
from time import time
from pprint import pprint
from threadpoolctl import threadpool_info
import numpy as np


pprint(threadpool_info())
rng = np.random.RandomState(0)
data = rng.randn(5000, 50)
t0_global = time()
for k in range(1, 15):
    t0 = time()
    # print(f"Running k-means with k={k}: ", end="", flush=True)
    cluster.KMeans(
        n_clusters=k,
        random_state=42,
        n_init=10,
        max_iter=2000,
        algorithm='full',
        init='k-means++').fit(data)
    # print(f"{time() - t0:.3f} s")

print(f"Total duration: {time() - t0_global:.3f} s")

I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for OMP_NUM_THREADS (unset, OMP_NUM_THREADS=1, OMP_NUM_THREADS=2, OMP_NUM_THREADS=4) on a laptop with 2 physical cpu cores (4 logical cpus).

In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.

The summary is:

with MKL there is not problem: large or unset values of OMP_NUM_THREADS are faster than OMP_NUM_THREADS=1;
with OpenBLAS without explicit setting of OMP_NUM_THREADS or setting a large value for it is significanlty slower forced sequential run with OMP_NUM_THREADS=1.

I will include my runs in the first comment.

/cc @jeremiedbb

The text was updated successfully, but these errors were encountered:

ogrisel · 2020-05-25T09:59:33Z

mkl env with OMP_NUM_THREADS unset:

[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libmkl_rt.so',
  'internal_api': 'mkl',
  'num_threads': 2,
  'prefix': 'libmkl_rt',
  'threading_layer': 'intel',
  'user_api': 'blas',
  'version': '2020.0.1'},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libiomp5.so',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'libiomp',
  'user_api': 'openmp',
  'version': None}]
Total duration: 4.032 s

mkl env + OMP_NUM_THREADS=1

[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libmkl_rt.so',
  'internal_api': 'mkl',
  'num_threads': 1,
  'prefix': 'libmkl_rt',
  'threading_layer': 'intel',
  'user_api': 'blas',
  'version': '2020.0.1'},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libiomp5.so',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libiomp',
  'user_api': 'openmp',
  'version': None}]
Total duration: 5.951 s

openblas env with OMP_NUM_THREADS unset:

[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/python3.8/site-packages/numpy/.libs/libopenblasp-r0-34a18dc3.3.7.so',
  'internal_api': 'openblas',
  'num_threads': 4,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.7'}]
Total duration: 23.408 s

openblas env with OMP_NUM_THREADS=1:

[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/python3.8/site-packages/numpy/.libs/libopenblasp-r0-34a18dc3.3.7.so',
  'internal_api': 'openblas',
  'num_threads': 1,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.7'}]
Total duration: 7.295 s

openblas env with OMP_NUM_THREADS=2

[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 2,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/python3.8/site-packages/numpy/.libs/libopenblasp-r0-34a18dc3.3.7.so',
  'internal_api': 'openblas',
  'num_threads': 2,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.7'}]
Total duration: 7.567 s

ogrisel · 2020-05-25T13:52:23Z

As @jeremiedbb said during the core dev meeting, it might be caused by openblas that tries to use 4 threads on 2 physical cores / 4 logical core which is not an optimal strategy.

This might be related to OpenMathLib/OpenBLAS#1881 although the above benchmarks would need to rerun to control the size of the outer OpenMP and inner OpenBLAS loops to understand exactly what's going on.

jeremiedbb · 2020-05-25T14:06:30Z

Not exactly. The BLAS is always limited to 1 thread. Setting OMP_NUM_THREADS will impact the openmp loop (over data chunks). On my laptop I observe like you that using logical cores is detrimental with OpenBLAS (also with MKL but way less). I don't understand why changing the BLAS makes a difference since it should be single threaded (I checked that it's effectively single threaded).

ogrisel · 2020-05-25T14:25:49Z

I agree. Maybe the CPU cache usage pattern of OpenBLAS is different. Or maybe the inner BLAS is not limited to 1 thread as it's supposed to be when we use OpenBLAS.

jeremiedbb · 2020-05-25T14:34:27Z

Hum I just tried your snipped while setting OPENBLAS_NUM_THREADS and there's a big difference when I don't set it and I set it to 1, which seems to indicate that openblas is actually not properly limited to 1.

jeremiedbb · 2020-05-25T14:41:50Z

Actually it's k-means++ init faults ! it scales poorly, especially using hyperthreads.
The timings are back to expected when you set the init to 'random', meaning that the openmp loop and the inner blas are properly controlled.

Maybe you can turn this issue into a k-means++ issue ?

ogrisel · 2020-05-25T14:45:28Z

meaning that the openmp loop and the inner blas are properly controlled.

That's very good news. I completely forgot about the init. I should have started by profiling before investigating the behavior of the main loop.

Maybe you can turn this issue into a k-means++ issue ?

Yes but it's still an issue for the KMeans estimator.

jeremiedbb · 2020-05-25T14:47:45Z

Yes but it's still an issue for the KMeans estimator.

True :)

ogrisel · 2020-05-25T15:23:25Z

So in summary we can say that the k-means++ code is indeed suffering from OpenMathLib/OpenBLAS#1881, right?

jeremiedbb · 2020-05-25T15:51:06Z

I think so but it might be more specific. In k-means++ the matrix multiplication is (n_candidates, n_features) x (n_samples, n_features) and the number of candidates is a small number (~log(n_clusters)). It's possible that mkl is better at optimizing the matrix multiplication where 1 dimension is much smaller than the other one.

ogrisel · 2020-05-29T12:10:58Z

A supposedly scalable alternative to k-means++ is k-means|| (#4357). However proper benchmarking of the DAAL implementation would be needed to tell what would be the actual benefit of implementing this alternative in scikit-learn.

ogrisel added Bug: triage Bug module:cluster Performance and removed Bug: triage labels May 25, 2020

ogrisel mentioned this issue May 25, 2020

sklearn.cluster.KMeans 0.23 is extra slower compared to 0.22.2 #17230

Closed

ogrisel changed the title ~~K-Means performance issue with OpenBLAS~~ KMeans performance issue with OpenBLAS May 25, 2020

ogrisel changed the title ~~KMeans performance issue with OpenBLAS~~ KMeans(init='k-means++') performance issue with OpenBLAS May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KMeans(init='k-means++') performance issue with OpenBLAS #17334

KMeans(init='k-means++') performance issue with OpenBLAS #17334

ogrisel commented May 25, 2020 •

edited

Loading

ogrisel commented May 25, 2020

ogrisel commented May 25, 2020 •

edited

Loading

jeremiedbb commented May 25, 2020

ogrisel commented May 25, 2020 •

edited

Loading

jeremiedbb commented May 25, 2020 •

edited

Loading

jeremiedbb commented May 25, 2020

ogrisel commented May 25, 2020 •

edited

Loading

jeremiedbb commented May 25, 2020

ogrisel commented May 25, 2020

jeremiedbb commented May 25, 2020 •

edited

Loading

ogrisel commented May 29, 2020 •

edited

Loading

KMeans(init='k-means++') performance issue with OpenBLAS #17334

KMeans(init='k-means++') performance issue with OpenBLAS #17334

Comments

ogrisel commented May 25, 2020 • edited Loading

ogrisel commented May 25, 2020

ogrisel commented May 25, 2020 • edited Loading

jeremiedbb commented May 25, 2020

ogrisel commented May 25, 2020 • edited Loading

jeremiedbb commented May 25, 2020 • edited Loading

jeremiedbb commented May 25, 2020

ogrisel commented May 25, 2020 • edited Loading

jeremiedbb commented May 25, 2020

ogrisel commented May 25, 2020

jeremiedbb commented May 25, 2020 • edited Loading

ogrisel commented May 29, 2020 • edited Loading

ogrisel commented May 25, 2020 •

edited

Loading

ogrisel commented May 25, 2020 •

edited

Loading

ogrisel commented May 25, 2020 •

edited

Loading

jeremiedbb commented May 25, 2020 •

edited

Loading

ogrisel commented May 25, 2020 •

edited

Loading

jeremiedbb commented May 25, 2020 •

edited

Loading

ogrisel commented May 29, 2020 •

edited

Loading