Skip to content

KMeans(init='k-means++') performance issue with OpenBLAS #17334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue May 25, 2020 · 11 comments
Open

KMeans(init='k-means++') performance issue with OpenBLAS #17334

ogrisel opened this issue May 25, 2020 · 11 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented May 25, 2020

I open this issue to investigate a performance problem that might be related to #17230.

I adapted the reproducer of #17230 to display more info and make it work on a medium-size ranom dataset.

from sklearn import cluster
from time import time
from pprint import pprint
from threadpoolctl import threadpool_info
import numpy as np


pprint(threadpool_info())
rng = np.random.RandomState(0)
data = rng.randn(5000, 50)
t0_global = time()
for k in range(1, 15):
    t0 = time()
    # print(f"Running k-means with k={k}: ", end="", flush=True)
    cluster.KMeans(
        n_clusters=k,
        random_state=42,
        n_init=10,
        max_iter=2000,
        algorithm='full',
        init='k-means++').fit(data)
    # print(f"{time() - t0:.3f} s")

print(f"Total duration: {time() - t0_global:.3f} s")

I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for OMP_NUM_THREADS (unset, OMP_NUM_THREADS=1, OMP_NUM_THREADS=2, OMP_NUM_THREADS=4) on a laptop with 2 physical cpu cores (4 logical cpus).

In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.

The summary is:

  • with MKL there is not problem: large or unset values of OMP_NUM_THREADS are faster than OMP_NUM_THREADS=1;
  • with OpenBLAS without explicit setting of OMP_NUM_THREADS or setting a large value for it is significanlty slower forced sequential run with OMP_NUM_THREADS=1.

I will include my runs in the first comment.

/cc @jeremiedbb

@ogrisel
Copy link
Member Author

ogrisel commented May 25, 2020

  • mkl env with OMP_NUM_THREADS unset:
[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libmkl_rt.so',
  'internal_api': 'mkl',
  'num_threads': 2,
  'prefix': 'libmkl_rt',
  'threading_layer': 'intel',
  'user_api': 'blas',
  'version': '2020.0.1'},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libiomp5.so',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'libiomp',
  'user_api': 'openmp',
  'version': None}]
Total duration: 4.032 s
  • mkl env + OMP_NUM_THREADS=1
[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libmkl_rt.so',
  'internal_api': 'mkl',
  'num_threads': 1,
  'prefix': 'libmkl_rt',
  'threading_layer': 'intel',
  'user_api': 'blas',
  'version': '2020.0.1'},
 {'filepath': '/home/ogrisel/miniconda3/envs/sklearn-mkl/lib/libiomp5.so',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libiomp',
  'user_api': 'openmp',
  'version': None}]
Total duration: 5.951 s
  • openblas env with OMP_NUM_THREADS unset:
[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 4,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/python3.8/site-packages/numpy/.libs/libopenblasp-r0-34a18dc3.3.7.so',
  'internal_api': 'openblas',
  'num_threads': 4,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.7'}]
Total duration: 23.408 s
  • openblas env with OMP_NUM_THREADS=1:
[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 1,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/python3.8/site-packages/numpy/.libs/libopenblasp-r0-34a18dc3.3.7.so',
  'internal_api': 'openblas',
  'num_threads': 1,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.7'}]
Total duration: 7.295 s
  • openblas env with OMP_NUM_THREADS=2
[{'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/libgomp.so.1.0.0',
  'internal_api': 'openmp',
  'num_threads': 2,
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/home/ogrisel/miniconda3/envs/pylatest/lib/python3.8/site-packages/numpy/.libs/libopenblasp-r0-34a18dc3.3.7.so',
  'internal_api': 'openblas',
  'num_threads': 2,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.7'}]
Total duration: 7.567 s

@ogrisel
Copy link
Member Author

ogrisel commented May 25, 2020

As @jeremiedbb said during the core dev meeting, it might be caused by openblas that tries to use 4 threads on 2 physical cores / 4 logical core which is not an optimal strategy.

This might be related to OpenMathLib/OpenBLAS#1881 although the above benchmarks would need to rerun to control the size of the outer OpenMP and inner OpenBLAS loops to understand exactly what's going on.

@jeremiedbb
Copy link
Member

Not exactly. The BLAS is always limited to 1 thread. Setting OMP_NUM_THREADS will impact the openmp loop (over data chunks). On my laptop I observe like you that using logical cores is detrimental with OpenBLAS (also with MKL but way less). I don't understand why changing the BLAS makes a difference since it should be single threaded (I checked that it's effectively single threaded).

@ogrisel
Copy link
Member Author

ogrisel commented May 25, 2020

I agree. Maybe the CPU cache usage pattern of OpenBLAS is different. Or maybe the inner BLAS is not limited to 1 thread as it's supposed to be when we use OpenBLAS.

@jeremiedbb
Copy link
Member

jeremiedbb commented May 25, 2020

Hum I just tried your snipped while setting OPENBLAS_NUM_THREADS and there's a big difference when I don't set it and I set it to 1, which seems to indicate that openblas is actually not properly limited to 1.

@jeremiedbb
Copy link
Member

Actually it's k-means++ init faults ! it scales poorly, especially using hyperthreads.
The timings are back to expected when you set the init to 'random', meaning that the openmp loop and the inner blas are properly controlled.

Maybe you can turn this issue into a k-means++ issue ?

@ogrisel
Copy link
Member Author

ogrisel commented May 25, 2020

meaning that the openmp loop and the inner blas are properly controlled.

That's very good news. I completely forgot about the init. I should have started by profiling before investigating the behavior of the main loop.

Maybe you can turn this issue into a k-means++ issue ?

Yes but it's still an issue for the KMeans estimator.

@jeremiedbb
Copy link
Member

Yes but it's still an issue for the KMeans estimator.

True :)

@ogrisel ogrisel changed the title K-Means performance issue with OpenBLAS KMeans performance issue with OpenBLAS May 25, 2020
@ogrisel ogrisel changed the title KMeans performance issue with OpenBLAS KMeans(init='k-means++') performance issue with OpenBLAS May 25, 2020
@ogrisel
Copy link
Member Author

ogrisel commented May 25, 2020

So in summary we can say that the k-means++ code is indeed suffering from OpenMathLib/OpenBLAS#1881, right?

@jeremiedbb
Copy link
Member

jeremiedbb commented May 25, 2020

I think so but it might be more specific. In k-means++ the matrix multiplication is (n_candidates, n_features) x (n_samples, n_features) and the number of candidates is a small number (~log(n_clusters)). It's possible that mkl is better at optimizing the matrix multiplication where 1 dimension is much smaller than the other one.

@ogrisel
Copy link
Member Author

ogrisel commented May 29, 2020

A supposedly scalable alternative to k-means++ is k-means|| (#4357). However proper benchmarking of the DAAL implementation would be needed to tell what would be the actual benefit of implementing this alternative in scikit-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants