-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
KMeans(init='k-means++') performance issue with OpenBLAS #17334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
As @jeremiedbb said during the core dev meeting, it might be caused by openblas that tries to use 4 threads on 2 physical cores / 4 logical core which is not an optimal strategy. This might be related to OpenMathLib/OpenBLAS#1881 although the above benchmarks would need to rerun to control the size of the outer OpenMP and inner OpenBLAS loops to understand exactly what's going on. |
Not exactly. The BLAS is always limited to 1 thread. Setting OMP_NUM_THREADS will impact the openmp loop (over data chunks). On my laptop I observe like you that using logical cores is detrimental with OpenBLAS (also with MKL but way less). I don't understand why changing the BLAS makes a difference since it should be single threaded (I checked that it's effectively single threaded). |
I agree. Maybe the CPU cache usage pattern of OpenBLAS is different. Or maybe the inner BLAS is not limited to 1 thread as it's supposed to be when we use OpenBLAS. |
Hum I just tried your snipped while setting OPENBLAS_NUM_THREADS and there's a big difference when I don't set it and I set it to 1, which seems to indicate that openblas is actually not properly limited to 1. |
Actually it's k-means++ init faults ! it scales poorly, especially using hyperthreads. Maybe you can turn this issue into a k-means++ issue ? |
That's very good news. I completely forgot about the init. I should have started by profiling before investigating the behavior of the main loop.
Yes but it's still an issue for the |
True :) |
So in summary we can say that the k-means++ code is indeed suffering from OpenMathLib/OpenBLAS#1881, right? |
I think so but it might be more specific. In k-means++ the matrix multiplication is |
A supposedly scalable alternative to k-means++ is k-means|| (#4357). However proper benchmarking of the DAAL implementation would be needed to tell what would be the actual benefit of implementing this alternative in scikit-learn. |
I open this issue to investigate a performance problem that might be related to #17230.
I adapted the reproducer of #17230 to display more info and make it work on a medium-size ranom dataset.
I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for
OMP_NUM_THREADS
(unset,OMP_NUM_THREADS=1
,OMP_NUM_THREADS=2
,OMP_NUM_THREADS=4
) on a laptop with 2 physical cpu cores (4 logical cpus).In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.
The summary is:
OMP_NUM_THREADS
are faster thanOMP_NUM_THREADS=1
;OMP_NUM_THREADS
or setting a large value for it is significanlty slower forced sequential run withOMP_NUM_THREADS=1
.I will include my runs in the first comment.
/cc @jeremiedbb
The text was updated successfully, but these errors were encountered: