-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Truly parallel execution of pairwise_kernels and pairwise_distances #29587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some CPU intensive operations in numpy or pandas do release the GIL, so the threading backend might be a good choice. Furthermore, CPython 3.13 will come with an optional build flag that should a allow to run without any GIL (also known as the "free threading" mode of Python). It's not yet operational (it can cause segfaults in some cases) but @ngoldbaum, @lesteve and others are making progress at identifying and fixing the remaining problems. See #28978 for a tracking issue on the scikit-learn side. It could be interesting to see if that can help with your specific workload. More generally, before deciding to embark on a specific refactoring strategy for
I am worried about relying on this too much. In retrospect I have the feeling that it's a bit of brittle black magic that is complex to debug and maintain and reason about its performance implications (both in terms of memory usage and computational overhead). More generally, process based parallelism can lead to hard to debug situations (see the recently opened #29579 for instance), so I would rather rely more on thread-based parallelism than process-based parallelism for scikit-learn in the future. |
Hi @ogrisel,
True, that's why wrote that it is especially severe for user-defined CPU-bound metric functions. I will try to put together some benchmark for a common sklearn kernel to see the effect there.
That would be super nice to have. The question is how long it might take to make it practically usable in sklearn.
I observed a huge impact for my user-defined CPU-intensive metric function but for most people, predefined sklearn metrics are more relevant so I will benchmark that first.
Totally agree, automatic memmapping was just a side note
Sure, threading is better if it provides the same speed-up, it is just very often not the case with python. That's why many sklearn functions use the default joblib backend based on multiprocessing. If the free threading proves to work well and will be available in sklearn, it might help a lot. But as of now, multiprocessing is the only way to get full parallelism. I will soon provide some benchmark data. |
This is already usable in the sense that we have a CI for CPython 3.13 free-threaded and all the scikit-learn tests pass. I you have time to try CPython 3.13 free-threaded on your particular use case, feed-back would be super useful and welcome 🙏. As of today, you have a few different ways to install CPython 3.13 free-threaded locally (through conda, your Linux distribution package manager, python.org download, etc ...) see the py-free-threading.github.io doc. numpy, scipy, scikit-learn, and other projects have development wheels for CPython 3.13 free-threaded see this doc. With this kind of "ongoing work" things, there may be some caveats along the road, for example my understanding is that there is currently a hit on single-threaded performance in CPython 3.13 free-threaded, but this will be improved in the future. Not sure about the details, this may be quickly mentioned at one point in Anthony Shaw's PyCon US 2024 talk video. |
@ogrisel Multithreading works in general better there. However, for the CPU-bound case, multiprocessing is way better: In fact, threading has no effect in the case of CPU-bound task. Also, slicing of both X and Y decreases the communication and memory overhead and maybe there is even better way of slicing than I used. You can find the code below. The problem I see is that the user does not have the option to choose between threading and multiprocessing. As the joblib manual states, one should not hardcode the backend and use @lesteve import numpy as np
from sklearn.metrics.pairwise import rbf_kernel, _parallel_pairwise, _return_float_dtype, euclidean_distances, _pairwise_callable
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
import matplotlib.pyplot as plt
from joblib import parallel_config, effective_n_jobs, delayed, Parallel
from functools import partial
import time
import math
def rbf_kernel2(x, y, gamma=None):
if gamma is None:
gamma = 1.0 / len(x)
k = 0.0
for i in range(len(x)):
k += (x[i]-y[i])**2
k *= -gamma
k = math.exp(k)
return k
def _parallel_pairwise2(X, Y, func, n_jobs, **kwds):
"""Break the pairwise matrix in n_jobs even slices
and compute them in parallel."""
if Y is None:
Y = X
X, Y, dtype = _return_float_dtype(X, Y)
if effective_n_jobs(n_jobs) == 1:
return func(X, Y, **kwds)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=dtype, order="F")
slices = list(gen_even_slices(_num_samples(Y), effective_n_jobs(n_jobs)))
out = Parallel(n_jobs=n_jobs)(
delayed(func)(X, Y[s], **kwds)
for s in slices
)
i = 0
for i, s in enumerate(slices):
ret[:, s] = out[i]
return ret
def _parallel_pairwise3(X, Y, func, n_jobs, **kwds):
"""Break the pairwise matrix in n_jobs even slices
and compute them in parallel."""
if Y is None:
Y = X
X, Y, dtype = _return_float_dtype(X, Y)
if effective_n_jobs(n_jobs) == 1:
return func(X, Y, **kwds)
ret = np.empty((X.shape[0], Y.shape[0]), dtype=dtype, order="F")
eff_jobs = effective_n_jobs(n_jobs)
n_jobs1 = int(np.sqrt(eff_jobs))
n_jobs2 = int(eff_jobs//n_jobs1)
# print(n_jobs1, n_jobs2)
slices1 = list(gen_even_slices(_num_samples(X), n_jobs1))
slices2 = list(gen_even_slices(_num_samples(Y), n_jobs2))
out = Parallel(n_jobs=n_jobs)(
delayed(func)(X[s1], Y[s2], **kwds)
for s1 in slices1 for s2 in slices2
)
i = 0
for s1 in slices1:
for s2 in slices2:
ret[s1, s2] = out[i]
i += 1
return ret
for n_samples, metric, title in [(5000, rbf_kernel, 'IO-bound'), (100, partial(_pairwise_callable, metric=rbf_kernel2), 'CPU-bound')]:
# n_samples = 1000
n_jobs = np.arange(1,11)
# metric = rbf_kernel
# metric = partial(_pairwise_callable, metric=rbf_kernel2)
X, Y = np.random.rand(n_samples, 500), np.random.rand(n_samples, 500)
results = [[] for i in range(5)]
for n in n_jobs:
with parallel_config(backend='threading'):
result = %timeit -o -n 1 -r 10 _parallel_pairwise(X, Y, metric, n_jobs=n)
results[0].append(result)
result = %timeit -o -n 1 -r 10 _parallel_pairwise2(X, Y, metric, n_jobs=n)
results[1].append(result)
result = %timeit -o -n 1 -r 10 _parallel_pairwise3(X, Y, metric, n_jobs=n)
results[2].append(result)
with parallel_config(backend='loky'):
result = %timeit -o -n 1 -r 10 _parallel_pairwise2(X, Y, metric, n_jobs=n)
results[3].append(result)
result = %timeit -o -n 1 -r 10 _parallel_pairwise3(X, Y, metric, n_jobs=n)
results[4].append(result)
print()
times = []
for i in range(len(results)):
times.append([np.mean(result.timings) for result in results[i]])
for scale in ['linear', 'log']:
plt.figure()
plt.title('_parallel_pairwise, '+title+', '+str(scale)+' scale, '+str(n_samples)+' samples')
plt.xlim(np.min(n_jobs),np.max(n_jobs))
if scale=='linear':
plt.ylim(0,np.max(times))
plt.xlabel('# of CPUs')
plt.ylabel('times [s]')
plt.xscale(scale)
plt.yscale(scale)
plt.plot(n_jobs, times[0], label='threading: original')
plt.plot(n_jobs, times[1], label='threading: mod. Y slicing')
plt.plot(n_jobs, times[2], label='threading: mod. X+Y slicing')
plt.plot(n_jobs, times[3], label='multiprocessing: mod. Y slicing')
plt.plot(n_jobs, times[4], label='multiprocessing: mod. X+Y slicing')
plt.legend()
plt.show() EDIT:
|
Thank you very much for your analysis. Looking forward to see the results with free-threading Python if someone has time to set it up. |
@stepan-srsen could you add the output of I am trying to run your snippet and full-disclosure for now I am getting weird results on vanilla Python, so I must be doing something wrong I need to have a a closer look ... |
@lesteve @ogrisel Everything, even multiprocessing was tested with the GIL turned off. None of the approaches actually accelerates the calculation in this case/setup when allowing parallelism inside the These results correspond to the results from the regular python setup above. Interestingly, the BLAS threading can speed up single-core execution but its interplay with the parallelism used within the For the CPU-bound tasks, the picture is clear, free-threading performs the same as multiprocessing or even slightly better (for numbers of CPUs for which the slicing is suboptimal) and the best times are basically the same as in my previous setup. I set Dependencies versions in the new setup:
|
@stepan-srsen just to make sure you do set I think not setting I was also wondering how the outer-loop parallelism performance (i.e. what you are showing) with the BLAS library in single-threaded mode, compares to no parallellism in the outer loop and let the BLAS library use all the cores, but I did not have time to look at this ... |
That's already great news! Thanks very much for taking the time to run this. For the "IO-bound" case it would be worth investigating more about the interaction with different implementations of multithreaded BLAS used internally by numpy and scikit-learn (via scipy). |
Thanks for updating the benchmark results with Given those results I think we can keep scikit-learn |
@lesteve @ogrisel @lesteve |
+1 for using the |
Hi, @ogrisel @lesteve Also, another thought: Will the python 3.13 support free-threading by default or will it need some special compilation? If it is the second case, then sklearn shouldn't just assume that the user compiles the python with some experimental feature. BTW you can remove the Needs Benchmarks and Needs Reproducible Code tags. |
The official python 3.13 binaries provided by python.org will not be free-threaded by default but some conda channel (and maybe other distributions) will ship both free-threaded and gil-based python versions. scikit-learn will probably ship wheels for both ABIs and users will be free to install what they want. If conda-forge is ready in time we will also ship free-threaded scikit-learn packages on conda-forge. I don't think that the full Python ecosystem will be 100% free-threading ready at the time of the CPython 3.13 release (in October) but I hope that most of the core scientific packages will be ready. My point is that I would rather invest dev effort in making free-threading work as best as possible as nearly as possible (in 2024 or 2025) than invest efforts in working around GIL and process-based paralellism limitations in the medium term. I am fine with short term pragmatic improvements for parallel based parallelism if they do not add to much complexity to our code bases (scikit-learn, joblib, loky). though. |
@ogrisel Results using the function below and NOT using free threading:
|
Describe the workflow you want to enable
Both
pairwise_kernels
andpairwise_distances
functions call_parallel_pairwise
function, which is (contrary to its name) not parallel as it enforces the threading backend. Therefore, these functions are terribly slow, especially for computationally expensive user-defined metrics. I understand that the reasons for the threading backend are possibly large memory demands and data communication overhead but I suggest a different approach. Also, the documentation for these functions talks about parallel execution and processes which is currently simply not true.Describe your proposed solution
The memory and data communication issues can be reduced by a smarter distribution of the input data to individual processes. Right now, only
Y
is sliced in the_parallel_pairwise
function which is suboptimal for parallel processing. BothX
andY
should be sliced to lower the demands for multiprocessing. For example for 100x100X
andY
distributed to 100 processes, we have to copy 100+1 inputs to every process when slicing onlyY
while only 10+10 when slicing bothX
andY
. As a result, multiprocessing can be allowed. Also, joblib does automatic memmapping in some cases.Alternatively, at least the documentation for
pairwise_kernels
andpairwise_distances
should be corrected.Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: