Description
Describe the workflow you want to enable
Both pairwise_kernels
and pairwise_distances
functions call _parallel_pairwise
function, which is (contrary to its name) not parallel as it enforces the threading backend. Therefore, these functions are terribly slow, especially for computationally expensive user-defined metrics. I understand that the reasons for the threading backend are possibly large memory demands and data communication overhead but I suggest a different approach. Also, the documentation for these functions talks about parallel execution and processes which is currently simply not true.
Describe your proposed solution
The memory and data communication issues can be reduced by a smarter distribution of the input data to individual processes. Right now, only Y
is sliced in the _parallel_pairwise
function which is suboptimal for parallel processing. Both X
and Y
should be sliced to lower the demands for multiprocessing. For example for 100x100 X
and Y
distributed to 100 processes, we have to copy 100+1 inputs to every process when slicing only Y
while only 10+10 when slicing both X
and Y
. As a result, multiprocessing can be allowed. Also, joblib does automatic memmapping in some cases.
Alternatively, at least the documentation for pairwise_kernels
and pairwise_distances
should be corrected.
Describe alternatives you've considered, if relevant
No response
Additional context
No response