Use case: add distributed and GPU support to SciPy #2

rgommers · 2020-05-17T14:38:02Z

When surveying a representative set of advanced users and research software engineers in 2019 (for this NSF proposal), the single most common pain point brought up about SciPy was performance.

SciPy heavily relies on NumPy (its only non-optional runtime dependency). NumPy provides an array implementation that's in-memory, CPU-only and single-threaded. Common things users ask for are:

parallel algorithms (multi-threaded or multiprocessing based)
support for distributed arrays (with Dask in particular)
support for GPUs

Some parallelism, in particular via multiprocessing, can be supported, however SciPy itself will not directly start depending on a GPU or distributed array implementation, or contain (e.g.) CUDA code - that's not maintainable given the resources for development. However, there is a way to provide distributed or GPU support. Part of the solution is provided by NumPy's "array protocols" (see gh-1), that allow dispatching to other array implementations. The main problem then becomes how to know whether this will work with a particular distributed or GPU array implementation - given that there are zero other array implementations that are even close to providing full NumPy compatibility - without adding it as a hard dependency.

It's clear that SciPy functionality that relies on compiled extensions (C, C++, Cython, Fortran) directly won't work. Pure Python code can work though. There's two main possibilities:

Testing with another package, manually or in CI, and simply provide a list of functionality that is found to work. Then make ad-hoc fixes to expand the set that works.
Start relying on a well-defined subset of the NumPy API (or a new NumPy-like API), for which compatibility is guaranteed.

Option (2) seems strongly preferable, and that "well-defined subset" is what an API standard should provide. Testing will still be needed, to ensure there are no critical corner cases or bugs between array implementations, however that's then a very tractable task.

The text was updated successfully, but these errors were encountered:

tdimitri · 2020-05-21T00:30:44Z

Certainly parallel algorithms are achievable today, and can grow over time. Every modern CPU has a multicore CPU today, thus everyone can benefit from this improvement.
Once the parallel algorithm is written, the distributed version is very similar. For instance take sum, which might be partitioned or distributed over multiple computers. Assume we have 3 computers A, B, C The entire array is 300 million rows with each computer getting 100 million rows.
If computer A uses threading to sum up its million rows, then 'gathering' the local sums of the threads is very similar to gathering the local computer sums of their respective arrays.
The main difference being that a network protocol interface is used over computers vs a shared memory communication model for thread.
Thus, writing the routing to scatter and gather sum locally is very similar to doing it distributedly.

rgommers · 2020-05-21T11:05:39Z

Thanks @tdimitri. Yes, I know parallel algorithms are feasible, and SciPy has some (mostly trivial parallelism though). There's issues though that need careful thought - around maintainability, portability, nested parallelism, distribution of binaries when you don't control the whole stack, and more. I intend to do a thorough write-up in a separate issue.

tdimitri · 2020-05-21T14:11:53Z

I look forward to the write up as I believe together we can address most issues.

We need a few OS abstractions: thread creation/destruction, wakeup, a few lockfree semantics such as interlockedadd. These are known for all OS and thus do not require linking to boost, or other OS abstraction packages. We can just support a few OS for this first pass. This does give us full control of the stack and portability for the first version.

Again, threading can always be off or default to off with no side effects or harm to existing code.

Then we can parallelize the easiest routines first, such as unary functions. We do not have to parallelize everything - that will take time. What is important is that we begin instead of waiting for a perfect solution. It is also worthwhile to vectorize while threading.

rgommers · 2020-09-16T21:26:14Z

Added in the "Use cases" section in master, closing.

rgommers closed this as completed Sep 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use case: add distributed and GPU support to SciPy #2

Use case: add distributed and GPU support to SciPy #2

rgommers commented May 17, 2020

tdimitri commented May 21, 2020

rgommers commented May 21, 2020

tdimitri commented May 21, 2020

rgommers commented Sep 16, 2020

Use case: add distributed and GPU support to SciPy #2

Use case: add distributed and GPU support to SciPy #2

Comments

rgommers commented May 17, 2020

tdimitri commented May 21, 2020

rgommers commented May 21, 2020

tdimitri commented May 21, 2020

rgommers commented Sep 16, 2020