Skip to content

Use case: add distributed and GPU support to SciPy #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgommers opened this issue May 17, 2020 · 4 comments
Closed

Use case: add distributed and GPU support to SciPy #2

rgommers opened this issue May 17, 2020 · 4 comments

Comments

@rgommers
Copy link
Member

When surveying a representative set of advanced users and research software engineers in 2019 (for this NSF proposal), the single most common pain point brought up about SciPy was performance.

SciPy heavily relies on NumPy (its only non-optional runtime dependency). NumPy provides an array implementation that's in-memory, CPU-only and single-threaded. Common things users ask for are:

  • parallel algorithms (multi-threaded or multiprocessing based)
  • support for distributed arrays (with Dask in particular)
  • support for GPUs

Some parallelism, in particular via multiprocessing, can be supported, however SciPy itself will not directly start depending on a GPU or distributed array implementation, or contain (e.g.) CUDA code - that's not maintainable given the resources for development. However, there is a way to provide distributed or GPU support. Part of the solution is provided by NumPy's "array protocols" (see gh-1), that allow dispatching to other array implementations. The main problem then becomes how to know whether this will work with a particular distributed or GPU array implementation - given that there are zero other array implementations that are even close to providing full NumPy compatibility - without adding it as a hard dependency.

It's clear that SciPy functionality that relies on compiled extensions (C, C++, Cython, Fortran) directly won't work. Pure Python code can work though. There's two main possibilities:

  1. Testing with another package, manually or in CI, and simply provide a list of functionality that is found to work. Then make ad-hoc fixes to expand the set that works.
  2. Start relying on a well-defined subset of the NumPy API (or a new NumPy-like API), for which compatibility is guaranteed.

Option (2) seems strongly preferable, and that "well-defined subset" is what an API standard should provide. Testing will still be needed, to ensure there are no critical corner cases or bugs between array implementations, however that's then a very tractable task.

@tdimitri
Copy link

Certainly parallel algorithms are achievable today, and can grow over time. Every modern CPU has a multicore CPU today, thus everyone can benefit from this improvement.
Once the parallel algorithm is written, the distributed version is very similar. For instance take sum, which might be partitioned or distributed over multiple computers. Assume we have 3 computers A, B, C The entire array is 300 million rows with each computer getting 100 million rows.
If computer A uses threading to sum up its million rows, then 'gathering' the local sums of the threads is very similar to gathering the local computer sums of their respective arrays.
The main difference being that a network protocol interface is used over computers vs a shared memory communication model for thread.
Thus, writing the routing to scatter and gather sum locally is very similar to doing it distributedly.

@rgommers
Copy link
Member Author

Thanks @tdimitri. Yes, I know parallel algorithms are feasible, and SciPy has some (mostly trivial parallelism though). There's issues though that need careful thought - around maintainability, portability, nested parallelism, distribution of binaries when you don't control the whole stack, and more. I intend to do a thorough write-up in a separate issue.

@tdimitri
Copy link

I look forward to the write up as I believe together we can address most issues.

We need a few OS abstractions: thread creation/destruction, wakeup, a few lockfree semantics such as interlockedadd. These are known for all OS and thus do not require linking to boost, or other OS abstraction packages. We can just support a few OS for this first pass. This does give us full control of the stack and portability for the first version.

Again, threading can always be off or default to off with no side effects or harm to existing code.

Then we can parallelize the easiest routines first, such as unary functions. We do not have to parallelize everything - that will take time. What is important is that we begin instead of waiting for a perfect solution. It is also worthwhile to vectorize while threading.

@rgommers
Copy link
Member Author

Added in the "Use cases" section in master, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants