Parallelizing certain portion of the code multiple times #275

YigitElma · 2025-02-18T23:05:20Z

YigitElma
Feb 18, 2025

Hello,

First of all, I don't have much experience with MPI, so I would like to apologize beforehand for my lack of jargon and knowledge.

I am working on an optimization code written with JAX. We want to scale up the optimization problem we try to solve but the current state of the code only allows us to use a single GPU/CPU, so we are limited by memory on the GPU during the Jacobian calculation. I managed to distribute objectives to different GPUs and we are able to run bigger problems. However, my approach was to run different functions that have pre-distributed data on different devices by jax.jit with device argument in a for loop. I want to make the for loop parallel using mpi4jax.

Here is a very simplified minimal working example of what I am trying to do. Sorry, for the lengthy code but I would like to keep the main structure of the code for a bunch of other non-mpi-related design choices. The purpose of my question is specifically the jac_error method of the ObjectiveFunctionParallel class.

mpi-parallel-test-case.pdf
This is the jupyter notebook version
mpi-parallel-test-case.zip

To give a shorter review of what the code is doing,

We have a class for the thing we are trying to optimize, it stores all the information on it.
We have Objective classes (in the actual code we have over 60 of them with different output dimensions)
In an optimization problem, we try to optimize the optimizable for multiple objectives. For code neatness, we have a ObjectiveFunction class that has wrappers for the Jacobian and compute functions of each objective. The problem is even if we form the Jacobian with bunch of single jvps (without vmapping over them) for large scale problems, we get out of memory error.

What I have done this far,

Create a new ObjectiveFunctionParallel class that calls each Objectives compute and Jacobian methods by placing the data on required GPU.
jax.jit the methods depending on the Objectives device id
combine the results on a single device to be able to decompose the Jacobian matrix using QR or SVD which CUDA support on single device.

I would like to create an MPI communication when I need to take the Jacobian (otherwise I need to put many if rank == 0: conditions which is not possible for the size of our code), execute the part I have a for loop in parallel, and then close the MPI communication. All the data needed by each GPU is already stored on that GPU, so, the only data transfer is needed at the end to form the full Jacobian. We typically need to take the Jacobian around 200 times per optimization, so, I need to be able to open and close the communication to prevent computing everything in the code multiple times on multiple processes.

I think this is possible with mpi4py using Dynamic Process Management but I wanted to ask if this is doable in mpi4jax? Also, I would appreciate any feedback about the implementation I have in mind!

I am sorry for not being able to give a shorter explanation and MWE.

Best regards,

dionhaefner · 2025-02-19T10:28:45Z

dionhaefner
Feb 19, 2025
Maintainer

I would like to create an MPI communication when I need to take the Jacobian (otherwise I need to put many if rank == 0: conditions which is not possible for the size of our code), execute the part I have a for loop in parallel, and then close the MPI communication. All the data needed by each GPU is already stored on that GPU, so, the only data transfer is needed at the end to form the full Jacobian.

I may be missing some complexities here, but this sounds like it could be done a lot simpler with jax.pmap? Typically mpi4jax is only advantageous over built-in sharding if you're very conscious about performance on CPU, or need precise control over where data resides at all times.

11 replies

PhilipVinc Feb 19, 2025
Maintainer

My operation includes automatic differantiation and also jitted, so it will always include some jax arrays.

My question is rather: where is the communication happening? Inside the function you want to run AD through, or outside, when you need to communicate/piece together the jacobian?

Option 1 -> mpi4jax will only work if you are using all reduce. Open 2-> you don't need mpi4jax

YigitElma Feb 19, 2025
Author

My operation includes automatic differantiation and also jitted, so it will always include some jax arrays.

My question is rather: where is the communication happening? Inside the function you want to run AD through, or outside, when you need to communicate/piece together the jacobian?

It happens outside the function that I take the AD. Let's say I obtain arrays of 3by4 (on gpu 1), 3by6 (on gpu2) and 3by2 (on gpu3) from the AD functions that should run in parallel, outside of the functions, I want to cancatenate them and put them on a single device. Maybe an important thing to mention, without any change (such as casting to numpy array and putting them to Cpu for example) each function returns a single array that lives on different Gpu than other processes.

YigitElma Feb 19, 2025
Author

@PhilipVinc I will follow this by a mwe (in an hour or so), ...

from mpi4py import MPI
import numpy as np
import jax.numpy as jnp

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# send numpy array
if rank == 0:
    print(f"Process {rank}: Doing nothing")
elif rank == 1:
    data = np.arange(10, dtype=np.float64)
    comm.Send(data, dest=2, tag=17)
    print(f"Process {rank}: Sending data to process 2 ")
elif rank == 2:
    data = np.empty(10, dtype=np.float64)
    comm.Recv(data, source=1, tag=17)
    print(f"Process {rank}: Printing data recovered from process 1 ", data)


# send jnp array casted as numpy array
if rank == 0:
    print(f"== Process {rank}: Doing nothing")
elif rank == 1:
    data = jnp.arange(10, dtype=jnp.int32)
    data = np.array(data, dtype=np.int32)
    comm.Send(data, dest=2, tag=13)
    print(f"== Process {rank}: Sending data to process 2 ")
elif rank == 2:
    data = np.empty(10, dtype=np.int32)
    comm.Recv(data, source=1, tag=13)
    print(f"== Process {rank}: Printing data recovered from process 1 ", data)

This example works fine as you said! I assume in my previous dummy case I forgot to put data type (or tried to cast something inside jit, tbh I didn't spend much time before since I thought it was not possible).

I will close the discussion once I ensure it works in our original code. But thank you very much!

dionhaefner Feb 19, 2025
Maintainer

BTW, it's likely possible to do this with mpi4jax as well if you really need MPI ops within JIT. In that case you'd do the process setup via mpi4py before crossing the JIT boundary, then use the communicators and whatnot from within JIT (passed via static_argnums or injected via a closure).

YigitElma Feb 19, 2025
Author

Actually, jitting the whole operation would be better for my case, but I don't know if it is possible. The portion I try to parallelize is,

    def jac_error(self, coefs=None, A=None):
        # compute the jacobian for each objective and concatenate them
        fs = [
            obj.jac_error(jax.device_put(coefi, device=obj._device), Ai)
            for obj, coefi, Ai in zip(self.objectives, coefs, A)
        ]
        #helper function which concatenates arrays from different devices
        return pconcat(fs)

the previous implementation without parallelization was jitted

    @jax.jit
    def jac_error(self, coefs=None, A=None):
        return jnp.vstack(
            [
                obj.jac_error(coefi, Ai)
                for obj, coefi, Ai in zip(self.objectives, coefs, A)
            ]
        )

So, maybe something like,

    @jax.jit
    def jac_error(self, coefs=None, A=None):
        # We need to gather the Jacobians from all the processes
        if rank_of_process == 0:
            # we know the final shape of the Jacobian
            gathered_jacobian = jnp.empty((self.dimf, self.dimx))
        displacements = # we can find displacements from the shapes of fs
        # I also don't want to have the processes running outside of this function
        # otherwise, it would require many `if rank_of_process == 0` statements
        obj = self.objectives[rank_of_process]
        Ai = A[rank_of_process]
        coefi = jax.device_put(coefi, device=obj._device)
        # this function is already jitted on the specific GPU
        fs = obj.jac_error(coefi, Ai)
        if rank_of_process != 0:
            # send the result to process 0
            comm.send(fs, dest=0)
        if rank_of_process == 0:
            # gather all of the Jacobians in single array in process 0
            # displacements and final shape are known, something like gatherv
            return gathered_jacobian

There are many parts I need to implement though.

YigitElma · 2025-02-26T17:27:52Z

YigitElma
Feb 26, 2025
Author

Thanks for the help! As you said, for my case mpi4py is enough

0 replies

Parallelizing certain portion of the code multiple times #275

Uh oh!

Uh oh!

YigitElma Feb 18, 2025

Replies: 2 comments · 11 replies

Uh oh!

dionhaefner Feb 19, 2025 Maintainer

Uh oh!

Uh oh!

PhilipVinc Feb 19, 2025 Maintainer

Uh oh!

YigitElma Feb 19, 2025 Author

Uh oh!

YigitElma Feb 19, 2025 Author

Uh oh!

dionhaefner Feb 19, 2025 Maintainer

Uh oh!

Uh oh!

YigitElma Feb 19, 2025 Author

Uh oh!

YigitElma Feb 26, 2025 Author

YigitElma
Feb 18, 2025

Replies: 2 comments 11 replies

dionhaefner
Feb 19, 2025
Maintainer

PhilipVinc Feb 19, 2025
Maintainer

YigitElma Feb 19, 2025
Author

YigitElma Feb 19, 2025
Author

dionhaefner Feb 19, 2025
Maintainer

YigitElma Feb 19, 2025
Author

YigitElma
Feb 26, 2025
Author