Hybrid MPI program with JAX #262

mtagliazucchi · 2024-10-02T15:14:56Z

mtagliazucchi
Oct 2, 2024

Hi, I'm using mpi4jax to distribute the calculation of a function over a large dataset on multiple cores/nodes.
The function uses JAX to speed up the computation on each chunk of data. The result of the functions are then collected using mpi4jax functions.

However, I'm not sure how to launch the slurm script on a HPC cluster.
In particular, I would like to launch 4 task, representing the number of chunks in which the dataset is divided, and assign to each task 8 CPUs (for example). This should be launched on a single computing node of the cluster.
I then imagine that each of the four JAX tasks uses only the 8 CPUs I allocated for them. Is this true?

Answered by PhilipVinc

Oct 4, 2024

Do you think this is this true even if the function to be applied to the data chunks is very expensive and uses jax.jit and jax.vmap?

In my experience yes. The multithreading inside of jax jit performs less well than the MPI model. Of course this is only true if your algorithm scales well.

I tried to do this using MPI "map-by" and "bind-to" options, but it didn't work. Do you have any examples of how to do this?

I never do it because on clusters I work on, it's automatically done by slurm.

Finally, for the single program multiple data problem, do you think that jax.sharding and mpi4jax generally have similar performance or not? Which of the two is more efficient?

On GPUs, in my code…

View full answer

PhilipVinc · 2024-10-02T16:24:02Z

PhilipVinc
Oct 2, 2024
Maintainer

If you are using CPUs, my traditional suggestion is to assume that jax uses effectively only 1 or 2 CPUs, and therefore to launch with --ncpus-per-task=2 --ntasks-per-node=NCPUS/2.

You should use also task set to ensure that the kernel does not move the processes around, unless it's already set by your HPC center.

0 replies

mtagliazucchi · 2024-10-04T06:55:47Z

mtagliazucchi
Oct 4, 2024
Author

Thanks for the reply! I still have few curiosities.

If you are using CPUs, my traditional suggestion is to assume that jax uses effectively only 1 or 2 CPUs

Do you think this is this true even if the function to be applied to the data chunks is very expensive and uses jax.jit and jax.vmap?

You should use also task set to ensure that the kernel does not move the processes around, unless it's already set by your HPC center

I tried to do this using MPI "map-by" and "bind-to" options, but it didn't work. Do you have any examples of how to do this?

Finally, for the single program multiple data problem, do you think that jax.sharding and mpi4jax generally have similar performance or not? Which of the two is more efficient?

Thanks again!

0 replies

PhilipVinc · 2024-10-04T08:18:13Z

PhilipVinc
Oct 4, 2024
Maintainer

Do you think this is this true even if the function to be applied to the data chunks is very expensive and uses jax.jit and jax.vmap?

In my experience yes. The multithreading inside of jax jit performs less well than the MPI model. Of course this is only true if your algorithm scales well.

I tried to do this using MPI "map-by" and "bind-to" options, but it didn't work. Do you have any examples of how to do this?

I never do it because on clusters I work on, it's automatically done by slurm.

Finally, for the single program multiple data problem, do you think that jax.sharding and mpi4jax generally have similar performance or not? Which of the two is more efficient?

On GPUs, in my codes, sharding outperforms by 10% mpi4jax direct gpu operation. I have no idea why. It's probably because NCCL is better tuned than MPI. Still, it's what we see.

On CPUs, well, jax may use MPI or GLOO as a communication backed for sharding. MPI as a sharding backend works, but it's largely undocumented and hard to setup. I am partly responsible for getting the support there, so we have tiny docs in here https://netket.readthedocs.io/en/latest/docs/parallelization.html#mpitrampoline-backend-very-experimental . You need to compile MPITrampoline and launch jax through it. It's a mess. But performance is identical to using mpi4jax and you get the niceties of sharding (modulo some unsupported operations).
GLOO Is terrible and it's only handy for locally testing some Sharded code. You should never use it in production.

So in essence mpi4jax performs identical to MPI sharding backend, though it's easier to setup and write code for. Sometimes sharding breaks down and starts to replicate the calculations unless you use shard map. mpi4jax is equivalent to putting a shard map on the whole of your code so it's a bit more 'reliable'.

1 reply

PhilipVinc Oct 4, 2024
Maintainer

My general suggestion is that on GPUs you should use sharding (it's also easier to setup).
On CPUs mpi4jax.

mtagliazucchi · 2024-10-07T12:31:02Z

mtagliazucchi
Oct 7, 2024
Author

Thanks so much!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid MPI program with JAX #262

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hybrid MPI program with JAX #262

Uh oh!

mtagliazucchi Oct 2, 2024

Replies: 4 comments · 1 reply

Uh oh!

PhilipVinc Oct 2, 2024 Maintainer

Uh oh!

Uh oh!

mtagliazucchi Oct 4, 2024 Author

Uh oh!

PhilipVinc Oct 4, 2024 Maintainer

Uh oh!

PhilipVinc Oct 4, 2024 Maintainer

Uh oh!

mtagliazucchi Oct 7, 2024 Author

mtagliazucchi
Oct 2, 2024

Replies: 4 comments 1 reply

PhilipVinc
Oct 2, 2024
Maintainer

mtagliazucchi
Oct 4, 2024
Author

PhilipVinc
Oct 4, 2024
Maintainer

PhilipVinc Oct 4, 2024
Maintainer

mtagliazucchi
Oct 7, 2024
Author