Skip to content

Extending mpi4jax with XPU support #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 110 commits into from
May 4, 2024

Conversation

jczaja
Copy link
Contributor

@jczaja jczaja commented Feb 1, 2024

Hi,

This PR is a draft of changes enabling Intel XPU (via SYCL) . Changes are modeled after GPU (Cuda) integration so hopefully it should be easy to understand.

Unit tests pass rate:

Currently on our setup with XPU We got:

mpirun -np 2 pytest .
============ 1 failed, 122 passed, 3 skipped, 2 warnings in 59.98s =============

which match result when We run on purely CPU (test_common.py I got failing for some reason on both XPU and CPU execution)

Usage:

New env var was added MPI4JAX_USE_SYCL_MPI by analogy to MPI4JAX_USE_CUDA_MPI. MPI4JAX_USE_SYCL_MPI=1 makes mpi4jax to assume that MPI implementation used can work with XPU buffers (GPU/XPU aware) and MPI4JAX_USE_SYCL_MPI=0 makes mpi4jax to assume that MPI implementation cannot work XPU buffers and there is a copy of XPU buffer into allocated CPU buffer (analogically like it is done for CUDA).

TODO:

  • Extend github action to have test for building XPU support.
  • Test against different implementation of MPI (OpenMPI, IntelMPI)
  • Write a doc
  • Polish code

Concerns:

  1. Collective ops (python code) of CUDA are almost the same code as SYCL (just different import of cython module) , so It would be good to make a common code for that for the ease of maintaining.
  2. Cython part of XPU is using C++ (see setup.py) so perhaps It wold be good to discuss what C++ features (which standard) can be used within code e.g. Currently very minimal C++ features are used .

Any suggestions are welcomed!

@wozna, @sfraczek, @bartekkuncer, @shssf

@jczaja jczaja marked this pull request as ready for review February 23, 2024 16:01
@jczaja
Copy link
Contributor Author

jczaja commented Feb 23, 2024

@dionhaefner , @PhilipVinc
This PR is ready to be reviewed now. One concern is that codecov CI is failing and I'm not sure if we can do something about as unit tests are not running on XPU neither on GPU in this github CI. So we added more code that is not covered in github CI.

Anyway , We are looking forward to your feedback.

Copy link
Collaborator

@dionhaefner dionhaefner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks high-quality, let's fix some minor issues.

In case this breaks in the future: Who is going to maintain this code?

@jczaja
Copy link
Contributor Author

jczaja commented Feb 29, 2024

Thanks! This looks high-quality, let's fix some minor issues.

In case this breaks in the future: Who is going to maintain this code?

@dionhaefner Thanks for reviewing this PR. This functionality will be maintained by Intel JAX GPU teams . I will send some more info via email

We started to work on suggestions you made and will update PR as soon as we have something.

@jczaja
Copy link
Contributor Author

jczaja commented Mar 5, 2024

@dionhaefner , @PhilipVinc We adressed (hopefully) all the questions and comments and updated PR with relevant changes. So we are ready for second round of review. Please review!

@dionhaefner
Copy link
Collaborator

Thanks, I like the reduced code duplication. Let's get some final changes done then merge this.

@dionhaefner
Copy link
Collaborator

@jczaja Could you please address the last 3 comments so we can get this merged?

@jczaja
Copy link
Contributor Author

jczaja commented Apr 11, 2024

@dionhaefner I apologize for lack of activity from our side. The reason were explained in an email I sent some time ago. Let my introduce my colleague @Zantares who will take over maintaining this work.

@Zantares
Copy link

@jczaja Could you please address the last 3 comments so we can get this merged?

Hi @dionhaefner , we have taken related work and will give the feedback in 1 week, thanks for your patience!

@Dboyqiao
Copy link

@jczaja Could you please address the last 3 comments so we can get this merged?

@dionhaefner we have addressed the left 3 comments. could you help to review? Thank you!

Copy link
Collaborator

@dionhaefner dionhaefner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks all!

@PhilipVinc You want to give it a final pass?

@dionhaefner dionhaefner merged commit ffd431b into mpi4jax:master May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants