Skip to content

[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC #157453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Jul 2, 2025

Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: #157411

@Skylion007 Skylion007 requested review from eqy, atalman and nWEIdia July 2, 2025 14:12
@Skylion007 Skylion007 requested review from a team and jeffdaily as code owners July 2, 2025 14:12
Copy link

pytorch-bot bot commented Jul 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157453

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c2d6d7b with merge base 82eefae (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 2, 2025
@Skylion007 Skylion007 force-pushed the skylion007/update-nvshmem-3-3-9 branch from 8119930 to 99dabab Compare July 2, 2025 14:17
@Skylion007 Skylion007 changed the title [BE]: Fix NVSHMEM builds and update to latest [BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest Jul 2, 2025
@Skylion007 Skylion007 force-pushed the skylion007/update-nvshmem-3-3-9 branch 3 times, most recently from c76ed71 to 1b12f69 Compare July 2, 2025 14:32
@Skylion007 Skylion007 requested review from malfet and tinglvv July 2, 2025 14:39
@Skylion007 Skylion007 added the better-engineering Relatively self-contained tasks for better engineering contributors label Jul 2, 2025
@Skylion007
Copy link
Collaborator Author

Great Ubuntu is having an outage

@Skylion007 Skylion007 force-pushed the skylion007/update-nvshmem-3-3-9 branch from 1b12f69 to f8899ce Compare July 2, 2025 14:46
@Skylion007 Skylion007 changed the title [BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest [BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC Jul 2, 2025
@Skylion007 Skylion007 added this to the 2.8.0 milestone Jul 2, 2025
@Skylion007 Skylion007 force-pushed the skylion007/update-nvshmem-3-3-9 branch from f8899ce to c84bbb0 Compare July 2, 2025 14:56
@atalman atalman added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jul 2, 2025
@atalman
Copy link
Contributor

atalman commented Jul 2, 2025

@Skylion007 Skylion007 force-pushed the skylion007/update-nvshmem-3-3-9 branch from c84bbb0 to c2d6d7b Compare July 2, 2025 16:30
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! LGTM!

Comment on lines +71 to +72
cp -a "libnvshmem/include/"* /usr/local/include/
cp -a "libnvshmem/lib/"* /usr/local/lib/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was previously detecting NVSHMEM lib from python/site-packages/nvidia.
Maybe I would need to change how we set NVSHMEM_HOME based on this change?

Copy link
Collaborator Author

@Skylion007 Skylion007 Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be detected either way here? We do use NVSHMEM from the python package anyway at the end for non-static builds.

Copy link
Contributor

@kwen2501 kwen2501 Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we currently detect it for building:

# Detect build dependencies from python lib path (in order to set *_HOME variables)
# NVSHMEM
nvshmem_home = py_lib_path + "/nvidia/nvshmem"
if os.path.exists(nvshmem_home):
build_options["NVSHMEM_HOME"] = nvshmem_home

And there is a gate here that stops building torch with NVSHMEM if NVSHMEM_HOME is not set:

pytorch/caffe2/CMakeLists.txt

Lines 998 to 1002 in 0105cd8

if(NOT DEFINED NVSHMEM_HOME)
message(WARNING "NVSHMEM_HOME not found. Please run `pip install nvidia-nvshmem-<version>`, or set NVSHMEM_HOME to its location.")
# Disable nvshmem if NVSHMEM_HOME is not found
set(USE_NVSHMEM FALSE)
endif()

I see two solutions:
(1) We add export NVSHMEM_HOME=/usr/local here, but it looks a bit weird;
(2) We add a findNVSHMEM.cmake as a single place to detect the library and set NVSHMEM_HOME there.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say a findNVSHMEM.cmake

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pip version is fine too obviously, but it didn't seem to work for you other build?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So libnvshmem comes with libnvshmem/lib/cmake/nvshmem/NVSHMEMConfig.cmake, but it doesn't set the NVSHMEM home.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwen2501 This more complicated as our current CMake is broken when we build with NVSHMEM because it tries to build for unsupported architectures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwen2501 Also libtorch will be missing it now, right?

Correct. So we'd need to find a way to work for both. (pip install or system install)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwen2501 Proper way is just do CMake. It will work pip install thanks the to the rpath changes we made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added CMake search here: #157513.
Do you mind taking a look? Thank you!

@atalman
Copy link
Contributor

atalman commented Jul 3, 2025

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 3, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Skylion007
Copy link
Collaborator Author

@atalman This is needed first as it's not actually able to find the installed NVSHMEM currently: #157513

kwen2501 added a commit that referenced this pull request Jul 4, 2025
…ation"


Previously we only search for NVSHMEM from pip install location. 
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 4, 2025
Previously we only search for NVSHMEM from pip install location. 
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jul 4, 2025
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: #157513
Approved by: https://github.com/atalman, https://github.com/Skylion007
kwen2501 added a commit that referenced this pull request Jul 4, 2025
…ation"


Previously we only search for NVSHMEM from pip install location. 
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 4, 2025
Previously we only search for NVSHMEM from pip install location. 
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 

[ghstack-poisoned]
pytorchbot pushed a commit that referenced this pull request Jul 8, 2025
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: #157513
Approved by: https://github.com/atalman, https://github.com/Skylion007

(cherry picked from commit 99c1a6b)
@kwen2501
Copy link
Contributor

kwen2501 commented Jul 8, 2025

@pytorchbot cherry-pick --onto release/2.8 -c critical

pytorchbot pushed a commit that referenced this pull request Jul 8, 2025
…atest for 2.8RC (#157453)

Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: #157411
Pull Request resolved: #157453
Approved by: https://github.com/kwen2501, https://github.com/atalman

(cherry picked from commit a6fab82)
@pytorchbot
Copy link
Collaborator

Cherry picking #157453

The cherry pick PR is at #157774 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

@atalman atalman removed this from the 2.8.0 milestone Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants