-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC #157453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC #157453
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157453
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c2d6d7b with merge base 82eefae ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
8119930
to
99dabab
Compare
c76ed71
to
1b12f69
Compare
Great Ubuntu is having an outage |
1b12f69
to
f8899ce
Compare
f8899ce
to
c84bbb0
Compare
c84bbb0
to
c2d6d7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! LGTM!
cp -a "libnvshmem/include/"* /usr/local/include/ | ||
cp -a "libnvshmem/lib/"* /usr/local/lib/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was previously detecting NVSHMEM lib from python/site-packages/nvidia
.
Maybe I would need to change how we set NVSHMEM_HOME
based on this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be detected either way here? We do use NVSHMEM from the python package anyway at the end for non-static builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we currently detect it for building:
pytorch/tools/setup_helpers/cmake.py
Lines 316 to 320 in d5d14ee
# Detect build dependencies from python lib path (in order to set *_HOME variables) | |
# NVSHMEM | |
nvshmem_home = py_lib_path + "/nvidia/nvshmem" | |
if os.path.exists(nvshmem_home): | |
build_options["NVSHMEM_HOME"] = nvshmem_home |
And there is a gate here that stops building torch with NVSHMEM if NVSHMEM_HOME
is not set:
Lines 998 to 1002 in 0105cd8
if(NOT DEFINED NVSHMEM_HOME) | |
message(WARNING "NVSHMEM_HOME not found. Please run `pip install nvidia-nvshmem-<version>`, or set NVSHMEM_HOME to its location.") | |
# Disable nvshmem if NVSHMEM_HOME is not found | |
set(USE_NVSHMEM FALSE) | |
endif() |
I see two solutions:
(1) We add export NVSHMEM_HOME=/usr/local
here, but it looks a bit weird;
(2) We add a findNVSHMEM.cmake
as a single place to detect the library and set NVSHMEM_HOME
there.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say a findNVSHMEM.cmake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pip version is fine too obviously, but it didn't seem to work for you other build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So libnvshmem comes with libnvshmem/lib/cmake/nvshmem/NVSHMEMConfig.cmake
, but it doesn't set the NVSHMEM home.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwen2501 This more complicated as our current CMake is broken when we build with NVSHMEM because it tries to build for unsupported architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwen2501 Also libtorch will be missing it now, right?
Correct. So we'd need to find a way to work for both. (pip install or system install)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kwen2501 Proper way is just do CMake. It will work pip install thanks the to the rpath changes we made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I added CMake search here: #157513.
Do you mind taking a look? Thank you!
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ation" Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. Pull Request resolved: #157513 Approved by: https://github.com/atalman, https://github.com/Skylion007
…ation" Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k Skylion007 [ghstack-poisoned]
Previously we only search for NVSHMEM from pip install location. This PR adds search in system locations deemed default by CMake. Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines. Pull Request resolved: #157513 Approved by: https://github.com/atalman, https://github.com/Skylion007 (cherry picked from commit 99c1a6b)
@pytorchbot cherry-pick --onto release/2.8 -c critical |
…atest for 2.8RC (#157453) Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support. This is a proper fix for: #157411 Pull Request resolved: #157453 Approved by: https://github.com/kwen2501, https://github.com/atalman (cherry picked from commit a6fab82)
Cherry picking #157453The cherry pick PR is at #157774 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.
This is a proper fix for: #157411