-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Guard rocm_smi.h include with a header check #158771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158771
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 07b5dbe with merge base 7a08755 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
Isn't this header file inclusion necessary for the source file to compile successful? If not, then just remove the #include statement entirely. The rocm_smi.h header comes from the rocm-smi-lib apt package. |
Hi @jeffdaily I saw the conversation between you and @trixirt. Based on that discussion intra_node__comm.cpp doesn't compile with USE_ROCM=ON unless both rocm_smi.h and librocm_smi64 are available. Since rocm-smi-lib isn't always isnstalled by default I kept the include but wrapped it in a check so it's only added when the header is present. Also in my latest commit I updated the block that uses rsmi_init() and RSMI_STATUS_SUCCESS so that it's guarded with the same HAS_ROCM_SMI macro. The original version was causing a test failure when the header wasn't included. This way the code's only compiled if the header is available. |
Shouldn't we instead change the cmake procedure to hard fail if rocm-smi isn't installed? The compile might succeed with this PR when rocm-smi header is missing, but you're also unsafely turning a build issue into a runtime issue if someone tries to use the feature that needed rsmi_init() to work properly. |
Hi @jeffdaily , thanks for the earlier feedback. I went ahead and updated the PR to follow your recommendation. Instead of conditionally compiling the #include <rocm_smi/rocm_smi.h> based on a CMake check, I've now moved that check up into the top-level CMakeLists.txt and changed it to fail hard during configuration if the header isn't present. This avoids the issue you pointed out. I also switched from checking the USE_ROCM CMake variable to checking ENV{USE_ROCM} since the CMake variable wasn't defined early enough in the top-level CMakeLists.txt. Using the environment variables ensures the check runs reliably when I build with USE_ROCM=1. Please let me know if this looks good or if you'd prefer the check be somewhere else. Thank you for your recommendations. |
Fixes #158725
This PR fixes a build issue when building with USE_ROCM enabled on systems that don't have the rocm_smi.h header installed.
The include of <rocm_smi/rocm_smi.h> in intrad_node_comm.cpp is now conditionally compiled based on a new CMake check. If the header's found, CMake defines HAS_ROCM_SMI which enables the include. Otherwise the file builds without it.
This makes the build more robust and compatible with different ROCm setups.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd