Skip to content

Guard rocm_smi.h include with a header check #158771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dsashidh
Copy link
Contributor

@dsashidh dsashidh commented Jul 21, 2025

Fixes #158725

This PR fixes a build issue when building with USE_ROCM enabled on systems that don't have the rocm_smi.h header installed.

The include of <rocm_smi/rocm_smi.h> in intrad_node_comm.cpp is now conditionally compiled based on a new CMake check. If the header's found, CMake defines HAS_ROCM_SMI which enables the include. Otherwise the file builds without it.

This makes the build more robust and compatible with different ROCm setups.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Copy link

pytorch-bot bot commented Jul 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158771

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 07b5dbe with merge base 7a08755 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 21, 2025
@dsashidh
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@jeffdaily
Copy link
Collaborator

Isn't this header file inclusion necessary for the source file to compile successful? If not, then just remove the #include statement entirely. The rocm_smi.h header comes from the rocm-smi-lib apt package.

@dsashidh
Copy link
Contributor Author

Isn't this header file inclusion necessary for the source file to compile successful? If not, then just remove the #include statement entirely. The rocm_smi.h header comes from the rocm-smi-lib apt package.

Hi @jeffdaily I saw the conversation between you and @trixirt. Based on that discussion intra_node__comm.cpp doesn't compile with USE_ROCM=ON unless both rocm_smi.h and librocm_smi64 are available. Since rocm-smi-lib isn't always isnstalled by default I kept the include but wrapped it in a check so it's only added when the header is present.

Also in my latest commit I updated the block that uses rsmi_init() and RSMI_STATUS_SUCCESS so that it's guarded with the same HAS_ROCM_SMI macro. The original version was causing a test failure when the header wasn't included. This way the code's only compiled if the header is available.

@jeffdaily
Copy link
Collaborator

Shouldn't we instead change the cmake procedure to hard fail if rocm-smi isn't installed? The compile might succeed with this PR when rocm-smi header is missing, but you're also unsafely turning a build issue into a runtime issue if someone tries to use the feature that needed rsmi_init() to work properly.

@soulitzer soulitzer requested review from jeffdaily and malfet July 23, 2025 18:54
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2025
@dsashidh
Copy link
Contributor Author

dsashidh commented Aug 5, 2025

Shouldn't we instead change the cmake procedure to hard fail if rocm-smi isn't installed? The compile might succeed with this PR when rocm-smi header is missing, but you're also unsafely turning a build issue into a runtime issue if someone tries to use the feature that needed rsmi_init() to work properly.

Hi @jeffdaily , thanks for the earlier feedback. I went ahead and updated the PR to follow your recommendation.

Instead of conditionally compiling the #include <rocm_smi/rocm_smi.h> based on a CMake check, I've now moved that check up into the top-level CMakeLists.txt and changed it to fail hard during configuration if the header isn't present. This avoids the issue you pointed out.

I also switched from checking the USE_ROCM CMake variable to checking ENV{USE_ROCM} since the CMake variable wasn't defined early enough in the top-level CMakeLists.txt. Using the environment variables ensures the check runs reliably when I build with USE_ROCM=1.

Please let me know if this looks good or if you'd prefer the check be somewhere else. Thank you for your recommendations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

including rocm_smi.h without checking
4 participants