Skip to content

[SymmMem] Install NVSHMEM wheel in CI docker #157411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: gh/kwen2501/188/base
Choose a base branch
from

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Jul 2, 2025

Stack from ghstack (oldest at bottom):

2.8 RC1 and nightly build did not compile with NVSHMEM, because the build environment on CI machine does not have NVSHMEM installed.

This PR pip installs NVSHMEM wheel in the docker of CI.

Also add nvidia-nvshmem-cu12 to PYTORCH_EXTRA_INSTALL_REQUIREMENTS of CUDA 12.9.

[ghstack-poisoned]
@kwen2501 kwen2501 requested review from a team and jeffdaily as code owners July 2, 2025 00:15
Copy link

pytorch-bot bot commented Jul 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157411

Note: Links to docs will display an error until the docs builds have been completed.

❌ 47 New Failures, 11 Cancelled Jobs, 1 Pending

As of commit b34db25 with merge base 64f2ec7 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Jul 2, 2025
ghstack-source-id: 47a2159
Pull-Request-resolved: #157411
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 2, 2025
@kwen2501 kwen2501 requested review from malfet, clee2000 and atalman July 2, 2025 00:15
[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 2, 2025
ghstack-source-id: 47cc574
Pull-Request-resolved: #157411
[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Jul 2, 2025
ghstack-source-id: 3983707
Pull-Request-resolved: #157411
Skylion007
Skylion007 previously approved these changes Jul 2, 2025
@Skylion007 Skylion007 dismissed their stale review July 2, 2025 13:53

Whoops

@@ -135,3 +135,6 @@ do
esac
shift
done

# Install NVSHMEM wheel which is a build-time dependency for torch since 2.8
python3 -mpip install nvidia-nvshmem-cu12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, why don't we build it like other nvidia libraries and use the pip bundle instead?

@kwen2501
Copy link
Contributor Author

kwen2501 commented Jul 2, 2025

Closing in favor of #157453

@kwen2501 kwen2501 closed this Jul 2, 2025
@Skylion007
Copy link
Collaborator

@kwen2501 Actually looks like we need more CMake changes are our current build is broken, rip on the older hardware. RIP

pytorchmergebot pushed a commit that referenced this pull request Jul 3, 2025
…atest for 2.8RC (#157453)

Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: #157411
Pull Request resolved: #157453
Approved by: https://github.com/kwen2501, https://github.com/atalman
@Skylion007 Skylion007 reopened this Jul 4, 2025
@Skylion007
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/kwen2501/188/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/157411)

pytorchmergebot pushed a commit that referenced this pull request Jul 4, 2025
ghstack-source-id: b8adf0b
Pull-Request-resolved: #157411
pytorchbot pushed a commit that referenced this pull request Jul 8, 2025
…atest for 2.8RC (#157453)

Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: #157411
Pull Request resolved: #157453
Approved by: https://github.com/kwen2501, https://github.com/atalman

(cherry picked from commit a6fab82)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants