[SymmMem] Install NVSHMEM wheel in CI docker #157411

kwen2501 · 2025-07-02T00:15:14Z

Stack from ghstack (oldest at bottom):

-> [SymmMem] Install NVSHMEM wheel in CI docker #157411

2.8 RC1 and nightly build did not compile with NVSHMEM, because the build environment on CI machine does not have NVSHMEM installed.

This PR pip installs NVSHMEM wheel in the docker of CI.

Also add nvidia-nvshmem-cu12 to PYTORCH_EXTRA_INSTALL_REQUIREMENTS of CUDA 12.9.

[ghstack-poisoned]

pytorch-bot · 2025-07-02T00:15:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157411

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 47 New Failures, 11 Cancelled Jobs, 1 Pending

As of commit b34db25 with merge base 64f2ec7 ():

NEW FAILURES - The following jobs have failed:

Build almalinux docker images / build-docker (cuda12.6) (gh)
Final attempt failed. Child_process exited with error code 1
Build manywheel docker images / manylinux2_28-builder:cuda12.6 (gh)
Final attempt failed. Child_process exited with error code 1
Build manywheel docker images / manylinux2_28-builder:cuda12.8 (gh)
Final attempt failed. Child_process exited with error code 1
Build manywheel docker images / manylinux2_28-builder:cuda12.9 (gh)
Final attempt failed. Child_process exited with error code 1
Build manywheel docker images / manylinuxaarch64-builder:cuda12.8 (gh)
Final attempt failed. Child_process exited with error code 1
Build manywheel docker images / manylinuxaarch64-builder:cuda12.9 (gh)
Final attempt failed. Child_process exited with error code 1
pull / cuda12.8-py3.10-gcc9-sm75 / build (gh)
pull / linux-docs / build-docs-cpp-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-functorch-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-python-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-cuda12.8-cudnn9-py3.9-clang12 / build (gh)
pull / linux-jammy-cuda12.8-py3.10-gcc11 / build (gh)
pull / linux-jammy-cuda12.8-py3.10-gcc11-build-distributed / build (gh)
pull / linux-jammy-cuda12.8-py3.10-gcc11-sm89 / build (gh)
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
Process completed with exit code 1.
pull / linux-jammy-py3.10-clang18-asan / test (default, 1, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang18-asan / test (default, 2, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang18-asan / test (default, 3, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang18-asan / test (default, 4, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang18-asan / test (default, 5, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang18-asan / test (default, 6, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (crossref, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (crossref, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (default, 1, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (default, 2, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (default, 3, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (default, 4, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (default, 5, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.13-clang12 / test (einops, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-clang12 / build (gh)
pull / linux-jammy-py3.9-clang12-onnx / test (default, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-clang12-onnx / test (default, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 5, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (docs_test, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (jit_legacy, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (numpy_2_x, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-xpu-2025.1-py3.9 / build (gh)

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Build almalinux docker images / build-docker (cuda12.8) (gh)
##[error]The operation was canceled.
Build almalinux docker images / build-docker (cuda12.9) (gh)
##[error]The operation was canceled.
Lint / Link checks / lint-urls / linux-job (gh)
##[error]The operation was canceled.
Lint / Link checks / lint-xrefs / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
Lint / quick-checks / linux-job (gh)
##[error]The operation was canceled.
Lint / Test tools / linux-job (gh)
##[error]The operation was canceled.
Lint / toc / linux-job (gh)
##[error]The operation was canceled.
Lint / workflow-checks / linux-job (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-pch / build (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 47a2159 Pull-Request-resolved: #157411

[ghstack-poisoned]

ghstack-source-id: 47cc574 Pull-Request-resolved: #157411

[ghstack-poisoned]

ghstack-source-id: 3983707 Pull-Request-resolved: #157411

Whoops

Skylion007 · 2025-07-02T13:54:41Z

.ci/docker/common/install_cuda.sh

@@ -135,3 +135,6 @@ do
    esac
    shift
 done
+
+# Install NVSHMEM wheel which is a build-time dependency for torch since 2.8
+python3 -mpip install nvidia-nvshmem-cu12


Interesting, why don't we build it like other nvidia libraries and use the pip bundle instead?

kwen2501 · 2025-07-02T18:18:02Z

Closing in favor of #157453

Skylion007 · 2025-07-02T18:27:45Z

@kwen2501 Actually looks like we need more CMake changes are our current build is broken, rip on the older hardware. RIP

…atest for 2.8RC (#157453) Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support. This is a proper fix for: #157411 Pull Request resolved: #157453 Approved by: https://github.com/kwen2501, https://github.com/atalman

Skylion007 · 2025-07-04T14:12:00Z

@pytorchbot rebase

pytorchmergebot · 2025-07-04T14:13:39Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2025-07-04T14:13:52Z

Successfully rebased gh/kwen2501/188/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/157411)

ghstack-source-id: b8adf0b Pull-Request-resolved: #157411

…atest for 2.8RC (#157453) Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support. This is a proper fix for: #157411 Pull Request resolved: #157453 Approved by: https://github.com/kwen2501, https://github.com/atalman (cherry picked from commit a6fab82)

Update

189ac77

[ghstack-poisoned]

kwen2501 requested review from a team and jeffdaily as code owners July 2, 2025 00:15

kwen2501 added a commit that referenced this pull request Jul 2, 2025

[SymmMem] Install NVSHMEM wheel in CI docker

7536483

ghstack-source-id: 47a2159 Pull-Request-resolved: #157411

pytorch-bot bot added the topic: not user facing topic category label Jul 2, 2025

kwen2501 requested review from malfet, clee2000 and atalman July 2, 2025 00:15

Update

8e37772

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jul 2, 2025

[SymmMem] Install NVSHMEM wheel in CI docker

9dbe1fe

ghstack-source-id: 47cc574 Pull-Request-resolved: #157411

Update

d5afa2d

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Jul 2, 2025

[SymmMem] Install NVSHMEM wheel in CI docker

d2767b5

ghstack-source-id: 3983707 Pull-Request-resolved: #157411

Skylion007 previously approved these changes Jul 2, 2025

View reviewed changes

Skylion007 reviewed Jul 2, 2025

View reviewed changes

Skylion007 mentioned this pull request Jul 2, 2025

[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC #157453

Closed

kwen2501 closed this Jul 2, 2025

Skylion007 reopened this Jul 4, 2025

Update

b34db25

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Jul 4, 2025

[SymmMem] Install NVSHMEM wheel in CI docker

79b16e0

ghstack-source-id: b8adf0b Pull-Request-resolved: #157411

pytorchbot mentioned this pull request Jul 8, 2025

[BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC #157774

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SymmMem] Install NVSHMEM wheel in CI docker #157411

[SymmMem] Install NVSHMEM wheel in CI docker #157411

kwen2501 commented Jul 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

Skylion007 Jul 2, 2025

Uh oh!

kwen2501 commented Jul 2, 2025

Uh oh!

Skylion007 commented Jul 2, 2025

Uh oh!

Skylion007 commented Jul 4, 2025

Uh oh!

pytorchmergebot commented Jul 4, 2025

Uh oh!

pytorchmergebot commented Jul 4, 2025

Uh oh!

Uh oh!

[SymmMem] Install NVSHMEM wheel in CI docker #157411

Are you sure you want to change the base?

[SymmMem] Install NVSHMEM wheel in CI docker #157411

Conversation

kwen2501 commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157411

❌ 47 New Failures, 11 Cancelled Jobs, 1 Pending

Uh oh!

Skylion007 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jul 2, 2025

Uh oh!

Skylion007 commented Jul 2, 2025

Uh oh!

Skylion007 commented Jul 4, 2025

Uh oh!

pytorchmergebot commented Jul 4, 2025

Uh oh!

pytorchmergebot commented Jul 4, 2025

Uh oh!

Uh oh!

kwen2501 commented Jul 2, 2025 •

edited

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading