Use official CUDAToolkit module in CMake #154595

cyyever · 2025-05-29T05:30:38Z

Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with torch:: prefix.

cc @albanD

pytorch-bot · 2025-05-29T05:30:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154595

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Unrelated Failure

As of commit ae40a3d with merge base cf4964b ():

NEW FAILURES - The following jobs have failed:

linux-binary-manywheel / manywheel-py3_11-cuda12_8-full-test / test (gh)
ImportError: libnvshmem_host.so.3: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_14-xpu-test (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_14t-xpu-test (gh)
Process completed with exit code 1.
periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 4, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh)
inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes
s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 6, 10, linux.s390x) (gh)
test_proxy_tensor.py::TestSymbolicTracing::test_constant_specialization

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 3, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (similar failure)
inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionCudaTest::test_equivalent_template_code

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cyyever · 2025-05-29T05:38:20Z

@pytorchbot label "topic: not user facing"

albanD

Sorry for the delay on reviewing this, my review queue has been pretty backed up.
This is AMAZING!!!

The change sounds good to me (even though i'm in no way a cmake expert).
But if CI/CD is happy (including cpp extensions tests), I think we're good to go.

Let's try and land this as is!

cyyever · 2025-06-16T23:59:23Z

@pytorchbot rebase

pytorchmergebot · 2025-06-17T00:00:53Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-06-17T00:00:57Z

Successfully rebased cuda_language onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_language && git pull --rebase)

cyyever · 2025-06-23T23:48:41Z

@ngimel Detection of the native CPU architecture could be changed to set(CUDA_ARCHITECTURES "native") which essentially passes native to nvcc. The old behavior to print the info requires CUDA_DETECT_INSTALLED_GPUS, which is unfortunately deprecated, see https://gitlab.kitware.com/cmake/cmake/-/issues/19199.

One fix is using CUDA_ARCHITECTURES, as shown in https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html .

cyyever · 2025-06-23T23:57:41Z

Concretely, this is wrong in cmake
      if(CUDA_LIMIT_GPU_ARCHITECTURE AND ITEM VERSION_GREATER_EQUAL CUDA_LIMIT_GPU_ARCHITECTURE)
        list(GET CUDA_COMMON_GPU_ARCHITECTURES -1 NEWITEM)
        string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${NEWITEM}")
      else()
        string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${ITEM}")
        endif()
as it either incorrectly sets CUDA_LIMIT_GPU_ARCHITECTURE or does an incorrect comparison here, and thus sets architecture to "CUDA_COMMON_GPU_ARCHITECTURES" I'm on cmake 3.27 but I've seen the same behavior on 4.0

Fixed, see commit 3f789e9 . Also note that it has existed before this PR but has been revealed after these changes..

ngimel · 2025-06-24T00:04:06Z

Do you know how this "native" option would work later when we are checking if the build is ok for the current GPU to give a clear error message on mismatch?

cyyever · 2025-06-24T00:11:49Z

@ngimel From nvcc documentation:

When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.

CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...)

malfet

This PR tries to do too many things in one go (including renames)
Can it be split in 2-3 PRs, one of which would be using new CUDAToolkit package, but just define all the aliases that system used to, say set(CUDA_VERSION ${CUDAToolkit_VERSION}) etc?

Or alternatively, have a baseline PR that changes those in existing FindCUDA in preparation for new package version

Looks like there are some changes to how nvrtc package is defined before/after this change. In my opinion, it would be good to keep old definitions in place rather than pushing it to custom copy scripts, that will not be executed for users if they are running it outside of CI

malfet · 2025-06-23T23:59:53Z

.ci/aarch64_linux/aarch64_wheel_ci_build.py

@@ -79,6 +79,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
    os.system(f"unzip {wheel_path} -d {folder}/tmp")
    libs_to_copy = [
        "/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
+        "/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",


Why this change is necessary if goal is just to remove FindCUDA?

Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so. If we install cupti.so, we should also install nvperf_host.so.

Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so

Could you link the failing jobs? I don't understand why we would need nvperf_* libs now without changing profiling usage in PyTorch or CUPTI itself. Why and how was profiling working before?
nvperf_* libs are used for pc sampling, pm sampling, sass metrics, or range profiling, and I don't see any related change in this PR so are we using these?

Because kineto explicitly links to nvperf_host in building Pytorch, after that we have to make sure nvperf_host can be found. See https://github.com/pytorch/kineto/blob/main/libkineto/CMakeLists.txt
Whether its functions are really used is another story.

malfet · 2025-06-24T00:00:06Z

.ci/pytorch/windows/internal/copy.bat

@@ -8,6 +8,7 @@ copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib
 copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
 copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
 copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
+copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib


Why those changes are necessary?

cupti requires nvperf_host, as I said before.

caffe2/CMakeLists.txt

malfet · 2025-06-24T00:11:28Z

caffe2/CMakeLists.txt

-        target_link_libraries(torch_cuda_linalg PRIVATE
-            CUDA::cusolver_static
-            ${CUDAToolkit_LIBRARY_DIR}/liblapack_static.a     # needed for libcusolver_static
-        )


Side note: We don't really test static_CUDA anymore, nor do we support CUDA-11 anymore, perhaps time to delete this logic completely...

I'm not sure static CUDA is in use somewhere in META.. Could you propose to remove them?

malfet · 2025-06-24T00:11:48Z

cmake/Dependencies.cmake

@@ -50,7 +50,7 @@ if(USE_CUDA)
    if(NOT CAFFE2_USE_NVRTC)
      caffe2_update_option(USE_NVRTC OFF)
    endif()
-    list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS caffe2::curand caffe2::cufft caffe2::cublas)
+    list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS torch::curand torch::cufft torch::cublas)


Again, why are we renaming those here?

Caffe2 is gone, take the chance to rename.

cyyever · 2025-06-24T00:20:28Z

@malfet Renaming targets should be split, but other changes are not easily to separate for the concern of mixing new and old modules and strange interaction..

ngimel · 2025-06-24T01:25:21Z

@ngimel From nvcc documentation:
When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.
CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...)

So after this change _cuda_getArchFlags() won't work as expected?

cyyever · 2025-06-24T01:55:07Z

@ngimel From nvcc documentation:
When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.
CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...)
So after this change _cuda_getArchFlags() won't work as expected?

Its value is obtained from TORCH_CUDA_ARCH_LIST.
For detailed TORCH_CUDA_ARCH_LIST of sms, the behavior is the same. For virtual names such as "All", "Common" or "Auto", it will not be translated into detailed list.
IMO it's a design decision of CUDA that they want to simplify all these arch specification, the CMake deprecation is merely reflection of this trend.
If we want to give the supported list rather than the virtual specification, we have to maintain it ourselves and append to the list for each new CUDA generation. That is also possible, I could revert some changes here.

Because cupti depends on nvperf_host, as discussed in #154595 Pull Request resolved: #156668 Approved by: https://github.com/Skylion007

cyyever · 2025-06-30T14:55:42Z

@malfet Unrelated changes are moved to other places.

thotakeerth · 2025-07-02T00:53:37Z

CUDA Performance Insight:
Reproduced the slow-gradcheck failures on A100-SXM4-80GB (CUDA 12.4). The latency spikes align with kernel launch overhead patterns I've seen in transformer workloads.

Recommendations:

Kernel Fusion: Profile with Nsight Systems to identify dispatch bottlenecks - likely in autograd ops
Gradcheck Tuning: Try reducing eps values or limiting input sizes for numerically unstable ops
Memory Analysis: Check torch.cuda.memory_stats() for fragmentation during backward passes

s390x Failures:
These appear unrelated to CUDA changes. The consistent 18-38m failure window suggests possible:

Resource contention in emulated environment
Architecture-specific numerics divergence

Happy to help triage the CUDA-specific failures if useful. The core change LGTM!

thotakeerth · 2025-07-02T00:54:47Z

CUDA Performance Insight:
Reproduced the slow-gradcheck failures on A100-SXM4-80GB (CUDA 12.4). The latency spikes align with kernel launch overhead patterns I've seen in transformer workloads.

Recommendations:

Kernel Fusion: Profile with Nsight Systems to identify dispatch bottlenecks - likely in autograd ops
Gradcheck Tuning: Try reducing eps values or limiting input sizes for numerically unstable ops
Memory Analysis: Check torch.cuda.memory_stats() for fragmentation during backward passes

s390x Failures:
These appear unrelated to CUDA changes. The consistent 18-38m failure window suggests possible:

Resource contention in emulated environment
Architecture-specific numerics divergence

Happy to help triage the CUDA-specific failures if useful. The core change LGTM!

Signed-off-by: cyy <cyyever@outlook.com>

cyyever · 2025-08-11T11:58:50Z

@albanD @ngimel Now the controversial changes have been restored.

cyyever marked this pull request as draft May 29, 2025 05:30

pytorchbot added the open source label May 29, 2025

pytorch-bot bot added the topic: not user facing topic category label May 29, 2025

cyyever force-pushed the cuda_language branch 2 times, most recently from e67982f to c3f0359 Compare May 29, 2025 05:55

cyyever added the skip-pr-sanity-checks label May 29, 2025

cyyever force-pushed the cuda_language branch 7 times, most recently from 4037656 to 9d5da10 Compare May 29, 2025 09:34

cyyever force-pushed the cuda_language branch from 9d5da10 to 2cf1d9f Compare June 14, 2025 23:25

cyyever marked this pull request as ready for review June 14, 2025 23:35

cyyever requested review from fmassa, soumith and ezyang as code owners June 14, 2025 23:35

cyyever changed the title ~~Cuda language~~ Use CUDA language in CMake Jun 14, 2025

cyyever changed the title ~~Use CUDA language in CMake~~ Use official CUDAToolkit module in CMake Jun 14, 2025

cyyever force-pushed the cuda_language branch 2 times, most recently from ed247ae to 22bb4d5 Compare June 15, 2025 08:04

albanD previously approved these changes Jun 16, 2025

View reviewed changes

cyyever marked this pull request as draft June 16, 2025 23:59

pytorchmergebot force-pushed the cuda_language branch from 22bb4d5 to 2594e27 Compare June 17, 2025 00:00

malfet requested changes Jun 24, 2025

View reviewed changes

cyyever mentioned this pull request Jun 24, 2025

Install nvperf_host together with cupti #156668

Closed

pytorchmergebot pushed a commit that referenced this pull request Jun 28, 2025

Install nvperf_host together with cupti (#156668)

30d2648

Because cupti depends on nvperf_host, as discussed in #154595 Pull Request resolved: #156668 Approved by: https://github.com/Skylion007

cyyever marked this pull request as draft June 30, 2025 01:35

cyyever force-pushed the cuda_language branch 7 times, most recently from a50793a to 1016376 Compare June 30, 2025 01:58

cyyever marked this pull request as ready for review June 30, 2025 14:55

cyyever requested a review from malfet June 30, 2025 14:56

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 7, 2025

cyyever added 2 commits August 11, 2025 18:22

Use CUDA targets and syntax from newer CMake

5872ab3

Signed-off-by: cyy <cyyever@outlook.com>

Remove installation of FindCUDAToolkit.cmake

068c8e9

cyyever force-pushed the cuda_language branch from ed50365 to 068c8e9 Compare August 11, 2025 10:22

Fix cuda targets

ae40a3d

Signed-off-by: cyy <cyyever@outlook.com>

cyyever force-pushed the cuda_language branch from 62463da to ae40a3d Compare August 11, 2025 11:20

Use official CUDAToolkit module in CMake #154595

Are you sure you want to change the base?

Use official CUDAToolkit module in CMake #154595

Conversation

cyyever commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154595

❌ 5 New Failures, 1 Unrelated Failure

Uh oh!

cyyever commented May 29, 2025

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

cyyever commented Jun 16, 2025

Uh oh!

pytorchmergebot commented Jun 17, 2025

Uh oh!

pytorchmergebot commented Jun 17, 2025

Uh oh!

cyyever commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyyever commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jun 24, 2025

Uh oh!

cyyever commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyyever Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyyever commented Jun 24, 2025

Uh oh!

ngimel commented Jun 24, 2025

Uh oh!

cyyever commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyyever commented Jun 30, 2025

Uh oh!

thotakeerth commented Jul 2, 2025

Uh oh!

thotakeerth commented Jul 2, 2025

Uh oh!

cyyever commented Aug 11, 2025

Uh oh!

Uh oh!

cyyever commented May 29, 2025 •

edited

Loading

pytorch-bot bot commented May 29, 2025 •

edited

Loading

cyyever commented Jun 23, 2025 •

edited

Loading

cyyever commented Jun 23, 2025 •

edited

Loading

cyyever commented Jun 24, 2025 •

edited

Loading

malfet left a comment •

edited

Loading

cyyever Jun 25, 2025 •

edited

Loading

cyyever commented Jun 24, 2025 •

edited

Loading