Skip to content

Use official CUDAToolkit module in CMake #154595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cyyever
Copy link
Collaborator

@cyyever cyyever commented May 29, 2025

Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with torch:: prefix.

cc @albanD

Copy link

pytorch-bot bot commented May 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154595

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Unrelated Failure

As of commit ae40a3d with merge base cf4964b (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@cyyever cyyever marked this pull request as draft May 29, 2025 05:30
@cyyever
Copy link
Collaborator Author

cyyever commented May 29, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 29, 2025
@cyyever cyyever force-pushed the cuda_language branch 2 times, most recently from e67982f to c3f0359 Compare May 29, 2025 05:55
@cyyever cyyever force-pushed the cuda_language branch 7 times, most recently from 4037656 to 9d5da10 Compare May 29, 2025 09:34
@cyyever cyyever marked this pull request as ready for review June 14, 2025 23:35
@cyyever cyyever changed the title Cuda language Use CUDA language in CMake Jun 14, 2025
@cyyever cyyever changed the title Use CUDA language in CMake Use official CUDAToolkit module in CMake Jun 14, 2025
@cyyever cyyever force-pushed the cuda_language branch 2 times, most recently from ed247ae to 22bb4d5 Compare June 15, 2025 08:04
albanD
albanD previously approved these changes Jun 16, 2025
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on reviewing this, my review queue has been pretty backed up.
This is AMAZING!!!

The change sounds good to me (even though i'm in no way a cmake expert).
But if CI/CD is happy (including cpp extensions tests), I think we're good to go.

Let's try and land this as is!

@cyyever cyyever marked this pull request as draft June 16, 2025 23:59
@cyyever
Copy link
Collaborator Author

cyyever commented Jun 16, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cuda_language onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_language && git pull --rebase)

@cyyever
Copy link
Collaborator Author

cyyever commented Jun 23, 2025

@ngimel Detection of the native CPU architecture could be changed to set(CUDA_ARCHITECTURES "native") which essentially passes native to nvcc. The old behavior to print the info requires CUDA_DETECT_INSTALLED_GPUS, which is unfortunately deprecated, see https://gitlab.kitware.com/cmake/cmake/-/issues/19199.

One fix is using CUDA_ARCHITECTURES, as shown in https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html .

@cyyever
Copy link
Collaborator Author

cyyever commented Jun 23, 2025

Concretely, this is wrong in cmake

      if(CUDA_LIMIT_GPU_ARCHITECTURE AND ITEM VERSION_GREATER_EQUAL CUDA_LIMIT_GPU_ARCHITECTURE)
        list(GET CUDA_COMMON_GPU_ARCHITECTURES -1 NEWITEM)
        string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${NEWITEM}")
      else()
        string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${ITEM}")
        endif()

as it either incorrectly sets CUDA_LIMIT_GPU_ARCHITECTURE or does an incorrect comparison here, and thus sets architecture to "CUDA_COMMON_GPU_ARCHITECTURES" I'm on cmake 3.27 but I've seen the same behavior on 4.0

Fixed, see commit 3f789e9 . Also note that it has existed before this PR but has been revealed after these changes..

@ngimel
Copy link
Collaborator

ngimel commented Jun 24, 2025

Do you know how this "native" option would work later when we are checking if the build is ok for the current GPU to give a clear error message on mismatch?

@cyyever
Copy link
Collaborator Author

cyyever commented Jun 24, 2025

@ngimel From nvcc documentation:

When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.

CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...)

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR tries to do too many things in one go (including renames)
Can it be split in 2-3 PRs, one of which would be using new CUDAToolkit package, but just define all the aliases that system used to, say set(CUDA_VERSION ${CUDAToolkit_VERSION}) etc?

Or alternatively, have a baseline PR that changes those in existing FindCUDA in preparation for new package version

Looks like there are some changes to how nvrtc package is defined before/after this change. In my opinion, it would be good to keep old definitions in place rather than pushing it to custom copy scripts, that will not be executed for users if they are running it outside of CI

@@ -79,6 +79,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
os.system(f"unzip {wheel_path} -d {folder}/tmp")
libs_to_copy = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change is necessary if goal is just to remove FindCUDA?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so. If we install cupti.so, we should also install nvperf_host.so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so

Could you link the failing jobs? I don't understand why we would need nvperf_* libs now without changing profiling usage in PyTorch or CUPTI itself. Why and how was profiling working before?
nvperf_* libs are used for pc sampling, pm sampling, sass metrics, or range profiling, and I don't see any related change in this PR so are we using these?

Copy link
Collaborator Author

@cyyever cyyever Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because kineto explicitly links to nvperf_host in building Pytorch, after that we have to make sure nvperf_host can be found. See https://github.com/pytorch/kineto/blob/main/libkineto/CMakeLists.txt
Whether its functions are really used is another story.

@@ -8,6 +8,7 @@ copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why those changes are necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cupti requires nvperf_host, as I said before.

Comment on lines 1080 to 1083
target_link_libraries(torch_cuda_linalg PRIVATE
CUDA::cusolver_static
${CUDAToolkit_LIBRARY_DIR}/liblapack_static.a # needed for libcusolver_static
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: We don't really test static_CUDA anymore, nor do we support CUDA-11 anymore, perhaps time to delete this logic completely...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure static CUDA is in use somewhere in META.. Could you propose to remove them?

@@ -50,7 +50,7 @@ if(USE_CUDA)
if(NOT CAFFE2_USE_NVRTC)
caffe2_update_option(USE_NVRTC OFF)
endif()
list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS caffe2::curand caffe2::cufft caffe2::cublas)
list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS torch::curand torch::cufft torch::cublas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why are we renaming those here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caffe2 is gone, take the chance to rename.

@cyyever
Copy link
Collaborator Author

cyyever commented Jun 24, 2025

@malfet Renaming targets should be split, but other changes are not easily to separate for the concern of mixing new and old modules and strange interaction..

@ngimel
Copy link
Collaborator

ngimel commented Jun 24, 2025

@ngimel From nvcc documentation:

When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.

CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...)

So after this change _cuda_getArchFlags() won't work as expected?

@cyyever
Copy link
Collaborator Author

cyyever commented Jun 24, 2025

@ngimel From nvcc documentation:

When -arch=native is specified, nvcc detects the visible GPUs on the system and generates codes for them, no PTX program will be generated for this option. It is a warning if there are no visible supported GPU on the system, and the default architecture will be used.

CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...)

So after this change _cuda_getArchFlags() won't work as expected?

Its value is obtained from TORCH_CUDA_ARCH_LIST.
For detailed TORCH_CUDA_ARCH_LIST of sms, the behavior is the same. For virtual names such as "All", "Common" or "Auto", it will not be translated into detailed list.
IMO it's a design decision of CUDA that they want to simplify all these arch specification, the CMake deprecation is merely reflection of this trend.
If we want to give the supported list rather than the virtual specification, we have to maintain it ourselves and append to the list for each new CUDA generation. That is also possible, I could revert some changes here.

pytorchmergebot pushed a commit that referenced this pull request Jun 28, 2025
Because cupti depends on nvperf_host, as discussed in #154595

Pull Request resolved: #156668
Approved by: https://github.com/Skylion007
@cyyever cyyever marked this pull request as draft June 30, 2025 01:35
@cyyever cyyever force-pushed the cuda_language branch 7 times, most recently from a50793a to 1016376 Compare June 30, 2025 01:58
@cyyever cyyever marked this pull request as ready for review June 30, 2025 14:55
@cyyever
Copy link
Collaborator Author

cyyever commented Jun 30, 2025

@malfet Unrelated changes are moved to other places.

@cyyever cyyever requested a review from malfet June 30, 2025 14:56
@thotakeerth
Copy link

CUDA Performance Insight:
Reproduced the slow-gradcheck failures on A100-SXM4-80GB (CUDA 12.4). The latency spikes align with kernel launch overhead patterns I've seen in transformer workloads.

Recommendations:

  1. Kernel Fusion: Profile with Nsight Systems to identify dispatch bottlenecks - likely in autograd ops
  2. Gradcheck Tuning: Try reducing eps values or limiting input sizes for numerically unstable ops
  3. Memory Analysis: Check torch.cuda.memory_stats() for fragmentation during backward passes

s390x Failures:
These appear unrelated to CUDA changes. The consistent 18-38m failure window suggests possible:

  • Resource contention in emulated environment
  • Architecture-specific numerics divergence

Happy to help triage the CUDA-specific failures if useful. The core change LGTM!

@thotakeerth
Copy link

CUDA Performance Insight:
Reproduced the slow-gradcheck failures on A100-SXM4-80GB (CUDA 12.4). The latency spikes align with kernel launch overhead patterns I've seen in transformer workloads.

Recommendations:

  1. Kernel Fusion: Profile with Nsight Systems to identify dispatch bottlenecks - likely in autograd ops
  2. Gradcheck Tuning: Try reducing eps values or limiting input sizes for numerically unstable ops
  3. Memory Analysis: Check torch.cuda.memory_stats() for fragmentation during backward passes

s390x Failures:
These appear unrelated to CUDA changes. The consistent 18-38m failure window suggests possible:

  • Resource contention in emulated environment
  • Architecture-specific numerics divergence

Happy to help triage the CUDA-specific failures if useful. The core change LGTM!

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 7, 2025
Signed-off-by: cyy <cyyever@outlook.com>
@cyyever
Copy link
Collaborator Author

cyyever commented Aug 11, 2025

@albanD @ngimel Now the controversial changes have been restored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request Merged open source Reverted skip-pr-sanity-checks topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants