-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Use official CUDAToolkit module in CMake #154595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154595
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 1 Unrelated FailureAs of commit ae40a3d with merge base cf4964b ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
e67982f
to
c3f0359
Compare
4037656
to
9d5da10
Compare
ed247ae
to
22bb4d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on reviewing this, my review queue has been pretty backed up.
This is AMAZING!!!
The change sounds good to me (even though i'm in no way a cmake expert).
But if CI/CD is happy (including cpp extensions tests), I think we're good to go.
Let's try and land this as is!
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
22bb4d5
to
2594e27
Compare
@ngimel Detection of the native CPU architecture could be changed to One fix is using |
Fixed, see commit 3f789e9 . Also note that it has existed before this PR but has been revealed after these changes.. |
Do you know how this "native" option would work later when we are checking if the build is ok for the current GPU to give a clear error message on mismatch? |
@ngimel From nvcc documentation:
CMake do little work here, we rely on nvcc. (IMO they don't want to maintain these flags...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR tries to do too many things in one go (including renames)
Can it be split in 2-3 PRs, one of which would be using new CUDAToolkit package, but just define all the aliases that system used to, say set(CUDA_VERSION ${CUDAToolkit_VERSION})
etc?
Or alternatively, have a baseline PR that changes those in existing FindCUDA in preparation for new package version
Looks like there are some changes to how nvrtc package is defined before/after this change. In my opinion, it would be good to keep old definitions in place rather than pushing it to custom copy scripts, that will not be executed for users if they are running it outside of CI
@@ -79,6 +79,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None: | |||
os.system(f"unzip {wheel_path} -d {folder}/tmp") | |||
libs_to_copy = [ | |||
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12", | |||
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change is necessary if goal is just to remove FindCUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so. If we install cupti.so, we should also install nvperf_host.so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CI jobs broke for unfound nvperf_host.so, and nvperf_host.so is indeed required by cupti.so
Could you link the failing jobs? I don't understand why we would need nvperf_*
libs now without changing profiling usage in PyTorch or CUPTI itself. Why and how was profiling working before?
nvperf_*
libs are used for pc sampling, pm sampling, sass metrics, or range profiling, and I don't see any related change in this PR so are we using these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because kineto explicitly links to nvperf_host in building Pytorch, after that we have to make sure nvperf_host can be found. See https://github.com/pytorch/kineto/blob/main/libkineto/CMakeLists.txt
Whether its functions are really used is another story.
@@ -8,6 +8,7 @@ copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib | |||
copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib | |||
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib | |||
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib | |||
copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why those changes are necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cupti requires nvperf_host, as I said before.
caffe2/CMakeLists.txt
Outdated
target_link_libraries(torch_cuda_linalg PRIVATE | ||
CUDA::cusolver_static | ||
${CUDAToolkit_LIBRARY_DIR}/liblapack_static.a # needed for libcusolver_static | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: We don't really test static_CUDA anymore, nor do we support CUDA-11 anymore, perhaps time to delete this logic completely...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure static CUDA is in use somewhere in META.. Could you propose to remove them?
cmake/Dependencies.cmake
Outdated
@@ -50,7 +50,7 @@ if(USE_CUDA) | |||
if(NOT CAFFE2_USE_NVRTC) | |||
caffe2_update_option(USE_NVRTC OFF) | |||
endif() | |||
list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS caffe2::curand caffe2::cufft caffe2::cublas) | |||
list(APPEND Caffe2_CUDA_DEPENDENCY_LIBS torch::curand torch::cufft torch::cublas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, why are we renaming those here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caffe2 is gone, take the chance to rename.
@malfet Renaming targets should be split, but other changes are not easily to separate for the concern of mixing new and old modules and strange interaction.. |
So after this change |
Its value is obtained from |
Because cupti depends on nvperf_host, as discussed in #154595 Pull Request resolved: #156668 Approved by: https://github.com/Skylion007
a50793a
to
1016376
Compare
@malfet Unrelated changes are moved to other places. |
CUDA Performance Insight: Recommendations:
s390x Failures:
Happy to help triage the CUDA-specific failures if useful. The core change LGTM! |
CUDA Performance Insight: Recommendations:
s390x Failures:
Happy to help triage the CUDA-specific failures if useful. The core change LGTM! |
Signed-off-by: cyy <cyyever@outlook.com>
Signed-off-by: cyy <cyyever@outlook.com>
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with
torch::
prefix.cc @albanD