-
Notifications
You must be signed in to change notification settings - Fork 548
Use static CUDA upstream libs on Unix #2785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change and looks like there is a way to split in 2 PRs (at least):
1- link with static nvidia libs
2- other changes
It is risky not to split big PR and some of the issues I ve suffered from in the last 3 years in AF were due to big PRs.
Would you mind splitting that one in 2 ?
Except the following commits
rest of them are actually fixes to enable linking with static cuda libs - it also fixes CUDA separable compilation in ArrayFire(broken right now) which is needed for the former to work successfully. The reason I kept them separate is those changes are independent enough to be a commit by themselves - change doesn't break non-separable compilation and tests. Each of the commits compile and pass tests on my local Linux and Windows box. Also, each commit is separated based on the type of change so that reviewing is easy. Moving them into different branches and there by other PRs doesn't add any value in this case I think. Note: Except the nvrtc related commit, rest of them are largely CMake related changes barring the LUT commit (it is a minor change compared to nvrtc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few comments but this is a really large PR. I would really appreciate it if you created separate PRs so we can discuss your changes individually.
153504e
to
8411f63
Compare
8411f63
to
2512d56
Compare
Instead of creating a static library out of all separate instantiations of thrust_sort_by_key sources, we now directly embed sources generated(using cmake's configure_file command) into afcuda target. This also fixed separable compilation. Prior to this change, separate compilation failed (related to cuda device linking - undefined references). I tried to fix that problem, but couldn't get a break through. However, I realized that just directly using the generated sources with afcuda target will do the job without any additional static library.
055130e
to
700e420
Compare
700e420
to
a578e64
Compare
thrust::stable_sort_by_key has known issue with device linking. The code crashes with cudaInvalidValueError. It works as expected without any changes with or without separable compilation otherwise. https://github.com/thrust/thrust/wiki/Debugging#known-issues https://github.com/thrust/thrust/blob/master/doc/changelog.md#known-issues-2 The above documents mention a known issue with device linking and thrust. Although the documents say it happens in debug mode(with -G flag), I noticed similar crashes in release configuration too in ArrayFire. Due to the above issue, I have separated out the relevant source files (fft,blas,sparse and solver) which require device linking into separate static library. Once separated into a separate static library, sort_by_key and all the other unit tests that use it are running as expected without any crashes.
a578e64
to
fbe662f
Compare
Is there still a way to dynamically link to CUDA? The uncompressed libafcuda.so.3.7.2 is almost a gigabyte in size! |
@daniel-dsouza The 1GB size of binary is not due to static linking of dynamic libs. It is rather due to compiling the library with multiple compute versions starting from 30 to 75 of CUDA cards. For our website binaries to work optimally on all CUDA cards, we build with all compute versions, hence increasing the binary size. For example, for a single compute version, the CUDA backend will end up being approx 450MB. Having said that, we are aware of the issue with such huge size and we are constantly working on reducing it. As part of that effort, we moved most of the kernels for runtime compilation instead of offline compilation using nvcc. This brought down CUDA backend binary a good bit. We have more work in the pipeline towards the same goal for future releases. |
@9prady9 Thank you for explaining this to me. Currently I am building from compute version 5 and higher, so I can look into narrowing that down to my specific hardware. |
Uh oh!
There was an error while loading. Please reload this page.