Skip to content

Use static CUDA upstream libs on Unix #2785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 1, 2020

Conversation

9prady9
Copy link
Member

@9prady9 9prady9 commented Mar 9, 2020

  • Use static cufft, cublas, cusolver and cusolver on Unix. Dynamic libraries are used on Windows. (Check commit message for details)
  • Removed cuda_thrust_sort_by_key static dependency (Check commit message for details)

Copy link
Contributor

@WilliamTambellini WilliamTambellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change and looks like there is a way to split in 2 PRs (at least):
1- link with static nvidia libs
2- other changes
It is risky not to split big PR and some of the issues I ve suffered from in the last 3 years in AF were due to big PRs.
Would you mind splitting that one in 2 ?

@9prady9
Copy link
Member Author

9prady9 commented Mar 9, 2020

Except the following commits

  • Renamed cu to cpp where possible and cleanup CUDA fast/orb
  • move CUDA kernels to runtime(nvrtc) compilation where ever possible

rest of them are actually fixes to enable linking with static cuda libs - it also fixes CUDA separable compilation in ArrayFire(broken right now) which is needed for the former to work successfully.

The reason I kept them separate is those changes are independent enough to be a commit by themselves - change doesn't break non-separable compilation and tests. Each of the commits compile and pass tests on my local Linux and Windows box. Also, each commit is separated based on the type of change so that reviewing is easy. Moving them into different branches and there by other PRs doesn't add any value in this case I think.

Note: Except the nvrtc related commit, rest of them are largely CMake related changes barring the LUT commit (it is a minor change compared to nvrtc).

Copy link
Member

@umar456 umar456 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few comments but this is a really large PR. I would really appreciate it if you created separate PRs so we can discuss your changes individually.

@9prady9 9prady9 force-pushed the use_static_culibs branch from 153504e to 8411f63 Compare March 14, 2020 09:29
@9prady9 9prady9 requested a review from umar456 March 14, 2020 09:30
@9prady9 9prady9 changed the title Use static CUDA upstream libs and other changes to reduce global constant memory usage Use static CUDA upstream libs on Unix Mar 14, 2020
@9prady9 9prady9 dismissed umar456’s stale review March 14, 2020 09:34

Addressed feedback

@9prady9 9prady9 force-pushed the use_static_culibs branch from 8411f63 to 2512d56 Compare March 24, 2020 06:02
umar456
umar456 previously approved these changes Mar 24, 2020
Instead of creating a static library out of all separate instantiations
of thrust_sort_by_key sources, we now directly embed sources
generated(using cmake's configure_file command) into afcuda target.

This also fixed separable compilation.
Prior to this change, separate compilation failed (related to cuda device
linking - undefined references). I tried to fix that problem, but
couldn't get a break through. However, I realized that just directly
using the generated sources with afcuda target will do the job without
any additional static library.
@9prady9 9prady9 force-pushed the use_static_culibs branch 3 times, most recently from 055130e to 700e420 Compare March 25, 2020 12:54
@9prady9 9prady9 force-pushed the use_static_culibs branch from 700e420 to a578e64 Compare March 26, 2020 06:15
@9prady9 9prady9 modified the milestones: 3.7.1, 3.7.2 Mar 26, 2020
thrust::stable_sort_by_key has known issue with device linking. The code
crashes with cudaInvalidValueError. It works as expected without any
changes with or without separable compilation otherwise.

https://github.com/thrust/thrust/wiki/Debugging#known-issues
https://github.com/thrust/thrust/blob/master/doc/changelog.md#known-issues-2

The above documents mention a known issue with device linking and thrust.
Although the documents say it happens in debug mode(with -G flag), I noticed
similar crashes in release configuration too in ArrayFire.

Due to the above issue, I have separated out the relevant source files
(fft,blas,sparse and solver) which require device linking into separate
static library. Once separated into a separate static library, sort_by_key
and all the other unit tests that use it are running as expected without
any crashes.
@9prady9 9prady9 force-pushed the use_static_culibs branch from a578e64 to fbe662f Compare March 30, 2020 05:42
@9prady9 9prady9 requested a review from umar456 March 30, 2020 09:01
@9prady9 9prady9 merged commit 08296d6 into arrayfire:master Apr 1, 2020
@daniel-dsouza
Copy link

Is there still a way to dynamically link to CUDA? The uncompressed libafcuda.so.3.7.2 is almost a gigabyte in size!

@9prady9
Copy link
Member Author

9prady9 commented Jul 23, 2020

@daniel-dsouza The 1GB size of binary is not due to static linking of dynamic libs. It is rather due to compiling the library with multiple compute versions starting from 30 to 75 of CUDA cards. For our website binaries to work optimally on all CUDA cards, we build with all compute versions, hence increasing the binary size. For example, for a single compute version, the CUDA backend will end up being approx 450MB. Having said that, we are aware of the issue with such huge size and we are constantly working on reducing it. As part of that effort, we moved most of the kernels for runtime compilation instead of offline compilation using nvcc. This brought down CUDA backend binary a good bit. We have more work in the pipeline towards the same goal for future releases.

@daniel-dsouza
Copy link

@9prady9 Thank you for explaining this to me. Currently I am building from compute version 5 and higher, so I can look into narrowing that down to my specific hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants