-
Notifications
You must be signed in to change notification settings - Fork 548
Automatically cache compiled CUDA kernels on disk to speed up kernel compilation #2848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@cschreib-ibex If would be great if you can separate Removed constexpr not supported by VS2015 into it's own PR. Thanks for working on this! Nice job. I have left some feedback, we can continue discussion about the caching in this PR while the VS2015 doesn't need to wait for this change to go into. |
We can actually move all directory related functions to |
@9prady9 Thank you for your review! I'll implement your suggestions and split the VS2015 stuff in a separate branch / PR. |
I've renamed the function I have also created a separate PR for the VS2015 compilation fix, #2850. I can revert the relevant commit from the current PR if you wish, although that will make it more painful for me to make further adjustments. |
@cschreib-ibex You can drop |
@9prady9 Done! |
We can have the VS PR merged soon, it is a minor change. I am waiting for the Windows jobs to finish. |
@cschreib-ibex #2850 is merged, you can rebase this branch for future changes/testing. |
a6b061f
to
2bfe895
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic. I have made a couple of requests. There is still a question about multiple threads. I noticed you guys had a solution but I am not sure if that will be sufficient to prevent conflicts. I was thinking we create a separate thread that handled the writing to disk. This thread will wait for things to be added to a queue and slowly write to disk. We can use the async_queue class to do this.... Perhaps this is out of the scope of this PR. What do you think @9prady9.
@umar Thanks for the comments, I'll implement your suggested changes. As for the threaded caching, using a separate thread for writing would still not safeguard us against two separate ArrayFire executables writing the same kernel file at the same time. The solution of writing to a temporary file with a thread-unique name (which guarantees no concurrent writes) and trying to move the file at the end is really easy to implement and provides full safety. I've implemented and tested it locally, but haven't pushed the changes here yet, waiting on your decision. Another solution, which has been suggested on SO, is to use OS-specific functions to lock the file prior to reading/writing into it, which will then prevent two threads (even from different processes) from accessing it simultaneously. I like this solution less because it requires platform-dependent code (which can get harder to maintain in the long run) while the other solution above is standard C/C++. |
Forgot to convert the result of std::hash (an int) into a string before appending to "KER".
This prevents data races where two threads or two processes might write to the same file.
FNV1a is a fine algorithm for now. If we run into any issues we can look into another algorithm although that would be a hell of a debug session.
Good point about the multiple process issue. I agree that could be a problem with multiple application running at the same time. Lets go with the move approach you suggested. I think as long as we don't hit this situation often we will not run into performance issues where you have multiple threads writing to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cschreib-ibex Cool changes since I last checked out. Just a couple of minor changes.
- Missing return statement - can cause issues for the scenario where there is no XDG_CACHE_HOME or HOME defined.
- Internal documentation for couple of functions that were added for hashing.
@cschreib-ibex Final one, formatting seems to be off in |
@9prady9 Man, clang-tidy is very particular :) It doesn't like the default behavior of Visual Studio. Anyways, I've fixed the formatting. |
@cschreib-ibex I understand. Sorry, there seems to be some other thing off. -/// Return a string suitable for naming a temporary file.
+/// Return a string suitable for naming a temporary file.
///
/// Every call to this function will generate a new string with a very low
/// probability of colliding with past or future outputs of this function, |
Finally! Thanks @cschreib-ibex fyi: Finally was for ci job :) |
No worries :P Thanks for the tip on |
@cschreib-ibex this is really cool. Out of curiosity, do you have any benchmarks with respect to a performance improvements? |
On our systems, and with our particular application which makes heavy use of ArrayFire, the runtime when "cold" (i.e., the first run of our processing after the application has started) went down from 28s to 9s. |
…ayfire#2848) * Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default. * cuda::buildKernel() now dumps cubin to disk for reuse * Adds cuda::loadKernel() for loading cached cubin files * cuda::loadKernel() returns empty kernel on failure * Uses XDG_CACHE_HOME as cache directory for Linux * Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program. * Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file. * Uses deterministicHash() for hashing kernel names and kernel binary data. * Adds kernel binary data file integrity check upon loading from disk
* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default. * cuda::buildKernel() now dumps cubin to disk for reuse * Adds cuda::loadKernel() for loading cached cubin files * cuda::loadKernel() returns empty kernel on failure * Uses XDG_CACHE_HOME as cache directory for Linux * Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program. * Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file. * Uses deterministicHash() for hashing kernel names and kernel binary data. * Adds kernel binary data file integrity check upon loading from disk
This PR addresses #2503
and #2845(addressed in separate PR: #2850).Kernels are saved to disk in CUBIN format straight after being compiled in
buildKernel()
. The folder they are saved into is platform-dependent:/var/lib/arrayfire
, or~/.arrayfire
, or/tmp/arrayfire
(in order of decreasing priority)%APPDATA%\Temp\ArrayFire
Each kernel is saved in a separate file, and the file name is built from a hash of the kernel function name (that name is already used internally for caching, so hopefully won't collide). The device compute capabilities and the AF API version are also encoded in the file name, so if these ever change, kernels will get recompiled automatically.
Disk usage looks very reasonable; on average 5 kB per kernel on the tests I ran.
Then, whenever a kernel is requested, we first look if the kernel is in the memory cache (as was done before), and if not, try the disk cache (that's new), and if that fails too, build the kernel from scratch (as was done before).
I have tested the implementation on Windows by running our own software with the ArrayFire DLLs built from that branch, and all worked fine. The linux implementation, however, is untested.