Skip to content

Automatically cache compiled CUDA kernels on disk to speed up kernel compilation #2848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Apr 24, 2020

Conversation

cschreib-ibex
Copy link
Contributor

@cschreib-ibex cschreib-ibex commented Apr 17, 2020

This PR addresses #2503 and #2845 (addressed in separate PR: #2850).

Kernels are saved to disk in CUBIN format straight after being compiled in buildKernel(). The folder they are saved into is platform-dependent:

  • linux: /var/lib/arrayfire, or ~/.arrayfire, or /tmp/arrayfire (in order of decreasing priority)
  • windows: %APPDATA%\Temp\ArrayFire

Each kernel is saved in a separate file, and the file name is built from a hash of the kernel function name (that name is already used internally for caching, so hopefully won't collide). The device compute capabilities and the AF API version are also encoded in the file name, so if these ever change, kernels will get recompiled automatically.

Disk usage looks very reasonable; on average 5 kB per kernel on the tests I ran.

Then, whenever a kernel is requested, we first look if the kernel is in the memory cache (as was done before), and if not, try the disk cache (that's new), and if that fails too, build the kernel from scratch (as was done before).

I have tested the implementation on Windows by running our own software with the ArrayFire DLLs built from that branch, and all worked fine. The linux implementation, however, is untested.

@9prady9
Copy link
Member

9prady9 commented Apr 17, 2020

@cschreib-ibex If would be great if you can separate Removed constexpr not supported by VS2015 into it's own PR.

Thanks for working on this! Nice job. I have left some feedback, we can continue discussion about the caching in this PR while the VS2015 doesn't need to wait for this change to go into.

@9prady9
Copy link
Member

9prady9 commented Apr 18, 2020

We can actually move all directory related functions to src/backend/common so that they can be used by other backends if and when necessary.

@cschreib-ibex
Copy link
Contributor Author

@9prady9 Thank you for your review! I'll implement your suggestions and split the VS2015 stuff in a separate branch / PR.

@cschreib-ibex
Copy link
Contributor Author

cschreib-ibex commented Apr 20, 2020

I've renamed the function getKernelCacheDirectory() into getCacheDirectory() when moving it to util.cpp. I figure this might get used later for other purposes than caching kernels.

I have also created a separate PR for the VS2015 compilation fix, #2850. I can revert the relevant commit from the current PR if you wish, although that will make it more painful for me to make further adjustments.

@9prady9
Copy link
Member

9prady9 commented Apr 20, 2020

@cschreib-ibex You can drop Removed constexpr not supported by VS2015 commit from this PR now. You will be able to rebase this from latest master once I merge #2850 (waiting for the ci jobs to go green).

@cschreib-ibex
Copy link
Contributor Author

@9prady9 Done!

@9prady9
Copy link
Member

9prady9 commented Apr 20, 2020

I've renamed the function getKernelCacheDirectory() into getCacheDirectory() when moving it to util.cpp. I figure this might get used later for other purposes than caching kernels.

I have also created a separate PR for the VS2015 compilation fix, #2850. I can revert the relevant commit from the current PR if you wish, although that will make it more painful for me to make further adjustments.

We can have the VS PR merged soon, it is a minor change. I am waiting for the Windows jobs to finish.

@9prady9
Copy link
Member

9prady9 commented Apr 20, 2020

@cschreib-ibex #2850 is merged, you can rebase this branch for future changes/testing.

Copy link
Member

@umar456 umar456 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic. I have made a couple of requests. There is still a question about multiple threads. I noticed you guys had a solution but I am not sure if that will be sufficient to prevent conflicts. I was thinking we create a separate thread that handled the writing to disk. This thread will wait for things to be added to a queue and slowly write to disk. We can use the async_queue class to do this.... Perhaps this is out of the scope of this PR. What do you think @9prady9.

@cschreib-ibex
Copy link
Contributor Author

@umar Thanks for the comments, I'll implement your suggested changes.

As for the threaded caching, using a separate thread for writing would still not safeguard us against two separate ArrayFire executables writing the same kernel file at the same time. The solution of writing to a temporary file with a thread-unique name (which guarantees no concurrent writes) and trying to move the file at the end is really easy to implement and provides full safety. I've implemented and tested it locally, but haven't pushed the changes here yet, waiting on your decision.

Another solution, which has been suggested on SO, is to use OS-specific functions to lock the file prior to reading/writing into it, which will then prevent two threads (even from different processes) from accessing it simultaneously. I like this solution less because it requires platform-dependent code (which can get harder to maintain in the long run) while the other solution above is standard C/C++.

@umar456
Copy link
Member

umar456 commented Apr 23, 2020

FNV1a is a fine algorithm for now. If we run into any issues we can look into another algorithm although that would be a hell of a debug session.

@umar Thanks for the comments, I'll implement your suggested changes.

As for the threaded caching, using a separate thread for writing would still not safeguard us against two separate ArrayFire executables writing the same kernel file at the same time. The solution of writing to a temporary file with a thread-unique name (which guarantees no concurrent writes) and trying to move the file at the end is really easy to implement and provides full safety. I've implemented and tested it locally, but haven't pushed the changes here yet, waiting on your decision.

Another solution, which has been suggested on SO, is to use OS-specific functions to lock the file prior to reading/writing into it, which will then prevent two threads (even from different processes) from accessing it simultaneously. I like this solution less because it requires platform-dependent code (which can get harder to maintain in the long run) while the other solution above is standard C/C++.

Good point about the multiple process issue. I agree that could be a problem with multiple application running at the same time. Lets go with the move approach you suggested. I think as long as we don't hit this situation often we will not run into performance issues where you have multiple threads writing to disk.

umar456
umar456 previously approved these changes Apr 23, 2020
Copy link
Member

@9prady9 9prady9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cschreib-ibex Cool changes since I last checked out. Just a couple of minor changes.

  1. Missing return statement - can cause issues for the scenario where there is no XDG_CACHE_HOME or HOME defined.
  2. Internal documentation for couple of functions that were added for hashing.

9prady9
9prady9 previously approved these changes Apr 24, 2020
@9prady9
Copy link
Member

9prady9 commented Apr 24, 2020

@cschreib-ibex Final one, formatting seems to be off in /src/backend/common/util.hpp Please correct it.

@cschreib-ibex
Copy link
Contributor Author

@9prady9 Man, clang-tidy is very particular :) It doesn't like the default behavior of Visual Studio. Anyways, I've fixed the formatting.

@9prady9
Copy link
Member

9prady9 commented Apr 24, 2020

@cschreib-ibex I understand. Sorry, there seems to be some other thing off. clang-format -i src/backend/common/util.hpp should take care of the entire file as per config we have. Try that.

-/// Return a string suitable for naming a temporary file. 
+/// Return a string suitable for naming a temporary file.
 ///
 /// Every call to this function will generate a new string with a very low
 /// probability of colliding with past or future outputs of this function,

@9prady9
Copy link
Member

9prady9 commented Apr 24, 2020

Finally! Thanks @cschreib-ibex

fyi: Finally was for ci job :)

@cschreib-ibex
Copy link
Contributor Author

No worries :P Thanks for the tip on clang-format, I actually had it installed, never ran it before...

@9prady9 9prady9 merged commit e2ad39d into arrayfire:master Apr 24, 2020
@cschreib-ibex cschreib-ibex deleted the cacheKernels branch April 24, 2020 10:47
@jacobkahn
Copy link
Contributor

@cschreib-ibex this is really cool. Out of curiosity, do you have any benchmarks with respect to a performance improvements?

@cschreib-ibex
Copy link
Contributor Author

On our systems, and with our particular application which makes heavy use of ArrayFire, the runtime when "cold" (i.e., the first run of our processing after the application has started) went down from 28s to 9s.

@umar456 umar456 added this to the 3.7.2 milestone Jun 27, 2020
umar456 pushed a commit to umar456/arrayfire that referenced this pull request Jun 27, 2020
…ayfire#2848)

* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default.
* cuda::buildKernel() now dumps cubin to disk for reuse
* Adds cuda::loadKernel() for loading cached cubin files
* cuda::loadKernel() returns empty kernel on failure
* Uses XDG_CACHE_HOME as cache directory for Linux
* Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program.
* Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file.
* Uses deterministicHash() for hashing kernel names and kernel binary data.
* Adds kernel binary data file integrity check upon loading from disk
@umar456 umar456 mentioned this pull request Jun 27, 2020
2 tasks
9prady9 pushed a commit that referenced this pull request Jun 27, 2020
* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default.
* cuda::buildKernel() now dumps cubin to disk for reuse
* Adds cuda::loadKernel() for loading cached cubin files
* cuda::loadKernel() returns empty kernel on failure
* Uses XDG_CACHE_HOME as cache directory for Linux
* Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program.
* Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file.
* Uses deterministicHash() for hashing kernel names and kernel binary data.
* Adds kernel binary data file integrity check upon loading from disk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants