Automatically cache compiled CUDA kernels on disk to speed up kernel compilation #2848

cschreib-ibex · 2020-04-17T15:09:47Z

This PR addresses #2503 ~~and #2845~~ (addressed in separate PR: #2850).

Kernels are saved to disk in CUBIN format straight after being compiled in buildKernel(). The folder they are saved into is platform-dependent:

linux: /var/lib/arrayfire, or ~/.arrayfire, or /tmp/arrayfire (in order of decreasing priority)
windows: %APPDATA%\Temp\ArrayFire

Each kernel is saved in a separate file, and the file name is built from a hash of the kernel function name (that name is already used internally for caching, so hopefully won't collide). The device compute capabilities and the AF API version are also encoded in the file name, so if these ever change, kernels will get recompiled automatically.

Disk usage looks very reasonable; on average 5 kB per kernel on the tests I ran.

Then, whenever a kernel is requested, we first look if the kernel is in the memory cache (as was done before), and if not, try the disk cache (that's new), and if that fails too, build the kernel from scratch (as was done before).

I have tested the implementation on Windows by running our own software with the ArrayFire DLLs built from that branch, and all worked fine. The linux implementation, however, is untested.

src/backend/cuda/nvrtc/cache.cpp

9prady9 · 2020-04-17T16:35:00Z

@cschreib-ibex If would be great if you can separate Removed constexpr not supported by VS2015 into it's own PR.

Thanks for working on this! Nice job. I have left some feedback, we can continue discussion about the caching in this PR while the VS2015 doesn't need to wait for this change to go into.

9prady9 · 2020-04-18T08:01:22Z

We can actually move all directory related functions to src/backend/common so that they can be used by other backends if and when necessary.

cschreib-ibex · 2020-04-20T08:37:04Z

@9prady9 Thank you for your review! I'll implement your suggestions and split the VS2015 stuff in a separate branch / PR.

cschreib-ibex · 2020-04-20T09:06:58Z

I've renamed the function getKernelCacheDirectory() into getCacheDirectory() when moving it to util.cpp. I figure this might get used later for other purposes than caching kernels.

I have also created a separate PR for the VS2015 compilation fix, #2850. I can revert the relevant commit from the current PR if you wish, although that will make it more painful for me to make further adjustments.

9prady9 · 2020-04-20T10:51:07Z

@cschreib-ibex You can drop Removed constexpr not supported by VS2015 commit from this PR now. You will be able to rebase this from latest master once I merge #2850 (waiting for the ci jobs to go green).

cschreib-ibex · 2020-04-20T10:57:29Z

@9prady9 Done!

9prady9 · 2020-04-20T11:28:03Z

I've renamed the function getKernelCacheDirectory() into getCacheDirectory() when moving it to util.cpp. I figure this might get used later for other purposes than caching kernels.

I have also created a separate PR for the VS2015 compilation fix, #2850. I can revert the relevant commit from the current PR if you wish, although that will make it more painful for me to make further adjustments.

We can have the VS PR merged soon, it is a minor change. I am waiting for the Windows jobs to finish.

9prady9 · 2020-04-20T12:31:38Z

@cschreib-ibex #2850 is merged, you can rebase this branch for future changes/testing.

umar456

This looks fantastic. I have made a couple of requests. There is still a question about multiple threads. I noticed you guys had a solution but I am not sure if that will be sufficient to prevent conflicts. I was thinking we create a separate thread that handled the writing to disk. This thread will wait for things to be added to a queue and slowly write to disk. We can use the async_queue class to do this.... Perhaps this is out of the scope of this PR. What do you think @9prady9.

src/backend/common/util.cpp

src/backend/cuda/nvrtc/cache.cpp

src/backend/common/util.cpp

cschreib-ibex · 2020-04-23T07:52:27Z

@umar Thanks for the comments, I'll implement your suggested changes.

As for the threaded caching, using a separate thread for writing would still not safeguard us against two separate ArrayFire executables writing the same kernel file at the same time. The solution of writing to a temporary file with a thread-unique name (which guarantees no concurrent writes) and trying to move the file at the end is really easy to implement and provides full safety. I've implemented and tested it locally, but haven't pushed the changes here yet, waiting on your decision.

Another solution, which has been suggested on SO, is to use OS-specific functions to lock the file prior to reading/writing into it, which will then prevent two threads (even from different processes) from accessing it simultaneously. I like this solution less because it requires platform-dependent code (which can get harder to maintain in the long run) while the other solution above is standard C/C++.

Forgot to convert the result of std::hash (an int) into a string before appending to "KER".

This prevents data races where two threads or two processes might write to the same file.

umar456 · 2020-04-23T15:07:51Z

FNV1a is a fine algorithm for now. If we run into any issues we can look into another algorithm although that would be a hell of a debug session.

@umar Thanks for the comments, I'll implement your suggested changes.

As for the threaded caching, using a separate thread for writing would still not safeguard us against two separate ArrayFire executables writing the same kernel file at the same time. The solution of writing to a temporary file with a thread-unique name (which guarantees no concurrent writes) and trying to move the file at the end is really easy to implement and provides full safety. I've implemented and tested it locally, but haven't pushed the changes here yet, waiting on your decision.

Another solution, which has been suggested on SO, is to use OS-specific functions to lock the file prior to reading/writing into it, which will then prevent two threads (even from different processes) from accessing it simultaneously. I like this solution less because it requires platform-dependent code (which can get harder to maintain in the long run) while the other solution above is standard C/C++.

Good point about the multiple process issue. I agree that could be a problem with multiple application running at the same time. Lets go with the move approach you suggested. I think as long as we don't hit this situation often we will not run into performance issues where you have multiple threads writing to disk.

CMakeLists.txt

src/backend/common/util.cpp

9prady9

@cschreib-ibex Cool changes since I last checked out. Just a couple of minor changes.

Missing return statement - can cause issues for the scenario where there is no XDG_CACHE_HOME or HOME defined.
Internal documentation for couple of functions that were added for hashing.

src/backend/common/util.cpp

src/backend/common/util.hpp

src/backend/cuda/nvrtc/cache.cpp

9prady9 · 2020-04-24T07:38:36Z

@cschreib-ibex Final one, formatting seems to be off in /src/backend/common/util.hpp Please correct it.

cschreib-ibex · 2020-04-24T07:42:05Z

@9prady9 Man, clang-tidy is very particular :) It doesn't like the default behavior of Visual Studio. Anyways, I've fixed the formatting.

9prady9 · 2020-04-24T07:43:52Z

@cschreib-ibex I understand. Sorry, there seems to be some other thing off. clang-format -i src/backend/common/util.hpp should take care of the entire file as per config we have. Try that.

-/// Return a string suitable for naming a temporary file. 
+/// Return a string suitable for naming a temporary file.
 ///
 /// Every call to this function will generate a new string with a very low
 /// probability of colliding with past or future outputs of this function,

9prady9 · 2020-04-24T07:47:22Z

Finally! Thanks @cschreib-ibex

fyi: Finally was for ci job :)

cschreib-ibex · 2020-04-24T07:49:40Z

No worries :P Thanks for the tip on clang-format, I actually had it installed, never ran it before...

jacobkahn · 2020-04-24T19:38:09Z

@cschreib-ibex this is really cool. Out of curiosity, do you have any benchmarks with respect to a performance improvements?

cschreib-ibex · 2020-04-27T07:31:05Z

On our systems, and with our particular application which makes heavy use of ArrayFire, the runtime when "cold" (i.e., the first run of our processing after the application has started) went down from 28s to 9s.

…ayfire#2848) * Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default. * cuda::buildKernel() now dumps cubin to disk for reuse * Adds cuda::loadKernel() for loading cached cubin files * cuda::loadKernel() returns empty kernel on failure * Uses XDG_CACHE_HOME as cache directory for Linux * Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program. * Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file. * Uses deterministicHash() for hashing kernel names and kernel binary data. * Adds kernel binary data file integrity check upon loading from disk

* Adds CMake variable AF_CACHE_KERNELS_TO_DISK to enable kernel caching. It is turned ON by default. * cuda::buildKernel() now dumps cubin to disk for reuse * Adds cuda::loadKernel() for loading cached cubin files * cuda::loadKernel() returns empty kernel on failure * Uses XDG_CACHE_HOME as cache directory for Linux * Adds common::deterministicHash() - This uses the FNV-1a hashing algorithm for fast and reproducible hashing of string or binary data. This is meant to replace the use of std::hash in some place, since std::hash does not guarantee its return value will be the same in subsequence executions of the program. * Write cached kernel to temporary file before moving into final file. This prevents data races where two threads or two processes might write to the same file. * Uses deterministicHash() for hashing kernel names and kernel binary data. * Adds kernel binary data file integrity check upon loading from disk

9prady9 reviewed Apr 17, 2020

View reviewed changes

9prady9 added CUDA feature in-progress labels Apr 18, 2020

cschreib-ibex force-pushed the cacheKernels branch from a6b061f to 2bfe895 Compare April 20, 2020 13:07

umar456 requested changes Apr 23, 2020

View reviewed changes

cschreib-ibex added 15 commits April 23, 2020 09:07

cuda::buildKernel() now dumps cubin to disk for reuse

2bb6035

Added cuda::loadKernel() for loading cached cubin files

00b4e6e

Simplified logic of getKernelCacheFilename()

88a6c5d

Using XDG_CACHE_HOME as cache director for linux

d170203

Return empty kernel on load failure

74b74bf

Implement Clang Format Lint corrections

c7a304f

Moved file system functions into src/backen/common/util.cpp

544d375

More linter fixes

0892205

Removed extra space for linter checks

7df4d5f

Fixed crash when saving CUDA kernels

d4a77c2

Forgot to convert the result of std::hash (an int) into a string before appending to "KER".

Renamed "folder" into "directory" in util for consistency

bd48f5b

Added renameFile() and makeTempFilename()

6555054

Write cached kernel to temporary file before moving into final file

a132e66

This prevents data races where two threads or two processes might write to the same file.

Added PP variable AF_CACHE_KERNELS to enable kernel caching

2a8e9d7

Cache directory is now static instead of thread_local

e60e673

umar456 reviewed Apr 23, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

src/backend/common/util.cpp Outdated Show resolved Hide resolved

cschreib-ibex added 3 commits April 23, 2020 16:23

Renamed AF_CACHE_KERNELS to AF_CACHE_KERNELS_TO_DISK

4b9db0c

Forward AF_CACHE_KERNELS_TO_DISK to the CMake backend targets

b6e583b

Removed "/var/lib/arrayfire" from possible cache directories

058f2e2

umar456 previously approved these changes Apr 23, 2020

View reviewed changes

umar456 reviewed Apr 23, 2020

View reviewed changes

src/backend/common/util.cpp Outdated Show resolved Hide resolved

9prady9 requested changes Apr 24, 2020

View reviewed changes

src/backend/common/util.cpp Outdated Show resolved Hide resolved

src/backend/common/util.hpp Show resolved Hide resolved

src/backend/cuda/nvrtc/cache.cpp Show resolved Hide resolved

cschreib-ibex added 2 commits April 24, 2020 08:20

Fixed value not being returned in getHomeDirectory()

bf23f02

Added documentation for makeTempFilename() and deterministicHash()

335ff87

cschreib-ibex dismissed umar456’s stale review via 335ff87 April 24, 2020 07:34

9prady9 previously approved these changes Apr 24, 2020

View reviewed changes

Removed trailing whitespace

04b8246

cschreib-ibex dismissed 9prady9’s stale review via 04b8246 April 24, 2020 07:41

Removed leftover trailing whitespace

b9a80b6

9prady9 approved these changes Apr 24, 2020

View reviewed changes

9prady9 removed the in-progress label Apr 24, 2020

9prady9 merged commit e2ad39d into arrayfire:master Apr 24, 2020

cschreib-ibex deleted the cacheKernels branch April 24, 2020 10:47

umar456 added this to the 3.7.2 milestone Jun 27, 2020

umar456 mentioned this pull request Jun 27, 2020

Backport changes to 3.7 #2949

Merged

2 tasks

Automatically cache compiled CUDA kernels on disk to speed up kernel compilation #2848

Automatically cache compiled CUDA kernels on disk to speed up kernel compilation #2848

Uh oh!

Conversation

cschreib-ibex commented Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

9prady9 commented Apr 17, 2020

Uh oh!

9prady9 commented Apr 18, 2020

Uh oh!

cschreib-ibex commented Apr 20, 2020

Uh oh!

cschreib-ibex commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented Apr 20, 2020

Uh oh!

cschreib-ibex commented Apr 20, 2020

Uh oh!

9prady9 commented Apr 20, 2020

Uh oh!

9prady9 commented Apr 20, 2020

Uh oh!

umar456 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cschreib-ibex commented Apr 23, 2020

Uh oh!

umar456 commented Apr 23, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

9prady9 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

9prady9 commented Apr 24, 2020

Uh oh!

cschreib-ibex commented Apr 24, 2020

Uh oh!

9prady9 commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cschreib-ibex commented Apr 24, 2020

Uh oh!

jacobkahn commented Apr 24, 2020

Uh oh!

cschreib-ibex commented Apr 27, 2020

Uh oh!

Uh oh!

cschreib-ibex commented Apr 17, 2020 •

edited

Loading

cschreib-ibex commented Apr 20, 2020 •

edited

Loading

9prady9 commented Apr 24, 2020 •

edited

Loading

9prady9 commented Apr 24, 2020 •

edited

Loading