-
Notifications
You must be signed in to change notification settings - Fork 548
Speedup of kernel caching mechanism by hashing sources at compile time #3043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@willyborn nice improvement, thank you! I have couple of comments on the caching API. Ideally, it would be great if the extra hash or any other parameters aren't needed to be passed down by caller, but I can see how this may not be possible. The following can actually help the above requirement a bit: I believe the source hash computation can actually be further moved to compilation stage to avoid even a single runtime calculation. This can be done by my modifying |
Good idea. PS: All tests on CUDA failed. Some idea what happened, and what I can do to fix? I do not have a NVIDIA card, so this part of the code is not configured on my system. |
The program itself is a standlone source file under If possible try to reuse the util.hpp/cpp files with this program rather than making a copy of it again so that any future changes to hashing function are immediately available to both the locations. I can take a look at CUDA backend. |
The changes are now in final testing, and this PR will be updated soon. |
181cb8f
to
85f63cc
Compare
@willyborn bin2cpp changes look good. I think we can improve this much further. I did some trials on godbolt and if we do the following, we can actually do a shorter use number of arguments as earlier and do away with changing single line in all If we define something like the following in namespace common {
struct Source {
const char* ptr;
size_t len;
size_t f1nhash; //32 bit hash, typically Fowler-Noll-Vo "1a"
};
} and change the bin2cpp to add the following tuple style struct static const common::Source anisotropic_diffusion_src = {
anisotropic_diffusion_cl,
anisotropic_diffusion_cl_len,
anisotropic_diffusion_cl_hash
}; then getKernel changes as follows - only vector element type will change Kernel getKernel(const string& kernelName,
const vector<common::Source>& sources, // earlier it was a string
const vector<TemplateArg>& targs,
const vector<string>& options,
const bool sourceIsJIT) By doing the above, we can retain the earlier getKernel API in all OpenCL kernel header files by only changing second parameter as follows auto approx1 = common::getKernel("approx1", {interp_src, approx1_src}, tmpltArgs, compileOpts); By doing so, we can:
Full trial snippet can be found https://godbolt.org/z/KbzaP9 |
@9prady9 I understand that you want to bring the number of parameters down to the original one, and I agree. I have however a problem since I don't have a finhash for the vector, since there is no way to combine individual hashes except through incremental calculation. The whole improvement was coming from the fact that only a minimum of (incremental) hashes had to be calculated in the getKernel function. As a result I would like to enlarge your proposed structure as follows:
in generated kernel file:
in kernel_cache.cpp:
in kernel header files with 1 source:
in kernel header files with multiple sources combined:
====================
==================== Perhaps this last one is even the best, since
|
Attached you will find a test of the bin2cpp program, corresponding with the last alternative. The #include (part of OpenCL syntax) is now interpreted, resulting in a concatenated file. For the example: interp.cl (10,223B) + approx1.cl (2,284B)
In case you agree, I will rework the PR as described above. PS: All current code also works with the new bin2cpp (since no #include used) Willy |
@willyborn Sorry about the delay. I was checking if hashes can be combined and it turns out we can combine list of hashes using another hash function. I haven't explored it further though. I have looked at the new bin2cpp program output and it looks good. I have couple of questions:
If we can easily implement combining hashes, then it would still have all the advantages of above approach and the following getKernel("approx1", {approx1_src, interp_src}, ....); call gives a more explicit indication of what sources are used for that OpenCL function right away without looking up the code of Can you please give me some time to explore more about incremental hashing and hash lists. We are on the right track and I want to ensure we have an overall good solution before we make code changes. |
Pradeep, I have no paper explaining the incremental behavior of the FNV-1a algorithm. You can however easily derive it ourselves:
FNV-1a split into separated logical operations:
The hashes of multiple Sources are now combined by calculating a new hash (FNV-1a) of the individual hashes.
There is no limit (recursive), which could give a problem when we arrive in a circular kernel construction. =================== =================== PS: What do you use to debug OpenCL kernels? I only have comments & printf & a lot of trying ... |
76deb14
to
518896b
Compare
I see that you have fixed the default value of I am concerned about the corner cases of I am going to experiment with the code a bit today and tomorrow. I will try fixing the code for Unix OSes. I fixed bin2cpp so far. Thank you! |
@willyborn We can remove |
I'm setting up a virtual machine with ubuntu 18, trying to find out the problems with unix. In a few days you will have an update. Regarding the include:
I will start by removing the include code. You can than decide later, to remove the comments as well, if you consider that the added complexity is too high compared to the memory footprint gain. PS: Can you share how you fixed the bin2cpp for unix? |
@willyborn I have the commits locally. Please wait for my push so that your changes are on top of my changes in your branch. I am going to push to your pr branch. I had to fix a bunch of things in cuda backend to get it compiling but it is still not running due to runtime compilation failure. It compiles now on Linux, so it should on osx too. Note: OpenCL backend works as expected with all tests passing. |
I have fixed the compilation so far, I have just began looking into the runtime compilation. You are welcome to take a stab at it, I am surprised what changed in CUDA backend that started causing the below.
Are you talking about removing comments before converting them into byte code ? If that is the case, I think that is alright. But we should be able to have comments in the |
Yes I'm talking about removing the comments in the generated hpp files. Is bin2cpp also used to convert cu source code? |
I guess that is alright then.
Yes :) It is not for OpenCL only, it is a generic and very basic code obfuscation tool used all over the repository where regular headers are converted to strings. |
@willyborn I just realized, I think I know why the error of already defined happened with half in CUDA backend. It is due to include processing, once you remove it, it should work as earlier. |
Interesting, the CUDA k5000 job run seems to be from 4 days back, for some reason it didn't run. We are looking into it and the CUDA compilation error on Edit: I may push the fixes today, a lot of CUDA tests are still failing even though I am able to compile on my local linux setup. Update: Found the issue. |
moduleKey = (sources.size() == 1 && sources[0].hash) | ||
? sources[0].hash | ||
: deterministicHash(sources); | ||
moduleKey = deterministicHash(options, moduleKey); | ||
} else { | ||
moduleKey = deterministicHash(tInstance); | ||
moduleKey = deterministicHash(tInstance, moduleKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that this is only for CUDA. Sorry, that I did not realize this before, and as a result broke the code.
For OpenCL, it has a negative effect on speed since now programs with multiple kernels are compiled multiple times (1 for each kernel) and time is avoidable time is spend in the hash.
I suggest that following changes:
- put the tInstance generation (loop) inside a #define(AF_CUDA) and only keep the initialization with kernelName.
- put the determiniticHash(tInstance,moduleKey) also after a #define(AF_CUDA)
- The call at the end, to getKernel can so become common for both. no #define needed anymore.
During my previous tests, I could measure a performance difference for OpenCL. By doing so, performance will even improve further, since this is on the critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Multiple compilations are not possible, let me explain why. For OpenCL backend, the template arguments are sort of dummy since they are also being supplied as part of compilation options because of the way OpenCL kernels work - they can be only written in C. Compilation options are different for identical named kernels depending on other opencl-compile time constants which can be types, constants depending on the need.
-
Earlier the bug wasn't just CUDA related. For example,
tInstance
isapprox<float>
in CUDA and justapprox
in OpenCL, a kernel file with identical content but with different kernel/function name would incorrectly fetch the wrong binary if we don't hash the name of the kernel. Although, this situation doesn't happen now, I don't think we should add this unnecessary limitation to the caching API. -
It is for the reason mentioned in (1), we need to call
getKernel
in the end with different parameters. If we always passtInstance
in the end, we would end up moving the condition to the beginning to make sure OpenCL backend doesn't add< ... >
totInstance
- I don't have any objection for this though since we end up doing less work adding those extra template parameters totInstance
which is not needed for OpenCL. In turn reducing the bytes to hash fromtInstance
.
How much time difference are you noticing ?
In another note, I think we should get rid of template args in OpenCL backend as in pass {}
instead of any values which save some time in creating that variable all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed the change from point (3) in earlier comment, can you please test the branch now for timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finally some testing results.
Scenario: I have 4 scenario's. disk caching ON/OFF; deterministicHash(tInstance) added for OpenCL ON/OFF (code diff attached).
For ORB, I measure the first run (incl. compiling, storing & execute) and a loop (only searching & execute)
Disk caching | hash | 1st run | %time | +speed | Loop (ORB) | %time | +speed | # kernels saved |
---|---|---|---|---|---|---|---|---|
OFF | ON | 1.8465s | 100% | 1.0x | 0.0120s | 100% | 1.0x | 9 programs |
OFF | OFF | 0.743 | 40% | 2.49x | 0.0120s | 100% | 1.0x | 4 programs <== Updated |
ON | ON | 0.2036s | 100% | 1.0x | 0.0119s | 100% | 1.0x | 9 programs |
ON | OFF | 0.1887s | 93% | 1.1x | 0.0119s | 100% | 1.0x | 4 programs |
These differences will be observed for: fast, flood_fill, harris, homography, regions and orb.
Nice gains for the above functions during initialization, although it is only once for a whole program run.
For all the others the difference is within the error margin.
=======
An update on the neural_network test:
The final speed is very similar as before. a 1% difference however between the hash(kernName) ON/OFF. This is a general situation, without multiply kernels in 1 program source file.
=========
I get an error while attaching files, so I include them here in clear.
Epochs:250; | Precision:f32 | speed vs master | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Batch | size: | 32 | | | Training | time: | 37.1264 | s | | | Accuracy | training: | 98.8924 | | | Accuracy | test: | 92.4829 | 1.21 |
Batch | size: | 40 | | | Training | time: | 29 | s | | | Accuracy | training: | 86.5763 | | | Accuracy | test: | 82.7892 | 1.21 |
Batch | size: | 48 | | | Training | time: | 23.9359 | s | | | Accuracy | training: | 98.8263 | | | Accuracy | test: | 93.3435 | 1.21 |
Batch | size: | 52 | | | Training | time: | 23.3515 | s | | | Accuracy | training: | 98.9254 | | | Accuracy | test: | 93.065 | 1.15 |
Batch | size: | 64 | | | Training | time: | 18.7354 | s | | | Accuracy | training: | 99.0908 | | | Accuracy | test: | 93.6219 | 1.19 |
Batch | size: | 96 | | | Training | time: | 14.9823 | s | | | Accuracy | training: | 98.9089 | | | Accuracy | test: | 93.3688 | 1.02 |
Batch | size: | 128 | | | Training | time: | 12.9594 | s | | | Accuracy | training: | 98.3138 | | | Accuracy | test: | 92.9132 | 1.00 |
Batch | size: | 256 | | | Training | time: | 10.4322 | s | | | Accuracy | training: | 95.5034 | | | Accuracy | test: | 92.407 | 1.01 |
Batch | size: | 512 | | | Training | time: | 8.55012 | s | | | Accuracy | training: | 33.4435 | | | Accuracy | test: | 33.2068 | 1.02 |
Batch | size: | 1024 | | | Training | time: | 7.72961 | s | | | Accuracy | training: | 13.7709 | | | Accuracy | test: | 13.389 | 1.01 |
=================
$ git diff src/backend/common/kernel_cache.cpp
diff --git a/src/backend/common/kernel_cache.cpp b/src/backend/common/kernel_cache.cpp
index 43ca6893..5031d6b7 100644
--- a/src/backend/common/kernel_cache.cpp
+++ b/src/backend/common/kernel_cache.cpp
@@ -81,7 +81,9 @@ Kernel getKernel(const string& kernelName,
? sources[0].hash
: deterministicHash(sources);
moduleKey = deterministicHash(options, moduleKey);
+#if defined(AF_CUDA)
moduleKey = deterministicHash(tInstance, moduleKey);
+#endif
}
const int device = detail::getActiveDeviceId();
Module currModule = findModule(device, moduleKey);
==================
Test program for orb speed:
const int loop = 500;
std::cout << "\nPlease delete all the cached kernels before starting this test."
<< "\nWhen not performed, the effect of permanent caching will be visible";
af::array in = af::loadImage(
"D:/github/willyborn/arrayfire/test/data/meanshift/coins.png", false);
af::features feat;
af::array desc;
af::sync();
af::timer::start();
af::orb(feat, desc, in);
af::sync();
double t = af::timer::stop();
std::cout << "\nInit: " << t;
af::timer::start();
for (int i = 0; i < loop; ++i) { af::orb(feat, desc, in); }
af::sync();
t = af::timer::stop();
std::cout << "\nLoop: " << t/loop;
c133246
to
71890c6
Compare
I am intrigued as to how the hashing a short name is causing more kernels to be saved to the disk in OpenCL backend.
I assume, there were no cached binaries on the disk when arrayfire tests ran in the above two scenarios. Just want to make sure, arrayfire is not picking up some items from cache from old compilations.
The above two cases shouldn't make much difference unless there are no binaries on disk at all. Based on the 1st run times, it looks binaries cached from earlier compilations are used in these. I want to check the timings with a more simple kernel like transpose and check how and why hashing the additional name can result in OpenCL backend generating more kernels because ideally it shouldn't. Will get back to you shortly. Are you on Indian time zone by any chance ? If so, you can reach out to be slack, the discussion cab be more instantaneous that way. |
Below is the output with: Caching(OFF), PreviousCaches(NONE), NameHash(YES)
Below is the output with: Caching(OFF), PreviousCaches(NONE), NameHash(NO)
I am building with CACHING turned on now, its gonna take few minutes but I doubt the difference in times wouldn't be much different from CACHING OFF since that is the code difference in both the cases. I didn't notice anything odd with simple kernel such as transpose. The compilation was invoked only once for each transpose, random generator as expected. I am going to try with ORB after current run and see, I suspect it has something to do with function rather than hashing the name. Lets see. |
padreep, Regarding the time, you are right a error slipped in.
|
In a given application run/session, if the hashing logic includes name, all requests to with name included
vs no name
They both are different in content for sure and will generate different hashes, but still uniquely represent one kernel compilation instance and will result in only one kernel always. Note that, modified content will result in new binary with different filename. Subsequent runs, after the new binary is stored, will pickup this new binary rather than the old binary. And my trial with transpose(instead of orb) function kind of proves that too. Also, do note that my most recent commit in your branch basically removed the logic that adds I don't think the problem here is the extra name hashing which is at most 10 to 20 characters in OpenCL backend, but the extra kernels getting generated in your trials with ORB. I am looking into it. |
I think I now know why that is happening, ORB is written in the following way, multiple kernel functions are present in the same file. So when we were not including name of function being asked for, the same program binary is being returned for all kernels present in single cl file. With the name included, it is create separate files. Okay, mystery solved. Now I have to figure out if such arrangement is okay to have. Will post shortly. |
@willyborn Squash undone probably due to me force pushing to trigger ci jobs, Sorry about that. Is the buffer overflow issue fixed now ? Updates:
|
No problem about the push. I thought I had done something wrong, since this project is the first time I'm participating to an open-source project and its typical tools usage. The single remark I received from the static analyzer in MSVC, (Microsoft Native Recommended Rules) is no longer there. There are however so many more profiles to check against, For the whole arrayfire solution, I get over 600 warnings, without analyzer, so it is rather difficult to determine to which one to react without a full refactoring of the whole code set (which I suppose is no the purpose). I will experiment with the LLVM Toolset, although this is still a mystery. Is there a way to download (have a peek) at those generated errors/logfiles? I see full (local) path file references in the displayed log, although no specific errors. |
@willyborn That is alright. Our ci jobs upload build reports to ci.arrayfire.org . You can look for reports based on PR number on that website. You can find the error message here https://ci.arrayfire.org/viewBuildError.php?buildid=2713 It is originating in line arrayfire/CMakeModules/bin2cpp.cpp Line 234 in cff80f8
I am busy debugging another problem right now. Will try to look into into it later today or tomorrow. As far as I can tell, there is some kind of write/read happening beyond buffer size that is allocated in |
Found it. |
218db8e
to
255e1c4
Compare
@willyborn When you push your next fix, please take care of the merge commit in the history too. Another one sneaked in recently, thank you. |
c87e055
to
f398a63
Compare
@9prady9
I am afraid that for the above errors/warnings, I will not be helpful and I will have to leave them for you. The PR is squashed and in sync, so that you can add commits. PS: My GTX card arrived, so that next time I will have to bother you less with CUDA. |
Not sure if that is a problem because regular OpenCL build works for me fine. Linux build with address sanitizer failed earlier because of buffer overflow.
We already enable newer CUSPARSE API based on CUDA toolkit version check. If any newer changes are introduced in CUDA 11, we will take care of it once the API stabilized.
I have seen these warnings, I have that on my TODO list already.
Glad you have CUDA card too :) Enjoy GPU computing. I am taking a look at the laset error, it is odd how these errors happen in this PR although this PR doesn't change CMakeLists.txt in those aspects. |
@willyborn One minor suggestion, the commit message of the first commit is formatted incorrectly without any new line break between brief and elaborate descriptions. |
@willyborn As I suspected it is not about laset.hpp under Unfortunately, our ci log isn't showing the full picture - we have separate known and ongoing issue with our cdash due to a recent version upgrade due to which this is happening. Sorry about it. I will take a look into the memory error. The following is the precise error now
|
|
I know what this can be.
…On Fri, 4 Dec 2020 at 13:24, pradeep ***@***.***> wrote:
strncpy-param-overlap: memory ranges thats the problem
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3043 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPHOGAVJUPXPKDVX5ZLSTDIHFANCNFSM4TRAS7VA>
.
|
strange, it is only in 1 cu file.
On Fri, 4 Dec 2020 at 13:32, Sabine & Willy Born <
sabine.willy.born@gmail.com> wrote:
… I know what this can be.
On Fri, 4 Dec 2020 at 13:24, pradeep ***@***.***> wrote:
> strncpy-param-overlap: memory ranges thats the problem
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#3043 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AQ2WGPHOGAVJUPXPKDVX5ZLSTDIHFANCNFSM4TRAS7VA>
> .
>
|
It is only probably getting reported on one of the files. Earlier it was cusparse header. Is there any chance the removeComments can be simplified further ? perhaps avoiding C function calls. |
7791b9f
to
825e461
Compare
@9prady9 The problem with strncpy is that the order is apparently not defined in the C++ standard, resulting in unknown behavior when the ranges overlap. We no longer have this problem since it is no longer used. Everything is squashed into 1 commit, so that I could update the comments and insert that blank line. |
The program source files memory footprint is reduced (-30%) by eliminating comments in the generated kernel headers. Hash calculation of each source file is performed at compile time and incrementally extended at runtime with the options & tInstance vectors. Overall performance increased up to 21%, up to the point that the GPU becomes the bottleneck, and the overhead to launch the same (small) kernel was improved by 63%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work. The benchmarks look promising for smaller sizes. Thank you for your contribution!
arrayfire#3043) * Reduced overhead of kernel caching for OpenCL & CUDA. The program source files memory footprint is reduced (-30%) by eliminating comments in the generated kernel headers. Hash calculation of each source file is performed at compile time and incrementally extended at runtime with the options & tInstance vectors. Overall performance increased up to 21%, up to the point that the GPU becomes the bottleneck, and the overhead to launch the same (small) kernel was improved by 63%. * Fix couple of minor cmake changes * Move spdlog fetch to use it in bin2cpp link command Co-authored-by: pradeep <pradeep@arrayfire.com> (cherry picked from commit 3cde757)
#3043) * Reduced overhead of kernel caching for OpenCL & CUDA. The program source files memory footprint is reduced (-30%) by eliminating comments in the generated kernel headers. Hash calculation of each source file is performed at compile time and incrementally extended at runtime with the options & tInstance vectors. Overall performance increased up to 21%, up to the point that the GPU becomes the bottleneck, and the overhead to launch the same (small) kernel was improved by 63%. * Fix couple of minor cmake changes * Move spdlog fetch to use it in bin2cpp link command Co-authored-by: pradeep <pradeep@arrayfire.com> (cherry picked from commit 3cde757)
Measured results:
a) 1 million times join(1, af::dim4(10,10), af::dim4(10,10))
--> 63% faster vs master 3.8.0
b) example/neural_network.cpp with varying batch sizes
on AMD A10-7870K (AMD Radeon R7 Graphics 8CU), on faster GPU's the improvement will persist with higher batch sizes.
--> up to 18% faster vs master 3.8.0
cacheHashing.xlsx
c) memory footprint reduced by 37%, and on top no longer copies internal.
All the OpenCL.cl kernel source code files occupy 451KB, vs the remaining code strings in the generated obfuscated hhp files only occupy 283KB. I assume that a similar effect is visible with the CUDA kernel code.
Description
Changes in backend/common/kernel_cache.cpp & backend/common/util.cpp
Changes in all OpenCL kernels in backend/opencl/kernel/*.hpp
Changes in interfaces:
Current flow of data:
New kernel:
0. Compile, kernel.cl static const common::Source{*code, length, hash}
<-- incremental hashes of options & tInstance (fast)
Search kernel:
Previous (3.8.0 master) flow of data:
New kernel:
Search kernel:
Changes to Users
None
Checklist