Skip to content

Speedup of kernel caching mechanism by hashing sources at compile time #3043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 18, 2021

Conversation

willyborn
Copy link
Contributor

@willyborn willyborn commented Nov 10, 2020

Measured results:
a) 1 million times join(1, af::dim4(10,10), af::dim4(10,10))
--> 63% faster vs master 3.8.0

b) example/neural_network.cpp with varying batch sizes

  • to switch saturation between CPU to GPU)
  • the best test accuracy is obtained with a batch size around 48 (reason to go so small)
    on AMD A10-7870K (AMD Radeon R7 Graphics 8CU), on faster GPU's the improvement will persist with higher batch sizes.
    --> up to 18% faster vs master 3.8.0
    Timings neural_network
    cacheHashing.xlsx

c) memory footprint reduced by 37%, and on top no longer copies internal.
All the OpenCL.cl kernel source code files occupy 451KB, vs the remaining code strings in the generated obfuscated hhp files only occupy 283KB. I assume that a similar effect is visible with the CUDA kernel code.

Description

Changes in backend/common/kernel_cache.cpp & backend/common/util.cpp

  1. Hashing is now incremental, so that only the dynamic parts are calculated
  2. Hashkey changed from string to size_t, to speed-up the find functions on map
  3. The hashkey is now for each kernel calculated at compile time by bin2cpp.exe
  4. The hashkey of multiple sources, is obtained by re-hashing the individual hashes

Changes in all OpenCL kernels in backend/opencl/kernel/*.hpp

  1. The struct common::Source now contains: Source code, Source length & Source hash
  2. The static struct, generated at compile time, is now used directly

Changes in interfaces:

  1. deterministicHash overloaded for incremental hashing of string, vector and vector
  2. getKernel now accepts common::Sources object, iso of vector

Current flow of data:
New kernel:
0. Compile, kernel.cl static const common::Source{*code, length, hash}

  1. Rep, kernel.hpp: vector
  2. Rep, kernel_cache.cpp: string tInstance <-- build directly from input data (fast)
  3. Rep, util.cpp: size_t moduleKey <--combined hashes for multiple sources
    <-- incremental hashes of options & tInstance (fast)

Search kernel:

  1. Rep, kernel_cache.cpp: search map with moduleKey (1 cmp 64bit instruction per kernel)

Previous (3.8.0 master) flow of data:
New kernel:

  1. Once, kernel.hpp: static const string <-- main kernel codefile cached
  2. Rep, kernel.hpp: vector <-- build combined kernel codefiles
  3. Rep, kernel_cache.cpp: vector args <-- transform targs vector into args (replace)
  4. Rep, kernel_cache.cpp: string tInstance <-- build from args vector
  5. Rep, kernel_cache.cpp: vector hashingVals <-- copy tInstance + kernel codefiles + options
  6. Rep, util.cpp: string accumStr <-- copy vector into 1 string
  7. Rep, util.cpp: size_t hashkey <-- hash on full string (slow)
  8. Rep, kernel_cache.cpp: string moduleKey <-- convert size_t to string

Search kernel:

  1. Rep, kernel_cache.cpp: search map with moduleKey (1cmp per char, 23 cmp 8bit instructions per kernel)

Changes to Users

None

Checklist

  • Rebased on latest master (Nov 18, 2020)
  • Code compiles
  • Tests pass
  • [-] Functions added to unified API
  • [-] Functions documented

@willyborn willyborn changed the title Speedup of kernel caching mechanism Speedup of OpenCL kernel caching mechanism Nov 10, 2020
@9prady9
Copy link
Member

9prady9 commented Nov 11, 2020

@willyborn nice improvement, thank you! I have couple of comments on the caching API.

Ideally, it would be great if the extra hash or any other parameters aren't needed to be passed down by caller, but I can see how this may not be possible.

The following can actually help the above requirement a bit: I believe the source hash computation can actually be further moved to compilation stage to avoid even a single runtime calculation. This can be done by my modifying bin2cpp program we use to generate the byte code of CL headers to add another variable function_name_hash. This approach however has one corner case where more than one source file forms the full-source of a given function which needs to be handled.

@willyborn
Copy link
Contributor Author

Good idea.
I will have a look at bin2cpp.

PS: All tests on CUDA failed. Some idea what happened, and what I can do to fix? I do not have a NVIDIA card, so this part of the code is not configured on my system.

@9prady9
Copy link
Member

9prady9 commented Nov 11, 2020

Good idea.
I will have a look at bin2cpp.

PS: All tests on CUDA failed. Some idea what happened, and what I can do to fix? I do not have a NVIDIA card, so this part of the code is not configured on my system.

The program itself is a standlone source file under CMakeModules folder under the root folder of the repository. It is built in the top CMakeLists.txt file here - https://github.com/arrayfire/arrayfire/blob/master/CMakeLists.txt#L152

If possible try to reuse the util.hpp/cpp files with this program rather than making a copy of it again so that any future changes to hashing function are immediately available to both the locations.

I can take a look at CUDA backend.

@willyborn
Copy link
Contributor Author

  • I figured out the changes in the CMakeLists.txt and now the bin2cpp.exe also includes the hashing from util.h/util.cpp.
  • For the OpenCL kernels with multiple source files, I take the compiled hashing key for the first file, and incrementally add the hashing for the second source file. The result is cached, so it only runs once.
  • I changed the interfaces for the hashing, and now use overloading iso defaults. This means that the old interface remains present, with the same action as before. Ex: no need for incremental.
  • I also overloaded the getKernel interface, so that there is no longer a change in the jit.cpp file needed. I think this is the reason that the CUDA backend failed.

The changes are now in final testing, and this PR will be updated soon.
I hope this corresponds with your remarks.

@9prady9
Copy link
Member

9prady9 commented Nov 13, 2020

@willyborn bin2cpp changes look good. I think we can improve this much further. I did some trials on godbolt and if we do the following, we can actually do a shorter use number of arguments as earlier and do away with changing single line in all kernel/*.hpp files.

If we define something like the following in src/backend/common/util.hpp

namespace common {
    struct Source {
        const char* ptr;
        size_t len;
        size_t f1nhash; //32 bit hash, typically Fowler-Noll-Vo "1a" 
    };
}

and change the bin2cpp to add the following tuple style struct

static const common::Source anisotropic_diffusion_src = {
    anisotropic_diffusion_cl,
    anisotropic_diffusion_cl_len,
    anisotropic_diffusion_cl_hash
};

then getKernel changes as follows - only vector element type will change

Kernel getKernel(const string& kernelName,
                 const vector<common::Source>& sources, // earlier it was a string
                 const vector<TemplateArg>& targs,
                 const vector<string>& options,
                 const bool sourceIsJIT)

By doing the above, we can retain the earlier getKernel API in all OpenCL kernel header files by only changing second parameter as follows

auto approx1 = common::getKernel("approx1", {interp_src, approx1_src}, tmpltArgs, compileOpts);

By doing so, we can:

  1. Avoid extra static string creation that happens with a copy of byte array - decreased mem footprint of arrayfire library. As common::Source is holding only a pointer and couple of size_ts, it should be more efficient than handling a vector of strings.
  2. We can still acces source hashes inside getKernel, just hidden from caller. Since the caller of getKernel doesn't need to create any objects of type common::Source or generate hashes, the getKernel is lot cleaner with less implementation details.

Full trial snippet can be found https://godbolt.org/z/KbzaP9

@9prady9 9prady9 added the perf label Nov 13, 2020
@willyborn
Copy link
Contributor Author

willyborn commented Nov 13, 2020

@9prady9 I understand that you want to bring the number of parameters down to the original one, and I agree.

I have however a problem since I don't have a finhash for the vector, since there is no way to combine individual hashes except through incremental calculation. The whole improvement was coming from the fact that only a minimum of (incremental) hashes had to be calculated in the getKernel function.
Would I combine all sources and recalculate the hash in the Source structure, I would have to copy all the sources again to get them concatenated for 1 ptr, again occupying more memory space.

As a result I would like to enlarge your proposed structure as follows:
in util.hpp:

namespace common {
    struct Source {
        const char* ptr;
        const size_t len;
        const size_t hash;   //32 bit hash, typically Fowler-Noll-Vo "1a" 
    };
   struct Sources {
        vector<Source> code;
        size_t finhash;  //32 bit hash, typically Fowler-Noll-Vo "1a" 
    };
}

in generated kernel file:

static const common::Source anisotropic_diffusion_src = {
    anisotropic_diffusion_cl,
    anisotropic_diffusion_cl_len,
    anisotropic_diffusion_cl_hash
};

in kernel_cache.cpp:

Kernel getKernel(const string& kernelName,
                 const common::Sources& sources,                       // earlier it was a string
                 const vector<TemplateArg>& targs,
                 const vector<string>& options,
                 const bool sourceIsJIT)

in kernel header files with 1 source:

auto approx1 = common::getKernel("approx1", source, tmpltArgs, compileOpts);

in kernel header files with multiple sources combined:

// Memory occupied by sources is minimal: 32B + 24B/Source
// getKernel is overloaded with Source & Sources as possible parameter types.
static common::Sources sources{{interp_src, approx1_src}, 
                               deterministicHash(approx1_src, interp_src.hash)};    //<-- This is slow
auto approx1 = common::getKernel("approx1", sources, tmpltArgs, compileOpts);

====================
In case you do not like the overload of getKernel, we can also immediately populate the Sources structure in the generated kernel. The creation of a combined sources var will be longer:

static common::Sources sources{{interp_src.code[0], approx1_src.code[0]},
                               deterministicHash(approx1_src.code[0], interp_src.finhash)};

====================
A 3rd alternative is implementing the include statement of OpenCL (all same directory) in bin2cpp, so that the combined files are all combined during at compilation time.
This would produce a single source file for each kernel, since the combination is static.

Perhaps this last one is even the best, since

  • we only have struct Source for all, and no vector any more
  • even more work is done at compile time
  • only 1 interface for getKernel

@willyborn
Copy link
Contributor Author

willyborn commented Nov 16, 2020

@9prady9

Attached you will find a test of the bin2cpp program, corresponding with the last alternative. The #include (part of OpenCL syntax) is now interpreted, resulting in a concatenated file.
Since I had to implement a state machine recognising comments, to avoid the execution of an out-commented #include command, or #include command inside a string. I also decided to eliminate comments, leading spaces and trailing spaces in the generated .hpp file, resulting in a even smaller memory footprint.

For the example: interp.cl (10,223B) + approx1.cl (2,284B)
--> approx1_combined.hpp (12,507B) + 2 hashes
--> approx1+include.hpp (9,709B => -23%) + 1 combined hash
Advantages:

  • hash code of combined kernel code is available
  • We always have 1 Source, independent from how the kernel is coded
  • reduced the memory footprint, due to elimination of comments
  • improved further the compile speed (smaller code)
  • speedup at runtime (no longer concatenation (&hash) at runtime needed)
  • improved readability of needed cpp code as kernel code
  • extra comments in the OpenCL kernel no longer has a negative impact on the runtime

In case you agree, I will rework the PR as described above.
else, you

PS: All current code also works with the new bin2cpp (since no #include used)

Willy
Attached: code of new bin2cpp.cpp, examples of cl kernels and resulting .hpp files
bin2cpp.zip

@9prady9
Copy link
Member

9prady9 commented Nov 17, 2020

@willyborn Sorry about the delay. I was checking if hashes can be combined and it turns out we can combine list of hashes using another hash function. I haven't explored it further though.

I have looked at the new bin2cpp program output and it looks good. I have couple of questions:

  • About the depth of include expansions: How deep does the #include expansion go ? Is it limited in any fashion ? If yes, is there an easy way to increase that limit ?
  • Earlier I did notice that you are replacing seed or FNV Prime with previous hash value which seemed odd. I don't think this modified FNV1-a algorithm would has same characteristics as the earlier one using FNV Prime. If you are sure it is not going to get effected, can you please share any references, preferably papers that throw some light on how to use FNV-1a hash function for incremental hashing.

If we can easily implement combining hashes, then it would still have all the advantages of above approach and the following

getKernel("approx1", {approx1_src, interp_src}, ....);

call gives a more explicit indication of what sources are used for that OpenCL function right away without looking up the code of interp.cl. It is not a major blocker, but it helps while debugging OpenCL kernels.

Can you please give me some time to explore more about incremental hashing and hash lists. We are on the right track and I want to ensure we have an overall good solution before we make code changes.

@willyborn
Copy link
Contributor Author

willyborn commented Nov 17, 2020

@9prady9

Pradeep, I have no paper explaining the incremental behavior of the FNV-1a algorithm. You can however easily derive it ourselves:

  1. FNV-1a incremental behavior

FNV-1a split into separated logical operations:

  • Assume input contains an array of chars: char[0], ..., char[3]
  • 1st operation: hash = (SEED ^ char[0]) * PRIME;
  • 2nd operation: hash = (hash ^ char[1]) * PRIME;
  • 3rd operation: hash = (hash ^ char[2]) * PRIME;
  • 4th operation: hash = (hash ^char[3]) * PRIME;
    Now split our array in 2 parts: array0=char[0]+char[1] & array1=char[2]+char[3]
    FNV-1a on array0:
    hash0 = ((SEED ^char[0]) * PRIME) ^char[1]) * PRIME;
    FNV-1a on array1:
    hash1 = ((SEED ^char[2]) * PRIME) ^char[3]) * PRIME;
    FNV-1a on array0+array1;
    hash0 = ((SEED ^char[0]) * PRIME) ^char[1]) * PRIME;
    hash0plus1 = ((hash0 ^char[2]) * PRIME) ^char[3]) * PRIME;
    You notice the SEED is replaced by the hash of the previous part (array0), when we want include extra text in the FNV-1a hash.
  1. Combining hashes

The hashes of multiple Sources are now combined by calculating a new hash (FNV-1a) of the individual hashes.

  1. Depth of #include in bin2cpp

There is no limit (recursive), which could give a problem when we arrive in a circular kernel construction.
fileA: #include fileB
fileB: #include fileA
The program bin2cpp will fail with out of memory (stack).
Since OpenCL kernels are not user specified, I did not build in a protection here.

===================
The new posted PR, gives you an idea of how it could look like.
I obtained marginal speed gain vs version1, certainly no drop back.
The memory footprint is now minimal, I expect, since code is only copied for compiling and no longer during runtime.

===================

PS: What do you use to debug OpenCL kernels? I only have comments & printf & a lot of trying ...
Is there a way to extract the log files of the compilers, because I can not test on MAC or LINUX at home?

@9prady9
Copy link
Member

9prady9 commented Nov 20, 2020

I see that you have fixed the default value of prevHash now, I was confused about that only earlier. Even though your explanation took into account the seed i.e. prime for first byte's calculation, the code change was not using the prime.

I am concerned about the corner cases of include processing, even if you accounted for most cases that approach is waiting for some bug to be found in the include processing we might not have thought about given the it's complexity of recursive inclusion and #pragma onces. I did notice that you have rewritten quite a bit since my last review - merging source hashes in similar to byte stream.

I am going to experiment with the code a bit today and tomorrow. I will try fixing the code for Unix OSes. I fixed bin2cpp so far.

Thank you!

@9prady9
Copy link
Member

9prady9 commented Nov 20, 2020

@willyborn We can remove include processing code addition in bin2cpp now, is that correct ? or is that still being used for something else ?

@willyborn
Copy link
Contributor Author

I'm setting up a virtual machine with ubuntu 18, trying to find out the problems with unix. In a few days you will have an update.

Regarding the include:

  • We can remove the include functionality, without problem.
  • In order to process the include, I also needed the status of comments. Later I remarked a positive side-effect:
    --> the kernel source code is reduced by 30% (approx1.cl) by eliminating these comments by bin2cpp. Besides from a smaller memory footprint, I have no idea (not yet measured) if this also improves the compilation time of the OpenCL kernel.
    --> I assume that this has no negative impact on debugging.
    -- It does add extra complexity. Is it enough worthwhile?

I will start by removing the include code. You can than decide later, to remove the comments as well, if you consider that the added complexity is too high compared to the memory footprint gain.

PS: Can you share how you fixed the bin2cpp for unix?

@9prady9
Copy link
Member

9prady9 commented Nov 21, 2020

@willyborn I have the commits locally. Please wait for my push so that your changes are on top of my changes in your branch. I am going to push to your pr branch. I had to fix a bunch of things in cuda backend to get it compiling but it is still not running due to runtime compilation failure. It compiles now on Linux, so it should on osx too.

Note: OpenCL backend works as expected with all tests passing.

@9prady9
Copy link
Member

9prady9 commented Nov 21, 2020

I have fixed the compilation so far, I have just began looking into the runtime compilation. You are welcome to take a stab at it, I am surprised what changed in CUDA backend that started causing the below.

[jit][1605961614][173859] [ ../src/backend/cuda/compile_module.cpp:405 ] {12954760905715010591 : Unable to open /home/pradeep/.arrayfire/KER12954760905715010591_CU_61_AF_38.bin for GeForce GTX 1060 3GB}
In function cuda::Module common::compileModule(const string&, const std::vector<std::__cxx11::basic_string<char> >&, const std::vector<std::__cxx11::basic_string<char> >&, const std::vector<std::__cxx11::basic_string<char> >&, bool)
In file src/backend/cuda/compile_module.cpp:276
NVRTC Error(6): NVRTC_ERROR_COMPILATION
Log: 
math.hpp(143): error: class "common::kernel_type<common::half>" has already been defined

1 error detected in the compilation of "12954760905715010591".

Are you talking about removing comments before converting them into byte code ? If that is the case, I think that is alright. But we should be able to have comments in the cl or cu source code.

@willyborn
Copy link
Contributor Author

Yes I'm talking about removing the comments in the generated hpp files.
The original cl source files, keep their comments.
==> I remove only the include part

Is bin2cpp also used to convert cu source code?

@9prady9
Copy link
Member

9prady9 commented Nov 21, 2020

Yes I'm talking about removing the comments in the generated hpp files.
The original cl source files, keep their comments.
==> I remove only the include part

I guess that is alright then.

Is bin2cpp also used to convert cu source code?

Yes :) It is not for OpenCL only, it is a generic and very basic code obfuscation tool used all over the repository where regular headers are converted to strings.

@9prady9
Copy link
Member

9prady9 commented Nov 22, 2020

@willyborn I just realized, I think I know why the error of already defined happened with half in CUDA backend. It is due to include processing, once you remove it, it should work as earlier.

@9prady9
Copy link
Member

9prady9 commented Nov 23, 2020

Interesting, the CUDA k5000 job run seems to be from 4 days back, for some reason it didn't run. We are looking into it and the CUDA compilation error on linux-all, the identifier name is correct - I wonder why it is complaining about it.

Edit: I may push the fixes today, a lot of CUDA tests are still failing even though I am able to compile on my local linux setup.

Update: Found the issue.

moduleKey = (sources.size() == 1 && sources[0].hash)
? sources[0].hash
: deterministicHash(sources);
moduleKey = deterministicHash(options, moduleKey);
} else {
moduleKey = deterministicHash(tInstance);
moduleKey = deterministicHash(tInstance, moduleKey);
Copy link
Contributor Author

@willyborn willyborn Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that this is only for CUDA. Sorry, that I did not realize this before, and as a result broke the code.
For OpenCL, it has a negative effect on speed since now programs with multiple kernels are compiled multiple times (1 for each kernel) and time is avoidable time is spend in the hash.

I suggest that following changes:

  • put the tInstance generation (loop) inside a #define(AF_CUDA) and only keep the initialization with kernelName.
  • put the determiniticHash(tInstance,moduleKey) also after a #define(AF_CUDA)
  • The call at the end, to getKernel can so become common for both. no #define needed anymore.

During my previous tests, I could measure a performance difference for OpenCL. By doing so, performance will even improve further, since this is on the critical path.

Copy link
Member

@9prady9 9prady9 Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Multiple compilations are not possible, let me explain why. For OpenCL backend, the template arguments are sort of dummy since they are also being supplied as part of compilation options because of the way OpenCL kernels work - they can be only written in C. Compilation options are different for identical named kernels depending on other opencl-compile time constants which can be types, constants depending on the need.

  2. Earlier the bug wasn't just CUDA related. For example, tInstance is approx<float> in CUDA and just approx in OpenCL, a kernel file with identical content but with different kernel/function name would incorrectly fetch the wrong binary if we don't hash the name of the kernel. Although, this situation doesn't happen now, I don't think we should add this unnecessary limitation to the caching API.

  3. It is for the reason mentioned in (1), we need to call getKernel in the end with different parameters. If we always pass tInstance in the end, we would end up moving the condition to the beginning to make sure OpenCL backend doesn't add < ... > to tInstance - I don't have any objection for this though since we end up doing less work adding those extra template parameters to tInstance which is not needed for OpenCL. In turn reducing the bytes to hash from tInstance.

How much time difference are you noticing ?

In another note, I think we should get rid of template args in OpenCL backend as in pass {} instead of any values which save some time in creating that variable all together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed the change from point (3) in earlier comment, can you please test the branch now for timing.

Copy link
Contributor Author

@willyborn willyborn Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@9prady9

I have finally some testing results.

Scenario: I have 4 scenario's. disk caching ON/OFF; deterministicHash(tInstance) added for OpenCL ON/OFF (code diff attached).

For ORB, I measure the first run (incl. compiling, storing & execute) and a loop (only searching & execute)

Disk caching hash 1st run %time +speed Loop (ORB) %time +speed # kernels saved
OFF ON 1.8465s 100% 1.0x 0.0120s 100% 1.0x 9 programs
OFF OFF 0.743 40% 2.49x 0.0120s 100% 1.0x 4 programs <== Updated
ON ON 0.2036s 100% 1.0x 0.0119s 100% 1.0x 9 programs
ON OFF 0.1887s 93% 1.1x 0.0119s 100% 1.0x 4 programs

These differences will be observed for: fast, flood_fill, harris, homography, regions and orb.
Nice gains for the above functions during initialization, although it is only once for a whole program run.
For all the others the difference is within the error margin.

=======
An update on the neural_network test:
The final speed is very similar as before. a 1% difference however between the hash(kernName) ON/OFF. This is a general situation, without multiply kernels in 1 program source file.

=========
I get an error while attaching files, so I include them here in clear.

Epochs:250; Precision:f32                           speed vs master
Batch size: 32 | Training time: 37.1264 s | Accuracy training: 98.8924 | Accuracy test: 92.4829 1.21
Batch size: 40 | Training time: 29 s | Accuracy training: 86.5763 | Accuracy test: 82.7892 1.21
Batch size: 48 | Training time: 23.9359 s | Accuracy training: 98.8263 | Accuracy test: 93.3435 1.21
Batch size: 52 | Training time: 23.3515 s | Accuracy training: 98.9254 | Accuracy test: 93.065 1.15
Batch size: 64 | Training time: 18.7354 s | Accuracy training: 99.0908 | Accuracy test: 93.6219 1.19
Batch size: 96 | Training time: 14.9823 s | Accuracy training: 98.9089 | Accuracy test: 93.3688 1.02
Batch size: 128 | Training time: 12.9594 s | Accuracy training: 98.3138 | Accuracy test: 92.9132 1.00
Batch size: 256 | Training time: 10.4322 s | Accuracy training: 95.5034 | Accuracy test: 92.407 1.01
Batch size: 512 | Training time: 8.55012 s | Accuracy training: 33.4435 | Accuracy test: 33.2068 1.02
Batch size: 1024 | Training time: 7.72961 s | Accuracy training: 13.7709 | Accuracy test: 13.389 1.01

=================
$ git diff src/backend/common/kernel_cache.cpp
diff --git a/src/backend/common/kernel_cache.cpp b/src/backend/common/kernel_cache.cpp
index 43ca6893..5031d6b7 100644
--- a/src/backend/common/kernel_cache.cpp
+++ b/src/backend/common/kernel_cache.cpp
@@ -81,7 +81,9 @@ Kernel getKernel(const string& kernelName,
? sources[0].hash
: deterministicHash(sources);
moduleKey = deterministicHash(options, moduleKey);
+#if defined(AF_CUDA)
moduleKey = deterministicHash(tInstance, moduleKey);
+#endif
}
const int device = detail::getActiveDeviceId();
Module currModule = findModule(device, moduleKey);

==================
Test program for orb speed:

  const int loop = 500;

  std::cout << "\nPlease delete all the cached kernels before starting this test."
            << "\nWhen not performed, the effect of permanent caching will be visible";

  af::array in = af::loadImage(
      "D:/github/willyborn/arrayfire/test/data/meanshift/coins.png", false);
  af::features feat;
  af::array desc;

  af::sync();
  af::timer::start();
  af::orb(feat, desc, in);
  af::sync();
  double t = af::timer::stop();
  std::cout << "\nInit: " << t;


  af::timer::start();
  for (int i = 0; i < loop; ++i) { af::orb(feat, desc, in); }
  af::sync();
  t = af::timer::stop();

  std::cout << "\nLoop: " << t/loop;

@9prady9 9prady9 force-pushed the hashCaching branch 2 times, most recently from c133246 to 71890c6 Compare November 25, 2020 07:28
@9prady9
Copy link
Member

9prady9 commented Nov 25, 2020

I am intrigued as to how the hashing a short name is causing more kernels to be saved to the disk in OpenCL backend.

Disk caching hash 1st run %time +speed Loop (ORB) %time +speed # kernels saved
OFF ON 1.8465s 100% 1.0x 0.0120s 100% 1.0x 9 programs
OFF OFF 0.1885s 10% 10.0x 0.0120s 100% 1.0x 4 programs

I assume, there were no cached binaries on the disk when arrayfire tests ran in the above two scenarios. Just want to make sure, arrayfire is not picking up some items from cache from old compilations.

Disk caching hash 1st run %time +speed Loop (ORB) %time +speed # kernels saved
ON ON 0.2036s 100% 1.0x 0.0119s 100% 1.0x 9 programs
ON OFF 0.1887s 93% 1.1x 0.0119s 100% 1.0x 4 programs

The above two cases shouldn't make much difference unless there are no binaries on disk at all. Based on the 1st run times, it looks binaries cached from earlier compilations are used in these.

I want to check the timings with a more simple kernel like transpose and check how and why hashing the additional name can result in OpenCL backend generating more kernels because ideally it shouldn't. Will get back to you shortly. Are you on Indian time zone by any chance ? If so, you can reach out to be slack, the discussion cab be more instantaneous that way.

@9prady9
Copy link
Member

9prady9 commented Nov 25, 2020

Below is the output with: Caching(OFF), PreviousCaches(NONE), NameHash(YES)

 ./examples/helloworld/helloworld_opencl 
ArrayFire v3.8.0 (OpenCL, 64-bit Linux, build 71890c6dc)
[0] NVIDIA: GeForce GTX 1060 3GB, 3011 MB -- OpenCL 1.2 CUDA -- Device driver 455.45.01 -- FP64 Support: True -- Unified Memory (False)
-1- AMD: Spectre, 2543 MB -- OpenCL 1.2 AMD-APP (3188.4) -- Device driver 3188.4 -- FP64 Support: True -- Unified Memory (True)

Init: 0.001279
Loop: 0.000160158

Below is the output with: Caching(OFF), PreviousCaches(NONE), NameHash(NO)

Init: 0.000915
Loop: 0.00015524

I am building with CACHING turned on now, its gonna take few minutes but I doubt the difference in times wouldn't be much different from CACHING OFF since that is the code difference in both the cases.

I didn't notice anything odd with simple kernel such as transpose. The compilation was invoked only once for each transpose, random generator as expected. I am going to try with ORB after current run and see, I suspect it has something to do with function rather than hashing the name. Lets see.

@willyborn
Copy link
Contributor Author

willyborn commented Nov 25, 2020

padreep,
When the kernelname is included in the hashing, we generate a new hashing dependent on the kernelname (same program).
When searching in our map, we do not find the new hash and as result we compile the program.
Since we now have multiple hashes for the same program, although dependent on the kernelname we end up with multiple compilations of the same program.

Regarding the time, you are right a error slipped in.
The new table:

Disk caching hash 1st run %time +speed Loop (ORB) %time +speed # kernels saved
OFF ON 1.8465s 100% 1.0x 0.0120s 100% 1.0x 9 programs
OFF OFF 0.743 40% 2.49x 0.0120s 100% 1.0x 4 programs <== Updated
ON ON 0.2036s 100% 1.0x 0.0119s 100% 1.0x 9 programs
ON OFF 0.1887s 93% 1.1x 0.0119s 100% 1.0x 4 programs

@9prady9
Copy link
Member

9prady9 commented Nov 25, 2020

When the kernelname is included in the hashing, we generate a new hashing dependent on the kernelname (same program).
When searching in our map, we do not find the new hash and as result we compile the program.

In a given application run/session, if the hashing logic includes name, all requests to getKernel hash the name. I understand that adding name changes the overall content that is being hashed but you are missing the point that name is not the unique criteria for the OpenCL backend. The information that comes from name<...> is passed in different format via compilation options too. Here is an example: (note that the options are just for illustration of what happens in the code)

with name included

name: transpose<float, true>
compile_options: float_type, conjugate_true
source: <...>

vs no name

compile_options: float_type, conjugate_false
source: <...>

They both are different in content for sure and will generate different hashes, but still uniquely represent one kernel compilation instance and will result in only one kernel always. Note that, modified content will result in new binary with different filename. Subsequent runs, after the new binary is stored, will pickup this new binary rather than the old binary.

And my trial with transpose(instead of orb) function kind of proves that too. Also, do note that my most recent commit in your branch basically removed the logic that adds <...> template instantiation text, it runs only on CUDA backend now.

I don't think the problem here is the extra name hashing which is at most 10 to 20 characters in OpenCL backend, but the extra kernels getting generated in your trials with ORB. I am looking into it.

@9prady9
Copy link
Member

9prady9 commented Nov 25, 2020

I think I now know why that is happening, ORB is written in the following way, multiple kernel functions are present in the same file. So when we were not including name of function being asked for, the same program binary is being returned for all kernels present in single cl file. With the name included, it is create separate files. Okay, mystery solved. Now I have to figure out if such arrangement is okay to have. Will post shortly.

@9prady9
Copy link
Member

9prady9 commented Dec 3, 2020

Rebased to the new master (cuSparse update), and squased into 1 commit. I thought I had done this already before??
The recent reported issue remains separate so that you can see what changed. I hope it is this that you where referring to, since I do not have the same tools as you have (MSVS and Windows).

@willyborn Squash undone probably due to me force pushing to trigger ci jobs, Sorry about that. Is the buffer overflow issue fixed now ?

Updates:

  1. Not it has yet been, I just checked the build failure - address sanitizer builds are still failing.
  2. I have fixed the history by pulling your branch. We are good now in case I push some fix.

@willyborn
Copy link
Contributor Author

willyborn commented Dec 3, 2020

No problem about the push. I thought I had done something wrong, since this project is the first time I'm participating to an open-source project and its typical tools usage.

The single remark I received from the static analyzer in MSVC, (Microsoft Native Recommended Rules) is no longer there. There are however so many more profiles to check against, For the whole arrayfire solution, I get over 600 warnings, without analyzer, so it is rather difficult to determine to which one to react without a full refactoring of the whole code set (which I suppose is no the purpose). I will experiment with the LLVM Toolset, although this is still a mystery.

Is there a way to download (have a peek) at those generated errors/logfiles? I see full (local) path file references in the displayed log, although no specific errors.

@9prady9
Copy link
Member

9prady9 commented Dec 3, 2020

@willyborn That is alright. Our ci jobs upload build reports to ci.arrayfire.org . You can look for reports based on PR number on that website. You can find the error message here https://ci.arrayfire.org/viewBuildError.php?buildid=2713 It is originating in line

for (size_t i = strlen(line) - 1; i > 0 && line[i] == ' ';

I am busy debugging another problem right now. Will try to look into into it later today or tomorrow. As far as I can tell, there is some kind of write/read happening beyond buffer size that is allocated in removeComments function for sure.

@willyborn
Copy link
Contributor Author

Found it.
After the full cycle of tests, I will upload.

@9prady9
Copy link
Member

9prady9 commented Dec 4, 2020

@willyborn When you push your next fix, please take care of the merge commit in the history too. Another one sneaked in recently, thank you.

@willyborn
Copy link
Contributor Author

willyborn commented Dec 4, 2020

@9prady9
Remaining work:

  • magma/laset.hpp: does not recognize getKernel ?? (already added util.h & changed type of kernelName to string, although problem remains. This is even not part of my solution)
  • cuda/sparse_blas.cu: CUSPARSE_CSRMM_ALG1 deprecated (I suppose that you are already looking at that)
  • DefaultMemoryManager.hpp: move constructor implicitly deleted (MemoryManagerBase) (I hesitate to change) ==> Also in master

I am afraid that for the above errors/warnings, I will not be helpful and I will have to leave them for you. The PR is squashed and in sync, so that you can add commits.
We are almost there.

PS: My GTX card arrived, so that next time I will have to bother you less with CUDA.

@9prady9
Copy link
Member

9prady9 commented Dec 4, 2020

* magma/laset.hpp: does not recognize getKernel ?? (already added util.h & changed type of kernelName to string, although problem remains.  This is even not part of my solution)

Not sure if that is a problem because regular OpenCL build works for me fine. Linux build with address sanitizer failed earlier because of buffer overflow.

* cuda/sparse_blas.cu: CUSPARSE_CSRMM_ALG1 deprecated (I suppose that you are already looking at that)

We already enable newer CUSPARSE API based on CUDA toolkit version check. If any newer changes are introduced in CUDA 11, we will take care of it once the API stabilized.

* DefaultMemoryManager.hpp:  move constructor implicitly deleted (MemoryManagerBase) (I hesitate to change)  ==> Also in master

I have seen these warnings, I have that on my TODO list already.

PS: My GTX card arrived, so that next time I will have to bother you less with CUDA.

Glad you have CUDA card too :) Enjoy GPU computing.

I am taking a look at the laset error, it is odd how these errors happen in this PR although this PR doesn't change CMakeLists.txt in those aspects.

@9prady9
Copy link
Member

9prady9 commented Dec 4, 2020

@willyborn One minor suggestion, the commit message of the first commit is formatted incorrectly without any new line break between brief and elaborate descriptions.

@9prady9
Copy link
Member

9prady9 commented Dec 4, 2020

@willyborn As I suspected it is not about laset.hpp under src/backend/opencl/kernel/ folder. There is still some sort of memory issue in the code, thus laset.hpp is not generated and hence it is not being found.

Unfortunately, our ci log isn't showing the full picture - we have separate known and ongoing issue with our cdash due to a recent version upgrade due to which this is happening. Sorry about it.

I will take a look into the memory error.

The following is the precise error now

[52/749] Compiling /home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/src/backend/opencl/kernel/laset.cl to C++ source
FAILED: src/backend/opencl/kernel_headers/laset.hpp 
cd /home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/src/backend/opencl/kernel && /usr/bin/cmake -E make_directory /home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/build/src/backend/opencl/kernel_headers && /usr/bin/cmake -E echo \#include\ \</home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/src/backend/opencl/kernel/laset.hpp\> >>"/home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/build/src/backend/opencl/kernel_headers/laset.hpp" && /home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/build/bin2cpp --file laset.cl --namespace opencl --output /home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/build/src/backend/opencl/kernel_headers/laset.hpp --name laset_cl
=================================================================
==59445==ERROR: AddressSanitizer: strncpy-param-overlap: memory ranges [0x7fff791844b0,0x7fff791844b1) and [0x7fff791844b0, 0x7fff791844b1) overlap
    #0 0x7fb0f46d2d97 in __interceptor_strncpy /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:481
    #1 0x55c9b3aff9d9 in removeComments(std::basic_ifstream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ../CMakeModules/bin2cpp.cpp:231
    #2 0x55c9b3b00fed in main ../CMakeModules/bin2cpp.cpp:298
    #3 0x7fb0f4188151 in __libc_start_main (/usr/lib/libc.so.6+0x28151)
    #4 0x55c9b3afda4d in _start (/home/pradeep/gitroot/ArrayFireWorkspace/worktrees/hashCaching/build/bin2cpp+0xba4d)

Address 0x7fff791844b0 is located in stack of thread T0 at offset 736 in frame
    #0 0x55c9b3afecca in removeComments(std::basic_ifstream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ../CMakeModules/bin2cpp.cpp:136

  This frame has 9 object(s):
    [48, 56) 'context' (line 156)
    [80, 88) 'd' (line 230)
    [112, 120) '<unknown>'
    [144, 152) '<unknown>'
    [176, 192) 'del' (line 152)
    [208, 232) 'dels' (line 153)
    [272, 664) 'ss' (line 137)
    [736, 992) 'line' (line 138) <== Memory access at offset 736 is inside this variable
    [1056, 1312) 'local' (line 148)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
Address 0x7fff791844b0 is located in stack of thread T0 at offset 736 in frame
    #0 0x55c9b3afecca in removeComments(std::basic_ifstream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) ../CMakeModules/bin2cpp.cpp:136

  This frame has 9 object(s):
    [48, 56) 'context' (line 156)
    [80, 88) 'd' (line 230)
    [112, 120) '<unknown>'
    [144, 152) '<unknown>'
    [176, 192) 'del' (line 152)
    [208, 232) 'dels' (line 153)
    [272, 664) 'ss' (line 137)
    [736, 992) 'line' (line 138) <== Memory access at offset 736 is inside this variable
    [1056, 1312) 'local' (line 148)
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: strncpy-param-overlap /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:481 in __interceptor_strncpy
==59445==ABORTING

@9prady9
Copy link
Member

9prady9 commented Dec 4, 2020

strncpy-param-overlap: memory ranges thats the problem

@willyborn
Copy link
Contributor Author

willyborn commented Dec 4, 2020 via email

@willyborn
Copy link
Contributor Author

willyborn commented Dec 4, 2020 via email

@9prady9
Copy link
Member

9prady9 commented Dec 4, 2020

strange, it is only in 1 cu file. On Fri, 4 Dec 2020 at 13:32, Sabine & Willy Born < sabine.willy.born@gmail.com> wrote:

I know what this can be. On Fri, 4 Dec 2020 at 13:24, pradeep @.***> wrote: > strncpy-param-overlap: memory ranges thats the problem > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#3043 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AQ2WGPHOGAVJUPXPKDVX5ZLSTDIHFANCNFSM4TRAS7VA > . >

It is only probably getting reported on one of the files. Earlier it was cusparse header.

Is there any chance the removeComments can be simplified further ? perhaps avoiding C function calls.

@willyborn willyborn force-pushed the hashCaching branch 2 times, most recently from 7791b9f to 825e461 Compare December 4, 2020 16:55
@willyborn
Copy link
Contributor Author

@9prady9
I eliminated the strncpy function calls, since they were also throwing up warnings.
Instead I used memcpy and my own loop.

The problem with strncpy is that the order is apparently not defined in the C++ standard, resulting in unknown behavior when the ranges overlap. We no longer have this problem since it is no longer used.

Everything is squashed into 1 commit, so that I could update the comments and insert that blank line.
All changes on laset are removed, since you explained that the problem was not there.

9prady9
9prady9 previously approved these changes Dec 7, 2020
@9prady9 9prady9 requested a review from umar456 December 17, 2020 05:00
willyborn and others added 2 commits February 18, 2021 04:46
The program source files memory footprint is reduced (-30%) by eliminating
comments in the generated kernel headers.  Hash calculation of each source
file  is performed at compile time and incrementally extended at runtime
with the options & tInstance vectors.  Overall performance increased up to
21%, up to the point that the GPU becomes the bottleneck, and the overhead
to launch the same (small) kernel was improved by 63%.
Copy link
Member

@umar456 umar456 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing work. The benchmarks look promising for smaller sizes. Thank you for your contribution!

@umar456 umar456 merged commit 3cde757 into arrayfire:master Feb 18, 2021
9prady9 added a commit to 9prady9/arrayfire that referenced this pull request Aug 2, 2021
arrayfire#3043)

* Reduced overhead of kernel caching for OpenCL & CUDA.

The program source files memory footprint is reduced (-30%) by eliminating
comments in the generated kernel headers.  Hash calculation of each source
file  is performed at compile time and incrementally extended at runtime
with the options & tInstance vectors.  Overall performance increased up to
21%, up to the point that the GPU becomes the bottleneck, and the overhead
to launch the same (small) kernel was improved by 63%.

* Fix couple of minor cmake changes

* Move spdlog fetch to use it in bin2cpp link command

Co-authored-by: pradeep <pradeep@arrayfire.com>
(cherry picked from commit 3cde757)
syurkevi pushed a commit that referenced this pull request Dec 28, 2021
#3043)

* Reduced overhead of kernel caching for OpenCL & CUDA.

The program source files memory footprint is reduced (-30%) by eliminating
comments in the generated kernel headers.  Hash calculation of each source
file  is performed at compile time and incrementally extended at runtime
with the options & tInstance vectors.  Overall performance increased up to
21%, up to the point that the GPU becomes the bottleneck, and the overhead
to launch the same (small) kernel was improved by 63%.

* Fix couple of minor cmake changes

* Move spdlog fetch to use it in bin2cpp link command

Co-authored-by: pradeep <pradeep@arrayfire.com>
(cherry picked from commit 3cde757)
@willyborn willyborn deleted the hashCaching branch September 29, 2022 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants