-
Notifications
You must be signed in to change notification settings - Fork 548
Reduce synchronized initialization of OpenCL Buffers #3039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think we had some issues using |
No problem.
It did however solve the INTERMITTENT ERRORS I have, with the ORB TEST.
In case the FillBuffer is not suited, I propose to change the WriteBuffer
to blocking since no synchronisation point is present before leaving the
function.
When your test indicates incompatibilities with some vendors, I can also
update the PR accordingly. It would be a petty however.
I wait on your instructions.
…On Thu, 5 Nov 2020 at 16:30, pradeep ***@***.***> wrote:
I think we had some issues using clEnqueueFillBuffer in the past with one
of the vendors - AMD, Intel & NVIDIA - don't remember which one it is. We
need to test this thoroughly to ensure we don't do a rollback again later.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3039 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPGWYGYNF7QDREDAITLSOLAJVANCNFSM4THQJAKA>
.
|
Yes, it is lame some times w.r.t OpenCL. Sure, we will keep you posted here. |
test_orb_opencl continue failing when I run the standard tests. With the latest speed increases, it is now each time. (On top, I have still 3 others continuous failing (also with master) pInverse_cpu, pInverse_opencl & parse_threading. I noticed that they succeed on your test system however. I think it has something to do with the synchronization. As soon as I find a cause, I will let you know.) What would you think, if I write a new non-blocking kernel with the same parameters as the clEnqueuFillBuffer call of OpenCL. I think this is not that difficult, and on top of fixing unexpected behaviors it will increase the speed again for specific cases thanks to the asynchronous behavior. |
I think we have kernels that can handle generic copying, not sure about filling a constant - may be they expect input buffer. In any case, the tests worked fine with the systems/vendors we have checked so far. We didn't face an issue although we haven't got our hands on Intel stuff I think. So, I would say don't try implementing such kernel until we can reproduce the issue on some kind of setup. We wouldn't want you to spend time on something that may have been fixed by vendors in their newer opencl implementations. With regards to the test failures, are you referring to speed increase from other PR changes w.r.t caching API ? pinverse how is it failing? Accuracy error? Segfault ? Or anything else ? I believe our CI systems use Intel MKL, what blas/lapacke libraries are you using ? |
Regarding the PR of JIT overhead & hashCaching. I had both changes already for 6 months, but did not introduce them because of the underlying errors. That is the reason I first introduced:
My HW & SW config:
====================
test_pinverse_cpu.txt |
I have a similar APU, A10 A7850K. I was able to reproduce the only two of your cases.
A quick update/note: orb_opencl doesn't fail for me on APU on master branch i.e. I didn't need the fillbuffer change. |
Below you can find the full list of all SW installed through vcpkg.
Regarding the CPU. It is overclocked to 4.3GHz (base) and 4.6GHz (boost). I did not succeed in overclocking the GPU. Even with a faster CPU, I notice that in most applications I have a problem in keeping the GPU busy. This is the main cause, that I focus so much on the overhead of the CPU. Off-course the bugger the arrays the lesser the impact. Please use the bat file (windows) attached, which repeats the threading_opencl until failure. It will output you the failure rate, which is very variable. |
We need to verify the different vendor implementations versions to ensure we aren't excluding any accidentally. If this can be done without the need to add custom fill implementation, that would be ideal. I shall get back here with updates once we figure out the next step. |
@9prady9 If you decide to go for this last commit, I still can merge everything into 1 commit so that the 1.2 updates disappears. The last commit overwrites all the previous changes. The stability problems were only detected when using the OpenCL library of AMD (NVIDIA OpenCL apparently performs more operations on the main Host thread so it takes longer before the variables go out of scope). The stability issues could however appear on both depending on the load (of other kernels) on the GPU. |
@willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit. Thank you! |
@willyborn Sorry about the extended delay. We are good with using |
@pradeep.
I will try to have this done by next week.
I am currently in the final tests with the join/copy improvements and
prefer to have a stable system during the different performance tests.
…On Wed, May 5, 2021, 16:03 pradeep ***@***.***> wrote:
@willyborn <https://github.com/willyborn> We are re-organizing our CI
infrastructure slightly and hence quite a few GPU jobs haven't been run on
this or some other PRs yet. Hence, there haven't been any merges that
changed GPU associated code. your changes look good, please give me some
time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill
buffer API or your most recent commit.
Thank you!
@willyborn <https://github.com/willyborn> Sorry about the extended delay.
We are good with using clEnqueueFillBuffer. Can you please drop the last
commit and rebase your branch from latest master. Once the jobs run, we can
get this in immediately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3039 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPGATDIZJ3DPJEV56ULTMFF4HANCNFSM4THQJAKA>
.
|
Sure, I haven't gotten to the join benchmarks you shared earlier. I am in the middle some other tasks, but I will get to them once I am done with what I am doing now. Sorry about the delay. |
@willyborn Did you get the chance to look into this ? |
The join optimization takes longer than expected (found a bug in my vida
code, discarding a big part of my tests).
I expect however, that tomorrow I can start syncing with latest master. So
updates and new PRs (this one first) from Thursday on.
Willy
…On Tue, May 18, 2021, 05:57 pradeep ***@***.***> wrote:
@willyborn <https://github.com/willyborn> Did you get the chance to look
into this ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3039 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPHVIBXAOOMIBPZUXZDTOHQQ3ANCNFSM4THQJAKA>
.
|
enqueueWriteBuffer is replaced by enqueueFillBuffer calls, which always operates asynchronisly because the pattern is copied during the call and not during the execution as for the enqueueWriteBuffer. Optimizes: susan, sparse, regions, orb, harris and fast.
thank you, will take a look once the jobs are done. |
I verified the errors in CDash, although I can not find any relationship to the changes applied in PR 3039. I confirm, on my local system all tests pass except
|
I checked the logs, they are failing on master too. I don't think they are related to your changes. I am just waiting for linux jobs - there seems to be ongoing issue w.r.t linux CI, we are looking into it already. |
Thanks once again @willyborn for sending in these changes. |
All OpenCL Buffers are now initialized in an synchronize way.
Short description of change
All blocking EnqueueWriteBuffers are replaced by non-blocking ones, resulting in performance improvement and/or extra stability.
Description
EnqueueWriteBuffer copies a buffer from Host memory into the device Buffer. We have 2 issues with the Host memory, being:
When the values are fixed (ex: always 0), a static variable can be used who remains always available, independent when the kernel executes. Since it is a constant, the static does not pose problems with thread-safety.
When the value is a calculated value. The scope of the variable has to be extended as long as possible, meaning that the variable declaration is performed in the highest function (no sub section) and an event is checked before destruction the variable. When extra enqueue commands are issued between the writeBuffer and the event check, the GPU will continue having commands.
This construction takes much time on the Host thread so that the no new commands are available in the queue, resulting in low utilization on the GPU.
An extra kernel is added to the corresponding OpenCL program, which initializes the Buffer with the provided value.
Impact on the updated files:
Stability: orb.hpp (Line 366: vector h_desc_lvl could be destructed before execution) --> intermittent failures
Performance: Kernel.cpp, csrmm.hpp, csrmv.hpp, fast.hpp, harris.hpp, regions.hpp, sparse.hpp, sparse_arith.hpp:, susan.hpp
Additional information about the PR answering following questions:
-->
Fixes: [Perf] OpenCL performance issue with sparse matmul #2937 partially, since this is not the main bottleneck.
Changes to Users
None
Checklist