Reduce synchronized initialization of OpenCL Buffers #3039

willyborn · 2020-11-02T14:06:29Z

All OpenCL Buffers are now initialized in an synchronize way.

Short description of change

All blocking EnqueueWriteBuffers are replaced by non-blocking ones, resulting in performance improvement and/or extra stability.

Description

EnqueueWriteBuffer copies a buffer from Host memory into the device Buffer. We have 2 issues with the Host memory, being:

memory gets out of scope when created on the stack.
When the values are fixed (ex: always 0), a static variable can be used who remains always available, independent when the kernel executes. Since it is a constant, the static does not pose problems with thread-safety.
When the value is a calculated value. The scope of the variable has to be extended as long as possible, meaning that the variable declaration is performed in the highest function (no sub section) and an event is checked before destruction the variable. When extra enqueue commands are issued between the writeBuffer and the event check, the GPU will continue having commands.
large vectors are constructed and initialized to 0, before the copying.
This construction takes much time on the Host thread so that the no new commands are available in the queue, resulting in low utilization on the GPU.
An extra kernel is added to the corresponding OpenCL program, which initializes the Buffer with the provided value.

Impact on the updated files:
Stability: orb.hpp (Line 366: vector h_desc_lvl could be destructed before execution) --> intermittent failures
Performance: Kernel.cpp, csrmm.hpp, csrmv.hpp, fast.hpp, harris.hpp, regions.hpp, sparse.hpp, sparse_arith.hpp:, susan.hpp

Additional information about the PR answering following questions:

Is this a new feature or a bug fix? No
Why these changes are necessary: Stability & Performance
Potential impact on specific hardware, software or backends: None
New functions and their functionality: None
-->
Fixes: [Perf] OpenCL performance issue with sparse matmul #2937 partially, since this is not the main bottleneck.

Changes to Users

None

Checklist

Rebased on latest master
Code compiles
Tests pass

9prady9 · 2020-11-05T15:30:17Z

I think we had some issues using clEnqueueFillBuffer in the past with one of the vendors - AMD, Intel & NVIDIA - don't remember which one it is. We need to test this thoroughly to ensure we don't do a rollback again later.

willyborn · 2020-11-05T15:38:14Z

No problem. It did however solve the INTERMITTENT ERRORS I have, with the ORB TEST. In case the FillBuffer is not suited, I propose to change the WriteBuffer to blocking since no synchronisation point is present before leaving the function. When your test indicates incompatibilities with some vendors, I can also update the PR accordingly. It would be a petty however. I wait on your instructions.

…

On Thu, 5 Nov 2020 at 16:30, pradeep ***@***.***> wrote: I think we had some issues using clEnqueueFillBuffer in the past with one of the vendors - AMD, Intel & NVIDIA - don't remember which one it is. We need to test this thoroughly to ensure we don't do a rollback again later. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPGWYGYNF7QDREDAITLSOLAJVANCNFSM4THQJAKA> .

9prady9 · 2020-11-05T15:58:21Z

No problem. It did however solve the INTERMITTENT ERRORS I have, with the ORB TEST. In case the FillBuffer is not suited, I propose to change the WriteBuffer to blocking since no synchronisation point is present before leaving the function. When your test indicates incompatibilities with some vendors, I can also update the PR accordingly. It would be a petty however. I wait on your instructions.
…
On Thu, 5 Nov 2020 at 16:30, pradeep @.***> wrote: I think we had some issues using clEnqueueFillBuffer in the past with one of the vendors - AMD, Intel & NVIDIA - don't remember which one it is. We need to test this thoroughly to ensure we don't do a rollback again later. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPGWYGYNF7QDREDAITLSOLAJVANCNFSM4THQJAKA .

Yes, it is lame some times w.r.t OpenCL. Sure, we will keep you posted here.

willyborn · 2020-11-25T19:21:34Z

@9prady9

test_orb_opencl continue failing when I run the standard tests. With the latest speed increases, it is now each time. (On top, I have still 3 others continuous failing (also with master) pInverse_cpu, pInverse_opencl & parse_threading. I noticed that they succeed on your test system however. I think it has something to do with the synchronization. As soon as I find a cause, I will let you know.)

What would you think, if I write a new non-blocking kernel with the same parameters as the clEnqueuFillBuffer call of OpenCL. I think this is not that difficult, and on top of fixing unexpected behaviors it will increase the speed again for specific cases thanks to the asynchronous behavior.

9prady9 · 2020-11-25T19:41:58Z

I think we have kernels that can handle generic copying, not sure about filling a constant - may be they expect input buffer.

In any case, the tests worked fine with the systems/vendors we have checked so far. We didn't face an issue although we haven't got our hands on Intel stuff I think. So, I would say don't try implementing such kernel until we can reproduce the issue on some kind of setup. We wouldn't want you to spend time on something that may have been fixed by vendors in their newer opencl implementations.

With regards to the test failures, are you referring to speed increase from other PR changes w.r.t caching API ?

pinverse how is it failing? Accuracy error? Segfault ? Or anything else ?

I believe our CI systems use Intel MKL, what blas/lapacke libraries are you using ?

willyborn · 2020-11-25T21:37:38Z

Regarding the PR of JIT overhead & hashCaching. I had both changes already for 6 months, but did not introduce them because of the underlying errors. That is the reason I first introduced:

Parameter length --> resolved errors in JIT
FillBuffer --> resolved errors in orb
at test_threading_opencl i got stuck and I did not longer want to wait to enter the rest.
There is still 1 PR waiting (Performance af::join for OpenCL)

My HW & SW config:

AMD A10 7870K + iGPU R7
Windows 10
openblas & fftw3
AMDAPPSDK 3.0
Microsoft Visual Studio Community 2017

====================

test_orb_opencl:
-- PR FillBuffer fixes
test_pinverse_cpu:
-- C++ exception (unkown file). I did not yet check further.
test_pinverse_opencl:
-- C++ exception (unknown file). I did not yet check further.
test_threading_opencl: (test_sparse_opencl OK)
-- RelWithDebInfo: intermediate (1 in 3). Wrong values are returned. Spent already 6months, since I have similar problems in own code. Some blocks inside the result array are overwritten. It happens when sparseTransposeTester(625,1331,1,2) runs in parallel with sparseTransposeTester(453,751,397,1). Running multiples of the same type is OK. Presence of the other types have no impact, except in reducing the chance of error.
-- Debug: very rare (1 in 100)

test_pinverse_cpu.txt
test_pinverse_opencl.tx.txt
test_threading_opencl.txt

9prady9 · 2020-11-26T05:22:53Z

I have a similar APU, A10 A7850K. I was able to reproduce the only two of your cases.

pinverse_cpu passes on 7850K, what is the openblas, lapack versions you have installed ?
pinverse_opencl fails with value errors on 7850K's APU
orb_opencl passes with fill buffer change.
threading_opencl passes fine with 7850k's APU.

A quick update/note: orb_opencl doesn't fail for me on APU on master branch i.e. I didn't need the fillbuffer change.

willyborn · 2020-11-26T10:12:26Z

Below you can find the full list of all SW installed through vcpkg.

Should I regularly run vcpkg update ? (When I do now, it says "No packages need updating")
vcpkg_list.txt

Regarding the CPU. It is overclocked to 4.3GHz (base) and 4.6GHz (boost). I did not succeed in overclocking the GPU. Even with a faster CPU, I notice that in most applications I have a problem in keeping the GPU busy. This is the main cause, that I focus so much on the overhead of the CPU. Off-course the bugger the arrays the lesser the impact.
I assume that the latest video card owners should have even bigger issues here. There cards can be up to 30X (80CU vs 8CU, 2500MHz vs 866MHz, IPC?) faster, while the fastest CPU (single core) can only be up to 2.5x (PassMark 3563 [R9-5950X] vs 1545 [A10 7870K]) faster than mine.
-> Compared to the latter, my GPU is still very busy when the CPU already leaves the function for the same size of array's. It should be even worse on your 7850K.

Please use the bat file (windows) attached, which repeats the threading_opencl until failure. It will output you the failure rate, which is very variable.
repeat.bat.txt

9prady9 · 2020-12-07T05:22:08Z

I have tracked down the last revert we did, away from enqueueFillBuffer. It is done to avoid incrementing the required OpenCL version from 1.1 to 1.2.

We need to verify the different vendor implementations versions to ensure we aren't excluding any accidentally. If this can be done without the need to add custom fill implementation, that would be ideal. I shall get back here with updates once we figure out the next step.

willyborn · 2021-01-15T08:22:50Z

@9prady9
Hi pradeep, I added an extra commit performing the same actions without the fillBuffer. The last commit is fully OpenCL 1.1 compatible although has the same effect (stability and speed).

If you decide to go for this last commit, I still can merge everything into 1 commit so that the 1.2 updates disappears. The last commit overwrites all the previous changes.

The stability problems were only detected when using the OpenCL library of AMD (NVIDIA OpenCL apparently performs more operations on the main Host thread so it takes longer before the variables go out of scope). The stability issues could however appear on both depending on the load (of other kernels) on the GPU.

9prady9 · 2021-01-18T04:23:54Z

@willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit.

Thank you!

9prady9 · 2021-05-05T14:03:23Z

@willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit.

Thank you!

@willyborn Sorry about the extended delay. We are good with using clEnqueueFillBuffer. Can you please drop the last commit and rebase your branch from latest master. Once the jobs run, we can get this in immediately.

willyborn · 2021-05-08T08:51:09Z

@pradeep. I will try to have this done by next week. I am currently in the final tests with the join/copy improvements and prefer to have a stable system during the different performance tests.

…

On Wed, May 5, 2021, 16:03 pradeep ***@***.***> wrote: @willyborn <https://github.com/willyborn> We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit. Thank you! @willyborn <https://github.com/willyborn> Sorry about the extended delay. We are good with using clEnqueueFillBuffer. Can you please drop the last commit and rebase your branch from latest master. Once the jobs run, we can get this in immediately. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPGATDIZJ3DPJEV56ULTMFF4HANCNFSM4THQJAKA> .

9prady9 · 2021-05-08T08:55:20Z

@pradeep. I will try to have this done by next week. I am currently in the final tests with the join/copy improvements and prefer to have a stable system during the different performance tests.
…
On Wed, May 5, 2021, 16:03 pradeep @.***> wrote: @willyborn https://github.com/willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit. Thank you! @willyborn https://github.com/willyborn Sorry about the extended delay. We are good with using clEnqueueFillBuffer. Can you please drop the last commit and rebase your branch from latest master. Once the jobs run, we can get this in immediately. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPGATDIZJ3DPJEV56ULTMFF4HANCNFSM4THQJAKA .

Sure, I haven't gotten to the join benchmarks you shared earlier. I am in the middle some other tasks, but I will get to them once I am done with what I am doing now. Sorry about the delay.

9prady9 · 2021-05-18T03:56:46Z

@willyborn Did you get the chance to look into this ?

willyborn · 2021-05-18T11:52:38Z

The join optimization takes longer than expected (found a bug in my vida code, discarding a big part of my tests). I expect however, that tomorrow I can start syncing with latest master. So updates and new PRs (this one first) from Thursday on. Willy

…

On Tue, May 18, 2021, 05:57 pradeep ***@***.***> wrote: @willyborn <https://github.com/willyborn> Did you get the chance to look into this ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPHVIBXAOOMIBPZUXZDTOHQQ3ANCNFSM4THQJAKA> .

enqueueWriteBuffer is replaced by enqueueFillBuffer calls, which always operates asynchronisly because the pattern is copied during the call and not during the execution as for the enqueueWriteBuffer. Optimizes: susan, sparse, regions, orb, harris and fast.

9prady9 · 2021-05-20T06:47:28Z

thank you, will take a look once the jobs are done.

willyborn · 2021-05-20T12:19:07Z

I verified the errors in CDash, although I can not find any relationship to the changes applied in PR 3039.

I confirm, on my local system all tests pass except

pinverse_cpu+opencl: Linear Algebra disabled on CPU/OPENCL (LAPACK not installed)
rng_quality_cuda+opencl: out of memory for double (max 2GB available, and float passes)
All local errors are related to the available hardware.

9prady9 · 2021-05-20T12:21:55Z

I verified the errors in CDash, although I can not find any relationship to the changes applied in PR 3039.

I confirm, on my local system all tests pass except
* pinverse_cpu+opencl: Linear Algebra disabled on CPU/OPENCL  (LAPACK not installed)

* rng_quality_cuda+opencl: out of memory for double (max 2GB available, and float passes)
  All local errors are related to the available hardware.

I checked the logs, they are failing on master too. I don't think they are related to your changes. I am just waiting for linux jobs - there seems to be ongoing issue w.r.t linux CI, we are looking into it already.

9prady9 · 2021-05-21T01:37:13Z

Thanks once again @willyborn for sending in these changes.

9prady9 added fix perf labels Nov 30, 2020

9prady9 requested review from umar456 and 9prady9 and removed request for umar456 December 17, 2020 05:01

willyborn force-pushed the FillBuffer branch from bcc506f to 14a5237 Compare January 15, 2021 08:00

willyborn force-pushed the FillBuffer branch from 14a5237 to 8635bdd Compare May 20, 2021 06:36

9prady9 approved these changes May 20, 2021

View reviewed changes

9prady9 merged commit 26604b7 into arrayfire:master May 21, 2021

9prady9 mentioned this pull request May 21, 2021

[Perf] OpenCL performance issue with sparse matmul #2937

Closed

willyborn deleted the FillBuffer branch September 29, 2022 12:55

Reduce synchronized initialization of OpenCL Buffers #3039

Reduce synchronized initialization of OpenCL Buffers #3039

Uh oh!

Conversation

willyborn commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short description of change

Description

Changes to Users

Checklist

Uh oh!

9prady9 commented Nov 5, 2020

Uh oh!

willyborn commented Nov 5, 2020 via email

Uh oh!

9prady9 commented Nov 5, 2020

Uh oh!

willyborn commented Nov 25, 2020

Uh oh!

9prady9 commented Nov 25, 2020

Uh oh!

willyborn commented Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willyborn commented Nov 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willyborn commented Jan 15, 2021

Uh oh!

9prady9 commented Jan 18, 2021

Uh oh!

9prady9 commented May 5, 2021

Uh oh!

willyborn commented May 8, 2021 via email

Uh oh!

9prady9 commented May 8, 2021

Uh oh!

9prady9 commented May 18, 2021

Uh oh!

willyborn commented May 18, 2021 via email

Uh oh!

9prady9 commented May 20, 2021

Uh oh!

willyborn commented May 20, 2021

Uh oh!

9prady9 commented May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9prady9 commented May 21, 2021

Uh oh!

Uh oh!

willyborn commented Nov 2, 2020 •

edited

Loading

willyborn commented Nov 25, 2020 •

edited

Loading

9prady9 commented Nov 26, 2020 •

edited

Loading

willyborn commented Nov 26, 2020 •

edited

Loading

9prady9 commented Dec 7, 2020 •

edited

Loading

9prady9 commented May 20, 2021 •

edited

Loading