Skip to content

Reduce synchronized initialization of OpenCL Buffers #3039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 21, 2021

Conversation

willyborn
Copy link
Contributor

@willyborn willyborn commented Nov 2, 2020

All OpenCL Buffers are now initialized in an synchronize way.

Short description of change

All blocking EnqueueWriteBuffers are replaced by non-blocking ones, resulting in performance improvement and/or extra stability.

Description

EnqueueWriteBuffer copies a buffer from Host memory into the device Buffer. We have 2 issues with the Host memory, being:

  • memory gets out of scope when created on the stack.
    When the values are fixed (ex: always 0), a static variable can be used who remains always available, independent when the kernel executes. Since it is a constant, the static does not pose problems with thread-safety.
    When the value is a calculated value. The scope of the variable has to be extended as long as possible, meaning that the variable declaration is performed in the highest function (no sub section) and an event is checked before destruction the variable. When extra enqueue commands are issued between the writeBuffer and the event check, the GPU will continue having commands.
  • large vectors are constructed and initialized to 0, before the copying.
    This construction takes much time on the Host thread so that the no new commands are available in the queue, resulting in low utilization on the GPU.
    An extra kernel is added to the corresponding OpenCL program, which initializes the Buffer with the provided value.

Impact on the updated files:
Stability: orb.hpp (Line 366: vector h_desc_lvl could be destructed before execution) --> intermittent failures
Performance: Kernel.cpp, csrmm.hpp, csrmv.hpp, fast.hpp, harris.hpp, regions.hpp, sparse.hpp, sparse_arith.hpp:, susan.hpp

Additional information about the PR answering following questions:

  • Is this a new feature or a bug fix? No
  • Why these changes are necessary: Stability & Performance
  • Potential impact on specific hardware, software or backends: None
  • New functions and their functionality: None
    -->
    Fixes: [Perf] OpenCL performance issue with sparse matmul #2937 partially, since this is not the main bottleneck.

Changes to Users

None

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass

@9prady9
Copy link
Member

9prady9 commented Nov 5, 2020

I think we had some issues using clEnqueueFillBuffer in the past with one of the vendors - AMD, Intel & NVIDIA - don't remember which one it is. We need to test this thoroughly to ensure we don't do a rollback again later.

@willyborn
Copy link
Contributor Author

willyborn commented Nov 5, 2020 via email

@9prady9
Copy link
Member

9prady9 commented Nov 5, 2020

No problem. It did however solve the INTERMITTENT ERRORS I have, with the ORB TEST. In case the FillBuffer is not suited, I propose to change the WriteBuffer to blocking since no synchronisation point is present before leaving the function. When your test indicates incompatibilities with some vendors, I can also update the PR accordingly. It would be a petty however. I wait on your instructions.

On Thu, 5 Nov 2020 at 16:30, pradeep @.***> wrote: I think we had some issues using clEnqueueFillBuffer in the past with one of the vendors - AMD, Intel & NVIDIA - don't remember which one it is. We need to test this thoroughly to ensure we don't do a rollback again later. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPGWYGYNF7QDREDAITLSOLAJVANCNFSM4THQJAKA .

Yes, it is lame some times w.r.t OpenCL. Sure, we will keep you posted here.

@willyborn
Copy link
Contributor Author

@9prady9

test_orb_opencl continue failing when I run the standard tests. With the latest speed increases, it is now each time. (On top, I have still 3 others continuous failing (also with master) pInverse_cpu, pInverse_opencl & parse_threading. I noticed that they succeed on your test system however. I think it has something to do with the synchronization. As soon as I find a cause, I will let you know.)

What would you think, if I write a new non-blocking kernel with the same parameters as the clEnqueuFillBuffer call of OpenCL. I think this is not that difficult, and on top of fixing unexpected behaviors it will increase the speed again for specific cases thanks to the asynchronous behavior.

@9prady9
Copy link
Member

9prady9 commented Nov 25, 2020

I think we have kernels that can handle generic copying, not sure about filling a constant - may be they expect input buffer.

In any case, the tests worked fine with the systems/vendors we have checked so far. We didn't face an issue although we haven't got our hands on Intel stuff I think. So, I would say don't try implementing such kernel until we can reproduce the issue on some kind of setup. We wouldn't want you to spend time on something that may have been fixed by vendors in their newer opencl implementations.

With regards to the test failures, are you referring to speed increase from other PR changes w.r.t caching API ?

pinverse how is it failing? Accuracy error? Segfault ? Or anything else ?

I believe our CI systems use Intel MKL, what blas/lapacke libraries are you using ?

@willyborn
Copy link
Contributor Author

willyborn commented Nov 25, 2020

Regarding the PR of JIT overhead & hashCaching. I had both changes already for 6 months, but did not introduce them because of the underlying errors. That is the reason I first introduced:

  • Parameter length --> resolved errors in JIT
  • FillBuffer --> resolved errors in orb
  • at test_threading_opencl i got stuck and I did not longer want to wait to enter the rest.
  • There is still 1 PR waiting (Performance af::join for OpenCL)

My HW & SW config:

  • AMD A10 7870K + iGPU R7
  • Windows 10
  • openblas & fftw3
  • AMDAPPSDK 3.0
  • Microsoft Visual Studio Community 2017

====================

  • test_orb_opencl:
    -- PR FillBuffer fixes
  • test_pinverse_cpu:
    -- C++ exception (unkown file). I did not yet check further.
  • test_pinverse_opencl:
    -- C++ exception (unknown file). I did not yet check further.
  • test_threading_opencl: (test_sparse_opencl OK)
    -- RelWithDebInfo: intermediate (1 in 3). Wrong values are returned. Spent already 6months, since I have similar problems in own code. Some blocks inside the result array are overwritten. It happens when sparseTransposeTester(625,1331,1,2) runs in parallel with sparseTransposeTester(453,751,397,1). Running multiples of the same type is OK. Presence of the other types have no impact, except in reducing the chance of error.
    -- Debug: very rare (1 in 100)

test_pinverse_cpu.txt
test_pinverse_opencl.tx.txt
test_threading_opencl.txt

@9prady9
Copy link
Member

9prady9 commented Nov 26, 2020

I have a similar APU, A10 A7850K. I was able to reproduce the only two of your cases.

  • pinverse_cpu passes on 7850K, what is the openblas, lapack versions you have installed ?
  • pinverse_opencl fails with value errors on 7850K's APU
  • orb_opencl passes with fill buffer change.
  • threading_opencl passes fine with 7850k's APU.

A quick update/note: orb_opencl doesn't fail for me on APU on master branch i.e. I didn't need the fillbuffer change.

@willyborn
Copy link
Contributor Author

willyborn commented Nov 26, 2020

Below you can find the full list of all SW installed through vcpkg.

  • Should I regularly run vcpkg update ? (When I do now, it says "No packages need updating")
    vcpkg_list.txt

Regarding the CPU. It is overclocked to 4.3GHz (base) and 4.6GHz (boost). I did not succeed in overclocking the GPU. Even with a faster CPU, I notice that in most applications I have a problem in keeping the GPU busy. This is the main cause, that I focus so much on the overhead of the CPU. Off-course the bugger the arrays the lesser the impact.
I assume that the latest video card owners should have even bigger issues here. There cards can be up to 30X (80CU vs 8CU, 2500MHz vs 866MHz, IPC?) faster, while the fastest CPU (single core) can only be up to 2.5x (PassMark 3563 [R9-5950X] vs 1545 [A10 7870K]) faster than mine.
-> Compared to the latter, my GPU is still very busy when the CPU already leaves the function for the same size of array's. It should be even worse on your 7850K.

Please use the bat file (windows) attached, which repeats the threading_opencl until failure. It will output you the failure rate, which is very variable.
repeat.bat.txt

@9prady9
Copy link
Member

9prady9 commented Dec 7, 2020

I have tracked down the last revert we did, away from enqueueFillBuffer. It is done to avoid incrementing the required OpenCL version from 1.1 to 1.2.

We need to verify the different vendor implementations versions to ensure we aren't excluding any accidentally. If this can be done without the need to add custom fill implementation, that would be ideal. I shall get back here with updates once we figure out the next step.

@9prady9 9prady9 requested review from umar456 and 9prady9 and removed request for umar456 December 17, 2020 05:01
@willyborn
Copy link
Contributor Author

@9prady9
Hi pradeep, I added an extra commit performing the same actions without the fillBuffer. The last commit is fully OpenCL 1.1 compatible although has the same effect (stability and speed).

If you decide to go for this last commit, I still can merge everything into 1 commit so that the 1.2 updates disappears. The last commit overwrites all the previous changes.

The stability problems were only detected when using the OpenCL library of AMD (NVIDIA OpenCL apparently performs more operations on the main Host thread so it takes longer before the variables go out of scope). The stability issues could however appear on both depending on the load (of other kernels) on the GPU.

@9prady9
Copy link
Member

9prady9 commented Jan 18, 2021

@willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit.

Thank you!

@9prady9
Copy link
Member

9prady9 commented May 5, 2021

@willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit.

Thank you!

@willyborn Sorry about the extended delay. We are good with using clEnqueueFillBuffer. Can you please drop the last commit and rebase your branch from latest master. Once the jobs run, we can get this in immediately.

@willyborn
Copy link
Contributor Author

willyborn commented May 8, 2021 via email

@9prady9
Copy link
Member

9prady9 commented May 8, 2021

@pradeep. I will try to have this done by next week. I am currently in the final tests with the join/copy improvements and prefer to have a stable system during the different performance tests.

On Wed, May 5, 2021, 16:03 pradeep @.***> wrote: @willyborn https://github.com/willyborn We are re-organizing our CI infrastructure slightly and hence quite a few GPU jobs haven't been run on this or some other PRs yet. Hence, there haven't been any merges that changed GPU associated code. your changes look good, please give me some time, I shall let you, as soon as I can, on whether to use OpenCL 1.2 fill buffer API or your most recent commit. Thank you! @willyborn https://github.com/willyborn Sorry about the extended delay. We are good with using clEnqueueFillBuffer. Can you please drop the last commit and rebase your branch from latest master. Once the jobs run, we can get this in immediately. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3039 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPGATDIZJ3DPJEV56ULTMFF4HANCNFSM4THQJAKA .

Sure, I haven't gotten to the join benchmarks you shared earlier. I am in the middle some other tasks, but I will get to them once I am done with what I am doing now. Sorry about the delay.

@9prady9
Copy link
Member

9prady9 commented May 18, 2021

@willyborn Did you get the chance to look into this ?

@willyborn
Copy link
Contributor Author

willyborn commented May 18, 2021 via email

enqueueWriteBuffer is replaced by enqueueFillBuffer calls, which always operates asynchronisly because the pattern is copied during the call and not during the execution as for the enqueueWriteBuffer.
Optimizes: susan, sparse, regions, orb, harris and fast.
@9prady9
Copy link
Member

9prady9 commented May 20, 2021

thank you, will take a look once the jobs are done.

@willyborn
Copy link
Contributor Author

I verified the errors in CDash, although I can not find any relationship to the changes applied in PR 3039.

I confirm, on my local system all tests pass except

  • pinverse_cpu+opencl: Linear Algebra disabled on CPU/OPENCL (LAPACK not installed)
  • rng_quality_cuda+opencl: out of memory for double (max 2GB available, and float passes)
    All local errors are related to the available hardware.

@9prady9
Copy link
Member

9prady9 commented May 20, 2021

I verified the errors in CDash, although I can not find any relationship to the changes applied in PR 3039.

I confirm, on my local system all tests pass except

* pinverse_cpu+opencl: Linear Algebra disabled on CPU/OPENCL  (LAPACK not installed)

* rng_quality_cuda+opencl: out of memory for double (max 2GB available, and float passes)
  All local errors are related to the available hardware.

I checked the logs, they are failing on master too. I don't think they are related to your changes. I am just waiting for linux jobs - there seems to be ongoing issue w.r.t linux CI, we are looking into it already.

@9prady9
Copy link
Member

9prady9 commented May 21, 2021

Thanks once again @willyborn for sending in these changes.

@9prady9 9prady9 merged commit 26604b7 into arrayfire:master May 21, 2021
@willyborn willyborn deleted the FillBuffer branch September 29, 2022 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Perf] OpenCL performance issue with sparse matmul
2 participants