Minimal implementation of GpuMatND #19259

nglee · 2021-01-05T08:20:24Z

For the first step to resolving #15897 and #16433, this PR tries to implement a minimal set of functions and entities that a GpuMatND class should have.

This PR includes:

Constructor with internally managed memory with reference counting
Constructor with external memory
Default destructor, default copy, and move operations
clone() for deep copying, with the resulting array being always continuous
N-dim submatrix extraction (shallow copy)
Uploading from and downloading to Mat
Converting operator to GpuMat

Tests are in opencv/opencv_contrib#2805
The following leak check was performed with the test.

cuda-memcheck --leak-check full opencv_test_cudev.exe --gtest_filter=*GpuMatND*

I hope this PR could be accepted as a minimal implementation.

Further things that should be added include:

Add GpuMatND to the list of classes handled by the proxy classes(InputArray and OutputArray)
Add asynchronous APIs with cuda::Stream
Add copyTo, convertTo, setTo, as in GpuMat

I'm guessing that these can be made as a separate PR.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Custom
buildworker:Custom=linux-4
build_image:Custom=ubuntu-cuda:18.04

nglee · 2021-01-12T03:30:28Z

Changed the behavior of:

GpuMat operator()(IndexArray idx, Range rowRange, Range colRange) const;
operator GpuMat() const;

These functions now return clone()-ed GpuMat. So it manages its own memory.

Previously these created a header for GpuMat without reference counting.
However, other APIs have certain naming conventions for this: createXXXHeader, for example:

HostMem::createMatHeader() : returns a Mat header without reference counting
HostMem::createGpuMatHeader() : returns a GpuMat header without reference counting

Therefore, it seems reasonable to add these two methods for creating a GpuMat header without reference counting.

GpuMat createGpuMatHeader(IndexArray idx, Range rowRange, Range colRange) const;
GpuMat createGpuMatHeader() const;

alalek

Thank you for working on this!

Please take a look on the comments below.

alalek · 2021-01-15T05:28:09Z

modules/core/include/opencv2/core/cuda.hpp

+    element of the submatrix, and it can be different from data_->data.
+    If this is not a submatrix, then data is always equal to data_->data.
+    */
+    uchar* data;


Perhaps size_t offset should be more suitable.

uchar* data is private now along with a new member size_t offset. I have also added a public member function getDevicePtr() to get the first byte.

alalek · 2021-01-15T05:32:47Z

modules/core/include/opencv2/core/cuda.hpp

+        DevicePtr(DevicePtr&&) = delete;
+        DevicePtr& operator=(DevicePtr&&) = delete;
+
+        uchar* data;


DevicePtr

It make sense to move this class on upper level, make it more generic, and change its name:

GpuData (like UMatData)

or GpuBuffer

or GpuDataContainer

In the future, we can reuse/share this "container" with existed GpuMat.

Please add size_t size field (size of allocated buffer in bytes) to perform accurate validation checks.

GpuData is a global struct now with a new member size_t size.

modules/core/src/cuda/gpu_mat_nd.cu

asmorkalov

Looks good to me in general. I added some comments for testing code in contrib. Manual test passed on Ubuntu 18.04 with NVIDIA GeForce 1080ti.

alalek

Well done!

alalek · 2021-01-27T19:52:27Z

modules/core/include/opencv2/core/cuda.hpp

+    @param _step Array of _size.size()-1 steps in case of a multi-dimensional array (the last step is always
+    set to the element size). If not specified, the matrix is assumed to be continuous.
+    */
+    GpuMatND(SizeArray _size, int _type, void* _data, StepArray _step = StepArray());


@asmorkalov Does this ctor has some intersection with similar cv::Mat ctor?
If so, then it make sense to "wrap" (or move) cv::Mat instead of void* (perhaps out of the scope of this PR)

_data is GPU memory, is not it?

upload call handles the case with regular cv::Mat

Main point here is to use named static function instead of constructor (to avoid confusions):

static GpuMatND wrapMemoryPtrGPU(...);

alalek · 2021-01-27T19:54:18Z

modules/core/include/opencv2/core/cuda.hpp

+    @param _type Array type. Use CV_8UC1, ..., CV_16FC4 to create 1-4 channel matrices, or
+    CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
+    */
+    GpuMatND(SizeArray _size, int _type);


_size
_type

No need to use underscores in declarations without implementation code. Lets keep docs and bindings clear

I'll do that in the next commit.

I have fixed it.

alalek · 2021-01-27T19:57:25Z

modules/core/include/opencv2/core/cuda.hpp

+    using StepArray = std::vector<size_t>;
+    using IndexArray = std::vector<int>;
+
+    ~GpuMatND() = default;


= default

It makes sense to have implementation (empty) in .cpp file for that.
There are several non-trivial fields.

I'll move the definition to the .cpp file to hide the implementation.

I have fixed it.

alalek · 2021-01-27T19:59:45Z

modules/core/src/cuda/gpu_mat_nd.cu

+GpuMatND GpuMatND::clone() const
+{
+    CV_DbgAssert(!empty());


Why is not just return clone(Stream::Null());?

(reduce duplicated code, DRY principle)

The one with the empty parameter list calls synchronous CUDA APIs: cudaMemcpy and cudaMemcpy2D, whereas the overloaded function with one argument calls asynchronous CUDA APIs: cudaMemcpyAsync and cudaMemcpyAsync2D.

GpuMat::upload() and GpuMat::download() also have overloades for asynchronous CUDA API calls.

However, I agree that we should reduce duplicated code. So I suggest we need to keep the two overloads while reducing duplication as much as possible.

I have fixed it, but please look at this comment below.

nglee · 2021-02-03T04:58:18Z

modules/core/src/cuda/gpu_mat_nd.cu

+/////////////////////////////////////////////////////
+/// clone
+
+static bool next(uchar*& d, const uchar*& s, std::vector<int>& idx, const int dims, const GpuMatND& dst, const GpuMatND& src)


This function signature does not look so good, but I'd like to keep it as is for now. Making an internal iterator class that iterates each 2d plane seems a better design. I'll make that on the next PR when I implement the convertTo, copyTo, and setTo member functions of GpuMatND. A straightforward implementation would be to iterator each 2d plane of GpuMatND and apply GpuMat counterparts. The iterator class would also fit well for this.

alalek

Thank you for update!

alalek · 2021-02-03T08:01:30Z

modules/core/src/cuda/gpu_mat_nd.cu

+            do
+            {
+                CV_CUDEV_SAFE_CALL(
+                    cudaMemcpy2D(
+                        d, ret.step[dims-2], s, step[dims-2],
+                        size[dims-1]*step[dims-1], size[dims-2], cudaMemcpyDeviceToDevice)
+                );
+            }
+            while (next(d, s, idx, dims, ret, *this));


Did you check performance of this loop?
If this a synchronized call, then the sequence of several synchronized calls may show bad performance.

I believe this scheme may perform better:

schedule async tasks

wait for completion

(code block behaves as synchronized)

I have changed cudaMemcpy2D to cudaMemcpy2DAsync and added cudaStreamSynchronize(0) to wait for completion.

Minimal implementation of GpuMatND * GpuMatND - minimal implementation * GpuMatND - createGpuMatHeader * GpuMatND - GpuData, offset, getDevicePtr(), license * reviews * reviews

nglee force-pushed the dev_gpumatnd1 branch 3 times, most recently from 54ad47d to d4862b6 Compare January 5, 2021 09:35

nglee mentioned this pull request Jan 5, 2021

Test code - GpuMatND & Mat interoperability opencv/opencv_contrib#2805

Merged

nglee force-pushed the dev_gpumatnd1 branch 3 times, most recently from 675aece to 0e2b741 Compare January 7, 2021 18:36

GpuMatND - minimal implementation

b0f109c

nglee force-pushed the dev_gpumatnd1 branch from 0e2b741 to b0f109c Compare January 8, 2021 07:16

GpuMatND - createGpuMatHeader

ddebc2c

nglee force-pushed the dev_gpumatnd1 branch from 27d85da to ddebc2c Compare January 12, 2021 12:55

alalek reviewed Jan 15, 2021

View reviewed changes

nglee force-pushed the dev_gpumatnd1 branch from ab648b5 to 4e06041 Compare January 19, 2021 10:47

GpuMatND - GpuData, offset, getDevicePtr(), license

d74dab4

nglee force-pushed the dev_gpumatnd1 branch from 4e06041 to d74dab4 Compare January 19, 2021 12:36

asmorkalov mentioned this pull request Jan 22, 2021

[WIP] Add GpuMatND with arbitrary dimension support #16666

Closed

6 tasks

asmorkalov approved these changes Jan 25, 2021

View reviewed changes

alalek reviewed Jan 27, 2021

View reviewed changes

reviews

29cf09e

nglee force-pushed the dev_gpumatnd1 branch from eb7ed31 to 29cf09e Compare February 3, 2021 04:56

nglee commented Feb 3, 2021

View reviewed changes

alalek reviewed Feb 3, 2021

View reviewed changes

reviews

3e411cb

alalek assigned asmorkalov Feb 5, 2021

alalek merged commit 7ea21c4 into opencv:master Feb 5, 2021

This was referenced Feb 5, 2021

rgbd: fix segmentation fault on Arm 32bit platform opencv/opencv_contrib#2858

Merged

core(cuda.hpp): fix GpuMatND compilation with GCC < 5 #19466

Merged

alalek mentioned this pull request Apr 9, 2021

(5.x) Merge 4.x #19885

Merged

chacha21 mentioned this pull request Jul 3, 2023

make cuda::GpuMatND compatible with InputArray/OutputArray #23913

Open

6 tasks

Uh oh!

Minimal implementation of GpuMatND #19259

Minimal implementation of GpuMatND #19259

Uh oh!

Conversation

nglee commented Jan 5, 2021 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

nglee commented Jan 12, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nglee Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nglee commented Jan 5, 2021 •

edited by alalek

Loading

nglee Feb 3, 2021 •

edited

Loading