Skip to content

Minimal implementation of GpuMatND #19259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 5, 2021
Merged

Conversation

nglee
Copy link
Contributor

@nglee nglee commented Jan 5, 2021

For the first step to resolving #15897 and #16433, this PR tries to implement a minimal set of functions and entities that a GpuMatND class should have.

This PR includes:

  • Constructor with internally managed memory with reference counting
  • Constructor with external memory
  • Default destructor, default copy, and move operations
  • clone() for deep copying, with the resulting array being always continuous
  • N-dim submatrix extraction (shallow copy)
  • Uploading from and downloading to Mat
  • Converting operator to GpuMat

Tests are in opencv/opencv_contrib#2805
The following leak check was performed with the test.

cuda-memcheck --leak-check full opencv_test_cudev.exe --gtest_filter=*GpuMatND*

I hope this PR could be accepted as a minimal implementation.

Further things that should be added include:

  • Add GpuMatND to the list of classes handled by the proxy classes(InputArray and OutputArray)
  • Add asynchronous APIs with cuda::Stream
  • Add copyTo, convertTo, setTo, as in GpuMat

I'm guessing that these can be made as a separate PR.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Custom
buildworker:Custom=linux-4
build_image:Custom=ubuntu-cuda:18.04

@nglee nglee force-pushed the dev_gpumatnd1 branch 3 times, most recently from 54ad47d to d4862b6 Compare January 5, 2021 09:35
@nglee nglee force-pushed the dev_gpumatnd1 branch 3 times, most recently from 675aece to 0e2b741 Compare January 7, 2021 18:36
@nglee
Copy link
Contributor Author

nglee commented Jan 12, 2021

Changed the behavior of:

  • GpuMat operator()(IndexArray idx, Range rowRange, Range colRange) const;
  • operator GpuMat() const;

These functions now return clone()-ed GpuMat. So it manages its own memory.

Previously these created a header for GpuMat without reference counting.
However, other APIs have certain naming conventions for this: createXXXHeader, for example:

  • HostMem::createMatHeader() : returns a Mat header without reference counting
  • HostMem::createGpuMatHeader() : returns a GpuMat header without reference counting

Therefore, it seems reasonable to add these two methods for creating a GpuMat header without reference counting.

  • GpuMat createGpuMatHeader(IndexArray idx, Range rowRange, Range colRange) const;
  • GpuMat createGpuMatHeader() const;

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

Please take a look on the comments below.

element of the submatrix, and it can be different from data_->data.
If this is not a submatrix, then data is always equal to data_->data.
*/
uchar* data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps size_t offset should be more suitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uchar* data is private now along with a new member size_t offset. I have also added a public member function getDevicePtr() to get the first byte.

DevicePtr(DevicePtr&&) = delete;
DevicePtr& operator=(DevicePtr&&) = delete;

uchar* data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DevicePtr

It make sense to move this class on upper level, make it more generic, and change its name:

  • GpuData (like UMatData)
  • or GpuBuffer
  • or GpuDataContainer

In the future, we can reuse/share this "container" with existed GpuMat.


Please add size_t size field (size of allocated buffer in bytes) to perform accurate validation checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GpuData is a global struct now with a new member size_t size.

Copy link
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general. I added some comments for testing code in contrib. Manual test passed on Ubuntu 18.04 with NVIDIA GeForce 1080ti.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

@param _step Array of _size.size()-1 steps in case of a multi-dimensional array (the last step is always
set to the element size). If not specified, the matrix is assumed to be continuous.
*/
GpuMatND(SizeArray _size, int _type, void* _data, StepArray _step = StepArray());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asmorkalov Does this ctor has some intersection with similar cv::Mat ctor?
If so, then it make sense to "wrap" (or move) cv::Mat instead of void* (perhaps out of the scope of this PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_data is GPU memory, is not it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upload call handles the case with regular cv::Mat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main point here is to use named static function instead of constructor (to avoid confusions):

static GpuMatND wrapMemoryPtrGPU(...);

@param _type Array type. Use CV_8UC1, ..., CV_16FC4 to create 1-4 channel matrices, or
CV_8UC(n), ..., CV_64FC(n) to create multi-channel (up to CV_CN_MAX channels) matrices.
*/
GpuMatND(SizeArray _size, int _type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_size
_type

No need to use underscores in declarations without implementation code. Lets keep docs and bindings clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that in the next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

using StepArray = std::vector<size_t>;
using IndexArray = std::vector<int>;

~GpuMatND() = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= default

It makes sense to have implementation (empty) in .cpp file for that.
There are several non-trivial fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the definition to the .cpp file to hide the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

Comment on lines +75 to +77
GpuMatND GpuMatND::clone() const
{
CV_DbgAssert(!empty());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is not just return clone(Stream::Null());?

(reduce duplicated code, DRY principle)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one with the empty parameter list calls synchronous CUDA APIs: cudaMemcpy and cudaMemcpy2D, whereas the overloaded function with one argument calls asynchronous CUDA APIs: cudaMemcpyAsync and cudaMemcpyAsync2D.

GpuMat::upload() and GpuMat::download() also have overloades for asynchronous CUDA API calls.

However, I agree that we should reduce duplicated code. So I suggest we need to keep the two overloads while reducing duplication as much as possible.

Copy link
Contributor Author

@nglee nglee Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it, but please look at this comment below.

/////////////////////////////////////////////////////
/// clone

static bool next(uchar*& d, const uchar*& s, std::vector<int>& idx, const int dims, const GpuMatND& dst, const GpuMatND& src)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function signature does not look so good, but I'd like to keep it as is for now. Making an internal iterator class that iterates each 2d plane seems a better design. I'll make that on the next PR when I implement the convertTo, copyTo, and setTo member functions of GpuMatND. A straightforward implementation would be to iterator each 2d plane of GpuMatND and apply GpuMat counterparts. The iterator class would also fit well for this.

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update!

Comment on lines 134 to 142
do
{
CV_CUDEV_SAFE_CALL(
cudaMemcpy2D(
d, ret.step[dims-2], s, step[dims-2],
size[dims-1]*step[dims-1], size[dims-2], cudaMemcpyDeviceToDevice)
);
}
while (next(d, s, idx, dims, ret, *this));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check performance of this loop?
If this a synchronized call, then the sequence of several synchronized calls may show bad performance.

I believe this scheme may perform better:

  • schedule async tasks
  • wait for completion

(code block behaves as synchronized)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed cudaMemcpy2D to cudaMemcpy2DAsync and added cudaStreamSynchronize(0) to wait for completion.

@alalek alalek merged commit 7ea21c4 into opencv:master Feb 5, 2021
@alalek alalek mentioned this pull request Apr 9, 2021
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
Minimal implementation of GpuMatND

* GpuMatND - minimal implementation

* GpuMatND - createGpuMatHeader

* GpuMatND - GpuData, offset, getDevicePtr(), license

* reviews

* reviews
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants