cuda4dnn(conv): autotuning for convolution #16900

YashasSamaga · 2020-03-24T13:24:23Z

Statistics as of May 14th 2020:

Devices:

GTX 1050 Mobile
GTX 1080 Ti
RTX 2080 Ti

benchmarks without this PR

benchmarks with this PR

autotuning algorithm selections

convolution configurations

NOTE: Not up-to-date.

Device: GTX 1050

Model	without this patch	with this patch
MobileNet SSD Coco v1	55ms	7ms
MobileNet SSD Coco v2	84ms	11ms
OpenPose pose MPI	156ms	111ms
EfficientNet B0 YOLOv3	121ms	15ms
FastNeuralStyle Stary Night	26ms	22ms
Inception v2 Mask RCNN	180ms	152ms

Every model which uses depthwise convolutions has got an improvement.

Most of Mask RCNN improvement comes from a single convolution layer where the algorithm chosen by heuristics took 40ms whereas the best algorithm took just 20ms.

Pending:

insanely big initialization times

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under OpenCV (BSD) License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Custom
buildworker:Custom=linux-4
build_image:Custom=ubuntu-cuda:18.04

tompollok · 2020-03-24T13:35:23Z

Great Job @YashasSamaga

alalek

Thank you for contribution!

alalek · 2020-03-24T21:51:18Z

modules/dnn/test/test_tf_importer.cpp

    runTensorFlowNet("fp16_eltwise_add_mul", false, l1, lInf);
    runTensorFlowNet("fp16_pad_and_concat", false, l1, lInf);
-    runTensorFlowNet("fp16_padding_valid", false, l1, lInf);
+    runTensorFlowNet("fp16_padding_valid", false, l1, lInf);std::cout << "2";


Perhaps we should split this test on smaller parts (for accurate detection of failed parts). May be in a separate PR to keep atomic changes.

YashasSamaga · 2020-03-26T12:36:13Z

The CUDA backend now performs autotuning on its own which appears to have resolved the performance degradation problems. Moreover, it also includes fused convolutions while autotuning. This has improved the performance further.

Problem:

The initialization time has gone up significantly. It's in several seconds for many models and few of them take two digit seconds. Mask RCNN takes 26s for initialization!

There are eight algorithms available for convolution. These are the combinations which are tried while autotuning:

cuDNN default math fused:
- all eight algorithms are tried
- cuDNN does convolution, bias addition and activation together
cuDNN default math unfused:
- all eight algorithms are tried
- cuDNN does the convolution
- bias addition and activation are carried out by cuda4dnn kernels (bias and activation are fused together)
cuDNN tensor core fused:
- two algorithms are tried
- cuDNN does convolution, bias addition and activation together
cuDNN tensor core unfused
- two algorithms are tried
- cuDNN does the convolution
- bias addition and activation are carried out by cuda4dnn kernels (bias and activation are fused together)

Not all algorithms are tried always. Some of them fail due to insufficient memory or they are not supported in the required configuration. Detailed summary of autotuning results for many models. In the worst case, 20 combinations will be tried for every convolution layer.

There are few algorithms, particularly FFT based algorithms, which request insane amounts of workspace. For example, the FFT tiling algorithm requests a workspace of 8GB for convolving on a 200KB image. These memory allocations take seconds! The memory is cached during profiling to avoid repeated allocations but even a single allocation of GBs of workspace take a second or two.

The FFT based algorithms are feebly used (only once in my testbed consisting of 26 models) as they are very inefficient for small filters and small batches.

Why try unfused cuDNN combinations when fused cuDNN combinations are available?
Because fused cuDNN can sometimes be slower than unfused cuDNN convolution with separate bias_activation step. For example, if fused cuDNN operations are forced whenever they are available, MobileNet to take 37ms instead of just 7ms!

tompollok · 2020-03-26T12:58:23Z

Would it make sense to add a file next to the model file that persists the tuned configuration in case the same hardware is used to save expensive autotuning when just processing a single image on request, because sometimes several models have to be applied to a single image for some tasks.

YashasSamaga · 2020-03-26T13:12:13Z

For reference, initialization times are around a few hundred milliseconds to a second without this PR. This depends on the device. The numbers that I report are for GTX 1050.

That would require discussions on the storage format, API and a lot of other things (and quite a bit of work).

An easier temporary solution might be to make autotuning optional (opt-in feature, i.e. disabled by default). This again will require new API (or we can use a layer param of the conv layer).

alalek · 2020-03-26T13:14:41Z

Some similar approach exists in OpenCL (ocl4dnn) backend. It includes:

rules to define convolution configuration.
default "pre-tuned" values for some cases (target H/W is selected through number of execution units (EU) - autotuning is available for Intel iGPUs).
multi-level storage of tuned configurations: "default", "in-memory", "on disk".
tuning configuration flags. Auto-tuning is disabled by default, but can be enabled by request or loaded from the disk.
some magic how candidates are generated for tuning process.

YashasSamaga · 2020-05-11T15:46:39Z

What about having an API for loading, saving and optimizing model cofigurations that is exposed through cv::dnn::Net? This seems more user-friendly than using environment variables. It also allows having unoptimized and optimized models to be used simultaneously (without having to modify the environment variables at runtime). It's also easier to extend in the future.

Both OpenCL and CUDA could use the same API instead of having individual backend-specific APIs.

I think only three functions are required (and maybe user-friendly overloads):

void Net::loadOptimizerConfiguration(const std::string& config);
void Net::saveOptimizerConfiguration(std::string& config);
void Net::runOptimizer();

The configuration could be stored in protobuf format. The format can be kept as an internal format. It can also have an internal version which can be used to track format across versions. This will allow printing user-friendly messages when it's incompatible or there is an update and the optimizer must be run again for additional benefits.

tompollok · 2020-05-12T08:47:29Z

The API proposal sounds reasonable to me as it makes the use of optimized vs non optimized versions dynamically selectable at runtime and transparent to the users. I would also vote against the use environment variables.

YashasSamaga · 2020-07-28T19:07:26Z

cuDNN 8 has a brand new API which offers better autotuning capabilities. I will make a PR supporting cuDNN 8's new backend API soon and then a PR for autotuning.

lorenzolightsgdwarf · 2021-12-23T16:07:41Z

@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests!

YashasSamaga · 2021-12-24T05:29:11Z

@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests!

It's always enabled in this PR. This PR hasn't been merged into master. Therefore, you need to build this PR to use the autotuning facility.

Related: #20966

alalek reviewed Mar 25, 2020

View reviewed changes

asmorkalov added the pr: Discussion Required label Apr 15, 2020

YashasSamaga changed the title ~~cuda4dnn(conv): runtime algorithm selection for cudnn convolution~~ cuda4dnn(conv): autotuning for convolution Apr 23, 2020

YashasSamaga mentioned this pull request Apr 24, 2020

dnn4cuda runs slower on rtx2080ti than gtx titan X and more GPU memory usage #17127

Closed

4 tasks

YashasSamaga mentioned this pull request May 11, 2020

cuda4dnn(region): add scale_x_y parameter for YOLOv4 #17253

Merged

6 tasks

YashasSamaga force-pushed the cuda4dnn-conv-runtime-tuning branch from 11f2edb to 4a58a22 Compare May 11, 2020 16:30

YashasSamaga mentioned this pull request May 29, 2020

cuda4dnn(conv): fuse eltwise with convolutions #17363

Merged

6 tasks

YashasSamaga mentioned this pull request Jun 10, 2020

Mobilenetv2 using cuda is slow than using cpu？ #17515

Closed

YashasSamaga force-pushed the cuda4dnn-conv-runtime-tuning branch from 4a58a22 to 993ba52 Compare June 20, 2020 11:05

YashasSamaga mentioned this pull request Jun 20, 2020

cuda4dnn(DetectionOutput): add fast approximate DetectionOutputOp #17301

Merged

6 tasks

This was referenced Jun 30, 2020

cuda4dnn(build): add basic support for cuDNN 8 #17685

Merged

Using gpu opencv dnn on Tesla p4, GPU is much slower than cpu #17711

Closed

YashasSamaga force-pushed the cuda4dnn-conv-runtime-tuning branch from 993ba52 to ef3d497 Compare July 2, 2020 14:30

YashasSamaga added 4 commits July 2, 2020 20:00

add cudnn runtime tuning for convolution

49b3370

recalibrate tests

7dc67b0

perform autotuning manually

322b5d5

update tests

ef3d497

YashasSamaga mentioned this pull request Jul 4, 2020

cuda4dnn: overlap D2H output blobs transfer with inference #17748

Merged

6 tasks

YashasSamaga closed this Jul 28, 2020

facug91 mentioned this pull request May 5, 2021

Add CuDNN 8 release support #17496

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

cuda4dnn(conv): autotuning for convolution #16900

cuda4dnn(conv): autotuning for convolution #16900

Uh oh!

YashasSamaga commented Mar 24, 2020 •

edited

Loading

Uh oh!

tompollok commented Mar 24, 2020

Uh oh!

alalek left a comment

Uh oh!

alalek Mar 24, 2020

Uh oh!

YashasSamaga commented Mar 26, 2020 •

edited

Loading

Uh oh!

tompollok commented Mar 26, 2020

Uh oh!

YashasSamaga commented Mar 26, 2020 •

edited

Loading

Uh oh!

alalek commented Mar 26, 2020

Uh oh!

YashasSamaga commented May 11, 2020 •

edited

Loading

Uh oh!

tompollok commented May 12, 2020

Uh oh!

YashasSamaga commented Jul 28, 2020 •

edited

Loading

Uh oh!

lorenzolightsgdwarf commented Dec 23, 2021

Uh oh!

YashasSamaga commented Dec 24, 2021

Uh oh!

Uh oh!

Uh oh!

cuda4dnn(conv): autotuning for convolution #16900

cuda4dnn(conv): autotuning for convolution #16900

Uh oh!

Conversation

YashasSamaga commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

tompollok commented Mar 24, 2020

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

alalek Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

YashasSamaga commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Uh oh!

tompollok commented Mar 26, 2020

Uh oh!

YashasSamaga commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalek commented Mar 26, 2020

Uh oh!

YashasSamaga commented May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tompollok commented May 12, 2020

Uh oh!

YashasSamaga commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorenzolightsgdwarf commented Dec 23, 2021

Uh oh!

YashasSamaga commented Dec 24, 2021

Uh oh!

Uh oh!

YashasSamaga commented Mar 24, 2020 •

edited

Loading

YashasSamaga commented Mar 26, 2020 •

edited

Loading

YashasSamaga commented Mar 26, 2020 •

edited

Loading

YashasSamaga commented May 11, 2020 •

edited

Loading

YashasSamaga commented Jul 28, 2020 •

edited

Loading