Skip to content

cuda4dnn(conv): autotuning for convolution #16900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

YashasSamaga
Copy link
Contributor

@YashasSamaga YashasSamaga commented Mar 24, 2020

Statistics as of May 14th 2020:

Devices:

  • GTX 1050 Mobile
  • GTX 1080 Ti
  • RTX 2080 Ti

benchmarks without this PR

benchmarks with this PR

autotuning algorithm selections

convolution configurations


NOTE: Not up-to-date.

Device: GTX 1050

Model without this patch with this patch
MobileNet SSD Coco v1 55ms 7ms
MobileNet SSD Coco v2 84ms 11ms
OpenPose pose MPI 156ms 111ms
EfficientNet B0 YOLOv3 121ms 15ms
FastNeuralStyle Stary Night 26ms 22ms
Inception v2 Mask RCNN 180ms 152ms

Every model which uses depthwise convolutions has got an improvement.

Most of Mask RCNN improvement comes from a single convolution layer where the algorithm chosen by heuristics took 40ms whereas the best algorithm took just 20ms.


Pending:

  • insanely big initialization times

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under OpenCV (BSD) License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Custom
buildworker:Custom=linux-4
build_image:Custom=ubuntu-cuda:18.04

@tompollok
Copy link
Contributor

Great Job @YashasSamaga

Copy link
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution!

runTensorFlowNet("fp16_eltwise_add_mul", false, l1, lInf);
runTensorFlowNet("fp16_pad_and_concat", false, l1, lInf);
runTensorFlowNet("fp16_padding_valid", false, l1, lInf);
runTensorFlowNet("fp16_padding_valid", false, l1, lInf);std::cout << "2";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should split this test on smaller parts (for accurate detection of failed parts). May be in a separate PR to keep atomic changes.

@YashasSamaga
Copy link
Contributor Author

YashasSamaga commented Mar 26, 2020

The CUDA backend now performs autotuning on its own which appears to have resolved the performance degradation problems. Moreover, it also includes fused convolutions while autotuning. This has improved the performance further.

Problem:

The initialization time has gone up significantly. It's in several seconds for many models and few of them take two digit seconds. Mask RCNN takes 26s for initialization!

There are eight algorithms available for convolution. These are the combinations which are tried while autotuning:

  1. cuDNN default math fused:
    • all eight algorithms are tried
    • cuDNN does convolution, bias addition and activation together
  2. cuDNN default math unfused:
    • all eight algorithms are tried
    • cuDNN does the convolution
    • bias addition and activation are carried out by cuda4dnn kernels (bias and activation are fused together)
  3. cuDNN tensor core fused:
    • two algorithms are tried
    • cuDNN does convolution, bias addition and activation together
  4. cuDNN tensor core unfused
    • two algorithms are tried
    • cuDNN does the convolution
    • bias addition and activation are carried out by cuda4dnn kernels (bias and activation are fused together)

Not all algorithms are tried always. Some of them fail due to insufficient memory or they are not supported in the required configuration. Detailed summary of autotuning results for many models. In the worst case, 20 combinations will be tried for every convolution layer.

There are few algorithms, particularly FFT based algorithms, which request insane amounts of workspace. For example, the FFT tiling algorithm requests a workspace of 8GB for convolving on a 200KB image. These memory allocations take seconds! The memory is cached during profiling to avoid repeated allocations but even a single allocation of GBs of workspace take a second or two.

The FFT based algorithms are feebly used (only once in my testbed consisting of 26 models) as they are very inefficient for small filters and small batches.

Why try unfused cuDNN combinations when fused cuDNN combinations are available?
Because fused cuDNN can sometimes be slower than unfused cuDNN convolution with separate bias_activation step. For example, if fused cuDNN operations are forced whenever they are available, MobileNet to take 37ms instead of just 7ms!

@tompollok
Copy link
Contributor

Would it make sense to add a file next to the model file that persists the tuned configuration in case the same hardware is used to save expensive autotuning when just processing a single image on request, because sometimes several models have to be applied to a single image for some tasks.

@YashasSamaga
Copy link
Contributor Author

YashasSamaga commented Mar 26, 2020

For reference, initialization times are around a few hundred milliseconds to a second without this PR. This depends on the device. The numbers that I report are for GTX 1050.

That would require discussions on the storage format, API and a lot of other things (and quite a bit of work).

An easier temporary solution might be to make autotuning optional (opt-in feature, i.e. disabled by default). This again will require new API (or we can use a layer param of the conv layer).

@alalek
Copy link
Member

alalek commented Mar 26, 2020

Some similar approach exists in OpenCL (ocl4dnn) backend. It includes:

  • rules to define convolution configuration.
  • default "pre-tuned" values for some cases (target H/W is selected through number of execution units (EU) - autotuning is available for Intel iGPUs).
  • multi-level storage of tuned configurations: "default", "in-memory", "on disk".
  • tuning configuration flags. Auto-tuning is disabled by default, but can be enabled by request or loaded from the disk.
  • some magic how candidates are generated for tuning process.

@YashasSamaga YashasSamaga changed the title cuda4dnn(conv): runtime algorithm selection for cudnn convolution cuda4dnn(conv): autotuning for convolution Apr 23, 2020
@YashasSamaga
Copy link
Contributor Author

YashasSamaga commented May 11, 2020

What about having an API for loading, saving and optimizing model cofigurations that is exposed through cv::dnn::Net? This seems more user-friendly than using environment variables. It also allows having unoptimized and optimized models to be used simultaneously (without having to modify the environment variables at runtime). It's also easier to extend in the future.

Both OpenCL and CUDA could use the same API instead of having individual backend-specific APIs.

I think only three functions are required (and maybe user-friendly overloads):

void Net::loadOptimizerConfiguration(const std::string& config);
void Net::saveOptimizerConfiguration(std::string& config);
void Net::runOptimizer();

The configuration could be stored in protobuf format. The format can be kept as an internal format. It can also have an internal version which can be used to track format across versions. This will allow printing user-friendly messages when it's incompatible or there is an update and the optimizer must be run again for additional benefits.

@YashasSamaga YashasSamaga force-pushed the cuda4dnn-conv-runtime-tuning branch from 11f2edb to 4a58a22 Compare May 11, 2020 16:30
@tompollok
Copy link
Contributor

The API proposal sounds reasonable to me as it makes the use of optimized vs non optimized versions dynamically selectable at runtime and transparent to the users. I would also vote against the use environment variables.

@YashasSamaga
Copy link
Contributor Author

YashasSamaga commented Jul 28, 2020

cuDNN 8 has a brand new API which offers better autotuning capabilities. I will make a PR supporting cuDNN 8's new backend API soon and then a PR for autotuning.

@lorenzolightsgdwarf
Copy link

@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests!

@YashasSamaga
Copy link
Contributor Author

@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests!

It's always enabled in this PR. This PR hasn't been merged into master. Therefore, you need to build this PR to use the autotuning facility.

Related: #20966

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants