-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
cuda4dnn(conv): autotuning for convolution #16900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda4dnn(conv): autotuning for convolution #16900
Conversation
Great Job @YashasSamaga |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution!
runTensorFlowNet("fp16_eltwise_add_mul", false, l1, lInf); | ||
runTensorFlowNet("fp16_pad_and_concat", false, l1, lInf); | ||
runTensorFlowNet("fp16_padding_valid", false, l1, lInf); | ||
runTensorFlowNet("fp16_padding_valid", false, l1, lInf);std::cout << "2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should split this test on smaller parts (for accurate detection of failed parts). May be in a separate PR to keep atomic changes.
The CUDA backend now performs autotuning on its own which appears to have resolved the performance degradation problems. Moreover, it also includes fused convolutions while autotuning. This has improved the performance further. Problem:The initialization time has gone up significantly. It's in several seconds for many models and few of them take two digit seconds. Mask RCNN takes 26s for initialization! There are eight algorithms available for convolution. These are the combinations which are tried while autotuning:
Not all algorithms are tried always. Some of them fail due to insufficient memory or they are not supported in the required configuration. Detailed summary of autotuning results for many models. In the worst case, 20 combinations will be tried for every convolution layer. There are few algorithms, particularly FFT based algorithms, which request insane amounts of workspace. For example, the FFT tiling algorithm requests a workspace of 8GB for convolving on a 200KB image. These memory allocations take seconds! The memory is cached during profiling to avoid repeated allocations but even a single allocation of GBs of workspace take a second or two. The FFT based algorithms are feebly used (only once in my testbed consisting of 26 models) as they are very inefficient for small filters and small batches. Why try unfused cuDNN combinations when fused cuDNN combinations are available? |
Would it make sense to add a file next to the model file that persists the tuned configuration in case the same hardware is used to save expensive autotuning when just processing a single image on request, because sometimes several models have to be applied to a single image for some tasks. |
For reference, initialization times are around a few hundred milliseconds to a second without this PR. This depends on the device. The numbers that I report are for GTX 1050. That would require discussions on the storage format, API and a lot of other things (and quite a bit of work). An easier temporary solution might be to make autotuning optional (opt-in feature, i.e. disabled by default). This again will require new API (or we can use a layer param of the conv layer). |
Some similar approach exists in OpenCL (ocl4dnn) backend. It includes:
|
What about having an API for loading, saving and optimizing model cofigurations that is exposed through Both OpenCL and CUDA could use the same API instead of having individual backend-specific APIs. I think only three functions are required (and maybe user-friendly overloads):
The configuration could be stored in protobuf format. The format can be kept as an internal format. It can also have an internal version which can be used to track format across versions. This will allow printing user-friendly messages when it's incompatible or there is an update and the optimizer must be run again for additional benefits. |
11f2edb
to
4a58a22
Compare
The API proposal sounds reasonable to me as it makes the use of optimized vs non optimized versions dynamically selectable at runtime and transparent to the users. I would also vote against the use environment variables. |
4a58a22
to
993ba52
Compare
993ba52
to
ef3d497
Compare
cuDNN 8 has a brand new API which offers better autotuning capabilities. I will make a PR supporting cuDNN 8's new backend API soon and then a PR for autotuning. |
@YashasSamaga thanks for the work! On my net (a custom version of SSD MobileNet v2) I haven't seen any improvement. I'm using Cuda 10.2 and Cudnn 7.6.2. Is there any flag/environment variable that enables the autotuning? Or is there something that disables it? Bests! |
It's always enabled in this PR. This PR hasn't been merged into master. Therefore, you need to build this PR to use the autotuning facility. Related: #20966 |
Statistics as of May 14th 2020:
Devices:
benchmarks without this PR
benchmarks with this PR
autotuning algorithm selections
convolution configurations
NOTE: Not up-to-date.
Device: GTX 1050
Every model which uses depthwise convolutions has got an improvement.
Most of Mask RCNN improvement comes from a single convolution layer where the algorithm chosen by heuristics took 40ms whereas the best algorithm took just 20ms.
Pending:
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.