Further optimize DNN for RISC-V Vector. #21086
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch is going to further optimize DNN for RVV based on my GSoC work. The previous version is #20521.
There are 3 changes in this patch.
Using
vsetvl
instead of a branch to handle vector tail (The last few elements of each row, which can not fill the entire vector register).I wrote an example on Godbolt about the different between using
vsetvl
and usingif
to show that use ofvsetvl
eliminates conditional jumps and just introduce a statement (sub).Unify the name of variables, which is about
vl
The variable naming in each function before is independent and unfriendly to readers. So I modified the variable name about
vl
with the same rule. For now, in all 4 functions:All the following variables are used for
vl
parameters in intrinsic, but different names have different meanings:vlm<LMUL>
: The maximum value thatvl
can be set for a certain LMUL. It is a constant value.vl
: The number of elements processed in each inner loop, which will be used to process tail in the final loop.unroll_tail
: The number of elements processed in each outer loop, also used to process tail in the final loop, but this tail is caused by loop unrollingAnd there are new parameters intrudced by CHANGE 1 called
avl
, which represents the number of unprocessed elements, and used as the parameter ofvsetvl
.Update the way function
fastConv
handles the matrix tail (The last few rows of the matrix, usually caused by loop unrolling, thevl
for matrix tail is calledunroll_tail
in CHANGE 2).In previous version, I use both vl and mask for the matrix tail to handle the different sizes of the blocksize and here is the discussion at the time. However, mask usually takes a lot of costs and I find a new way to only use vl to handle that. With that, no mask, even no additional branch is needed.
I have already tested this patch on QEMU, the minimal DNN test data set show the same result on the patch and on the master branch:
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.