Skip to content

GAPI Fluid: SIMD for MulC kernel. #21177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 3, 2021
Merged

Conversation

anna-khakimova
Copy link
Member

SIMD for GAPI Fluid MulC kernel.

Performance report:

FluidMulCSIMD.xlsx

force_builders=Linux AVX2,Custom,Custom Win,Custom Mac
build_gapi_standalone:Linux x64=ade-0.1.1f
build_gapi_standalone:Win64=ade-0.1.1f
Xbuild_gapi_standalone:Mac=ade-0.1.1f
build_gapi_standalone:Linux x64 Debug=ade-0.1.1f

build_image:Custom=centos:7
buildworker:Custom=linux-1
build_gapi_standalone:Custom=ade-0.1.1f

Xbuild_image:Custom=ubuntu-openvino-2021.3.0:20.04
build_image:Custom Win=openvino-2021.4.1
build_image:Custom Mac=openvino-2021.2.0

buildworker:Custom Win=windows-3

test_modules:Custom=gapi,python2,python3,java
test_modules:Custom Win=gapi,python2,python3,java
test_modules:Custom Mac=gapi,python2,python3,java

buildworker:Custom=linux-1
# disabled due high memory usage: test_opencl:Custom=ON
Xtest_opencl:Custom=OFF
Xtest_bigdata:Custom=1
Xtest_filter:Custom=*

CPU_BASELINE:Custom Win=AVX512_SKX
CPU_BASELINE:Custom=SSE4_2

Copy link
Contributor

@sivanov-work sivanov-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks similar with prevs PR, LGTM

cv::Size sz;
MatType type = -1;
int dtype = -1;
double scale = 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale is not configurable param?

Copy link
Member Author

@anna-khakimova anna-khakimova Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While bug "Fluid: Add scaling support to MulC kernel." hasn't fixed yet, the MulC kernel doesn't support scaling, so it makes no sense to configure it via test parameters.

float* sc = scratch.OutLine<float>();

for (int i = 0; i < scratch.length(); ++i)
sc[i] = static_cast<float>(_scalar[i % chan]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks similar... can all the same similar places aggregated into single inline function like as load_scalar_from_scratch or something more meaningful


initMatsRandU(type, sz, dtype, false);

// OpenCV code ///////////////////////////////////////////////////////////
cv::multiply(in_mat1, sc, out_mat_ocv, 1, dtype);
cv::multiply(in_mat1, sc, out_mat_ocv, scale, dtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale is double, but in implementation functions it is float: is there any compile wraning about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave the scale of the double type, then we will have to convert all data to the double type. Only 2 elements will fit in a 128-bit vector, so we will have to do 2 times more iterations, which include load / store and many other high-latency operations. We will significantly reduce performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMDs for the Mul and the Div kernels also were implemented (by me and before by Evgeny Latkin) with this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, if i confused: i meant interface part not implementation: does cv::multiply accept double (otherwise warning must happen) and how where it convert into float in gapi kernels?
If yes, then there is a some confusion between interface and it's implementation. But from other hand this situation doesn't affect scale so much, because its probably expected x2,x4,x8 etc

@@ -138,6 +138,33 @@ SUBC_SIMD(float, float)

#undef SUBC_SIMD

#define MULC_SIMD(SRC, DST) \
int mulc_simd(const SRC in[], const float scalar[], DST out[], \
const int length, const int chan, const float scale) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float here, but UT has double - but maybe it is minor question...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave the scale of the double type, then we will have to convert all data to the double type. Only 2 elements will fit in a 128-bit vector, so we will have to do 2 times more iterations, which include load / store and many other high-latency operations. We will significantly reduce performance.

case 2: \
case 4: \
{ \
if (std::fabs(scale - 1.0f) <= FLT_EPSILON) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you put comment please about what happened here?
if scale ~ 1, then we use scalar version?

Copy link
Member Author

@anna-khakimova anna-khakimova Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scale = 1.0, we go to the branch without scaling, i.e. a*scalar only.

@dmatveev dmatveev added this to the 4.5.5 milestone Dec 3, 2021
@dmatveev dmatveev self-assigned this Dec 3, 2021
@alalek alalek merged commit c391080 into opencv:4.x Dec 3, 2021
@alalek
Copy link
Member

alalek commented Dec 3, 2021

There are several GPU tests failed:

[  PASSED  ] 9369 tests.
[  FAILED  ] 27 tests, listed below:
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/3, where GetParam() = (compare_f, 128x128, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/7, where GetParam() = (compare_f, 128x128, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/11, where GetParam() = (compare_f, 128x128, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/15, where GetParam() = (compare_f, 128x128, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/16, where GetParam() = (compare_f, 128x128, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/19, where GetParam() = (compare_f, 128x128, 32FC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/23, where GetParam() = (compare_f, 640x480, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/27, where GetParam() = (compare_f, 640x480, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/31, where GetParam() = (compare_f, 640x480, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/35, where GetParam() = (compare_f, 640x480, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/36, where GetParam() = (compare_f, 640x480, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/38, where GetParam() = (compare_f, 640x480, 32FC1, 2, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/39, where GetParam() = (compare_f, 640x480, 32FC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/43, where GetParam() = (compare_f, 1280x720, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/47, where GetParam() = (compare_f, 1280x720, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/51, where GetParam() = (compare_f, 1280x720, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/55, where GetParam() = (compare_f, 1280x720, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/56, where GetParam() = (compare_f, 1280x720, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/58, where GetParam() = (compare_f, 1280x720, 32FC1, 2, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/59, where GetParam() = (compare_f, 1280x720, 32FC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/63, where GetParam() = (compare_f, 1920x1080, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/67, where GetParam() = (compare_f, 1920x1080, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/71, where GetParam() = (compare_f, 1920x1080, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/75, where GetParam() = (compare_f, 1920x1080, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/76, where GetParam() = (compare_f, 1920x1080, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/78, where GetParam() = (compare_f, 1920x1080, 32FC1, 2, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/79, where GetParam() = (compare_f, 1920x1080, 32FC1, 5, { gapi.kernel_package })

"Custom Win" by default doesn't run OpenCL testing.
BTW, windows-3 has Rocket Lake CPU with AVX512 support (i7-11700K).

@anna-khakimova
Copy link
Member Author

There are several GPU tests failed:

[  PASSED  ] 9369 tests.
[  FAILED  ] 27 tests, listed below:
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/3, where GetParam() = (compare_f, 128x128, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/7, where GetParam() = (compare_f, 128x128, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/11, where GetParam() = (compare_f, 128x128, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/15, where GetParam() = (compare_f, 128x128, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/16, where GetParam() = (compare_f, 128x128, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/19, where GetParam() = (compare_f, 128x128, 32FC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/23, where GetParam() = (compare_f, 640x480, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/27, where GetParam() = (compare_f, 640x480, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/31, where GetParam() = (compare_f, 640x480, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/35, where GetParam() = (compare_f, 640x480, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/36, where GetParam() = (compare_f, 640x480, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/38, where GetParam() = (compare_f, 640x480, 32FC1, 2, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/39, where GetParam() = (compare_f, 640x480, 32FC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/43, where GetParam() = (compare_f, 1280x720, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/47, where GetParam() = (compare_f, 1280x720, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/51, where GetParam() = (compare_f, 1280x720, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/55, where GetParam() = (compare_f, 1280x720, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/56, where GetParam() = (compare_f, 1280x720, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/58, where GetParam() = (compare_f, 1280x720, 32FC1, 2, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/59, where GetParam() = (compare_f, 1280x720, 32FC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/63, where GetParam() = (compare_f, 1920x1080, 8UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/67, where GetParam() = (compare_f, 1920x1080, 8UC3, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/71, where GetParam() = (compare_f, 1920x1080, 16UC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/75, where GetParam() = (compare_f, 1920x1080, 16SC1, 5, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/76, where GetParam() = (compare_f, 1920x1080, 32FC1, -1, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/78, where GetParam() = (compare_f, 1920x1080, 32FC1, 2, { gapi.kernel_package })
[  FAILED  ] MulDoublePerfTestGPU/MulDoublePerfTest.TestPerformance/79, where GetParam() = (compare_f, 1920x1080, 32FC1, 5, { gapi.kernel_package })

"Custom Win" by default doesn't run OpenCL testing. BTW, windows-3 has Rocket Lake CPU with AVX512 support (i7-11700K).

I've configured my windows the same as mentioned "Custom Win" CI check and run the test, but unfortunately I didn't manage to reproduce this failures. I can delivery fix for this failure in separate PR. Could you please run this test on new PR?

@alalek alalek mentioned this pull request Dec 30, 2021
@alalek alalek mentioned this pull request Feb 22, 2022
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
* GAPI Fluid: SIMD for MulC kernel.

* Changes for MulDouble kernel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants