Skip to content

imgproc: supports CV_SIMD_SCALABLE in pointSetBoundingRect #27479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: 4.x
Choose a base branch
from

Conversation

fengyuentau
Copy link
Member

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@fengyuentau
Copy link
Member Author

fengyuentau commented Jun 25, 2025

Performance stats:

SpacemiT K1 (RISC-V):

                  Name of Test                   base-gcc patch-gcc patch-gcc 
                                                                        vs    
                                                                     base-gcc 
                                                                    (x-factor)
BoundingRect::TestBoundingRect::(CV_32F, 400)     0.001     0.001      1.95   
BoundingRect::TestBoundingRect::(CV_32F, 511)     0.002     0.001      2.12   
BoundingRect::TestBoundingRect::(CV_32F, 1000)    0.003     0.001      2.53   
BoundingRect::TestBoundingRect::(CV_32F, 10000)   0.025     0.011      2.15   
BoundingRect::TestBoundingRect::(CV_32F, 100000)  0.245     0.108      2.27   
BoundingRect::TestBoundingRect::(CV_32S, 400)     0.001     0.001      1.35   
BoundingRect::TestBoundingRect::(CV_32S, 511)     0.001     0.001      1.38   
BoundingRect::TestBoundingRect::(CV_32S, 1000)    0.002     0.001      1.54   
BoundingRect::TestBoundingRect::(CV_32S, 10000)   0.014     0.011      1.25   
BoundingRect::TestBoundingRect::(CV_32S, 100000)  0.139     0.109      1.27

                  Name of Test                   base-clang patch-clang patch-clang
                                                                            vs     
                                                                        base-clang 
                                                                        (x-factor) 
BoundingRect::TestBoundingRect::(CV_32F, 400)      0.002       0.001       2.68    
BoundingRect::TestBoundingRect::(CV_32F, 511)      0.002       0.001       2.85    
BoundingRect::TestBoundingRect::(CV_32F, 1000)     0.004       0.001       3.38    
BoundingRect::TestBoundingRect::(CV_32F, 10000)    0.039       0.013       3.03    
BoundingRect::TestBoundingRect::(CV_32F, 100000)   0.380       0.120       3.17    
BoundingRect::TestBoundingRect::(CV_32S, 400)      0.001       0.001       1.89    
BoundingRect::TestBoundingRect::(CV_32S, 511)      0.002       0.001       2.01    
BoundingRect::TestBoundingRect::(CV_32S, 1000)     0.003       0.001       2.35    
BoundingRect::TestBoundingRect::(CV_32S, 10000)    0.026       0.013       2.02    
BoundingRect::TestBoundingRect::(CV_32S, 100000)   0.248       0.120       2.06

i7-12700K:

                  Name of Test                   i7-base i7-patch  i7-patch 
                                                                      vs    
                                                                   i7-base  
                                                                  (x-factor)
BoundingRect::TestBoundingRect::(CV_32F, 400)     0.000   0.000      1.05   
BoundingRect::TestBoundingRect::(CV_32F, 511)     0.000   0.000      1.03   
BoundingRect::TestBoundingRect::(CV_32F, 1000)    0.000   0.000      1.04   
BoundingRect::TestBoundingRect::(CV_32F, 10000)   0.004   0.004      1.04   
BoundingRect::TestBoundingRect::(CV_32F, 100000)  0.042   0.041      1.02   
BoundingRect::TestBoundingRect::(CV_32S, 400)     0.000   0.000      1.09   
BoundingRect::TestBoundingRect::(CV_32S, 511)     0.000   0.000      1.09   
BoundingRect::TestBoundingRect::(CV_32S, 1000)    0.001   0.001      1.01   
BoundingRect::TestBoundingRect::(CV_32S, 10000)   0.005   0.004      1.11   
BoundingRect::TestBoundingRect::(CV_32S, 100000)  0.044   0.042      1.03 

m2:

                  Name of Test                   m2-base m2-patch  m2-patch
                                                                      vs
                                                                   m2-base
                                                                  (x-factor)
BoundingRect::TestBoundingRect::(CV_32F, 400)     0.000   0.000      1.10
BoundingRect::TestBoundingRect::(CV_32F, 511)     0.000   0.000      0.97
BoundingRect::TestBoundingRect::(CV_32F, 1000)    0.000   0.000      0.98
BoundingRect::TestBoundingRect::(CV_32F, 10000)   0.003   0.003      0.99
BoundingRect::TestBoundingRect::(CV_32F, 100000)  0.031   0.031      1.00
BoundingRect::TestBoundingRect::(CV_32S, 400)     0.000   0.000      0.95
BoundingRect::TestBoundingRect::(CV_32S, 511)     0.000   0.000      0.92
BoundingRect::TestBoundingRect::(CV_32S, 1000)    0.000   0.000      1.04
BoundingRect::TestBoundingRect::(CV_32S, 10000)   0.003   0.003      1.00
BoundingRect::TestBoundingRect::(CV_32S, 100000)  0.029   0.031      0.92

}
for(int j = 16; j < VTraits<v_uint8>::vlanes(); j*=2)
constexpr int max_nlanes = VTraits<v_int32>::max_nlanes;
int arr_minval[max_nlanes], arr_maxval[max_nlanes];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do better than this if we have something like vtrn1q and vtrn2q in universal intrinsics, which does something like this:

v: [x1 y1 x2 y2]

vtrn1q(v, v): [x1 x1 x2 x2]
vtrn2q(v, v): [y1 y1 y2 y2]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO you can add extra branch with #ifdef CV_NEON. It gives you a chance to fix regressions on Mac M chips.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can use CV_NEON for conditional compilation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with vtrn and actually the performance did not go better on m2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you push it as separate branch? It may be useful with other ARM configs. I want to test it with Jetson.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give you a patch instead.

diff --git a/modules/imgproc/src/geometry.cpp b/modules/imgproc/src/geometry.cpp
index c98684f6d4..412460e0e2 100644
--- a/modules/imgproc/src/geometry.cpp
+++ b/modules/imgproc/src/geometry.cpp
@@ -739,6 +739,16 @@ static Rect pointSetBoundingRect( const Mat& points )
             minval = v_min(ptXY2, minval);
             maxval = v_max(ptXY2, maxval);
         }
+    #if CV_NEON_AARCH64
+        int32x4_t x_minval = vtrn1q_s32(minval.val, minval.val);
+        int32x4_t y_minval = vtrn2q_s32(minval.val, minval.val);
+        int32x4_t x_maxval = vtrn1q_s32(maxval.val, maxval.val);
+        int32x4_t y_maxval = vtrn2q_s32(maxval.val, maxval.val);
+        xmin = vminvq_s32(x_minval);
+        ymin = vminvq_s32(y_minval);
+        xmax = vmaxvq_s32(x_maxval);
+        ymax = vmaxvq_s32(y_maxval);
+    #else
         constexpr int max_nlanes = VTraits<v_int32>::max_nlanes;
         int arr_minval[max_nlanes], arr_maxval[max_nlanes];
         vx_store(arr_minval, minval);
@@ -754,6 +764,7 @@ static Rect pointSetBoundingRect( const Mat& points )
             if (xmax < arr_maxval[2*j])   xmax = arr_maxval[2*j];
             if (ymax < arr_maxval[2*j+1]) ymax = arr_maxval[2*j+1];
         }
+    #endif
 #endif
         for( ; i < npoints; i++ )
         {
@@ -794,6 +805,17 @@ static Rect pointSetBoundingRect( const Mat& points )
             minval = v_min(ptXY2, minval);
             maxval = v_max(ptXY2, maxval);
         }
+    #if CV_NEON_AARCH64
+        v_int32 i_minval = v_floor(minval), i_maxval = v_floor(maxval);
+        int32x4_t x_minval = vtrn1q_s32(i_minval.val, i_minval.val);
+        int32x4_t y_minval = vtrn2q_s32(i_minval.val, i_minval.val);
+        int32x4_t x_maxval = vtrn1q_s32(i_maxval.val, i_maxval.val);
+        int32x4_t y_maxval = vtrn2q_s32(i_maxval.val, i_maxval.val);
+        xmin = vminvq_s32(x_minval);
+        ymin = vminvq_s32(y_minval);
+        xmax = vmaxvq_s32(x_maxval);
+        ymax = vmaxvq_s32(y_maxval);
+    #else
         constexpr int max_nlanes = VTraits<v_int32>::max_nlanes;
         int arr_minval[max_nlanes], arr_maxval[max_nlanes];
         vx_store(arr_minval, v_floor(minval));
@@ -809,6 +831,7 @@ static Rect pointSetBoundingRect( const Mat& points )
             if (xmax < arr_maxval[2*j])   xmax = arr_maxval[2*j];
             if (ymax < arr_maxval[2*j+1]) ymax = arr_maxval[2*j+1];
         }
+    #endif
 #endif
         for( ; i < npoints; i++ )
         {

Copy link
Member Author

@fengyuentau fengyuentau Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed that with older version of gcc (8.3), the halide trick which deals with the tail of simd processing, can lead to some performance degradation. Newer compiler seems fine.

However, if the halide trick is dropped, the performance on i7-12700K also degrades.

@vpisarev vpisarev self-requested a review July 9, 2025 19:45
@asmorkalov
Copy link
Contributor

Hm, there is strange fault on armv7:

[----------] 4 tests from Imgproc_BoundingRect_Types
[ RUN      ] Imgproc_BoundingRect_Types.accuracy/0, where GetParam() = 4
[       OK ] Imgproc_BoundingRect_Types.accuracy/0 (81 ms)
[ RUN      ] Imgproc_BoundingRect_Types.accuracy/1, where GetParam() = 5
[       OK ] Imgproc_BoundingRect_Types.accuracy/1 (82 ms)
[ RUN      ] Imgproc_BoundingRect_Types.alignment/0, where GetParam() = 4
***  ***: ./bin/opencv_test_imgproc terminated
***  ***: ./bin/opencv_test_imgproc terminated

I'm debugging it right now.

@asmorkalov
Copy link
Contributor

asmorkalov commented Jul 14, 2025

The test check the function behaviour with unaligned input. It was introduced by Maxim in c4ce942. armv7 is more sensitive to alignment and triggers sigbus:

Note: Google Test filter = Imgproc_BoundingRect_Types.alignment/0
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Imgproc_BoundingRect_Types
[ RUN      ] Imgproc_BoundingRect_Types.alignment/0, where GetParam() = 4

Program received signal SIGBUS, Bus error.
0xb6be3ab0 in pointSetBoundingRect(cv::Mat const&) () from /mnt/disk/opencv-build/lib/libopencv_imgproc.so.413
(gdb) bt
#0  0xb6be3ab0 in pointSetBoundingRect(cv::Mat const&) () from /mnt/disk/opencv-build/lib/libopencv_imgproc.so.413
Backtrace stopped: Cannot access memory at address 0x100fff8

Newer generation just charges some performance penalties here.

@asmorkalov asmorkalov self-requested a review July 14, 2025 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants