-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
imgproc: supports CV_SIMD_SCALABLE in pointSetBoundingRect #27479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.x
Are you sure you want to change the base?
Conversation
Performance stats: SpacemiT K1 (RISC-V):
i7-12700K:
m2:
|
} | ||
for(int j = 16; j < VTraits<v_uint8>::vlanes(); j*=2) | ||
constexpr int max_nlanes = VTraits<v_int32>::max_nlanes; | ||
int arr_minval[max_nlanes], arr_maxval[max_nlanes]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do better than this if we have something like vtrn1q and vtrn2q in universal intrinsics, which does something like this:
v: [x1 y1 x2 y2]
vtrn1q(v, v): [x1 x1 x2 x2]
vtrn2q(v, v): [y1 y1 y2 y2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO you can add extra branch with #ifdef CV_NEON
. It gives you a chance to fix regressions on Mac M chips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we can use CV_NEON for conditional compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with vtrn and actually the performance did not go better on m2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you push it as separate branch? It may be useful with other ARM configs. I want to test it with Jetson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me give you a patch instead.
diff --git a/modules/imgproc/src/geometry.cpp b/modules/imgproc/src/geometry.cpp
index c98684f6d4..412460e0e2 100644
--- a/modules/imgproc/src/geometry.cpp
+++ b/modules/imgproc/src/geometry.cpp
@@ -739,6 +739,16 @@ static Rect pointSetBoundingRect( const Mat& points )
minval = v_min(ptXY2, minval);
maxval = v_max(ptXY2, maxval);
}
+ #if CV_NEON_AARCH64
+ int32x4_t x_minval = vtrn1q_s32(minval.val, minval.val);
+ int32x4_t y_minval = vtrn2q_s32(minval.val, minval.val);
+ int32x4_t x_maxval = vtrn1q_s32(maxval.val, maxval.val);
+ int32x4_t y_maxval = vtrn2q_s32(maxval.val, maxval.val);
+ xmin = vminvq_s32(x_minval);
+ ymin = vminvq_s32(y_minval);
+ xmax = vmaxvq_s32(x_maxval);
+ ymax = vmaxvq_s32(y_maxval);
+ #else
constexpr int max_nlanes = VTraits<v_int32>::max_nlanes;
int arr_minval[max_nlanes], arr_maxval[max_nlanes];
vx_store(arr_minval, minval);
@@ -754,6 +764,7 @@ static Rect pointSetBoundingRect( const Mat& points )
if (xmax < arr_maxval[2*j]) xmax = arr_maxval[2*j];
if (ymax < arr_maxval[2*j+1]) ymax = arr_maxval[2*j+1];
}
+ #endif
#endif
for( ; i < npoints; i++ )
{
@@ -794,6 +805,17 @@ static Rect pointSetBoundingRect( const Mat& points )
minval = v_min(ptXY2, minval);
maxval = v_max(ptXY2, maxval);
}
+ #if CV_NEON_AARCH64
+ v_int32 i_minval = v_floor(minval), i_maxval = v_floor(maxval);
+ int32x4_t x_minval = vtrn1q_s32(i_minval.val, i_minval.val);
+ int32x4_t y_minval = vtrn2q_s32(i_minval.val, i_minval.val);
+ int32x4_t x_maxval = vtrn1q_s32(i_maxval.val, i_maxval.val);
+ int32x4_t y_maxval = vtrn2q_s32(i_maxval.val, i_maxval.val);
+ xmin = vminvq_s32(x_minval);
+ ymin = vminvq_s32(y_minval);
+ xmax = vmaxvq_s32(x_maxval);
+ ymax = vmaxvq_s32(y_maxval);
+ #else
constexpr int max_nlanes = VTraits<v_int32>::max_nlanes;
int arr_minval[max_nlanes], arr_maxval[max_nlanes];
vx_store(arr_minval, v_floor(minval));
@@ -809,6 +831,7 @@ static Rect pointSetBoundingRect( const Mat& points )
if (xmax < arr_maxval[2*j]) xmax = arr_maxval[2*j];
if (ymax < arr_maxval[2*j+1]) ymax = arr_maxval[2*j+1];
}
+ #endif
#endif
for( ; i < npoints; i++ )
{
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed that with older version of gcc (8.3), the halide trick which deals with the tail of simd processing, can lead to some performance degradation. Newer compiler seems fine.
However, if the halide trick is dropped, the performance on i7-12700K also degrades.
Hm, there is strange fault on armv7:
I'm debugging it right now. |
The test check the function behaviour with unaligned input. It was introduced by Maxim in c4ce942. armv7 is more sensitive to alignment and triggers sigbus:
Newer generation just charges some performance penalties here. |
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.