Skip to content

core: ARM64 loop unrolling in kmeans to improve Weighted Filter performance #27596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: 4.x
Choose a base branch
from

Conversation

pratham-mcw
Copy link
Contributor

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.

  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV

  • The PR is proposed to the proper branch

  • This PR improves the performance of the Weighted Filter function from the ximgproc module on Windows on ARM64.

  • The optimization is achieved by unrolling two performance-critical loops in the generateCentersPP function in modules/core/src/kmeans.cpp, which is internally used by the Weighted Filter function.

  • The unrolling is enabled only for ARM64 builds using #if defined(_M_ARM64) guards to preserve compatibility and maintain performance on other architectures.

Performance Improvements:

  • Improves execution time for Weighted Filter performance tests on ARM64 without affecting other platforms.
image

@@ -120,6 +120,20 @@ static void generateCentersPP(const Mat& data, Mat& _out_centers,
{
double p = (double)rng*sum0;
int ci = 0;
#if defined(_M_ARM64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if CV_ENABLE_UNROLLED should be checked too.

for (; ci + 1 < N - 1; ci += 2)
{
p -= dist[ci];
if (p <= 0) break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break doesn't provide the same result due to the next loop.
At least for p value.


Don't squash if body - separate lines are necessary for accurate reports of code coverage analyses.

Copy link
Contributor Author

@pratham-mcw pratham-mcw Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @opencv-alalek,

  • Reverted the earlier unrolling of the p loop to maintain correctness, as you mentioned.
  • Increased the unrolling factor of the existing tdist2 accumulation loop to ×8, preserving original behavior while improving benchmark performance.
  • I have updated the changes in this commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants