Make CAReduce more SIMD and memory friendly #1385

ricardoV94 · 2025-04-30T17:31:15Z

This PR does two major changes:

Adds a branch with intermediate allocators when reducing over a contiguous dimension that is SIMD friendly.
Allocate output buffer aligned with input dimensions, instead of always allocating C-order.

Performance is now comparable or better than numpy

…emory

Make CAReduce more SIMD friendly and do better allocation of output m…

3ef6fb8

…emory

ricardoV94 force-pushed the faster_careduce branch from 293af91 to 3ef6fb8 Compare May 1, 2025 19:44

Provide feedback