[ARM] Integrate INT4→BF16 via KleidiAI, with fallback #158250

usamahz · 2025-07-14T17:05:31Z

Co-authored-by: Nikhil Gupta nikhil.gupta2@arm.com

This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch.

✅ Key Results
• Integration of KleidiAI direct support for INT4→BF16 kernel execution.
• Kernels exposed in PyTorch: INT4 channelwise kernels now support BF16 output when available.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>

pytorch-bot · 2025-07-14T17:05:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158250

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 49979eb with merge base ffaed8c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

usamahz · 2025-07-14T17:07:31Z

@pytorchbot label "module: arm"

nikhil-arm · 2025-07-14T20:05:41Z

aten/src/ATen/native/kleidiai/kai_kernels.cpp

+            if (cpuinfo_has_arm_bf16()) {
+                kleidiai::kai_quant_pack_lhs_int4_mm_bf16_channelwise(
+                    output, input, weight, m, n, k);
+            } else {


There is no need to fallback to fp32. If the platform does not support bf16 arm vector or platform is not arm then fallback to bf16 scalar reference implementation.
if bf16 scalar is no supported, error out.

Done - 48e16e1

nikhil-arm · 2025-07-14T20:07:09Z

aten/src/ATen/native/kleidiai/kai_kernels.cpp

@@ -217,6 +237,17 @@ static void kai_quant_pack_lhs_int4_mm_channelwise(
    const int64_t m,
    const int64_t n,
    const int64_t k) {
+
+    at::Tensor input_fp32;


This code is not needed as we are not falling back to fp32 path. We can keep the existing fp32 path untouched

Done - Removed

nikhil-arm · 2025-07-14T20:08:12Z

aten/src/ATen/native/kleidiai/kai_kernels.cpp

@@ -13,6 +15,19 @@

 namespace at::native::kleidiai {

+at::Tensor kleidi_bf16_to_fp32(const at::Tensor& src) {


No required. No fallback to fp32

Done - Removed

nikhil-arm · 2025-07-14T20:08:34Z

aten/src/ATen/native/kleidiai/kai_kernels.h

@@ -5,6 +5,8 @@

 namespace at::native::kleidiai {

+at::Tensor kleidi_bf16_to_fp32(const at::Tensor& src);


Remove this

Done - Removed

nikhil-arm · 2025-07-14T20:11:45Z

aten/src/ATen/native/kleidiai/kai_ukernel_interface.h

-      3 // Channelwise 4 bit GEMM
+      3, // Channelwise 4 bit GEMM
+  matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod =
+      4, // Channelwise 4 bit GEMV


Please Fix comments and explain this is bf16 int4 gemm and how its different than other ones

nikhil-arm · 2025-07-14T20:15:07Z

aten/src/ATen/native/kleidiai/kai_kernels.cpp

 }
 } // namespace at::native::kleidiai
-#endif
+#endif


add a new line

nikhil-arm · 2025-07-14T20:16:15Z

aten/src/ATen/native/kleidiai/kai_kernels.cpp

  uint8_t* dst_act_mtx_f32 = reinterpret_cast<uint8_t*>(output.data_ptr());
-  const uint8_t* lhs_native_mtx_f32 =
-      reinterpret_cast<const uint8_t*>(input.data_ptr());
+  TORCH_CHECK(input_fp32.scalar_type() == at::kFloat, "Input tensor must be float.");


remove the checks and warns from the performant code

nikhil-arm · 2025-07-14T20:17:24Z

aten/src/ATen/native/kleidiai/kai_kernels.cpp

@@ -238,13 +269,17 @@ static void kai_quant_pack_lhs_int4_mm_channelwise(

  const size_t lhs_packed_size =
      kernel_packet.kai_get_lhs_packed_size(m, k, mr, kr, sr);
-  auto lhs_packed = std::make_unique<uint8_t[]>(lhs_packed_size);
+  const size_t padding = 128;  // extra bytes
+  auto lhs_packed_tensor = at::empty({(int64_t)(lhs_packed_size + padding)}, at::kByte);


Why are we using at::empty instead of the original unique array?

at::empty handles allocation, alignment, padding safety, and cleanup, and it can reuse memory internally, gives all memory safety and integration preventing memory corruption

aten/src/ATen/native/kleidiai/kai_kernels.cpp

Co-author: Nikhil Gupta <nikhil.gupta2@arm.com>

christinaburge · 2025-07-18T13:55:07Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+    float scalar_min,
+    float scalar_max) {
+
+  std::unique_ptr<int8_t[]> lhs_quantized(new int8_t[m * (k + sizeof(float) + sizeof(int32_t))]);


are you adding elements to bytes here?

Yeah, it’s just adding extra bytes per row, 4 bytes for the scale (float) and 4 for the zero-point (int32).

Since int8_t is 1 byte, we’re still just allocating the total number of bytes needed. Not mixing types, just accounting for all the data in raw byte space.

nikhil-arm · 2025-07-18T13:59:04Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+      scalar_max);
+}
+
+static void ref_dyn_quant_matmul_4bit_groupwise_kernel_bf16(


this is not needed as we are adding only bf16 channelwise kernel?

cool I ll remove it

nikhil-arm · 2025-07-21T09:47:19Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+    return result;
+}
+
+static inline uint16_t kai_cast_bf16_f32(float val) {


Remove kai_* prefix from functions in the reference code

nikhil-arm · 2025-07-21T09:48:58Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

@@ -793,6 +793,215 @@ bool can_use_kleidiai(
 }
 #endif

+static inline size_t roundup(size_t a, size_t b) {


Can we remove these helper functions from here and perform these operations withing reference kernel function?

Done - integrated internally

nikhil-arm · 2025-07-21T09:50:13Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+    }
+}
+
+static void ref_quant_qa8dx_bf16(size_t m, size_t k, const uint16_t* lhs_native_mtx_bf16, int8_t* lhs_ref_mtx_qa8dx) {


Can you please move this inside reference kernel function like this

pytorch/aten/src/ATen/native/cpu/int4mm_kernel.cpp

Line 858 in a527e81

auto input_quant_pack_8bit_channelwise =

Done - added as a lambda

nikhil-arm · 2025-07-21T09:55:13Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+    }
+}
+
+static void ref_dyn_quant_matmul_4bit_channelwise_kernel_bf16(


Can we limit the whole reference kernel into one function instead of dividing it into smaller ones?
The kernel can effectively become ref_quant_qa8dx_bf16 + ref_matmul_mxn_mxk_nxk_bf16_qa8dx_qs4cx

We will move the lhs quant logic out to avoid code duplication later when we add bf16-int4 groupwise kernel as well as for f32-int4 kernels

Done - Refactored

robert-hardwick · 2025-07-21T15:04:46Z

@pytorchbot label "ciflow/linux-aarch64"

pytorch-bot · 2025-07-21T15:04:57Z

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

usamahz · 2025-07-21T15:09:54Z

@pytorchbot label "ciflow/linux-aarch64"

nikhil-arm · 2025-07-24T13:16:24Z

Can we please add test cases for the bf16 data type?
These are the fp32-int4 test cases
https://github.com/pytorch/pytorch/pull/134124/files#diff-809c39aeafb3acc92289f42a63e670a8719d4ce5627d5f88820142d80edf8d2aR2232

https://github.com/pytorch/pytorch/pull/134124/files#diff-809c39aeafb3acc92289f42a63e670a8719d4ce5627d5f88820142d80edf8d2aR2267

https://github.com/pytorch/pytorch/pull/134124/files#diff-7e17421f32124016eb8de04dc2f445da5786a28355e1addc72b305466f590180R6717

https://github.com/pytorch/pytorch/pull/134124/files#diff-7e17421f32124016eb8de04dc2f445da5786a28355e1addc72b305466f590180R6745

https://github.com/pytorch/pytorch/pull/134124/files#diff-7e17421f32124016eb8de04dc2f445da5786a28355e1addc72b305466f590180R6817

Ryo-not-rio · 2025-07-24T13:21:13Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+    auto cast_bf16_to_f32 = [](uint16_t bf16_val) {
+        uint32_t tmp = static_cast<uint32_t>(bf16_val) << 16;
+        float f;
+        std::memcpy(&f, &tmp, sizeof(f));
+        return f;
+    };
+
+    // Cast float32 to bfloat16 inline
+    auto cast_f32_to_bf16 = [](float f) {
+        uint32_t bits;
+        std::memcpy(&bits, &f, sizeof(bits));
+        return static_cast<uint16_t>(bits >> 16);
+    };


Wonder if we can make use of the vectorized class convert for the cast here. It might turn out that autovec does a good enough job but it's a possible perf improvement

Since this is the scalar reference implementation and meant to run on CPUs without SVE/NEON, I avoided vectorization intentionally. The focus here is correctness and portability, not performance. This inline cast is simple and safe for platforms without vector ISA.

cfRod · 2025-07-24T13:42:07Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+            float rmax = std::max(0.0f, mx);
+            const float qmin = static_cast<float>(INT8_MIN);
+            const float qmax = static_cast<float>(INT8_MAX);
+            float scale = (rmin == rmax) ? 1.f : (qmax - qmin) / (rmax - rmin);


what happens where rmax = rmin? (we should catch this)

its setting scale to 1 ? if rmin and rmax are equal then there will be no requirement for scaling im quantization as the tensor will have same values throughout

Just to double check : can we guarantee this check (rmin == rmax) is safe when dealing with float values really close to each other?

cfRod · 2025-07-24T14:38:57Z

aten/src/ATen/native/cpu/int4mm_kernel.cpp

+                float err_min = qmin + des_min;
+                float err_max = qmax + des_max;
+                float zp_f = (err_min + err_max) > 0
+                             ? qmin - des_min


Is this a standard way of calculating zero points?

Yes, we are using this logic for kleidiai reference kernels

nikhil-arm · 2025-08-05T11:49:47Z

kai_pack_int4_rhs in kai_kernels.cpp is not specialised for bf16 case. Even if the rhs kernels is same, we need to use the pack size and packing from the correct struct.
Same for torch/_meta_registrations.py. Getting kai packed size in python is not specialized to bf16 case

kai_pack_rhs_int4_size is also not specialized for bf16 case

Add Bias in BF16 Refernce Kernels Refactoring Update Comments

Integrate INT4→BF16 via KleidiAI, with fallback to FP32

14e4697

Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: linalg_frontend release notes category labels Jul 14, 2025

pytorchbot added the open source label Jul 14, 2025

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Jul 14, 2025

nikhil-arm reviewed Jul 14, 2025

View reviewed changes

aten/src/ATen/native/kleidiai/kai_kernels.cpp Outdated Show resolved Hide resolved

bf16 scalar kernels with checks

48e16e1

Co-author: Nikhil Gupta <nikhil.gupta2@arm.com>

christinaburge reviewed Jul 18, 2025

View reviewed changes

nikhil-arm reviewed Jul 18, 2025

View reviewed changes

nikhil-arm reviewed Jul 21, 2025

View reviewed changes

Refactor Scalar Kernels

3bced3c

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Jul 21, 2025

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Jul 21, 2025

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Jul 21, 2025

usamahz changed the title ~~Integrate INT4→BF16 via KleidiAI, with fallback to FP32~~ Integrate INT4→BF16 via KleidiAI, with fallback Jul 21, 2025

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Jul 23, 2025

Ryo-not-rio reviewed Jul 24, 2025

View reviewed changes

cfRod reviewed Jul 24, 2025

View reviewed changes

usamahz changed the title ~~Integrate INT4→BF16 via KleidiAI, with fallback~~ [ARM] Integrate INT4→BF16 via KleidiAI, with fallback Jul 25, 2025

pytorch-bot bot added the module: inductor label Aug 6, 2025

Add Unit Tests

49979eb

Add Bias in BF16 Refernce Kernels Refactoring Update Comments

usamahz force-pushed the integrate/int4-bf16-kleidiai branch from dcb6d4a to 49979eb Compare August 12, 2025 13:17

		@@ -13,6 +15,19 @@

		namespace at::native::kleidiai {

		at::Tensor kleidi_bf16_to_fp32(const at::Tensor& src) {

		@@ -5,6 +5,8 @@

		namespace at::native::kleidiai {

		at::Tensor kleidi_bf16_to_fp32(const at::Tensor& src);

[ARM] Integrate INT4→BF16 via KleidiAI, with fallback #158250

Are you sure you want to change the base?

[ARM] Integrate INT4→BF16 via KleidiAI, with fallback #158250

Conversation

usamahz commented Jul 14, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158250

✅ No Failures

Uh oh!

usamahz commented Jul 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-arm Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robert-hardwick commented Jul 21, 2025

Uh oh!

pytorch-bot bot commented Jul 21, 2025

Uh oh!

usamahz commented Jul 21, 2025

Uh oh!

usamahz commented Jul 14, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading

nikhil-arm Jul 21, 2025 •

edited

Loading

nikhil-arm commented Jul 24, 2025 •

edited

Loading

nikhil-arm commented Aug 5, 2025 •

edited

Loading