metal : use F32 attention accumulators in FA kernels #13975

ggerganov · 2025-06-02T16:06:40Z

It seems that the attention output lo overflows F16 at large context (more than 32k). This fixes Gemma 3 27B at large contexts with Metal.

ggml-ci

metal : use F32 accumulators in FA kernels

21be70e

ggml-ci

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 2, 2025

ggerganov mentioned this pull request Jun 2, 2025

Eval bug: Gemma3 <unused32> spam #12433

Closed

ggerganov merged commit ea394d7 into master Jun 2, 2025
53 checks passed

ggerganov deleted the gg/metal-fa-acc-f32 branch June 2, 2025 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : use F32 attention accumulators in FA kernels #13975

metal : use F32 attention accumulators in FA kernels #13975

Uh oh!

ggerganov commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

metal : use F32 attention accumulators in FA kernels #13975

metal : use F32 attention accumulators in FA kernels #13975

Uh oh!

Conversation

ggerganov commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!