sync : ggml #3171

ggerganov · 2025-05-19T10:40:09Z

No description provided.

* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1 * Fixed lldb attach * Simplify by having the child do ggml_print_backtrace_symbols

…3509) Signed-off-by: Dan Johansson <dan.johansson@arm.com>

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.

This PR improves q6_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q6_k quantization model. - 40% ~ 54% S_PP uplift for all batch sizes - 16% ~ 47% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q6_K.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- | PP | TG | B | S_PP t/s | S_TG t/s | | | | | original | this pr | original | this pr | |-------|--------|------|----------|----------|----------|----------| | 128 | 128 | 1 | 78.52 | 109.18 | 18.63 | 18.88 | | 128 | 128 | 2 | 84.62 | 123.94 | 34.54 | 36.92 | | 128 | 128 | 4 | 84.36 | 122.49 | 52.65 | 61.32 | | 128 | 128 | 8 | 90.52 | 138.87 | 63.46 | 84.41 | | 128 | 128 | 16 | 90.11 | 138.56 | 71.04 | 101.33 | | 128 | 128 | 32 | 89.81 | 137.79 | 75.14 | 110.47 | --------------------------------------------------------------------- ```

* gguf : use ggml log system * llama : remove unnecessary new lines in exception messages

ggml-ci

* vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix

* fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`

Signed-off-by: noemotiovon <757486878@qq.com>

ggml-ci

danielzgtg and others added 24 commits May 19, 2025 13:38

ggml : Fix missing backtrace on Linux (ggml/1228)

d8bebae

* Modern Linux defaults /proc/sys/kernel/yama/ptrace_scope to 1 * Fixed lldb attach * Simplify by having the child do ggml_print_backtrace_symbols

ggml : fix apple OS check in ggml_print_backtrace (ggml/1229)

6c2e5c7

mnist: fix segmentation fault (ggml/1227)

f6e2799

ggml-cpu: Update KleidiAI to v1.6 and fix include directives (llama/1…

3970f74

…3509) Signed-off-by: Dan Johansson <dan.johansson@arm.com>

metal : optimize multi-sequence FA vec kernel (llama/13493)

5b11174

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci

metal : use FA-vec kernel up to batch size 20 (llama/13496)

7cdf758

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci

vulkan: workaround FA compile failures on macos (llama/13517)

014d860

cmake: simplify vulkan shader test logic (llama/13263)

4611d55

CUDA: faster Deepseek FA, add Turing support (llama/13435)

cc84fce

CUDA: fix crash on large batch size for quant. MoE (llama/13537)

bd65c3b

sycl: use oneDNN for matrices multiplication (llama/12972)

0afc371

sycl: reordered Q4_K MMVQ (llama/13109)

511ae56

sycl: simplify bin_bcast_kernel (llama/13383)

4d8d4f8

gguf : use ggml log system (llama/13571)

577402d

* gguf : use ggml log system * llama : remove unnecessary new lines in exception messages

sycl : fixed compilation warnings (llama/13582)

8bfb65c

metal : add FA-vec kernel for head size 64 (llama/13583)

82f8e5f

ggml-ci

vulkan: use scalar FA rather than coopmat2 when N==1 (llama/13554)

798a95c

vulkan: move common FA code to flash_attn_base.comp (llama/13556)

5378cbb

* vulkan: move common FA code to flash_attn_base.comp * vulkan: move common FA index/stride setup code to flash_attn_base.comp * build fix

cmake: use the current build config for vulkan-shaders-gen (llama/13595)

2d07c56

* fix: use the current build config for `vulkan-shaders-gen` * fix: only pass a valid build type to `--config`

CANN: Support MOE Model MUL_MAT_ID (llama/13042)

6dedddb

Signed-off-by: noemotiovon <757486878@qq.com>

sync : ggml

abcf4b2

ggml-ci

talk-llama : sync llama.cpp

e5ededd

ggml-ci

danbev approved these changes May 19, 2025

View reviewed changes

ggerganov merged commit 6b6cf19 into master May 19, 2025
58 of 60 checks passed

ggerganov deleted the sync-whisper.cpp-25-05-19 branch May 19, 2025 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3171

sync : ggml #3171

Uh oh!

ggerganov commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

sync : ggml #3171

sync : ggml #3171

Uh oh!

Conversation

ggerganov commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!