Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 27, 2025

The logic for creating the KQ mask tensor was over-estimating the size of n_kv. Instead of considering only the sequence lengths in the current ubatch, it was considering all sequence lengths in the KV cache, which leads to allocating a much larger mask than necessary. This resulted in performance overhead when computing a short sequence in the presence of a large sequence in the cache. The difference is quite noticeable with llama-server with parallel sequences.

@ggerganov ggerganov merged commit 1bded5a into master Aug 27, 2025
55 of 56 checks passed
@ggerganov ggerganov deleted the gg/kv-cache-fix-n_kv branch August 27, 2025 10:55
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master: (59 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…upport

* origin/master: (61 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…g-model-disabled-agent-prefill

* origin/master: (76 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master: (59 commits)
scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633)
kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)
gguf-py: byteswapping improvements (ggml-org#12851)
cli : change log to warning to explain reason for stopping (ggml-org#15604)
model-conversion : add mmproj conversion target (ggml-org#15628)
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622)
server: higher timeout for tests (ggml-org#15621)
presets : add qwen3-30B-a3b FIM (ggml-org#15616)
HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615)
kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610)
CANN: refactor mask handling and improve performance in FA (ggml-org#15561)
ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057)
common : add -m to bash completion for --model [no ci] (ggml-org#15591)
OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314)
tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599)
SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592)
mtmd : fix mtmd ios build (ggml-org#15579)
tests: add performance test for mul mat id (ggml-org#15543)
llamafile: PowerPC Sgemm Optimization (ggml-org#15558)
graph : fix assert in memory-less build_attn (ggml-org#15590)
...
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant