kv-cache : remove LLAMA_SET_ROWS checks #15505

ggerganov · 2025-08-22T12:21:01Z

With the adoption of ggml_set_rows() across all backends (#14661) we can now remove the old fallback path and simplify the implementation.

ggerganov · 2025-08-22T12:38:47Z

The only backends that currently are going to be affected by this change are CANN and OpenCL when using quantized KV cache. With standard F16 KV cache there should be no problems.

@hipudding @max-krasnyansky @lhez FYI

ggml-ci

…nemotron-nano-15409 * origin/master: (59 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

…upport * origin/master: (61 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

…g-model-disabled-agent-prefill * origin/master: (76 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

ggerganov · 2025-08-28T13:58:36Z

src/llama-kv-cache.cpp

        // only find a suitable slot for the ubatch. don't modify the cells yet
-        const auto sinfo_new = find_slot(ubatch, cont);
+        const auto sinfo_new = find_slot(ubatch, true);


This should have been false - fixing ..

…nemotron-nano-15409 * origin/master: (59 commits) scripts: add sqlite3 check for compare-commits.sh (ggml-org#15633) kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505) gguf-py: byteswapping improvements (ggml-org#12851) cli : change log to warning to explain reason for stopping (ggml-org#15604) model-conversion : add mmproj conversion target (ggml-org#15628) cuda: Add cublasLt_static linking when GGML_STATIC is enabled (ggml-org#15622) server: higher timeout for tests (ggml-org#15621) presets : add qwen3-30B-a3b FIM (ggml-org#15616) HIP: Enable support for ggml_backend_cuda_register_host_buffer (ggml-org#15615) kv-cache : better estimate of n_kv for multi-sequence batches (ggml-org#15610) CANN: refactor mask handling and improve performance in FA (ggml-org#15561) ggml-cpu : add basic RVV support for vector f32 ops (ggml-org#15057) common : add -m to bash completion for --model [no ci] (ggml-org#15591) OpenCL: add fused group_norm/norm, mul, add (ggml-org#15314) tests : fix test-opt with GGML_BACKEND_DL (ggml-org#15599) SYCL: fix rms_norm_mul_add for tensor dim not a multiple of sg_size (ggml-org#15592) mtmd : fix mtmd ios build (ggml-org#15579) tests: add performance test for mul mat id (ggml-org#15543) llamafile: PowerPC Sgemm Optimization (ggml-org#15558) graph : fix assert in memory-less build_attn (ggml-org#15590) ...

ggml-ci

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Aug 22, 2025

danbev approved these changes Aug 22, 2025

View reviewed changes

Base automatically changed from gg/kv-cache-reuse-layers to master August 24, 2025 10:07

ggerganov force-pushed the gg/kv-cache-remove-set-rows-checks branch from 44933f7 to 642d79e Compare August 24, 2025 10:07

kv-cache : remove LLAMA_SET_ROWS checks

61b7fe2

ggml-ci

ggerganov force-pushed the gg/kv-cache-remove-set-rows-checks branch from 642d79e to 61b7fe2 Compare August 27, 2025 10:57

ggerganov mentioned this pull request Aug 27, 2025

Feature Request: Repeated Unecessary Activation Quantization Ops #15602

Open

4 tasks

ggerganov merged commit 8a4280c into master Aug 28, 2025
59 of 60 checks passed

ggerganov commented Aug 28, 2025

View reviewed changes

ggerganov mentioned this pull request Aug 28, 2025

kv-cache : fix find_slot to not search for continuous slot #15638

Merged

ggerganov deleted the gg/kv-cache-remove-set-rows-checks branch August 28, 2025 14:07

Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 29, 2025

kv-cache : remove LLAMA_SET_ROWS checks (ggml-org#15505)

246385a

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : remove LLAMA_SET_ROWS checks #15505

kv-cache : remove LLAMA_SET_ROWS checks #15505

ggerganov commented Aug 22, 2025

Uh oh!

ggerganov commented Aug 22, 2025

Uh oh!

Uh oh!

ggerganov Aug 28, 2025

Uh oh!

Uh oh!

kv-cache : remove LLAMA_SET_ROWS checks #15505

kv-cache : remove LLAMA_SET_ROWS checks #15505

Conversation

ggerganov commented Aug 22, 2025

Uh oh!

ggerganov commented Aug 22, 2025

Uh oh!

Uh oh!

ggerganov Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!