Misc. bug: llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed when loading processed prompt again

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Quadro M2000, compute capability 5.2, VMM: yes
version: 5913 (225e7a14)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-cli

### Command line

```shell
model_folder="/your_model_folder/Qwen3-4B-Thinking-2507-UD-Q8_K_XL/" && \
model_basename="Qwen3-4B-Thinking-2507-UD-Q8_K_XL" && \
model=$model_folder$model_basename'.gguf' && \
CUDA_VISIBLE_DEVICES="0," \
./llama-cli \
--model "$model" \
--n-gpu-layers 0 \
--prompt-cache "cached.prompt" \
--file "text.prompt"
```

### Problem description & steps to reproduce

The program crashed when trying to use a previously cached prompt from a binary file.

Create a text file "text.prompt" with some content and save it in the bin folder of llama-cli. Make sure there are no *.prompt files in the same folder, otherwise delete them. Run the command once, and, when the provided prompt was processed, and the program started to generate response, exit it (press Ctrl+C twice)

Confirm that a file "cached.prompt" was created in the folder. Then run the command once again. The expected behavior is that the program will load and use the cached prompt. However, it crashes complaining about this:

`.../llama.cpp/src/llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed`

It is the same behavior regardless if any layers being offloaded to GPU or it is being run completely on CPU.

### First Bad Commit

The first bad commit is 225e7a1438f4ea85eaa7b5ef3ab3b266ee4d9c06
The last working commit is ab140198211385b85eeeb0abd549a4bbe259e10d

Confirmed for such models as Qwen3-4B-Thinking-2507-UD-Q8_K_XL and DeepSeek-R1-0528-Q2_K_L

### Relevant log output

```shell
model_folder="/your_model_folder/Qwen3-4B-Thinking-2507-UD-Q8_K_XL/" && model_basename="Qwen3-4B-Thinking-2507-UD-Q8_K_XL" && model=$model_folder$model_basename'.gguf' && CUDA_VISIBLE_DEVICES="0," ./llama-cli --model "$model" --n-gpu-layers 0 --prompt-cache "cached.prompt" --file "text.prompt"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5913 (225e7a14) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23858 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 398 tensors from /mnt/AI/LLM/Qwen3-4B-Thinking-2507-UD-Q8_K_XL/Qwen3-4B-Thinking-2507-UD-Q8_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-4B-Thinking-2507
llama_model_loader: - kv   3:                            general.version str              = 2507
llama_model_loader: - kv   4:                           general.finetune str              = Thinking
llama_model_loader: - kv   5:                           general.basename str              = Qwen3-4B-Thinking-2507
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 4B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-4B-...
llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
llama_model_loader: - kv  12:                  general.base_model.0.name str              = Qwen3 4B Thinking 2507
llama_model_loader: - kv  13:               general.base_model.0.version str              = 2507
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-4B-...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  17:                          qwen3.block_count u32              = 36
llama_model_loader: - kv  18:                       qwen3.context_length u32              = 262144
llama_model_loader: - kv  19:                     qwen3.embedding_length u32              = 2560
llama_model_loader: - kv  20:                  qwen3.feed_forward_length u32              = 9728
llama_model_loader: - kv  21:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  22:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  23:                       qwen3.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  24:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  26:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 7
llama_model_loader: - kv  38:                      quantize.imatrix.file str              = Qwen3-4B-Thinking-2507-GGUF/imatrix_u...
llama_model_loader: - kv  39:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-4B-Thinking...
llama_model_loader: - kv  40:             quantize.imatrix.entries_count u32              = 252
llama_model_loader: - kv  41:              quantize.imatrix.chunks_count u32              = 79
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:   26 tensors
llama_model_loader: - type q8_0:  227 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.70 GiB (10.05 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2560
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 9728
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 4.02 B
print_info: general.name     = Qwen3-4B-Thinking-2507
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4816.76 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:        CPU KV buffer size =   576.00 MiB
llama_kv_cache_unified: size =  576.00 MiB (  4096 cells,  36 layers,  1/ 1 seqs), K (f16):  288.00 MiB, V (f16):  288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =  1115.12 MiB
llama_context:  CUDA_Host compute buffer size =    13.01 MiB
llama_context: graph nodes  = 1482
llama_context: graph splits = 472 (with bs=512), 109 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | REPACK = 1 | 

main: attempting to load saved session from 'cached.prompt'
main: loaded a session with prompt size of 270 tokens
/home/ai/LLAMA_CPP/2025-07-16T13:35:42Z/llama.cpp/src/llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed
main: session file has exact match for prompt!
./llama-cli(+0x70034b)[0x5c67b65db34b]
./llama-cli(+0x70090f)[0x5c67b65db90f]
./llama-cli(+0x700ade)[0x5c67b65dbade]
./llama-cli(+0x2b2ff2)[0x5c67b618dff2]
./llama-cli(+0x52bdb)[0x5c67b5f2dbdb]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7af585429d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7af585429e40]
./llama-cli(+0x89e15)[0x5c67b5f64e15]
Aborted (core dumped)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed when loading processed prompt again #15215

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed when loading processed prompt again #15215

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions