Misc. bug: Performance degradation with -ctk / -ctv q8_0 using GPT OSS 20B

### Name and Version

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 6365 (5eae9348)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./llama-server -m gpt-oss-20b-mxfp4.gguf -fa on -ctk q8_0 -ctv q8_0
```

### Problem description & steps to reproduce

Using KV cache quantization seems to have a huge impact on performance with GPT OSS models, this is the result of testing the 20B model:


**Without quantized KV cache**
```
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 884
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 884, n_tokens = 884, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 884, n_tokens = 884
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 116, pos_max = 883, size = 18.009 MiB, total = 1/3 (18.009 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 1094, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     291.60 ms /   884 tokens (    0.33 ms per token,  3031.60 tokens per second)
       eval time =    1313.75 ms /   211 tokens (    6.23 ms per token,   160.61 tokens per second)
      total time =    1605.34 ms /  1095 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
```

**With quantized KV cache**
```
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 884
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 884, n_tokens = 884, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 884, n_tokens = 884
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 116, pos_max = 883, size = 9.572 MiB, total = 1/3 (9.572 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 1163, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =    1802.50 ms /   884 tokens (    2.04 ms per token,   490.43 tokens per second)
       eval time =    4165.97 ms /   280 tokens (   14.88 ms per token,    67.21 tokens per second)
      total time =    5968.47 ms /  1164 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
```

On the other hand, VRAM usage seems to be the same or very close.

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Performance degradation with -ctk / -ctv q8_0 using GPT OSS 20B #15766

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Performance degradation with -ctk / -ctv q8_0 using GPT OSS 20B #15766

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions