Skip to content

Misc. bug: Performance degradation with -ctk / -ctv q8_0 using GPT OSS 20B #15766

@daniel-dona

Description

@daniel-dona

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 6365 (5eae9348)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m gpt-oss-20b-mxfp4.gguf -fa on -ctk q8_0 -ctv q8_0

Problem description & steps to reproduce

Using KV cache quantization seems to have a huge impact on performance with GPT OSS models, this is the result of testing the 20B model:

Without quantized KV cache

main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 884
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 884, n_tokens = 884, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 884, n_tokens = 884
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 116, pos_max = 883, size = 18.009 MiB, total = 1/3 (18.009 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 1094, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     291.60 ms /   884 tokens (    0.33 ms per token,  3031.60 tokens per second)
       eval time =    1313.75 ms /   211 tokens (    6.23 ms per token,   160.61 tokens per second)
      total time =    1605.34 ms /  1095 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

With quantized KV cache

main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 884
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 884, n_tokens = 884, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 884, n_tokens = 884
slot update_slots: id  0 | task 0 | SWA checkpoint create, pos_min = 116, pos_max = 883, size = 9.572 MiB, total = 1/3 (9.572 MiB)
slot      release: id  0 | task 0 | stop processing: n_past = 1163, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =    1802.50 ms /   884 tokens (    2.04 ms per token,   490.43 tokens per second)
       eval time =    4165.97 ms /   280 tokens (   14.88 ms per token,    67.21 tokens per second)
      total time =    5968.47 ms /  1164 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

On the other hand, VRAM usage seems to be the same or very close.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions