-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Open
Labels
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 6365 (5eae9348)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server -m gpt-oss-20b-mxfp4.gguf -fa on -ctk q8_0 -ctv q8_0
Problem description & steps to reproduce
Using KV cache quantization seems to have a huge impact on performance with GPT OSS models, this is the result of testing the 20B model:
Without quantized KV cache
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 884
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 884, n_tokens = 884, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 884, n_tokens = 884
slot update_slots: id 0 | task 0 | SWA checkpoint create, pos_min = 116, pos_max = 883, size = 18.009 MiB, total = 1/3 (18.009 MiB)
slot release: id 0 | task 0 | stop processing: n_past = 1094, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 291.60 ms / 884 tokens ( 0.33 ms per token, 3031.60 tokens per second)
eval time = 1313.75 ms / 211 tokens ( 6.23 ms per token, 160.61 tokens per second)
total time = 1605.34 ms / 1095 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
With quantized KV cache
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 884
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 884, n_tokens = 884, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 884, n_tokens = 884
slot update_slots: id 0 | task 0 | SWA checkpoint create, pos_min = 116, pos_max = 883, size = 9.572 MiB, total = 1/3 (9.572 MiB)
slot release: id 0 | task 0 | stop processing: n_past = 1163, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 1802.50 ms / 884 tokens ( 2.04 ms per token, 490.43 tokens per second)
eval time = 4165.97 ms / 280 tokens ( 14.88 ms per token, 67.21 tokens per second)
total time = 5968.47 ms / 1164 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
On the other hand, VRAM usage seems to be the same or very close.
First Bad Commit
No response