Skip to content

Misc. bug: llama-server, speed penalty from d9d398f since b5222 #15672

@cmhamiche

Description

@cmhamiche

Name and Version

version: 6319 (792b44f)
built with MSVC 19.44.35215.0 for x64
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86" -DLLAMA_CURL=OFF

Intel i3-12100F, 2*32GB 3200 Mhz, Nvidia RTX 3060 12 GB

Operating systems

Windows

Problem description & steps to reproduce

With llama-server, there's a noticeable penalty from d9d398f since b5222 :
"sampling : when top-k <= 0 -> noop (#13173)"
It's a single line in src/llama-sampling.cpp, llama-cli does not seem to be affected

First Bad Commit

There's comparable differences between llama-b5221-bin-win-cuda-cu12.4-x64 and llama-b5222-bin-win-cuda-cu12.4-x64

Relevant log output


llama-server without cpu offloading: top-k 0.1, min-p 0.1

gemma-2-2b-it.F16.gguf

llama-server.exe --no-mmap -t 7 -fa -ngl 99 -c 8000 -n 1000 --temp 1 --top-p 1 --top-k 0.1 --min-p 0.1 --jinja -m gemma-2-2b-it.F16.gguf
[...]
load_tensors: offloaded 27/27 layers to GPU
load_tensors: CUDA_Host model buffer size = 1125.00 MiB
load_tensors: CUDA0 model buffer size = 4986.92 MiB

b6319 with reverted d9d398f, "sampling : when top-k <= 0 -> noop (#13173)"
prompt eval time = 470.11 ms / 453 tokens ( 1.04 ms per token, 963.61 tokens per second)
eval time = 16651.99 ms / 476 tokens ( 34.98 ms per token, 28.59 tokens per second)
total time = 17122.10 ms / 929 tokens

b6319 default
prompt eval time = 388.79 ms / 453 tokens ( 0.86 ms per token, 1165.15 tokens per second)
eval time = 18129.15 ms / 458 tokens ( 39.58 ms per token, 25.26 tokens per second)
total time = 18517.94 ms / 911 tokens


llama-server with cpu offloading: ncmoe 32, top-k 0, min-p 0

gpt-oss-120b-F16.gguf

llama-server.exe --no-mmap -t 7 -fa -ngl 99 -ncmoe 32 -c 30000 -n 1000 --temp 1 --top-p 1 --top-k 0 --min-p 0 --jinja -m gpt-oss-120b-F16.gguf
[...]
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU model buffer size = 52877.11 MiB
load_tensors: CUDA0 model buffer size = 9451.23 MiB

b6319 with reverted d9d398f, "sampling : when top-k <= 0 -> noop (#13173)"
prompt eval time = 5145.30 ms / 495 tokens ( 10.39 ms per token, 96.20 tokens per second)
eval time = 101352.60 ms / 1000 tokens ( 101.35 ms per token, 9.87 tokens per second)
total time = 106497.90 ms / 1495 tokens

b6319 default
prompt eval time = 4994.36 ms / 495 tokens ( 10.09 ms per token, 99.11 tokens per second)
eval time = 105848.31 ms / 1000 tokens ( 105.85 ms per token, 9.45 tokens per second)
total time = 110842.66 ms / 1495 tokens


llama-server with some shared ram: top-k 0.1, min-p 0.1

Qwen3-14B-Q4_K_M.gguf

llama-server.exe --no-mmap -t 7 -fa -ngl 99 -c 8000 -n 1000 --temp 1 --top-p 1 --top-k 0.1 --min-p 0.1 --jinja -m Qwen3-14B-Q4_K_M.gguf
[...]
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CUDA0 model buffer size = 8161.75 MiB
load_tensors: CPU model buffer size = 417.30 MiB

b6319 with reverted d9d398f, "sampling : when top-k <= 0 -> noop (#13173)"
prompt eval time = 864.52 ms / 518 tokens ( 1.67 ms per token, 599.18 tokens per second)
eval time = 38920.46 ms / 1000 tokens ( 38.92 ms per token, 25.69 tokens per second)
total time = 39784.98 ms / 1518 tokens

b6319 default
prompt eval time = 803.84 ms / 518 tokens ( 1.55 ms per token, 644.40 tokens per second)
eval time = 42078.33 ms / 1000 tokens ( 42.08 ms per token, 23.77 tokens per second)
total time = 42882.18 ms / 1518 tokens


CLI for comparison, llama-cli with some shared ram: top-k 0, min-p 0

Qwen3-14B-Q4_K_M.gguf

llama-cli.exe --no-mmap -t 7 -fa -ngl 99 -c 8000 -n 1000 --temp 1 --top-p 1 --top-k 0 --min-p 0 --jinja -m Qwen3-14B-Q4_K_M.gguf -p "On a typical 24-hour digital alarm clock, what time will the sum of the four digits be the highest?"
[...]
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CUDA0 model buffer size = 8161.75 MiB
load_tensors: CPU model buffer size = 417.30 MiB

b6319 with reverted d9d398f, "sampling : when top-k <= 0 -> noop (#13173)"
llama_perf_sampler_print: sampling time = 10327.16 ms / 1032 runs ( 10.01 ms per token, 99.93 tokens per second)
llama_perf_context_print: load time = 2430.43 ms
llama_perf_context_print: prompt eval time = 61.37 ms / 32 tokens ( 1.92 ms per token, 521.44 tokens per second)
llama_perf_context_print: eval time = 27465.81 ms / 999 runs ( 27.49 ms per token, 36.37 tokens per second)
llama_perf_context_print: total time = 40954.06 ms / 1031 tokens
llama_perf_context_print: graphs reused = 995

b6319 default
llama_perf_sampler_print: sampling time = 13363.90 ms / 1032 runs ( 12.95 ms per token, 77.22 tokens per second)
llama_perf_context_print: load time = 2433.27 ms
llama_perf_context_print: prompt eval time = 60.15 ms / 32 tokens ( 1.88 ms per token, 532.05 tokens per second)
llama_perf_context_print: eval time = 27404.76 ms / 999 runs ( 27.43 ms per token, 36.45 tokens per second)
llama_perf_context_print: total time = 41827.05 ms / 1031 tokens
llama_perf_context_print: graphs reused = 995

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions