Dramatic overnight performance changes (not all good) #15595
bitbottrap
started this conversation in
General
Replies: 2 comments 2 replies
-
If you want the supposed performance regression to be addressed, actually reproduce it and nail down the exact commit that causes it. |
Beta Was this translation helpful? Give feedback.
1 reply
-
1 m-million? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Qwen3-Coder-480B x 4 GPU:
llama-server -a qwen-480b --host 0.0.0.0 --port 8081 -b 2048 -ub 2048 --threads 128 -ngl 999 -c 262144 --flash-attn --no-mmap -m Qwen3-Coder-480B-A35B-Instruct-Q4_K_M-00001-of-00006.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --jinja --parallel 4 --swa-full --keep -1 -kvu -ts 9,10,10,10
prompt eval time = 942.08 ms / 636 tokens ( 1.48 ms per token, 675.10 tokens per second)
eval time = 7180.95 ms / 339 tokens ( 21.18 ms per token, 47.21 tokens per second)
total time = 8123.03 ms / 975 tokens
to:
prompt eval time = 2317.30 ms / 3254 tokens ( 0.71 ms per token, 1404.22 tokens per second)
eval time = 30309.46 ms / 1357 tokens ( 22.34 ms per token, 44.77 tokens per second)
total time = 32626.76 ms / 4611 tokens
Dramatic over doubling of PP performance here. This is across 4 GPUs. Bit of a consistent drop in generation though.
gpt-oss-120b x 4 GPU:
Before
llama-server -a gpt-oss-120b --host 0.0.0.0 --port 8081 -b 2048 -ub 2048 --threads 128 -ngl 999 -c 1048576 --flash-attn --no-mmap -m gpt-oss-120b-mxfp4-00001-of-00003.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --swa-full --parallel 8
Don't have a record of this aside from memory:
pp was ~1200? (could have been as high as 3000)
generation was ~100
After:
prompt eval time = 12068.61 ms / 3275 tokens ( 3.69 ms per token, 271.37 tokens per second)
eval time = 44585.55 ms / 5585 tokens ( 7.98 ms per token, 125.26 tokens per second)
total time = 56654.16 ms / 8860 tokens
Dramatic LOSS of preprocessing performance here.
Generation performance improved ~25tps.
gpt-oss-120b x 1 GPU:
Bringing this back to a single GPU - dramatic improvement - all around (cut the cache to 1/8 of the size though):
CUDA_VISIBLE_DEVICES=0 llama-server -a gpt-oss-120b --host 0.0.0.0 --port 8081 -b 2048 -ub 2048 --threads 128 -ngl 999 -c 0 --flash-attn --no-mmap -m gpt-oss-120b-mxfp4-00001-of-00003.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --swa-full
prompt eval time = 666.16 ms / 3275 tokens ( 0.20 ms per token, 4916.21 tokens per second)
eval time = 52855.16 ms / 6822 tokens ( 7.75 ms per token, 129.07 tokens per second)
total time = 53521.33 ms / 10097 tokens
Huge improvement to PP and generation performance here. This case was a winner all around.
Some notes - this is the change from 'master' in the last 24 hours. GPU are 4 x Nvidia 6000 Pro Max-Q.
Beta Was this translation helpful? Give feedback.
All reactions