Dramatic overnight performance changes (not all good) #15595

bitbottrap · 2025-08-26T15:08:45Z

bitbottrap
Aug 26, 2025

Qwen3-Coder-480B x 4 GPU:
llama-server -a qwen-480b --host 0.0.0.0 --port 8081 -b 2048 -ub 2048 --threads 128 -ngl 999 -c 262144 --flash-attn --no-mmap -m Qwen3-Coder-480B-A35B-Instruct-Q4_K_M-00001-of-00006.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --jinja --parallel 4 --swa-full --keep -1 -kvu -ts 9,10,10,10

prompt eval time = 942.08 ms / 636 tokens ( 1.48 ms per token, 675.10 tokens per second)
eval time = 7180.95 ms / 339 tokens ( 21.18 ms per token, 47.21 tokens per second)
total time = 8123.03 ms / 975 tokens
to:
prompt eval time = 2317.30 ms / 3254 tokens ( 0.71 ms per token, 1404.22 tokens per second)
eval time = 30309.46 ms / 1357 tokens ( 22.34 ms per token, 44.77 tokens per second)
total time = 32626.76 ms / 4611 tokens

Dramatic over doubling of PP performance here. This is across 4 GPUs. Bit of a consistent drop in generation though.

gpt-oss-120b x 4 GPU:
Before
llama-server -a gpt-oss-120b --host 0.0.0.0 --port 8081 -b 2048 -ub 2048 --threads 128 -ngl 999 -c 1048576 --flash-attn --no-mmap -m gpt-oss-120b-mxfp4-00001-of-00003.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --swa-full --parallel 8

Don't have a record of this aside from memory:
pp was ~1200? (could have been as high as 3000)
generation was ~100

After:
prompt eval time = 12068.61 ms / 3275 tokens ( 3.69 ms per token, 271.37 tokens per second)
eval time = 44585.55 ms / 5585 tokens ( 7.98 ms per token, 125.26 tokens per second)
total time = 56654.16 ms / 8860 tokens

Dramatic LOSS of preprocessing performance here.
Generation performance improved ~25tps.

gpt-oss-120b x 1 GPU:
Bringing this back to a single GPU - dramatic improvement - all around (cut the cache to 1/8 of the size though):
CUDA_VISIBLE_DEVICES=0 llama-server -a gpt-oss-120b --host 0.0.0.0 --port 8081 -b 2048 -ub 2048 --threads 128 -ngl 999 -c 0 --flash-attn --no-mmap -m gpt-oss-120b-mxfp4-00001-of-00003.gguf --temp 0.7 --top-p 0.8 --top-k 20 --repeat-penalty 1.05 --jinja --chat-template-kwargs '{"reasoning_effort":"high"}' --swa-full
prompt eval time = 666.16 ms / 3275 tokens ( 0.20 ms per token, 4916.21 tokens per second)
eval time = 52855.16 ms / 6822 tokens ( 7.75 ms per token, 129.07 tokens per second)
total time = 53521.33 ms / 10097 tokens

Huge improvement to PP and generation performance here. This case was a winner all around.

Some notes - this is the change from 'master' in the last 24 hours. GPU are 4 x Nvidia 6000 Pro Max-Q.

JohannesGaessler · 2025-08-26T17:38:54Z

JohannesGaessler
Aug 26, 2025
Collaborator

If you want the supposed performance regression to be addressed, actually reproduce it and nail down the exact commit that causes it.

1 reply

bitbottrap Aug 26, 2025
Author

There's a reason I didn't submit a bug. Wanted to see what other people's experience has been.

Been using it more and I don't believe there's an issue with preprocessing regression at all. In fact, it's dramatically improved. I'm unsure why I hit the extremely low numbers earlier. But I've repeated it when I was certain nothing else was going on and it looks great. Like, really great.

A bit unsure about the small generation regression though. It's close enough that I'm chalking it up to small differences in usage for now.

Looks like some really nice work went in there recently.

Rotatingxenomorph · 2025-08-26T18:41:43Z

Rotatingxenomorph
Aug 26, 2025

-c 1048576

1 m-million?

1 reply

ggerganov Aug 26, 2025
Maintainer

This is 8x131072 for 8 parallel requests.

@bitbottrap Btw, you don't really need the --swa-full flag. Simply remove it to save almost x2 memory for the KV cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dramatic overnight performance changes (not all good) #15595

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Dramatic overnight performance changes (not all good) #15595

Uh oh!

Uh oh!

bitbottrap Aug 26, 2025

Replies: 2 comments · 2 replies

Uh oh!

JohannesGaessler Aug 26, 2025 Collaborator

Uh oh!

bitbottrap Aug 26, 2025 Author

Uh oh!

Rotatingxenomorph Aug 26, 2025

Uh oh!

ggerganov Aug 26, 2025 Maintainer

bitbottrap
Aug 26, 2025

Replies: 2 comments 2 replies

JohannesGaessler
Aug 26, 2025
Collaborator

bitbottrap Aug 26, 2025
Author

Rotatingxenomorph
Aug 26, 2025

ggerganov Aug 26, 2025
Maintainer