-
Notifications
You must be signed in to change notification settings - Fork 12.9k
Description
Name and Version
./llama-cli --version
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1080 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from H:\llama.cpp_vulkan\ggml-vulkan.dll
load_backend: loaded CPU backend from H:\llama.cpp_vulkan\ggml-cpu-haswell.dll
version: 6278 (34bdbbd)
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-bench, llama-server
Command line
----------------------------------------------------
## Command Used for llama-bench
----------------------------------------------------
@echo off
H:\llama.cpp_vulkan\llama-bench.exe --model .\gpt-oss-20b-UD-Q4_K_XL.gguf ^
--threads 24 --main-gpu 1 ^
--n-gpu-layers 99 ^
--n-prompt 512,2048 --n-gen 128,256,512 ^
--mmap 0 --flash-attn 1 --split-mode layer
pause
----------------------------------------------------
## Command Used for llama-server
----------------------------------------------------
@echo off
set GGML_VK_VISIBLE_DEVICES=0,2
H:\llama.cpp_vulkan\llama-server.exe --model .\gpt-oss-20b-mxfp4.gguf ^
--alias gtp-4.1 ^
--threads -1 --main-gpu 1 ^
--n-gpu-layers 99 --tensor-split 25,75 ^
--temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 40 --repeat-penalty 1.0 ^
--port 8181 --host 127.0.0.1 ^
--ctx-size 0 -fa --metrics --seed 42 ^
--reasoning-format none --chat-template-kwargs "{\"reasoning_effort\":\"high\"}"
pause
Problem description & steps to reproduce
llama-bench exits unexpectedly after displaying the table headers without running any benchmarks in build 6278 and newer builds, while the same command works perfectly in build 6277.
llama-server exits abruptly when receiving a simple "hello" message from Open WebUI in build 6278 and newer builds, while the same setup works perfectly in build 6277.
Environment
- OS: Windows
- Build: 6278 (34bdbbd) - broken, 6277 (74f52f7) - working
- Compiler: clang version 19.1.5 for x86_64-pc-windows-msvc
- GPUs:
- NVIDIA GeForce GTX 1080 Ti
- NVIDIA GeForce RTX 4070 (main GPU)
Command Used for llama-bench
H:\llama.cpp_vulkan\llama-bench.exe --model .\gpt-oss-20b-UD-Q4_K_XL.gguf ^
--threads 24 --main-gpu 1 ^
--n-gpu-layers 99 ^
--n-prompt 512,2048 --n-gen 128,256,512 ^
--mmap 0 --flash-attn 1 --split-mode layer
In build 6278, the program:
- Successfully initializes backends (RPC, Vulkan, CPU)
- Detects GPUs correctly
- Displays the table header
- Immediately exits with "Press any key to continue . . ." without running any benchmarks
Working Output (Build 6277)
| model | size | params | backend | ngl | threads | main_gpu | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | RPC,Vulkan | 99 | 24 | 1 | 1 | 0 | pp512 | 606.65 ± 11.03 |
| gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | RPC,Vulkan | 99 | 24 | 1 | 1 | 0 | pp2048 | 562.76 ± 17.86 |
| gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | RPC,Vulkan | 99 | 24 | 1 | 1 | 0 | tg128 | 82.04 ± 1.57 |
| gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | RPC,Vulkan | 99 | 24 | 1 | 1 | 0 | tg256 | 82.53 ± 1.53 |
| gpt-oss 20B Q4_K - Medium | 11.04 GiB | 20.91 B | RPC,Vulkan | 99 | 24 | 1 | 1 | 0 | tg512 | 80.57 ± 0.94 |
Broken Output (Build 6278)
| model | size | params | backend | ngl | threads | main_gpu | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | -: | ---: | --------------: | -------------------: |
Press any key to continue . . .
Additional Information
- The model file (
gpt-oss-20b-UD-Q4_K_XL.gguf
) is the same in both tests - Backend initialization appears to work correctly in both versions
- GPU detection is identical in both builds
First Bad Commit
No response