Skip to content

Misc. bug: Build 6278 Vulkan crashes: llama-bench and llama-server both affected #15678

@kidVTP

Description

@kidVTP

Name and Version

./llama-cli --version
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1080 Ti (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
ggml_vulkan: 1 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from H:\llama.cpp_vulkan\ggml-vulkan.dll
load_backend: loaded CPU backend from H:\llama.cpp_vulkan\ggml-cpu-haswell.dll
version: 6278 (34bdbbd)

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-bench, llama-server

Command line

----------------------------------------------------
## Command Used for llama-bench
----------------------------------------------------
@echo off
H:\llama.cpp_vulkan\llama-bench.exe --model .\gpt-oss-20b-UD-Q4_K_XL.gguf ^
  --threads 24 --main-gpu 1 ^
  --n-gpu-layers 99 ^
  --n-prompt 512,2048 --n-gen 128,256,512 ^
  --mmap 0 --flash-attn 1 --split-mode layer

pause

----------------------------------------------------
## Command Used for llama-server
----------------------------------------------------
@echo off
set GGML_VK_VISIBLE_DEVICES=0,2
H:\llama.cpp_vulkan\llama-server.exe --model .\gpt-oss-20b-mxfp4.gguf ^
 --alias gtp-4.1 ^
 --threads -1 --main-gpu 1 ^
 --n-gpu-layers 99 --tensor-split 25,75 ^
 --temp 1.0 --min-p 0.0 --top-p 1.0 --top-k 40 --repeat-penalty 1.0 ^
 --port 8181 --host 127.0.0.1 ^
 --ctx-size 0 -fa --metrics --seed 42 ^
 --reasoning-format none --chat-template-kwargs "{\"reasoning_effort\":\"high\"}"
 
pause

Problem description & steps to reproduce

llama-bench exits unexpectedly after displaying the table headers without running any benchmarks in build 6278 and newer builds, while the same command works perfectly in build 6277.

llama-server exits abruptly when receiving a simple "hello" message from Open WebUI in build 6278 and newer builds, while the same setup works perfectly in build 6277.

Environment

  • OS: Windows
  • Build: 6278 (34bdbbd) - broken, 6277 (74f52f7) - working
  • Compiler: clang version 19.1.5 for x86_64-pc-windows-msvc
  • GPUs:
    • NVIDIA GeForce GTX 1080 Ti
    • NVIDIA GeForce RTX 4070 (main GPU)

Command Used for llama-bench

H:\llama.cpp_vulkan\llama-bench.exe --model .\gpt-oss-20b-UD-Q4_K_XL.gguf ^
  --threads 24 --main-gpu 1 ^
  --n-gpu-layers 99 ^
  --n-prompt 512,2048 --n-gen 128,256,512 ^
  --mmap 0 --flash-attn 1 --split-mode layer

In build 6278, the program:

  1. Successfully initializes backends (RPC, Vulkan, CPU)
  2. Detects GPUs correctly
  3. Displays the table header
  4. Immediately exits with "Press any key to continue . . ." without running any benchmarks

Working Output (Build 6277)

| model                          |       size |     params | backend    | ngl | threads |   main_gpu | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | RPC,Vulkan |  99 |      24 |          1 |  1 |    0 |           pp512 |       606.65 ± 11.03 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | RPC,Vulkan |  99 |      24 |          1 |  1 |    0 |          pp2048 |       562.76 ± 17.86 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | RPC,Vulkan |  99 |      24 |          1 |  1 |    0 |           tg128 |         82.04 ± 1.57 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | RPC,Vulkan |  99 |      24 |          1 |  1 |    0 |           tg256 |         82.53 ± 1.53 |
| gpt-oss 20B Q4_K - Medium      |  11.04 GiB |    20.91 B | RPC,Vulkan |  99 |      24 |          1 |  1 |    0 |           tg512 |         80.57 ± 0.94 |

Broken Output (Build 6278)

| model                          |       size |     params | backend    | ngl | threads |   main_gpu | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ---------: | -: | ---: | --------------: | -------------------: |
Press any key to continue . . .

Additional Information

  • The model file (gpt-oss-20b-UD-Q4_K_XL.gguf) is the same in both tests
  • Backend initialization appears to work correctly in both versions
  • GPU detection is identical in both builds

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    VulkanIssues specific to the Vulkan backendbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions