Skip to content

Misc. bug: CUDA error: device kernel image is invalid (Quadro RTX 8000) #12717

Open
@evgenyigumnov

Description

@evgenyigumnov

Name and Version

root@C.19155793:/workspace/llama.cpp$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes
version: 5026 (83a88bd)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

Problem description & steps to reproduce

Hello!

How to reproduce:

apt clean && apt update && apt install cmake -y

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8
sudo apt install cuda-compiler-12-8
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=75
cmake --build build --config Release
wget https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated-GGUF/resolve/main/gemma-3-27b-it-abliterated.q8_0.gguf

./build/bin/llama-server -m ./gemma-3-27b-it-abliterated.q8_0.gguf --host 0.0.0.0 --port 1234

print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/63 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 27371.80 MiB
................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0                                                                                                                                                                                                                                      llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 62, can_shift = 1
init:        CPU KV buffer size =  1984.00 MiB
llama_context: KV self size  = 1984.00 MiB, K (f16):  992.00 MiB, V (f16):  992.00 MiB
llama_context:      CUDA0 compute buffer size =  1950.97 MiB
llama_context:  CUDA_Host compute buffer size =    26.51 MiB
llama_context: graph nodes  = 2611
llama_context: graph splits = 934 (with bs=512), 125 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
ggml_cuda_compute_forward: MUL failed
CUDA error: device kernel image is invalid
  current device: 0, in function ggml_cuda_compute_forward at /workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2338
  err

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions