Open
Description
Name and Version
llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64
version: 5201 (85f36e5e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
llama-server -m UD-IQ1_S/MAI-DS-R1-UD-IQ1_S-00001-of-00004.gguf -c 32768 -b 8192 -ub 4096 -ngl 999 -to 3600 -a MAI-DS-R1-UD-IQ1_S --no-mmap -t 1 -nkvo -fa
small context windows, small prompts, it seems to work, but if i use large context, it seems like it's using CPU. I get about 20-30 tokens/s on small prompts. otherwise, it just hangs for hours
without -fa
i have to use much smaller -c
-b
-ub
values to not max out on VRAM, and it runs larger prompts, but only at 7-10 tokens/second.
#!/bin/bash
cd ~/llama.cpp
git pull origin master
#rm -rf build
mkdir build
cd build
HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -S .. -B . \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DAMDGPU_TARGETS=gfx942 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/usr/local \
-DBUILD_SHARED_LIBS=ON \
-DLLAMA_CURL=ON \
&& cmake --build . --config Release -j8 \
&& sudo cmake --install .
sudo ldconfig