Skip to content

Misc. bug: Flash Attention not working on CDNA3 ROCm 6.4 MI300 #13145

Open
@unclemusclez

Description

@unclemusclez

Name and Version

llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64
version: 5201 (85f36e5e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

llama-server -m UD-IQ1_S/MAI-DS-R1-UD-IQ1_S-00001-of-00004.gguf -c 32768  -b 8192 -ub 4096  -ngl 999  -to 3600  -a MAI-DS-R1-UD-IQ1_S --no-mmap -t 1 -nkvo -fa

small context windows, small prompts, it seems to work, but if i use large context, it seems like it's using CPU. I get about 20-30 tokens/s on small prompts. otherwise, it just hangs for hours

without -fa i have to use much smaller -c -b -ub values to not max out on VRAM, and it runs larger prompts, but only at 7-10 tokens/second.

#!/bin/bash 
cd ~/llama.cpp
git pull origin master

#rm -rf build
mkdir build
cd build

HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -S .. -B . \
    -DGGML_HIP=ON \
    -DGGML_HIP_ROCWMMA_FATTN=ON \
    -DAMDGPU_TARGETS=gfx942 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=/usr/local \
    -DBUILD_SHARED_LIBS=ON \
    -DLLAMA_CURL=ON \
&& cmake --build . --config Release -j8 \
&& sudo cmake --install .
sudo ldconfig

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions