Description
Name and Version
version 4992 (af6ae1e)
built with MSVC 19.33.31629.0 for x64
Operating systems
Other? (Please let us know in description)
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
Problem description & steps to reproduce
Flash Attention Failure in Metal/CUDA RPC Mixed Environment
Issue Description
When using the DeepSeekV3 model with Flash Attention enabled:
- Running llama-server directly works fine
- When running llama-server on CUDA with rpc-server on Metal, everything works correctly
- However, when running llama-server on Metal with rpc-server on CUDA, errors occur on the CUDA side
Specific scenario where the issue occurs:
- Host machine running llama-server with Metal backend
- Remote server running rpc-server with CUDA backend
- DeepSeekV3 model
- Error occurs in fattn.cu, primarily related to tensor transfer
Steps to Reproduce
- Working configuration (CUDA → Metal):
# Host with CUDA
./llama-server -m /path/to/deepseekv3-model.gguf [...other parameters] -fa --rpc metal-machine:50052
# Remote server with Metal
./rpc-server -H 0.0.0.0 -p 50052
✅(?) Works, but super slow (rpc throws null buffer for tensor passed into init_tensor function error, but can do inference)
- Failing configuration (Metal → CUDA):
# Host with Metal
./llama-server -m /path/to/deepseekv3-model.gguf -fa [...other parameters] --rpc cuda-machine:50052
# Remote server with CUDA
./rpc-server -H 0.0.0.0 -p 50052
❌ Error: fattn.cu error, unable to process tensor
Error Information
On the CUDA rpc-server side, an error is triggered directly in fattn.cu, around line 33, in the template function template <int ncols2>
code section.
Environment
- Configuration 1: CUDA host with Metal remote server (works correctly)
- Configuration 2: Metal host with CUDA remote server (fails)
- DeepSeekV3 UD quant
- Latest version of llama.cpp
Additional Observations
- Disabling Flash Attention allows the Metal → CUDA configuration to work normally
- The issue is direction-specific: CUDA → Metal works, but Metal → CUDA fails
- The error location is in the Flash Attention-specific head size processing code
This appears to be a one-way compatibility issue when tensors are transferred from Metal to CUDA during RPC communication, particularly for Flash Attention-related tensors.
First Bad Commit
No response