Skip to content

Misc. bug: rpc - Flash Attention Failure in Metal/CUDA RPC Mixed Environment #12655

Closed
@Dango233

Description

@Dango233

Name and Version

version 4992 (af6ae1e)
built with MSVC 19.33.31629.0 for x64

Operating systems

Other? (Please let us know in description)

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

Problem description & steps to reproduce

Flash Attention Failure in Metal/CUDA RPC Mixed Environment

Issue Description

When using the DeepSeekV3 model with Flash Attention enabled:

  • Running llama-server directly works fine
  • When running llama-server on CUDA with rpc-server on Metal, everything works correctly
  • However, when running llama-server on Metal with rpc-server on CUDA, errors occur on the CUDA side

Specific scenario where the issue occurs:

  • Host machine running llama-server with Metal backend
  • Remote server running rpc-server with CUDA backend
  • DeepSeekV3 model
  • Error occurs in fattn.cu, primarily related to tensor transfer

Steps to Reproduce

  1. Working configuration (CUDA → Metal):
# Host with CUDA
./llama-server -m /path/to/deepseekv3-model.gguf [...other parameters] -fa --rpc metal-machine:50052

# Remote server with Metal
./rpc-server -H 0.0.0.0 -p 50052

✅(?) Works, but super slow (rpc throws null buffer for tensor passed into init_tensor function error, but can do inference)

  1. Failing configuration (Metal → CUDA):
# Host with Metal
./llama-server -m /path/to/deepseekv3-model.gguf -fa [...other parameters] --rpc cuda-machine:50052

# Remote server with CUDA
./rpc-server -H 0.0.0.0 -p 50052

❌ Error: fattn.cu error, unable to process tensor

Error Information

On the CUDA rpc-server side, an error is triggered directly in fattn.cu, around line 33, in the template function template <int ncols2> code section.

Environment

  • Configuration 1: CUDA host with Metal remote server (works correctly)
  • Configuration 2: Metal host with CUDA remote server (fails)
  • DeepSeekV3 UD quant
  • Latest version of llama.cpp

Additional Observations

  • Disabling Flash Attention allows the Metal → CUDA configuration to work normally
  • The issue is direction-specific: CUDA → Metal works, but Metal → CUDA fails
  • The error location is in the Flash Attention-specific head size processing code

This appears to be a one-way compatibility issue when tensors are transferred from Metal to CUDA during RPC communication, particularly for Flash Attention-related tensors.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions