Misc. bug: rpc - Flash Attention Failure in Metal/CUDA RPC Mixed Environment

### Name and Version

version 4992 (af6ae1ef)
built with MSVC 19.33.31629.0 for x64

### Operating systems

Other? (Please let us know in description)

### Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

### Command line

```shell

```

### Problem description & steps to reproduce

# Flash Attention Failure in Metal/CUDA RPC Mixed Environment

## Issue Description

When using the DeepSeekV3 model with Flash Attention enabled:
- Running llama-server directly works fine
- When running llama-server on CUDA with rpc-server on Metal, everything works correctly
- However, when running llama-server on Metal with rpc-server on CUDA, errors occur on the CUDA side

Specific scenario where the issue occurs:
- Host machine running llama-server with Metal backend
- Remote server running rpc-server with CUDA backend
- DeepSeekV3 model
- Error occurs in fattn.cu, primarily related to tensor transfer

## Steps to Reproduce

1. Working configuration (CUDA → Metal):
```
# Host with CUDA
./llama-server -m /path/to/deepseekv3-model.gguf [...other parameters] -fa --rpc metal-machine:50052

# Remote server with Metal
./rpc-server -H 0.0.0.0 -p 50052
```
✅(?) Works, but super slow (rpc throws null buffer for tensor passed into init_tensor function error, but can do inference)

2. Failing configuration (Metal → CUDA):
```
# Host with Metal
./llama-server -m /path/to/deepseekv3-model.gguf -fa [...other parameters] --rpc cuda-machine:50052

# Remote server with CUDA
./rpc-server -H 0.0.0.0 -p 50052
```
❌ Error: fattn.cu error, unable to process tensor

## Error Information

On the CUDA rpc-server side, an error is triggered directly in fattn.cu, around line 33, in the template function `template <int ncols2>` code section.

## Environment

- Configuration 1: CUDA host with Metal remote server (works correctly)
- Configuration 2: Metal host with CUDA remote server (fails)
- DeepSeekV3 UD quant
- Latest version of llama.cpp

## Additional Observations

- Disabling Flash Attention allows the Metal → CUDA configuration to work normally
- The issue is direction-specific: CUDA → Metal works, but Metal → CUDA fails
- The error location is in the Flash Attention-specific head size processing code

This appears to be a one-way compatibility issue when tensors are transferred from Metal to CUDA during RPC communication, particularly for Flash Attention-related tensors.


### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: rpc - Flash Attention Failure in Metal/CUDA RPC Mixed Environment #12655

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Flash Attention Failure in Metal/CUDA RPC Mixed Environment

Issue Description

Steps to Reproduce

Error Information

Environment

Additional Observations

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: rpc - Flash Attention Failure in Metal/CUDA RPC Mixed Environment #12655

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Flash Attention Failure in Metal/CUDA RPC Mixed Environment

Issue Description

Steps to Reproduce

Error Information

Environment

Additional Observations

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions