Feature Request: Multimodal: llama-server support for Qwen2.5-VL chat template type: list of image paths (type: "video")

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Hello, I hope this message finds everyone well!

For the Qwen2.5-VL-7B-Instruct-GGUF model, I see on this HuggingFace page (https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF) that the chat template for Qwen2.5-VL currently supports an element of type "video" (in reality, a sequence of images) prompt input in this format:

```
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
```

The chat template above contains a list of frames as paths to image files. llama-server already supports a list of type "image_url" with jpeg images encoded as base64, shown in the chat template below. However, Qwen2.5-VL tends to interpret these as a set of distinct images. I would like to set the list type to "video" so the model interprets the list as a set of frames belonging to the same video. However, llama-server currently responds with Error Code 500: unknown request type "video" when I try using the chat template above.

Below is the chat template that llama-server currently supports:

```
{
    "role": "user",
    "content": [
        {"type": "text", "text": "What type of video is this and what's happening in it? Be specific about the content type and general activities you observe."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAB//9k="} },
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAC//9k="} },
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAC//9k="} }
    ]
}
```

Thank you very much for your time and consideration!


### Motivation

This feature would significantly increase the video inferencing capabilities of the Qwen Video AI model on llama-server.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Multimodal: llama-server support for Qwen2.5-VL chat template type: list of image paths (type: "video") #13905

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Multimodal: llama-server support for Qwen2.5-VL chat template type: list of image paths (type: "video") #13905

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions