Skip to content

Feature Request: Multimodal: llama-server support for Qwen2.5-VL chat template type: list of image paths (type: "video") #13905

Open
@omarwahby-telestream

Description

@omarwahby-telestream

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hello, I hope this message finds everyone well!

For the Qwen2.5-VL-7B-Instruct-GGUF model, I see on this HuggingFace page (https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF) that the chat template for Qwen2.5-VL currently supports an element of type "video" (in reality, a sequence of images) prompt input in this format:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

The chat template above contains a list of frames as paths to image files. llama-server already supports a list of type "image_url" with jpeg images encoded as base64, shown in the chat template below. However, Qwen2.5-VL tends to interpret these as a set of distinct images. I would like to set the list type to "video" so the model interprets the list as a set of frames belonging to the same video. However, llama-server currently responds with Error Code 500: unknown request type "video" when I try using the chat template above.

Below is the chat template that llama-server currently supports:

{
    "role": "user",
    "content": [
        {"type": "text", "text": "What type of video is this and what's happening in it? Be specific about the content type and general activities you observe."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAB//9k="} },
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAC//9k="} },
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAC//9k="} }
    ]
}

Thank you very much for your time and consideration!

Motivation

This feature would significantly increase the video inferencing capabilities of the Qwen Video AI model on llama-server.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions