Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Hello, I hope this message finds everyone well!
For the Qwen2.5-VL-7B-Instruct-GGUF model, I see on this HuggingFace page (https://huggingface.co/unsloth/Qwen2.5-VL-7B-Instruct-GGUF) that the chat template for Qwen2.5-VL currently supports an element of type "video" (in reality, a sequence of images) prompt input in this format:
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
},
{"type": "text", "text": "Describe this video."},
],
}
]
The chat template above contains a list of frames as paths to image files. llama-server already supports a list of type "image_url" with jpeg images encoded as base64, shown in the chat template below. However, Qwen2.5-VL tends to interpret these as a set of distinct images. I would like to set the list type to "video" so the model interprets the list as a set of frames belonging to the same video. However, llama-server currently responds with Error Code 500: unknown request type "video" when I try using the chat template above.
Below is the chat template that llama-server currently supports:
{
"role": "user",
"content": [
{"type": "text", "text": "What type of video is this and what's happening in it? Be specific about the content type and general activities you observe."},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAB//9k="} },
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAC//9k="} },
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgAAC//9k="} }
]
}
Thank you very much for your time and consideration!
Motivation
This feature would significantly increase the video inferencing capabilities of the Qwen Video AI model on llama-server.
Possible Implementation
No response