Description
I have fine-tuned qwen2.5-vl-7b using unsloth and merged it with LoRA. Now, I need to use llama.cpp to perform Q4 quantization on it. Before that, I converted it to the GGUF format using convert_hf_to_gguf.py. I want to test the performance of the unquantized model first and deployed the model using the following command:
./llama-server -m /root/autodl-tmp/qwen2.5-vl/qwen-gguf/qwen2.5.gguf -c 2048
The model was deployed successfully without any errors. However, when I tested it with the following request:
curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/autodl-tmp/qwen2.5-vl/qwen-gguf/qwen2.5.gguf", "messages": [ {"role": "system", "content": "你是一个有用的助手。"}, {"role": "user", "content": [ {"type": "image_url", "image_url": { "url": "https://oss-pai-emcfh1jjcesunsrf7g-cn-guangzhou.oss-cn-guangzhou.aliyuncs.com/031920645691.jpg?Expires=1740968587&OSSAccessKeyId=TMP.3KoFNaN1sZAKuMb8zSRv5Ct65nWvYgsQfACyR9DRFXPzTVTVh4Ym6uQUp8nXcoANAP7MatHJB5Gux1iz2iwRgQEfPM4zpc&Signature=2%2FkCE6f5QkjQhY7t9zsCYSacmiA%3D" } }, {"type": "text", "text": "这是一张电表图片,提取具体的电表读数,总共有6位,最后1位为小数位,小数位不需要提取,只返回最终的电表读数不要返回多余内容"} ]} ] }'
I encountered an error stating that the model does not support image input.
After researching, I found that when deploying multimodal models using llama.cpp, the command generally looks like this:
build/bin/llama-server -m ../models/BroadBit/Qwen2.5-VL-7B-Instruct-Q8_0.gguf --mmproj ../models/BroadBit/mmproj-Qwen2.5-VL-7B-Instruct-f16.gguf -c 32768 -ngl 50 --temp 0.01 -np 1 --host 0.0.0.0 --port 18080 --mlock --no-warmup -t 4
Here, the --mmproj option is used. I would like to know how to generate the corresponding mmproj file when I convert a multimodal model to the GGUF format using llama.cpp. I am not very familiar with llama.cpp and would appreciate guidance from someone experienced.