Replace use of STB with something that can handle more image formats? #15542

markg85 · 2025-08-24T16:00:37Z

markg85
Aug 24, 2025

Hi,

I was just playing with VL model to describe an image. The image i tried was WebP which currently cannot be decoded.
mtmd_helper_bitmap_init_from_buf: failed to decode image bytes

It turns out that llama uses stb which doesn't support WebP and doesn't want to add additional formats either.

In other terms, the image format support in llama is limited to what STB supports.

I would like to be able to use webp and jxl. But perhaps also even QOI.

Now i'm curious if llama would be open to the idea of replacing stb with another image decoding framework. I can only see two possible serious libraries that could do this:

ImageMagick
FFmpeg

But i'm definitely open to other alternatives.
For any option i'd suggest to use system libraries as that gives more flexibility in terms of image formats. Flexibility that you be a recompile for llama.cpp if it becomes a build-in library.

I have looked at exacple code in both ffmpeg and imagemagick to decode an image and give back raw pixel data. Suffice to say, the ffmpeg side is substantially more code for the same functionality. Then again ffmpeg might give more flexibility when content needs to be loaded that is for example a video. Tradeoffs i suppose ;)

What i want to know is if a patch implementing this would be something that would be welcomed?
And yes, i would be willing to work on this.

Looking forward to your thoughts!

markg85 · 2025-08-24T17:24:36Z

markg85
Aug 24, 2025
Author

In fact, the implementation is very simple. Here it is (ImageMagick):

diff --git a/tools/mtmd/CMakeLists.txt b/tools/mtmd/CMakeLists.txt
index 4baa15b9..0af5ae4c 100644
--- a/tools/mtmd/CMakeLists.txt
+++ b/tools/mtmd/CMakeLists.txt
@@ -1,6 +1,7 @@
 # mtmd
 
 find_package(Threads REQUIRED)
+find_package(ImageMagick REQUIRED COMPONENTS Magick++)
 
 add_library(mtmd
             mtmd.cpp
@@ -14,10 +15,11 @@ add_library(mtmd
             )
 
 target_link_libraries     (mtmd PUBLIC ggml llama)
-target_link_libraries     (mtmd PRIVATE Threads::Threads)
+target_link_libraries     (mtmd PRIVATE Threads::Threads ImageMagick::Magick++)
 target_include_directories(mtmd PUBLIC  .)
 target_include_directories(mtmd PRIVATE ../..)
 target_include_directories(mtmd PRIVATE ../../vendor)
+target_include_directories(mtmd PRIVATE ${ImageMagick_INCLUDE_DIRS})
 target_compile_features   (mtmd PRIVATE cxx_std_17)
 
 if (BUILD_SHARED_LIBS)
@@ -38,7 +40,7 @@ set_target_properties(mtmd
 install(TARGETS mtmd LIBRARY PUBLIC_HEADER)
 
 if (NOT MSVC)
-    # for stb_image.h and miniaudio.h
+    # for miniaudio.h
     target_compile_options(mtmd PRIVATE -Wno-cast-qual)
 endif()
 
diff --git a/tools/mtmd/mtmd-helper.cpp b/tools/mtmd/mtmd-helper.cpp
index 686f42f3..cf303504 100644
--- a/tools/mtmd/mtmd-helper.cpp
+++ b/tools/mtmd/mtmd-helper.cpp
@@ -29,8 +29,7 @@
 #define MA_API static
 #include "miniaudio/miniaudio.h"
 
-#define STB_IMAGE_IMPLEMENTATION
-#include "stb/stb_image.h"
+#include <Magick++.h>
 
 #define LOG_INF(...) fprintf(stdout, __VA_ARGS__)
 #define LOG_ERR(...) fprintf(stderr, __VA_ARGS__)
@@ -423,15 +422,40 @@ mtmd_bitmap * mtmd_helper_bitmap_init_from_buf(mtmd_context * ctx, const unsigne
 
     // otherwise, we assume it's an image
     mtmd_bitmap * result = nullptr;
-    {
-        int nx, ny, nc;
-        auto * data = stbi_load_from_memory(buf, len, &nx, &ny, &nc, 3);
-        if (!data) {
-            LOG_ERR("%s: failed to decode image bytes\n", __func__);
-            return nullptr;
-        }
+    try {
+        // Create a Blob object from the in-memory data.
+        Magick::Blob blob(buf, len);
+        Magick::Image image;
+
+        // Read the image from the blob.
+        image.read(blob);
+
+        // Prepare a new blob to hold the raw RGB pixel data.
+        Magick::Blob rgb_blob;
+        // Write the image data to the new blob in the desired format (RGB, 8-bit per channel).
+        // This ensures the pixel data is in a simple, contiguous array format.
+        image.write(&rgb_blob, "RGB");
+
+        // Get image dimensions.
+        int nx = image.columns();
+        int ny = image.rows();
+
+        // Create a copy of the pixel data, as the blob's lifetime is tied to this scope.
+        size_t data_size = rgb_blob.length();
+        unsigned char * data = new unsigned char[data_size];
+        memcpy(data, rgb_blob.data(), data_size);
+
+        // Initialize the bitmap with the copied data.
         result = mtmd_bitmap_init(nx, ny, data);
-        stbi_image_free(data);
+    } catch (const Magick::Exception &e) {
+        LOG_ERR("%s: failed to decode image bytes with ImageMagick: %s\n", __func__, e.what());
+        return nullptr;
     }
     return result;
 }

I tried it on webp (not working on the current master) and it works just fine. I tried it with an image that has horizontal stripes in different colors and then letting it list the colors in order. Or listing the color of the n-th stripe. All works wonderfully fine!

Also tried it on QOI. It's a bit annoying as the browser (using Open WebUI) uploads it but can't itself display it (the browser doesn't understand QOI). But it does in fact work. The VL model gets the image and is perfectly able to describe it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace use of STB with something that can handle more image formats? #15542

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Replace use of STB with something that can handle more image formats? #15542

Uh oh!

markg85 Aug 24, 2025

Replies: 1 comment

Uh oh!

Uh oh!

markg85 Aug 24, 2025 Author

markg85
Aug 24, 2025

markg85
Aug 24, 2025
Author