Replies: 1 comment
-
In fact, the implementation is very simple. Here it is (ImageMagick): diff --git a/tools/mtmd/CMakeLists.txt b/tools/mtmd/CMakeLists.txt
index 4baa15b9..0af5ae4c 100644
--- a/tools/mtmd/CMakeLists.txt
+++ b/tools/mtmd/CMakeLists.txt
@@ -1,6 +1,7 @@
# mtmd
find_package(Threads REQUIRED)
+find_package(ImageMagick REQUIRED COMPONENTS Magick++)
add_library(mtmd
mtmd.cpp
@@ -14,10 +15,11 @@ add_library(mtmd
)
target_link_libraries (mtmd PUBLIC ggml llama)
-target_link_libraries (mtmd PRIVATE Threads::Threads)
+target_link_libraries (mtmd PRIVATE Threads::Threads ImageMagick::Magick++)
target_include_directories(mtmd PUBLIC .)
target_include_directories(mtmd PRIVATE ../..)
target_include_directories(mtmd PRIVATE ../../vendor)
+target_include_directories(mtmd PRIVATE ${ImageMagick_INCLUDE_DIRS})
target_compile_features (mtmd PRIVATE cxx_std_17)
if (BUILD_SHARED_LIBS)
@@ -38,7 +40,7 @@ set_target_properties(mtmd
install(TARGETS mtmd LIBRARY PUBLIC_HEADER)
if (NOT MSVC)
- # for stb_image.h and miniaudio.h
+ # for miniaudio.h
target_compile_options(mtmd PRIVATE -Wno-cast-qual)
endif()
diff --git a/tools/mtmd/mtmd-helper.cpp b/tools/mtmd/mtmd-helper.cpp
index 686f42f3..cf303504 100644
--- a/tools/mtmd/mtmd-helper.cpp
+++ b/tools/mtmd/mtmd-helper.cpp
@@ -29,8 +29,7 @@
#define MA_API static
#include "miniaudio/miniaudio.h"
-#define STB_IMAGE_IMPLEMENTATION
-#include "stb/stb_image.h"
+#include <Magick++.h>
#define LOG_INF(...) fprintf(stdout, __VA_ARGS__)
#define LOG_ERR(...) fprintf(stderr, __VA_ARGS__)
@@ -423,15 +422,40 @@ mtmd_bitmap * mtmd_helper_bitmap_init_from_buf(mtmd_context * ctx, const unsigne
// otherwise, we assume it's an image
mtmd_bitmap * result = nullptr;
- {
- int nx, ny, nc;
- auto * data = stbi_load_from_memory(buf, len, &nx, &ny, &nc, 3);
- if (!data) {
- LOG_ERR("%s: failed to decode image bytes\n", __func__);
- return nullptr;
- }
+ try {
+ // Create a Blob object from the in-memory data.
+ Magick::Blob blob(buf, len);
+ Magick::Image image;
+
+ // Read the image from the blob.
+ image.read(blob);
+
+ // Prepare a new blob to hold the raw RGB pixel data.
+ Magick::Blob rgb_blob;
+ // Write the image data to the new blob in the desired format (RGB, 8-bit per channel).
+ // This ensures the pixel data is in a simple, contiguous array format.
+ image.write(&rgb_blob, "RGB");
+
+ // Get image dimensions.
+ int nx = image.columns();
+ int ny = image.rows();
+
+ // Create a copy of the pixel data, as the blob's lifetime is tied to this scope.
+ size_t data_size = rgb_blob.length();
+ unsigned char * data = new unsigned char[data_size];
+ memcpy(data, rgb_blob.data(), data_size);
+
+ // Initialize the bitmap with the copied data.
result = mtmd_bitmap_init(nx, ny, data);
- stbi_image_free(data);
+ } catch (const Magick::Exception &e) {
+ LOG_ERR("%s: failed to decode image bytes with ImageMagick: %s\n", __func__, e.what());
+ return nullptr;
}
return result;
} I tried it on webp (not working on the current master) and it works just fine. I tried it with an image that has horizontal stripes in different colors and then letting it list the colors in order. Or listing the color of the n-th stripe. All works wonderfully fine! Also tried it on QOI. It's a bit annoying as the browser (using Open WebUI) uploads it but can't itself display it (the browser doesn't understand QOI). But it does in fact work. The VL model gets the image and is perfectly able to describe it. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I was just playing with VL model to describe an image. The image i tried was WebP which currently cannot be decoded.
mtmd_helper_bitmap_init_from_buf: failed to decode image bytes
It turns out that llama uses stb which doesn't support WebP and doesn't want to add additional formats either.
In other terms, the image format support in llama is limited to what STB supports.
I would like to be able to use webp and jxl. But perhaps also even QOI.
Now i'm curious if llama would be open to the idea of replacing stb with another image decoding framework. I can only see two possible serious libraries that could do this:
But i'm definitely open to other alternatives.
For any option i'd suggest to use system libraries as that gives more flexibility in terms of image formats. Flexibility that you be a recompile for llama.cpp if it becomes a build-in library.
I have looked at exacple code in both ffmpeg and imagemagick to decode an image and give back raw pixel data. Suffice to say, the ffmpeg side is substantially more code for the same functionality. Then again ffmpeg might give more flexibility when content needs to be loaded that is for example a video. Tradeoffs i suppose ;)
What i want to know is if a patch implementing this would be something that would be welcomed?
And yes, i would be willing to work on this.
Looking forward to your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions