Skip to content

Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

tc-mb
Copy link
Contributor

@tc-mb tc-mb commented Aug 12, 2025

As stated in #14983, I have integrated Apple NPU (ANE) acceleration into llama.cpp.

Using MiniCPM-V 4.0 as an example, I will introduce a simple way to use ANE and hope we can discuss a better approach.

  1. Build llama.cpp locally,I added an ENABLE_ANE option to control whether ANE is used.
cmake -B build -DENABLE_ANE=ON
cmake --build build --config Release -j 8
  1. Download ane in Hugging Face or Modelscope, If you downloaded the zip file, please unzip it.

  2. Used like mmproj, I added the "--ane" interface. The path is the downloaded ane_minicpmv4_vit_f16.mlmodelc file address.

./build/bin/llama-mtmd-cli -m {dir_path}/ggml-model-Q4_0.gguf --mmproj {dir_path}/mmproj-model-f16.gguf --ane {dir_path}/ane_minicpmv4_vit_f16.mlmodelc -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image {dir_path}/xx.png -p "Describe the content of the image in detail." 

I tested ANE acceleration on several devices. The benchmark results are as follows:

mac M2   image size   MiniCPM-V 4.0(ANE) MiniCPM-V 4.0
q4_K_M 1 448×448 prefill time(ms) 790.26 5716.77
  2 600×600 prefill time(ms) 1894.24 17961.35
  3 700×700 prefill time(ms) 2954.34 27866.59
  4 800×800 prefill time(ms) 2964.44 27946.48
  5 1024×625 prefill time(ms) 2977.56 30111.43
  6 1024×768 prefill time(ms) 2975.98 30415.11
  7 1280×960 prefill time(ms) 4065.79 41889.12
mac M4       MiniCPM-V 4.0(ane) MiniCPM-V 4.0
q4_K_M 1 448×448 prefill time(ms) 412.57 736.57
  2 600×600 prefill time(ms) 989.44 3365.09
  3 700×700 prefill time(ms) 1564.61 4031.90
  4 800×800 prefill time(ms) 1555.85 4124.81
  5 1024×625 prefill time(ms) 1563.65 5405.13
  6 1024×768 prefill time(ms) 1567.45 5169.05
  7 1280×960 prefill time(ms) 2141.54 7544.96

A point worth noting: The first time ANE is used, there is a loading time and it will be slightly slower. After that, as long as ANE is not updated, it will remain ready and waiting in the system.

@github-actions github-actions bot added examples python python script changes labels Aug 12, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK. Need to improve encapsulation of the CoreML code (see comments). Would need a review from @ngxson.

Also:

  • Use "CoreML" instead of "ANE"
  • Would eventually need instructions for generating the CoreML inference code - can add those after the PR is approved

Comment on lines +98 to +100
bool ane_embedding(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec);
bool ane_resampler(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, const float * vit_embedding, float * vec);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to expose this in the public interface

Comment on lines +115 to +117

// ANE support functions
void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to avoid this. Maybe we can do something similar to whisper.cpp:

https://github.com/ggml-org/whisper.cpp/blob/f7502dca872866a310fe69d30b163fa87d256319/src/whisper.cpp#L3351-L3373

@@ -82,6 +82,7 @@ struct mtmd_context_params {
enum ggml_log_level verbosity;
const char * image_marker; // deprecated, use media_marker instead
const char * media_marker;
const char * ane_model_path; // path to ANE model for iOS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the term "ane", use the term "coreml" as it is more correct. CoreML models can run not only the Apple Neural Engine, but also on the GPU and CPU.

Comment on lines +3845 to +3852

static int flag = 0;
static const void* coremlEncoder = NULL;
static std::string cached_model_path = "";

// Check if we need to load a new model
if (flag == 0 || (ane_model_path && cached_model_path != ane_model_path)) {
if (coremlEncoder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this global state. Figure out a way to move this to the clip context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants