-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262
Conversation
Feat ios
feat ios: add clean kv cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks OK. Need to improve encapsulation of the CoreML code (see comments). Would need a review from @ngxson.
Also:
- Use "CoreML" instead of "ANE"
- Would eventually need instructions for generating the CoreML inference code - can add those after the PR is approved
bool ane_embedding(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec); | ||
bool ane_resampler(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, const float * vit_embedding, float * vec); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to expose this in the public interface
|
||
// ANE support functions | ||
void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to avoid this. Maybe we can do something similar to whisper.cpp
:
@@ -82,6 +82,7 @@ struct mtmd_context_params { | |||
enum ggml_log_level verbosity; | |||
const char * image_marker; // deprecated, use media_marker instead | |||
const char * media_marker; | |||
const char * ane_model_path; // path to ANE model for iOS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the term "ane", use the term "coreml" as it is more correct. CoreML models can run not only the Apple Neural Engine, but also on the GPU and CPU.
|
||
static int flag = 0; | ||
static const void* coremlEncoder = NULL; | ||
static std::string cached_model_path = ""; | ||
|
||
// Check if we need to load a new model | ||
if (flag == 0 || (ane_model_path && cached_model_path != ane_model_path)) { | ||
if (coremlEncoder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid this global state. Figure out a way to move this to the clip context.
As stated in #14983, I have integrated Apple NPU (ANE) acceleration into llama.cpp.
Using MiniCPM-V 4.0 as an example, I will introduce a simple way to use ANE and hope we can discuss a better approach.
Download ane in Hugging Face or Modelscope, If you downloaded the zip file, please unzip it.
Used like mmproj, I added the "--ane" interface. The path is the downloaded ane_minicpmv4_vit_f16.mlmodelc file address.
I tested ANE acceleration on several devices. The benchmark results are as follows:
A point worth noting: The first time ANE is used, there is a loading time and it will be slightly slower. After that, as long as ANE is not updated, it will remain ready and waiting in the system.