Flutter binding of llama.cpp , which use platform channel .
llama.cpp: Inference of LLaMA model in pure C/C++
flutter pub add fcllama
Please run pod install
or pod update
in your iOS project.
You need install cmake 3.31.0、android sdk 35 and ndk 28.0.12674087. No additional operation required .
This is the fastest and recommended way to add HLlama to your project.
ohpm install hllama
Or, you can add it to your project manually.
- Add the following lines to
oh-package.json5
on your app module.
"dependencies": {
"hllama": "^0.0.2",
}
- Then run
ohpm install
- Initializing Llama
import 'package:fcllama/fllama.dart';
FCllama.instance()?.initContext("model path",emitLoadProgress: true)
.then((context) {
modelContextId = context?["contextId"].toString() ?? "";
if (modelContextId.isNotEmpty) {
// you can get modelContextId,if modelContextId > 0 is success.
}
});
- Bench model on device
import 'package:fcllama/fllama.dart';
FCllama.instance()?.bench(double.parse(modelContextId),pp:8,tg:4,pl:2,nr: 1).then((res){
Get.log("[FCllama] Bench Res $res");
});
- Tokenize and Detokenize
import 'package:fcllama/fllama.dart';
FCllama.instance()?.tokenize(double.parse(modelContextId), text: "What can you do?").then((res){
Get.log("[FCllama] Tokenize Res $res");
FCllama.instance()?.detokenize(double.parse(modelContextId), tokens: res?['tokens']).then((res){
Get.log("[FCllama] Detokenize Res $res");
});
});
- Streaming monitoring
import 'package:fcllama/fllama.dart';
FCllama.instance()?.onTokenStream?.listen((data) {
if(data['function']=="loadProgress"){
Get.log("[FCllama] loadProgress=${data['result']}");
}else if(data['function']=="completion"){
Get.log("[FCllama] completion=${data['result']}");
final tempRes = data["result"]["token"];
// tempRes is ans
}
});
- Release this or Stop one
import 'package:fcllama/fllama.dart';
FCllama.instance()?.stopCompletion(contextId: double.parse(modelContextId)); // stop one completion
FCllama.instance()?.releaseContext(double.parse(modelContextId)); // release one
FCllama.instance()?.releaseAllContexts(); // release all
You can see this file
System | Min SDK | Arch | Other |
---|---|---|---|
Android | 23 | arm64-v8a、x86_64、armeabi-v7a | Supports additional optimizations for certain CPUs |
iOS | 14 | arm64 | Support Metal |
OpenHarmonyOS/HarmonyOS | 12 | arm64-v8a、x86_64 | No additional optimizations for certain CPUs are supported |
You can search HuggingFace for available models (Keyword: GGUF
).
For get a GGUF model or quantize manually, see Prepare and Quantize
section in llama.cpp.
iOS:
- The Extended Virtual Addressing capability is recommended to enable on iOS project.
- Metal:
- We have tested to know some devices is not able to use Metal ('params.n_gpu_layers > 0') due to llama.cpp used SIMD-scoped operation, you can check if your device is supported in Metal feature set tables, Apple7 GPU will be the minimum requirement.
- It's also not supported in iOS simulator due to this limitation, we used constant buffers more than 14.
Android:
- Currently only supported arm64-v8a / x86_64 / armeabi-v7a platform, this means you can't initialize a context on another platforms. The 64-bit platform are recommended because it can allocate more memory for the model.
- No integrated any GPU backend yet.
MIT