[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313
Replies: 32 comments 38 replies
-
From this, IMO it only misses Linux+CUDA bundle to be useable as download & use. If we want better packaging on Linux, we can also work on snap/bash installer when trying to use pre-built packages. |
Beta Was this translation helpful? Give feedback.
-
It’s high time HuggingFace to copy Ollama’s packaging and GTM strategy, but this time, give credit to llama.cpp. Ideally, we should retain llama.cpp as the core component. |
Beta Was this translation helpful? Give feedback.
-
Is the barrier the installation process, or the need to use a complex command line to launch llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
For me the biggest thing is I'd love to see more emphasis placed on My ideal would be for the Maybe include systray integration and a simple UI for selecting and downloading models too. At that point |
Beta Was this translation helpful? Give feedback.
-
It will be cool if 'llama-server' would have auto configuration option to the machine/model like 'ollama' does it. |
Beta Was this translation helpful? Give feedback.
-
For windows maybe choco and windows store would be a good idea? 🤔 |
Beta Was this translation helpful? Give feedback.
-
I created a rpm spec to manage installation though I think flatpaks might be more user friendly and distribution agnostic. |
Beta Was this translation helpful? Give feedback.
-
The released Windows builds are available via Scoop. Updates happen automatically. Old installed versions are kept, and current one symlinked into a folder „current“ which provides the executables on the path. |
Beta Was this translation helpful? Give feedback.
-
is it feasible to have a single release for OS including all the backend? |
Beta Was this translation helpful? Give feedback.
-
For linux I just install the vulkan binaries and run the server from there. Maybe we can have a install script like ollama that detects the system and launches the server which can be controlled from an app as well as cli? The user then gets basic command line utillities like run start stop load list etc? |
Beta Was this translation helpful? Give feedback.
-
On Mac, the easiest way (also arguably the safest way) from a user's perspective is to find it in App Store, and install from there. Because of apps from App Store are in a sandbox, so from a user's point of view, installing or uninstalling is simple and clean. Creating a build and passing the App Store review might take some efforts (due to the sandbox constraint), but it should be a one-time thing. |
Beta Was this translation helpful? Give feedback.
-
Its my understanding that none of the automated installs support GPU acceleration. I might be wrong but its definitely the case for Windows, which makes it useless to install via winget. |
Beta Was this translation helpful? Give feedback.
-
To me the biggest advantage ollama currently has is that the optimal settings for a model are bundled, the gguf spec would allow for this to since its versatile enough to make this a metadata field inside the model. It would allow people to load the settings from a gguf and frontends can extract them and adapt them as they see fit. I think that part is going to be more valuable than obtaining the binary since downloading the binary from github is not that hard. |
Beta Was this translation helpful? Give feedback.
-
My personal wishlist
|
Beta Was this translation helpful? Give feedback.
-
Today was announced PyTorch 2.8 and the possibility of just using a single command (uv pip install torch) to install PyTorch with optimal compatibility with your hardware and OS (Windows and Linux): https://astral.sh/blog/wheel-variants It is the result of a joint open source project called WheelNext: In particular for Nvidia hardware uses this to detect best compatibility: https://github.com/wheelnext/nvidia-variant-provider This might be useful to use or to partially use just the detection of the user hardware. |
Beta Was this translation helpful? Give feedback.
-
I believe a package for llama-vulkan would be the killer app.. |
Beta Was this translation helpful? Give feedback.
-
brew install is great but we also need a single curl script like in https://bun.sh/ and https://docs.astral.sh/uv/getting-started/installation/ Also https://x.com/zereraz/status/1956081559695745528 the DX here is not the best.
to
Also even though the logs are cool, verbose mode can be separate. Better TUI inspired by https://github.com/vadimdemedes/ink or https://github.com/sst/opentui |
Beta Was this translation helpful? Give feedback.
-
Docker works great, is simpler than this.
|
Beta Was this translation helpful? Give feedback.
-
TL;DR Within the next few weeks, you'll be able to just Longer Version: With regards to installtion usability, within the next few weeks, llama.cpp will be made available in Debian via the official There are already packages in the In the releases we prepared, we ship the CPU, BLAS, HIP, CUDA and Vulkan backends. More backends will be added on request. However this is also a matter of test infrastructure, as I do test all of the backends on actual hardware before a release. I'm currently setting up a CI to automate this. The CPU backend is built with I would consider it a bug of the Debian package if its performance is not on par with upstream, hence the CI I'm setting up. This is just some initial info, I'm going to submit more information once I've got the CI finished. Most importantly, I want upstream here to see this as a benefit, not a burden, so I need to work out some user documentation and also see if upstream can benefit from our CI in some way. Edit: Forgot to say, upstream has been very accommodating in accepting changes that make shipping universally usable packages easier for us downstreams. Also @mbaudier's input on and testing of the Debian packages mentioned above were really helpful in finalizing this. |
Beta Was this translation helpful? Give feedback.
-
I was about to open a similar issue and ask when CUDA Linux builds, like Another thing is to address the |
Beta Was this translation helpful? Give feedback.
-
Using the docker-image I have tested with different UIs and also ollama and as mentioned from other users, |
Beta Was this translation helpful? Give feedback.
-
Create a separate installer/launcher per platform that checks CPU/GPU/iGPU against a database and downloads the right executable. The same executable could be used to update. Advanced settings for configuring server options and model parameters. Have a curated list of quantized models to download and launch for that hardware. Have a "custom" option that prompts the user to save a well commented batch file with several examples. |
Beta Was this translation helpful? Give feedback.
-
You might want to look into my Mmojo-Server project. One binary built with Cosmopolitan, that runs on ARM/x86 and macoS/Linux/Windows. I have a Mmojo-Server prebuilt on HuggingFace. I use generic CPU inference, which isn't awful for small models. 12B on good i7/i9. 4B on a Raspberry Pi. While GPU support would be great, it mostly confuses users who are challenged a little by the Terminal. I even sell a Pi based appliance, if you want something to plug and play. https://Mmojo.net |
Beta Was this translation helpful? Give feedback.
-
llama.cpp on NixOS is very easy in my experience. Simply adding the following to your config gets you environment.systemPackages = [
pkgs.llama-cpp
]; Depending on your config it may already have CUDA enabled. If not you can do environment.systemPackages = [
(pkgs.llama-cpp.override { cudaSupport = true; })
]; For rocm environment.systemPackages = [
(pkgs.llama-cpp.override { rocmSupport = true; })
]; Vulkan environment.systemPackages = [
(pkgs.llama-cpp.override { vulkanSupport = true; })
]; Etc etc. You can also try combo multiple but your mileage may vary |
Beta Was this translation helpful? Give feedback.
-
Note that CUDA builds exists for Windows and Linux, and macOS builds are optimized for accelerate and Metal on This distribution can easily be updated by anyone to newer versions and refined for specific targets if needed. |
Beta Was this translation helpful? Give feedback.
-
I have a good idea for a self configuring front end, just been a little consumed with my new code companion project to start on it. For Windows users my Llama.Cpp-Toolbox is probably a good place to get up and running fast. I'm going to be updating it soon with more functionality for power users. The new idea basically would map out the functions after each build, generate a GUI from them and should never go out of date. I haven't thought hard about how it would look but pulling the options and the executables seems easy enough. The scripts and examples could be annoying but if I can determine how to pinpoint them and the instructions or options it could work. I'll get around to it when my new Ai-Code-Companion can scan it all and I'll see what's possible. eye rolls |
Beta Was this translation helpful? Give feedback.
-
I had to install extra stuff for compiling with curl to work when compiling CUDA version on ubuntu 25.04 today, it was pretty obscure: sudo apt-get install curl libssl-dev libcurl4-openssl-dev |
Beta Was this translation helpful? Give feedback.
-
@slaren if you build with GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS on. And have all backends installed, is there a way at runtime to switch between say vulkan and ROCm or vulkan and CUDA? |
Beta Was this translation helpful? Give feedback.
-
I just went through this, so it's fresh in my mind. I had not used Linux since about 2012, when I changed our in-house servers to Windows server. Running local llm's I have used Text-Generation-WebUI, Ollama, and LM Studio. With the release of GPT-OSS-120b, most of the wrappers did not keep up with the new changes as quickly as I wanted. I run a 5090, and I also wanted to move to the newer torch release. I was doing some small training stuff that needed Triton anyway, so, Linux. OP lists using winget, in Windows, but the recompiled llama.cpp is Vulcan. The binary distributions stop at 12.4, so if you need the newer copies of torch >12.4, it's source code. Having been away from Linux a long time, setting up WSL and creating everything for a build was a bit of a struggle. I actually had GPT-OSS write me a step by step walk through with copy and paste commands so I wouldn't screw it up. Once everything was running, it's great, and a significant speedup. I moved my dev work to Linux now. Then I wrote a script (really me and GPT-OSS wrote a script) to fire up llama.cpp, so I could just pick models from an old fashioned script menu. It's crude, but it's never out of date. I posted the script creation script that GPT wrote below because I thought the idea of a self creating script was nice. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.
One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.
Currently if someone was to use llama.cpp directly:
brew install llama.cpp
worksThis adds barrier for non technically inclined people specially since in all the above methods users would have to reinstall llama.cpp to get upgrades (and llama.cpp makes releases per commit - not a bad thing, but becomes an issue since you need to upgrade more frequently)
Opening this issue to discuss what could be done to package llama.cpp better and allow users to maybe download an executable and be on their way.
More so, are there people in the community interested in taking this up?
Beta Was this translation helpful? Give feedback.
All reactions