[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313

Vaibhavs10 · 2025-08-14T13:50:28Z

Vaibhavs10
Aug 14, 2025
Collaborator

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:

For Mac - brew install llama.cpp works
For Linux (CUDA) - they need to clone and install directly from github
For Windows - winget (?)

This adds barrier for non technically inclined people specially since in all the above methods users would have to reinstall llama.cpp to get upgrades (and llama.cpp makes releases per commit - not a bad thing, but becomes an issue since you need to upgrade more frequently)

Opening this issue to discuss what could be done to package llama.cpp better and allow users to maybe download an executable and be on their way.

More so, are there people in the community interested in taking this up?

KhazAkar · 2025-08-14T14:03:19Z

KhazAkar
Aug 14, 2025

From this, IMO it only misses Linux+CUDA bundle to be useable as download & use.

Click to expand

If we want better packaging on Linux, we can also work on snap/bash installer when trying to use pre-built packages.

1 reply

michaelgiba Aug 15, 2025

On the first point this was a pain point I faced as well (#15249)

mitkox · 2025-08-14T14:04:55Z

mitkox
Aug 14, 2025

It’s high time HuggingFace to copy Ollama’s packaging and GTM strategy, but this time, give credit to llama.cpp. Ideally, we should retain llama.cpp as the core component.

2 replies

KhazAkar Aug 14, 2025

For HF - they should follow similar path as Jan.ai does for example, a.k.a raw llama.cpp, not some weird forks like ollama.

mitkox Aug 14, 2025

On technology- yes, on marketing strategy, GTM, and partnership lock - there is nothing like Ollama's way of doing it. I can't open any app without ollama support to be the "local AI" support.

slaren · 2025-08-14T14:06:25Z

slaren
Aug 14, 2025
Maintainer

Is the barrier the installation process, or the need to use a complex command line to launch llama.cpp?

1 reply

ericcurtin Aug 15, 2025
Collaborator

A bit of both.

I think packaging has improved in recent times, windows as an example. I kinda like the containers approach for Linux, it makes sense, I'm not sure how many Linux distros are going to tolerate separate llama-cpp-cuda, llama-cpp-rocm, llama-cpp-vulkan packages as per their packaging guidelines.

I do think at a minimum llama.cpp should document upstream what flags should be used with each backend CUDA, Vulkan, Metal (--threads, --cache-reuse, --flash-attention, etc.), as discussed here:

ollama/ollama#11714 (comment)

People use sub-optimal flags often when using llama.cpp and assume performance isn't up to scratch.

qnixsynapse · 2025-08-14T14:09:24Z

qnixsynapse
Aug 14, 2025
Collaborator

4 replies

slaren Aug 14, 2025
Maintainer

Are you using the GGML_CPU_ALL_VARIANTS option?

qnixsynapse Aug 14, 2025
Collaborator

Nope. Separate instructions builds for CPU instructions(For example: using -DGGML_AVX512=ON for AVX512 build). We even found people use processors which doesn't even have avx2.

slaren Aug 14, 2025
Maintainer

You should look into GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS, it is designed to solve this problem. Most of our releases use it.

qnixsynapse Aug 14, 2025
Collaborator

Thank you!

simonw · 2025-08-14T14:22:06Z

simonw
Aug 14, 2025

For me the biggest thing is llama-server. That tool is fantastic... but it has very low discoverability. Until a few months ago I still thought it was just a demo because it lived in the examples/ folder (I just noticed it moved from there to the tools/ folder in May).

I'd love to see more emphasis placed on llama-server. It's really good! I think it's probably the most end-user appropriate way to interact with this project.

My ideal would be for the llama.cpp project to ship official installers for llama-server on Mac and Windows and Linux, which on Mac and Windows work like a desktop application: you get an icon you can use to launch the tool which start the server running and launch a window that shows that web UI.

Maybe include systray integration and a simple UI for selecting and downloading models too.

At that point llama-server would feel like an alternative to Ollama and LM Studio and Jan. I think it deserves that - it has most of what those tools offer implemented already, what's missing is an installer and a think desktop shell.

5 replies

YavorGIvanov Aug 14, 2025

Agreed. If llama-server was nicely packaged on all OSes and could be easily distributed and run everywhere, then I don't see a reason to use any alternative. The benefit will also be that you get the latest version of llama.cpp immediately without waiting for other companies to package it with nice UI.

t1u1 Aug 14, 2025

My ideal would be for the llama.cpp project to ship official installers for llama-server on Mac and Windows and Linux,

An easy way to do this is to make it a PWA. Will work in most major browsers, including Safari, Chrome and its derivatives, Android and iOS, but not Firefox. It just takes a few lines of code to do this.

allozaur Aug 14, 2025
Collaborator

Hey @simonw, as a matter of fact we are working on new WebUI for llama.cpp and we are planning to put it into a native app as well!

The first step is going to be a release of a new version of WebUI rewritten in Svelte and with a much better UI/UX.

You can track the progress here #14839 (comment)

I am planning to have the Pull Request ready for a review at the beginning of the next week, so stay tuned!

calculatortamer Aug 14, 2025

i hope they won't deprecate llama-server CLI for a llama-server GUI, it would be too annoying to use on termux or over SSH

red-co Aug 15, 2025

Hey @simonw, as a matter of fact we are working on new WebUI for llama.cpp and we are planning to put it into a native app as well!

The first step is going to be a release of a new version of WebUI rewritten in Svelte and with a much better UI/UX.

You can track the progress here #14839 (comment)

I am planning to have the Pull Request ready for a review at the beginning of the next week, so stay tuned!

I think I'm not that enough new to this, and I can compile llamacpp for a specific GPU, but I don't know how to use the llamacpp server for persistent chat-session data caching. Does it simply not have this feature? Or am I missing the parameter?

yaronsumel · 2025-08-14T14:36:51Z

yaronsumel
Aug 14, 2025

It will be cool if 'llama-server' would have auto configuration option to the machine/model like 'ollama' does it.

0 replies

exxocism · 2025-08-14T14:46:52Z

exxocism
Aug 14, 2025

For windows maybe choco and windows store would be a good idea? 🤔

0 replies

acbits · 2025-08-14T14:49:31Z

acbits
Aug 14, 2025

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:
1. For Mac - `brew install llama.cpp` works

2. For Linux (CUDA) - they need to clone and install directly from github

I created a rpm spec to manage installation though I think flatpaks might be more user friendly and distribution agnostic.

0 replies

schnow265 · 2025-08-14T15:38:12Z

schnow265
Aug 14, 2025

For Windows - winget (?)

The released Windows builds are available via Scoop.

Updates happen automatically. Old installed versions are kept, and current one symlinked into a folder „current“ which provides the executables on the path.

0 replies

damianofalcioni · 2025-08-14T16:42:51Z

damianofalcioni
Aug 14, 2025

is it feasible to have a single release for OS including all the backend?

2 replies

slaren Aug 15, 2025
Maintainer

It's technically possible, but the size of the package would be quite big.

SamuelMarks Aug 15, 2025

Easiest way is with an αcτµαlly pδrταblε εxεcµταblε

https://justine.lol/ape.html

tellsiddh · 2025-08-14T16:49:52Z

tellsiddh
Aug 14, 2025

For linux I just install the vulkan binaries and run the server from there. Maybe we can have a install script like ollama that detects the system and launches the server which can be controlled from an app as well as cli? The user then gets basic command line utillities like run start stop load list etc?

0 replies

leoxzhao · 2025-08-14T16:50:23Z

leoxzhao
Aug 14, 2025

On Mac, the easiest way (also arguably the safest way) from a user's perspective is to find it in App Store, and install from there. Because of apps from App Store are in a sandbox, so from a user's point of view, installing or uninstalling is simple and clean. Creating a build and passing the App Store review might take some efforts (due to the sandbox constraint), but it should be a one-time thing.

0 replies

O-J1 · 2025-08-14T16:53:53Z

O-J1
Aug 14, 2025

Its my understanding that none of the automated installs support GPU acceleration. I might be wrong but its definitely the case for Windows, which makes it useless to install via winget.

3 replies

slaren Aug 15, 2025
Maintainer

The winget package comes with the Vulkan backend.

O-J1 Aug 25, 2025

Does that work reliably with CUDA cards and does it support the same kinds of models (Qwen2.5VL for instance)

anderspitman Aug 25, 2025

My experience with the Vulkan backend on a RTX 3060 has been excellent.

henk717 · 2025-08-14T16:55:38Z

henk717
Aug 14, 2025

To me the biggest advantage ollama currently has is that the optimal settings for a model are bundled, the gguf spec would allow for this to since its versatile enough to make this a metadata field inside the model. It would allow people to load the settings from a gguf and frontends can extract them and adapt them as they see fit. I think that part is going to be more valuable than obtaining the binary since downloading the binary from github is not that hard.

1 reply

logan-markewich Aug 14, 2025

This is the killer feature right here. Id use llama.cpp directly over any wrapper if this was added

digitalspaceport · 2025-08-14T17:46:49Z

digitalspaceport
Aug 14, 2025

My personal wishlist

Llama-swap integration
Catchall compiled binaries for linux, win, osx
GUI improvements

2 replies

neutrinotek Aug 16, 2025

I feel like point number 1 is widely undermentioned in this thread. The ease that Llama-Swap adds to being able to bounce around models for various tasks is invaluable. If it weren't for that, I would probably still be using Ollama. It still needs a bit of upfront setup, but once you get a model in, it basically becomes "set it and forget it."

Edit I just realized who I was replying to and felt the need to come back and say I love the YT channel man. Keep up the good work!

ericcurtin Aug 16, 2025
Collaborator

Docker Model Runner provides 1 and 2 FWIW... They are trying to grow the community also... It uses llama.cpp upstream, contributes any changes required back... Number 3... I recommend AnythingLLM, but there are many options

alonsosilvaallende · 2025-08-14T21:40:54Z

alonsosilvaallende
Aug 14, 2025

Today was announced PyTorch 2.8 and the possibility of just using a single command (uv pip install torch) to install PyTorch with optimal compatibility with your hardware and OS (Windows and Linux):

https://astral.sh/blog/wheel-variants

It is the result of a joint open source project called WheelNext:

https://wheelnext.dev/

In particular for Nvidia hardware uses this to detect best compatibility:

https://github.com/wheelnext/nvidia-variant-provider

This might be useful to use or to partially use just the detection of the user hardware.

0 replies

rowheel · 2025-08-15T08:42:21Z

rowheel
Aug 15, 2025

I believe a package for llama-vulkan would be the killer app..

0 replies

zereraz · 2025-08-15T10:39:30Z

zereraz
Aug 15, 2025

brew install is great but we also need a single curl script like in https://bun.sh/ and https://docs.astral.sh/uv/getting-started/installation/

Also https://x.com/zereraz/status/1956081559695745528 the DX here is not the best.
llama-cli could be renamed to llama

llama-cli -hf ggml-org/gemma-3-270m-it-GGUF -c 0 -fa --jinja

to

llama run ggml-org/gemma-3-270m-it-GGUF

Also even though the logs are cool, verbose mode can be separate.

Better TUI inspired by https://github.com/vadimdemedes/ink or https://github.com/sst/opentui

1 reply

ericcurtin Aug 16, 2025
Collaborator

llama-run does exist although the syntax to pull from huggingface is a little different (I wrote it). I recommend

docker model run hf.co/ggml-org/gemma-3-270m-it-GGUF

it uses llama.cpp backend.

I think command prompts are better off being done in golang... I used linenoise.cpp for the command prompt, C++ is just not a great language for the user facing stuff...

ericcurtin · 2025-08-15T10:40:08Z

ericcurtin
Aug 15, 2025
Collaborator

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:

For Mac - brew install llama.cpp works

For Linux (CUDA) - they need to clone and install directly from github

Docker works great, is simpler than this.

For Windows - winget (?)

This adds barrier for non technically inclined people specially since in all the above methods users would have to reinstall llama.cpp to get upgrades (and llama.cpp makes releases per commit - not a bad thing, but becomes an issue since you need to upgrade more frequently)

Opening this issue to discuss what could be done to package llama.cpp better and allow users to maybe download an executable and be on their way.

More so, are there people in the community interested in taking this up?

0 replies

ckastner · 2025-08-15T12:55:23Z

ckastner
Aug 15, 2025
Collaborator

TL;DR Within the next few weeks, you'll be able to just apt-get install llama.cpp from the official Debian archive, and ideally Ubuntu soon after.

Longer Version: With regards to installtion usability, within the next few weeks, llama.cpp will be made available in Debian via the official bookworm-backports and trixie-backports distributions.

There are already packages in the unstable distribution but these are obsolete and will be replaced by a more sophisticated layout. This is already finished and uploaded, and is only pending Archive master approval (they are swamped at the moment because we just released Debian 13). I also plan to provide official backports for Ubuntu but I need to clear upload rights for that first (shouldn't be much of an issue).

In the releases we prepared, we ship the CPU, BLAS, HIP, CUDA and Vulkan backends. More backends will be added on request. However this is also a matter of test infrastructure, as I do test all of the backends on actual hardware before a release. I'm currently setting up a CI to automate this.

The CPU backend is built with GGML_CPU_ALL_VARIANTS=ON and should work fine on any common CPU even if old. I've tested this for amd64, arm64, and ppc64el.

I would consider it a bug of the Debian package if its performance is not on par with upstream, hence the CI I'm setting up.

This is just some initial info, I'm going to submit more information once I've got the CI finished. Most importantly, I want upstream here to see this as a benefit, not a burden, so I need to work out some user documentation and also see if upstream can benefit from our CI in some way.

Edit: Forgot to say, upstream has been very accommodating in accepting changes that make shipping universally usable packages easier for us downstreams. Also @mbaudier's input on and testing of the Debian packages mentioned above were really helpful in finalizing this.

2 replies

ericcurtin Aug 15, 2025
Collaborator

I'd been using GGML_NATIVE=OFF to build native binaries without features specific to newer CPUs, at least that's what I thought it did 😅 What's the difference between GGML_CPU_ALL_VARIANTS=ON and GGML_NATIVE=OFF?

ckastner Aug 16, 2025
Collaborator

GGML_NATIVE=ON probes your CPU features at build-time, and builds exactly one CPU backend with those exact features.

GGML_CPU_ALL_VARIANTS=ON builds multiple CPU backends all with different features, such that a maximally performant backend is available for any common CPU. The best backend is dynamically loaded at runtime, hence this requires GGML_BACKEND_DL=ON.

gesrtewrtwr · 2025-08-18T12:23:44Z

gesrtewrtwr
Aug 18, 2025

I was about to open a similar issue and ask when CUDA Linux builds, like text-generation-webui has them.
(Fortunately the newly introduced --cpu-moe (never tried the same over a regex) increased the tokens per second for my 4070+64GB RAM setup from:
Before: ~6 t/s tg: llama-server -m 'Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf' --n_gpu_layers 7 --port 8080 --threads 3 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0 -c 4096 (~85% VRAM usage).
After: ~10 t/s tg: llama-server -m 'Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf' --n_gpu_layers 99 --port 8080 --threads 3 --cpu-moe --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0 -c 4096 (~40% VRAM usage).
Using text-generation-webui' CUDA build I get like 12 t/s, so not bad at all for the -bin-ubuntu-vulkan-x64.zip build so far.)

Another thing is to address the ..ubuntu.. in the release build's name and either rename it to ..linux.., or add linux to the ubuntu name, because the build works on my arch just fine.

4 replies

ericcurtin Aug 18, 2025
Collaborator

Well I guess it's built against Ubuntu libraries so Ubuntu makes sense. This is a Linux packaging problem in general, if using dynamic linking, you gotta build for each distro. But because things are also packaged as containers, you can just use the containers on pretty much any Linux distro.

gesrtewrtwr Aug 20, 2025

If it's indeed build against Ubuntu, then I guess it's luck that it runs on Arch, because ldd ./llama-server shows a list, out of which 25 libs are linked against external libs.
AI says Ubuntu has easily on order of magnitude more users. I was hesitant to try the ubuntu-vulkan build because I thought it wouldn't run on Arch (but the "llama.cpp webui is all you need" mentality convinced me after some time to try it first and maybe build later, if it doesn't work). An Arch build would be still nice or ..arch-ubuntu.. in build name.

What's the difference between cudart-llama-bin-win-cuda-12.4-x64.zip and llama-b6210-bin-win-cuda-12.4-x64.zip? Can't find anything in docs.

ericcurtin Aug 20, 2025
Collaborator

Yeah, sometimes we can get lucky between distros and binaries are still somewhat ABI compatible cross-distro, but it's probably not something worth depending on.

anderspitman Aug 25, 2025

Linux has decent backwards compatibility. Since Arch is rolling release, it tends to have newer versions of libraries than systems where executables are built, which helps things work often.

@gesrtewrtwr couldn't you get much better performance by using a quantized Qwen3-30B-A3B? Sorry if I'm missing something. I'm still pretty new to this.

LMSSonos · 2025-08-18T19:24:53Z

LMSSonos
Aug 18, 2025

Using the docker-image llama.cpp:server and mainly running it on my raspberry - therefore missing the latest docker builds for arm64 (issue #13891)

I have tested with different UIs and also ollama and as mentioned from other users, llama-server is kind of hidden but really basic but still awesome, as it covers many of my daily work tasks - recently started to test out the vscode extension, but not managed to bring it up to production

2 replies

ericcurtin Aug 18, 2025
Collaborator

If you install this version of docker:

curl -fsSL https://get.docker.com | sudo bash # installs everything, including docker model runner

https://docs.docker.com/ai/model-runner/get-started/#enable-dmr-in-docker-engine

you'll be back in business with the "docker model runner" server for arm64, (which uses llama-server, but you get model swapping, pushable models to OCI registries, some goodies like this)

LMSSonos Aug 19, 2025

If you install this version of docker:

curl -fsSL https://get.docker.com | sudo bash # installs everything, including docker model runner

@ericcurtin thank you for your suggestion, I appreciate it. But for the same reason I don't want to use ollama I also don't want to use docker ;-) I prefer to have less features. But just started to try to compile llama.cpp on my arm64 as suggested by others on the post ;-)

Build worked on raspberrypi, see here #13891 (comment) for the details.

randomqhacker · 2025-08-18T20:30:47Z

randomqhacker
Aug 18, 2025

Create a separate installer/launcher per platform that checks CPU/GPU/iGPU against a database and downloads the right executable. The same executable could be used to update. Advanced settings for configuring server options and model parameters. Have a curated list of quantized models to download and launch for that hardware. Have a "custom" option that prompts the user to save a well commented batch file with several examples.

0 replies

BradHutchings · 2025-08-22T04:48:44Z

BradHutchings
Aug 22, 2025

You might want to look into my Mmojo-Server project. One binary built with Cosmopolitan, that runs on ARM/x86 and macoS/Linux/Windows. I have a Mmojo-Server prebuilt on HuggingFace.

I use generic CPU inference, which isn't awful for small models. 12B on good i7/i9. 4B on a Raspberry Pi. While GPU support would be great, it mostly confuses users who are challenged a little by the Terminal.

I even sell a Pi based appliance, if you want something to plug and play. https://Mmojo.net

0 replies

06kellyjac · 2025-08-22T18:56:31Z

06kellyjac
Aug 22, 2025

llama.cpp on NixOS is very easy in my experience.

Simply adding the following to your config gets you llama-server etc.

environment.systemPackages = [
  pkgs.llama-cpp
];

Depending on your config it may already have CUDA enabled. If not you can do

environment.systemPackages = [
  (pkgs.llama-cpp.override { cudaSupport = true; })
];

For rocm

environment.systemPackages = [
  (pkgs.llama-cpp.override { rocmSupport = true; })
];

Vulkan

environment.systemPackages = [
  (pkgs.llama-cpp.override { vulkanSupport = true; })
];

Etc etc.

You can also try combo multiple but your mileage may vary

0 replies

jjerphan · 2025-08-23T13:08:31Z

jjerphan
Aug 23, 2025

Note that llma.cpp is available on conda-forge, and can be installed with conda, mamba, pixi, etc.

CUDA builds exists for Windows and Linux, and macOS builds are optimized for accelerate and Metal on x86_64 and arm64 respectively.

This distribution can easily be updated by anyone to newer versions and refined for specific targets if needed.

0 replies

3Simplex · 2025-08-24T00:56:32Z

3Simplex
Aug 24, 2025

I have a good idea for a self configuring front end, just been a little consumed with my new code companion project to start on it.

For Windows users my Llama.Cpp-Toolbox is probably a good place to get up and running fast. I'm going to be updating it soon with more functionality for power users.

The new idea basically would map out the functions after each build, generate a GUI from them and should never go out of date. I haven't thought hard about how it would look but pulling the options and the executables seems easy enough. The scripts and examples could be annoying but if I can determine how to pinpoint them and the instructions or options it could work. I'll get around to it when my new Ai-Code-Companion can scan it all and I'll see what's possible.

eye rolls
Crap that name was already taken, time to rename the project.

1 reply

3Simplex Aug 28, 2025

I solved CURL and dependency management. The script now also installs ccache since that build takes so long without it. The only manual thing you have to do is install either cuda or vulkan but that is said in the GUI...
I have not yet added personal fork source for a toolbox repo/branch but that is on the way, changing llama.cpp repo/branch was always available.

Rotatingxenomorph · 2025-08-25T18:32:04Z

Rotatingxenomorph
Aug 25, 2025

I had to install extra stuff for compiling with curl to work when compiling CUDA version on ubuntu 25.04 today, it was pretty obscure: sudo apt-get install curl libssl-dev libcurl4-openssl-dev

2 replies

ericcurtin Aug 26, 2025
Collaborator

This is pretty normal and standard for a C/C++ application that uses libcurl as a dependency, you need the headers and library.

Rotatingxenomorph Aug 26, 2025

It's a trap for young players, though. You just get told you need curl, so you get curl but that doesn't fix it. Then you google and find a random 2 year old web page to find what you need to do.

ericcurtin · 2025-08-26T09:32:00Z

ericcurtin
Aug 26, 2025
Collaborator

@slaren if you build with GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS on. And have all backends installed, is there a way at runtime to switch between say vulkan and ROCm or vulkan and CUDA?

4 replies

slaren Aug 26, 2025
Maintainer

You can use --device to select the devices to use.

ericcurtin Aug 26, 2025
Collaborator

Thanks, interestingly my integrated graphics aren't listed as Vulkan-capable:

$ ./build/bin/llama-server --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon RX 7600, gfx1102 (0x1102), VMM: no, Wave Size: 32
  Device 1: AMD Radeon Graphics, gfx90c:xnack- (0x90c), VMM: no, Wave Size: 64
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
Available devices:
  ROCm0: AMD Radeon RX 7600 (8176 MiB, 8142 MiB free)
  ROCm1: AMD Radeon Graphics (15726 MiB, 15669 MiB free)
  Vulkan0: AMD Radeon RX 7600 (RADV NAVI33) (7936 MiB, 7936 MiB free)

slaren Aug 26, 2025
Maintainer

I believe you need to enable them explicitly by setting GGML_VK_VISIBLE_DEVICES=0,1 etc.

ericcurtin Aug 26, 2025
Collaborator

Yup, that worked:

$ GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-server --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon RX 7600, gfx1102 (0x1102), VMM: no, Wave Size: 32
  Device 1: AMD Radeon Graphics, gfx90c:xnack- (0x90c), VMM: no, Wave Size: 64
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 7600 (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
Available devices:
  ROCm0: AMD Radeon RX 7600 (8176 MiB, 8142 MiB free)
  ROCm1: AMD Radeon Graphics (15726 MiB, 15669 MiB free)
  Vulkan0: AMD Radeon Graphics (RADV RENOIR) (10825 MiB, 10825 MiB free)
  Vulkan1: AMD Radeon RX 7600 (RADV NAVI33) (7936 MiB, 7936 MiB free)

Windsage63 · 2025-08-28T17:52:57Z

Windsage63
Aug 28, 2025

I just went through this, so it's fresh in my mind. I had not used Linux since about 2012, when I changed our in-house servers to Windows server. Running local llm's I have used Text-Generation-WebUI, Ollama, and LM Studio. With the release of GPT-OSS-120b, most of the wrappers did not keep up with the new changes as quickly as I wanted. I run a 5090, and I also wanted to move to the newer torch release. I was doing some small training stuff that needed Triton anyway, so, Linux.

OP lists using winget, in Windows, but the recompiled llama.cpp is Vulcan. The binary distributions stop at 12.4, so if you need the newer copies of torch >12.4, it's source code. Having been away from Linux a long time, setting up WSL and creating everything for a build was a bit of a struggle. I actually had GPT-OSS write me a step by step walk through with copy and paste commands so I wouldn't screw it up.

Once everything was running, it's great, and a significant speedup. I moved my dev work to Linux now. Then I wrote a script (really me and GPT-OSS wrote a script) to fire up llama.cpp, so I could just pick models from an old fashioned script menu. It's crude, but it's never out of date. I posted the script creation script that GPT wrote below because I thought the idea of a self creating script was nice.

make_l-server.md

0 replies

[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313

Uh oh!

Uh oh!

Vaibhavs10 Aug 14, 2025 Collaborator

Replies: 32 comments · 38 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

ericcurtin Aug 15, 2025 Collaborator

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allozaur Aug 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 15, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vaibhavs10
Aug 14, 2025
Collaborator

Replies: 32 comments 38 replies

slaren
Aug 14, 2025
Maintainer

ericcurtin Aug 15, 2025
Collaborator

qnixsynapse
Aug 14, 2025
Collaborator

slaren Aug 14, 2025
Maintainer

qnixsynapse Aug 14, 2025
Collaborator

slaren Aug 14, 2025
Maintainer

qnixsynapse Aug 14, 2025
Collaborator

allozaur Aug 14, 2025
Collaborator

slaren Aug 15, 2025
Maintainer