Releases · dumpmemory/llama.cpp

16 Jul 20:44

496957e

b5916 Latest

Latest

llama : fix parameter order for hybrid memory initialization (#14725)

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6
373 MB 2025-07-16T20:44:42Z
llama-b5916-bin-macos-arm64.zip

sha256:83b3575e2b2b780443b773e116f53de8ab4f5204ca7d87fc402e940fe058e17d
10.6 MB 2025-07-16T20:44:53Z
llama-b5916-bin-macos-x64.zip

sha256:0e915282b606cc5e044c1d1bfd698fdbef3a5884c3c591a9a582dd90690dd6bd
27.2 MB 2025-07-16T20:44:54Z
llama-b5916-bin-ubuntu-vulkan-x64.zip

sha256:72a0dd2b192ca4fcb8ef22d3e2655c82fe6a3a2fa49424dfef3e8868550cc9be
20.9 MB 2025-07-16T20:44:55Z
llama-b5916-bin-ubuntu-x64.zip

sha256:ade4902a254b3b8d308651334bcee5080fdd21414e2704bea481095f8c12ae93
12.5 MB 2025-07-16T20:44:57Z
llama-b5916-bin-win-cpu-arm64.zip

sha256:2bf1f8380a98799678d86428ea32b055ff85d1f162fd55273987873c65ebaae1
10.8 MB 2025-07-16T20:44:57Z
llama-b5916-bin-win-cpu-x64.zip

sha256:ba272a1955375962e1f42fa0590f6ce3c3410a3c8fd971ad43d1e40e1d8ff9da
13.7 MB 2025-07-16T20:44:58Z
llama-b5916-bin-win-cuda-12.4-x64.zip

sha256:394456cb1da97ddaf23595d8762df7030c3bf3cf28e45d3a76a689741c93ab53
129 MB 2025-07-16T20:44:59Z
llama-b5916-bin-win-hip-radeon-x64.zip

sha256:3b103fb2a6942b8ae5a40e66e6f8937a14d89459a9e6221d4eae017c651d11d8
299 MB 2025-07-16T20:45:03Z
llama-b5916-bin-win-opencl-adreno-arm64.zip

sha256:e6ce2715a9044d31ac65dd0a43593e0cc332ef0b345271a44cb265d47d5cdca6
11.2 MB 2025-07-16T20:45:13Z
Source code (zip)

2025-07-16T19:17:25Z
Source code (tar.gz)

2025-07-16T19:17:25Z

16 Jul 14:35

github-actions

b5913

225e7a1

b5913

llama : add high-throughput mode (#14363)

* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Assets 15

16 Jul 09:20

github-actions

b5907

5cae766

b5907

scripts: synthetic prompt mode for server-bench.py (#14695)

Assets 15

16 Jul 04:05

github-actions

b5903

c81f419

b5903

gguf-py : dump bpw per layer and model in markdown mode (#14703)

Assets 15

15 Jul 23:11

github-actions

b5900

10a0351

b5900

vulkan: add RTE variants for glu/add/sub/mul/div (#14653)

Assets 15

15 Jul 09:34

github-actions

b5898

cbc68be

b5898

cuda: fix build warnings in set-rows.cu (unused variable) (#14687)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Assets 15

04 Jun 08:50

github-actions

b5585

0b4be4c

b5585

CUDA: fix FTZ in FA for Gemma 3 (#13991)

Assets 18

03 Jun 02:50

github-actions

b5581

71e74a3

b5581

opencl: add `backend_synchronize` (#13939)

* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.

Assets 18

02 Jun 21:11

github-actions

b5579

3637576

b5579

server : disable speculative decoding for SWA models (#13970)

* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models

Assets 18

02 Jun 14:25

github-actions

b5574

093e3f1

b5574

cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13…

Assets 18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: dumpmemory/llama.cpp

b5916

Uh oh!

b5913

Uh oh!

b5907

Uh oh!

b5903

Uh oh!

b5900

Uh oh!

b5898

Uh oh!

b5585

Uh oh!

b5581

Uh oh!

b5579

Uh oh!

b5574

Uh oh!