Releases · ggml-org/llama.cpp

29 Aug 14:06

8101786

b6318 Latest

Latest

CUDA: fix bug in rms_norm fusion (#15660)

* CUDA: fix bug in rms_norm fusion

* Fix bug for OP_REPEAT

* Fix index for add

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-08-29T14:06:27Z
llama-b6318-bin-macos-arm64.zip

sha256:f272e59f547c6ffe8449978bf78dce387030c2a9ade46ca9304e06fdc025b7ea

11 MB 2025-08-29T14:06:37Z
llama-b6318-bin-macos-x64.zip

sha256:3cd87f6bd678d5263bc7bc7ec829197ad85812020065785530349403600275c2

28.3 MB 2025-08-29T14:06:38Z
llama-b6318-bin-ubuntu-vulkan-x64.zip

sha256:502a247c0e90274be0fda8f4863a44630c721c617d404ffee424e194d27454a1

25.1 MB 2025-08-29T14:06:39Z
llama-b6318-bin-ubuntu-x64.zip

sha256:a7b2d8f8a2b05d38a375c73d162c827e1f5229e691e3926e611ab481641f228f

13 MB 2025-08-29T14:06:40Z
llama-b6318-bin-win-cpu-arm64.zip

sha256:5e08d1c6d400296ea3e4c4b9c8863ac6aa61abb38bd1f45d332a7f24431eadf2

11.2 MB 2025-08-29T14:06:41Z
llama-b6318-bin-win-cpu-x64.zip

sha256:fd2823c414faa8af2545a1fb1ed8cb75d9bc70240ace2b512f1abf8df3ed1515

14.1 MB 2025-08-29T14:06:42Z
llama-b6318-bin-win-cuda-12.4-x64.zip

sha256:fb8f3b459829b5b644cb1255c9bb221da6a17d4d767f7bd04f5e5273793f11ff

138 MB 2025-08-29T14:06:43Z
llama-b6318-bin-win-hip-radeon-x64.zip

sha256:66357e00938a4cdb00985e155d62e01b3cfc68016dfc59e33fa1c410ca79fe17

287 MB 2025-08-29T14:06:48Z
llama-b6318-bin-win-opencl-adreno-arm64.zip

sha256:e477ab115ec24099b525e58b88ac05bba3e1b558e4b5a8647b8348031b3aba98

11.6 MB 2025-08-29T14:06:56Z
Source code (zip)

2025-08-29T13:30:06Z
Source code (tar.gz)

2025-08-29T13:30:06Z

29 Aug 13:21

github-actions

b6317

60e5eee

b6317

chat : Seed OSS thinking + tool call support (#15552)

* Reasoning and tool-calling support for Seed OSS

* Fix grammar and partial parsing

* Whitespace

* New chat template

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Remove unused 'purge_healing_marker' helper

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Assets 15

29 Aug 03:59

github-actions

b6316

009b709

b6316

CUDA: fuse adds, fuse add with rms norm (#15631)

* CUDA: fused add with rms_norm_mul

* Non-broadcast fuse works

* Add fused adds

* format

* Remove n_fuse from template params

* Address review comments

* Move template inside binbcast

Assets 15

29 Aug 01:04

github-actions

b6315

e8d99dd

b6315

nvidia nemotron nano v2 (nemotronh) (#15507)

* feat: Add NEMOTRONH to python arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add NEMOTRONH to c++ arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add NEMOTRONH to llama-arch layer map

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First pass at conversion for nemotronh

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add a verbose log for each tensor loaded

This is really helpful for diagnosing mismatches between the expected and
received tensors

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First (broken) pass at nemotronh model architecture

It generates tokens, just not valid ones!

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Explicitly enable add_bos_token during conversion

The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Only allocate attention cache for attention layers (not non-recurrent)

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Move residual add to after every block

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the correct norm tensor for the MLP blocks

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)

This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.

* SSM: respect ssm_dt_rank for dt_dim when provided

Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).

* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)

* Rename nemotronh to nemotron_h for consistency

- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)

* feat: Support conversion for older NemotronH models

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Maicon Domingues <dominguesm@outlook.com>
Co-authored-by: weatherman <fxdstudios@gmail.com>

Assets 15

28 Aug 21:36

github-actions

b6314

a8bca68

b6314

fix: Compute the full sum in llama-eval-callback, not just the sum of…

Assets 15

28 Aug 19:28

github-actions

b6313

c97dc09

b6313

CUDA: add conv2d (#15635)

* CUDA: add conv2d

* CUDA: conv2d - correct formatting and added const

Assets 15

28 Aug 17:12

github-actions

b6312

6c442f4

b6312

ggml-cpu: fix invalid hsum build in debug s390x (#15634)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Assets 15

28 Aug 17:11

github-actions

b6311

7380414

b6311

ggml : fix SSM_SCAN for n_groups > 1 (#15625)

Assets 15

28 Aug 16:34

github-actions

b6310

c8d0d14

b6310

kv-cache : fix find_slot to not search for continuous slot (#15638)

ggml-ci

Assets 15

28 Aug 16:32

github-actions

b6309

84ab83c

b6309

model : jina-embeddings-v3 support (#13693)

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6318

Uh oh!

b6317

Uh oh!

b6316

Uh oh!

b6315

Uh oh!

b6314

Uh oh!

b6313

Uh oh!

b6312

Uh oh!

b6311

Uh oh!

b6310

Uh oh!

b6309

Uh oh!