Releases · ggml-org/llama.cpp

29 Aug 03:59

009b709

b6316 Latest

Latest

CUDA: fuse adds, fuse add with rms norm (#15631)

* CUDA: fused add with rms_norm_mul

* Non-broadcast fuse works

* Add fused adds

* format

* Remove n_fuse from template params

* Address review comments

* Move template inside binbcast

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-08-29T03:59:40Z
llama-b6316-bin-macos-arm64.zip

sha256:e6aaebf0e290c237d099b9d6467e2d56d28c80818c28c2f41b75ba98e27d45ea

10.9 MB 2025-08-29T03:59:51Z
llama-b6316-bin-macos-x64.zip

sha256:19ad14c264e95607080698d152615b03036e90926a3b0f986bb9cb17fabd2ba9

28.1 MB 2025-08-29T03:59:52Z
llama-b6316-bin-ubuntu-vulkan-x64.zip

sha256:ae34ec13d2518d2054925894c4a5e9b9cf7912ff188b51826927e6f3747e0a73

25 MB 2025-08-29T03:59:53Z
llama-b6316-bin-ubuntu-x64.zip

sha256:072221327c38eb86af844641ed1c8e3736ad5d0e019b954746a8ae7acce47234

12.9 MB 2025-08-29T03:59:54Z
llama-b6316-bin-win-cpu-arm64.zip

sha256:79786477fad9e8656e29a6f9f7f7edfd5b074c7dfc258b8efcb781d1a643315d

11.1 MB 2025-08-29T03:59:55Z
llama-b6316-bin-win-cpu-x64.zip

sha256:7279c8ea14e1c8f7ed829a6e816a523a926a689f29d0b6776bb78c710ace0af6

14.1 MB 2025-08-29T03:59:56Z
llama-b6316-bin-win-cuda-12.4-x64.zip

sha256:6d144405be63a80fe97ad35fab698a1cce10b035f33bef1d131a7dcb9e49d54b

138 MB 2025-08-29T03:59:58Z
llama-b6316-bin-win-hip-radeon-x64.zip

sha256:8814ddeca4d50dd0f6a69fb5f6e213307c69db08d70156f65e20d237354b4507

287 MB 2025-08-29T04:00:02Z
llama-b6316-bin-win-opencl-adreno-arm64.zip

sha256:d06ac25980e764be266acdc6f0e0146a38330cb1bde784ad54fd51c8a21ff532

11.5 MB 2025-08-29T04:00:12Z
Source code (zip)

2025-08-29T03:35:58Z
Source code (tar.gz)

2025-08-29T03:35:58Z

29 Aug 01:04

github-actions

b6315

e8d99dd

b6315

nvidia nemotron nano v2 (nemotronh) (#15507)

* feat: Add NEMOTRONH to python arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add NEMOTRONH to c++ arch enum

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add NEMOTRONH to llama-arch layer map

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First pass at conversion for nemotronh

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add a verbose log for each tensor loaded

This is really helpful for diagnosing mismatches between the expected and
received tensors

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: First (broken) pass at nemotronh model architecture

It generates tokens, just not valid ones!

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Explicitly enable add_bos_token during conversion

The `tokenizer.json`/`tokenizer_config.json` in the model are a bit
contradictory. In the config, add_bos_token is set to False, but the
tokenizer model itself has a post_processor that adds the BOS token via
type: TemplateProcessing

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use relu2 (LLM_FFN_RELU_SQR) for activation in FFN layers

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Only allocate attention cache for attention layers (not non-recurrent)

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Move residual add to after every block

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the correct norm tensor for the MLP blocks

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* Nemotron-H: MLP gate cleanup (pass NULL for unused gate)

This model does not use a gate in MLP blocks; pass NULLs for gate tensors to make intent clear and avoid unused-pointer noise.

* SSM: respect ssm_dt_rank for dt_dim when provided

Use GGUF-provided time_step_rank (ssm_dt_rank) to set dt_dim when > 0; fallback to max(64, n_embd/16).

* fix: plamo2 - revert dt_dim to default (remove ssm_dt_rank usage)

* Rename nemotronh to nemotron_h for consistency

- Update architecture name from NEMOTRONH to NEMOTRON_H in constants.py
- Change architecture string from 'nemotronh' to 'nemotron_h' in all files
- Update enum LLM_ARCH_NEMOTRONH to LLM_ARCH_NEMOTRON_H
- Update class name llm_build_nemotronh to llm_build_nemotron_h
- Consistent naming with underscore convention (nemotron_h vs nemotronh)

* feat: Support conversion for older NemotronH models

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Maicon Domingues <dominguesm@outlook.com>
Co-authored-by: weatherman <fxdstudios@gmail.com>

Assets 15

28 Aug 21:36

github-actions

b6314

a8bca68

b6314

fix: Compute the full sum in llama-eval-callback, not just the sum of…

Assets 15

28 Aug 19:28

github-actions

b6313

c97dc09

b6313

CUDA: add conv2d (#15635)

* CUDA: add conv2d

* CUDA: conv2d - correct formatting and added const

Assets 15

28 Aug 17:12

github-actions

b6312

6c442f4

b6312

ggml-cpu: fix invalid hsum build in debug s390x (#15634)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Assets 15

28 Aug 17:11

github-actions

b6311

7380414

b6311

ggml : fix SSM_SCAN for n_groups > 1 (#15625)

Assets 15

28 Aug 16:34

github-actions

b6310

c8d0d14

b6310

kv-cache : fix find_slot to not search for continuous slot (#15638)

ggml-ci

Assets 15

28 Aug 16:32

github-actions

b6309

84ab83c

b6309

model : jina-embeddings-v3 support (#13693)

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* initial jina-embeddings-v3 support

* fix vocab parsing with only tokenizer.json

* set mask token lstrip attribute

* additional unk_token_id fallback just in case [no ci]

* revert vocab_size() change [no ci]

* merge tensor loading into general bert

* rope

* add lora embedding and loading (non-functional)

* export separate lora ggufs instead

* add adapter metadata api

* use std::string

* convert_hf_to_lora compatibility

* fix assert

* apply suggestions from review

* apply suggestion from review

Assets 15

28 Aug 10:43

github-actions

b6307

8a4280c

b6307

kv-cache : remove LLAMA_SET_ROWS checks (#15505)

ggml-ci

Assets 15

28 Aug 09:29

github-actions

b6305

d35a1e8

b6305

cli : change log to warning to explain reason for stopping (#15604)

* Change to warn instead of debug, to explain reason for stopping.

* Update tools/main/main.cpp

Fix printing --2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b6316

Uh oh!

b6315

Uh oh!

b6314

Uh oh!

b6313

Uh oh!

b6312

Uh oh!

b6311

Uh oh!

b6310

Uh oh!

b6309

Uh oh!

b6307

Uh oh!

b6305

Uh oh!