Update to upstream eacdeb5b #2

olyasir · 2025-08-13T14:26:45Z

Make sure to read the contributing guidelines before submitting a PR

* ggml-cpu: add nnpa compile flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 4a9f60c) * ggml-cpu: add fp16->fp32 nnpa first Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8d4a798) * ggml-cpu: add fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 0ff0d65) * ggml-cpu: better variable names Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 2f58bbc) * docs: update s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 01b9294) * ggml-cpu: add debugging prints to see if dlf16 is correct Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix print vs printf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix float placeholder Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: ensure fp16 and fp32 load and stores are called Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fp16 load ensured to hit Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove sigint from fp16 store for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa switch to vec_xst test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to vec_xst for 4 element loops also Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework noop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove noop, general code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify variable naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add breakpoint for debugging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test fix for conversion failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: disable fp32->fp16 nnpa conversions for now there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to elif macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix compiler types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: change to typedef vector types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add 4 element loops for fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarified vector naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back fp32->fp16 store nnpa Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add nnpa macro check in ggml-impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add missing __func__ Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: diagnose why __NNPA__ macro is not being defined Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: import vecintrin.h to fix compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update macro tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 157f856. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to importing ggml-cpu-impl instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix macro declaration Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test more macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add debug prints Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bruteforce macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h to cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to private macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 157f856) * ggml-cpu: move things around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to quotes for import Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add compiler error macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add s390x detection in ggml-src Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: undo cmakelists work Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 18d79e1. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedefs.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedef from cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h future notes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add todo comment for future reference Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify naming of dlf16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove unnecessary target compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update broken huggingface link for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix duplicate func names during compile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: fix duplicate func names during compile" This reverts commit fbb7334. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu" This reverts commit bd288e8. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp16<->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h import in quants.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h within repack Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix amx mmq missing simd-mappings.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: attempt at fixing loongarch failing build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa together with other fp16<->fp32 simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix wrong refactor of ggml-base ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove dependency on ggml-cpu from ggml-base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove mistaken fallback macro fallback logic was already implemented but i was too sleepy to realise Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures" This reverts commit 32a3533. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: move ggml_table_f32_f16 to ggml-cpu" This reverts commit 9e40d98. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: ggml-org#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 9e40d98) * ggml: move ggml_table_f32_f16 to ggml-cpu.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: extern c ggml_table_f32_f16 + chore docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h we rely on the variable declaration in ggml-cpu.c instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h" This reverts commit f71b21d. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back ggml_table_f32_f16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: bring back ggml_table_f32_f16" This reverts commit 2dce119. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * fix ggml time initialization * fix f32_f16 table init * remove extra line --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: slaren <slarengh@gmail.com>

* musa: enable fp16 mma (all) and cublas on qy2 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Address review comments Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* docs: update s390x documentation + add faq Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add s390x z17 build q&a Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* metal : batch rows copy in a single threadgroup ggml-ci * metal : handle some edge cases when threadgroup size is not a power of 2 ggml-ci

ggml-ci

…4390)

…#14398) * Add shaders-gen sources as target deps

* gemma3n * add llm_graph_input_one

* ggml : add ggml_set_rows Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: ggml-org#8366 * use I64 for indices * ggml : add repeat impl for i64 * ggml : add ggml_is_contiguous_rows * ggml : ggml_set_rows support broadcast * ggml : ggml_set_rows support quantized dst ggml-ci * ggml : support GGML_TYPE_F32 ".from_float" trait * ggml : ggml_set_rows update comment + better index name * tests : add ggml_set_rows * metal : add ggml_set_rows implementation ggml-ci * ggml : simplify forward_dup_f32 * ggml : fix supports_op * tests : add comment to set_rows * ggml : leave the repeat_i64 for a separate PR ggml-ci * ggml : set_rows use std::min instead of MIN * ggml : better error message for set_rows unsupported type * metal : perform op->type check only once * tests : more consistent implementation + more tests ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model. Signed-off-by: Weizhao Ouyang <weizhao.ouyang@arm.com>

ggml-org#14378)

) * CUDA: add bf16 and f32 support to cublas_mul_mat_batched * Review: add type traits and make function more generic * Review: make check more explicit, add back comments, and fix formatting * Review: fix formatting, remove useless type conversion, fix naming for bools

* vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <slarengh@gmail.com>

* implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (ggml-org#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (ggml-org#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel

…ml-org#14322)

…eature), from command line and from client (ggml-org#13196) * initial commit for handling extra template kwargs * enable_thinking and assistant prefill cannot be enabled at the same time * can set chat_template_kwargs in command line * added doc * fixed formatting * add support for extra context in generic template init * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * coding standard: common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from code review coding standard: cosmetic changes Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix merge conflict * chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context) * normalize environment variable name * simplify code * prefill cannot be used with thinking models * compatibility with the new reasoning-budget parameter * fix prefill for non thinking models --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>

* Update docker.yml 修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动 * Remove redundant include path in CMakeLists.txt The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths. * Enable scheduled Docker image builds Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

ggml-ci

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Support diffusion models: Add Dream 7B * Move diffusion to examples * Move stuff to examples. Add patch to not use kv-cache * Address review comments * Make sampling fast * llama: remove diffusion functions * Add basic timings + cleanup * More cleanup * Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length * fixup! * Review: move everything to diffusion-cli for now

* kv-cache : prepare K/V buffers for separation ggml-ci * batched-bench : fix oob write ggml-ci * llama : add "virtual sequences" ggml-ci * llama : use "stream" vs "virtual sequence" ggml-ci * graph : fix stream splitting when KV cache is not used ggml-ci * kv-cache : add multi-stream save/load support ggml-ci * llama : add "--attn-streams" flag ggml-ci * kv-cache : fix handling when find_slot fails ggml-ci * kv-cache : restore find_slot impl ggml-ci * kv-cache : add comments * kv-cache : add bounds checks for sequence id ggml-ci * cont : add n_seq_max to batch allocr ggml-ci * kv-cache : perform stream copies lazily after llama_synchronize ggml-ci * kv-cache : avoid throwing exceptions across the C boundary ggml-ci * CUDA: 4D FlashAttention support (ggml-org#14628) * CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel * llama : rename attn_streams -> kv_unified ggml-ci * common : rename kv_split -> kv_unified ggml-ci --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Co-authored-by: qwaqrm <qwaqrm@126.com>

* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults * Initialize webgpu device * Making progress on setting up the backend * Finish more boilerplate/utility functions * Organize file and work on alloc buffer * Add webgpu_context to prepare for actually running some shaders * Work on memset and add shader loading * Work on memset polyfill * Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it * Implement get_tensor and buffer_clear * Finish rest of setup * Start work on compute graph * Basic mat mul working * Work on emscripten build * Basic WebGPU backend instructions * Use EMSCRIPTEN flag * Work on passing ci, implement 4d tensor multiplication * Pass thread safety test * Implement permuting for mul_mat and cpy * minor cleanups * Address feedback * Remove division by type size in cpy op * Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends * Fix name * Fix macos dawn prefix path

…g#14725)

* make hf token optional * fail if we can't get necessary tokenizer config

ggml-ci

* llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci

ggml-ci

* Add Ernie4.5 MoE * Fix Flake errors. * Properly encode/decode MoE layer step * Correct tensor mappings (.weight) * Pass and read n_ff_exp * n_ff_shexp calculation and further minor changes * Rope fixes. * .gitignore fix * Add unit32 cast for Linux builds * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes from code review * Fix trailing whitespace * Reenable missing experts error * Code style from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix non-MoE regression Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…rg#14726)

…org#14732)

) Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.

ggml-ci

ShanoToni and others added 30 commits June 25, 2025 18:09

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (ggml…

2bf9d53

…-org#13973)

ggml : do not output unprintable characters on GGUF load failure (ggm…

b193d53

…l-org#14381)

docs: update s390x documentation + add faq (ggml-org#14389)

bf5bcd0

* docs: update s390x documentation + add faq Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: add s390x z17 build q&a Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

metal : batch rows copy in a single threadgroup (ggml-org#14384)

5783ae4

* metal : batch rows copy in a single threadgroup ggml-ci * metal : handle some edge cases when threadgroup size is not a power of 2 ggml-ci

metal : add special-case mat-vec mul for ne00 == 4 (ggml-org#14385)

e8215db

ggml-ci

llama : return mistral-v7-tekken as default template only (ggml-org#1…

b253462

…4390)

cmake: regen vulkan shaders when shaders-gen sources change (ggml-org…

a01047b

…#14398) * Add shaders-gen sources as target deps

model : gemma3n text-only (ggml-org#14400)

8846aac

* gemma3n * add llm_graph_input_one

convert : fix broken sentencepiece vocab (ggml-org#14416)

f667f1e

recurrent : call balloc split_reset() in init_batch() (ggml-org#14414)

4367806

ggml-ci

graph : make llm_graph_context destructor virtual (ggml-org#14410)

72babea

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427)

ceb1bf5

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (ggml-org#14431)

6609507

fix async_mode bug (ggml-org#14432)

b25e927

model : add support for ERNIE 4.5 0.3B model (ggml-org#14408)

566c16f

Add Day-0 support for Baidu ERNIE 4.5 0.3B model. Signed-off-by: Weizhao Ouyang <weizhao.ouyang@arm.com>

vulkan: lock accesses of pinned_memory vector (ggml-org#14333)

00d5282

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (

63a7bb3

ggml-org#14378)

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (ggml-org#14443)

a5d1fb6

SYCL: disable faulty fp16 exp kernel (ggml-org#14395)

f47c1d7

* SYCL: disable faulty fp16 CPU exponent for now * Revert "SYCL: disable faulty fp16 CPU exponent for now" This reverts commit ed0aab1. * SYCL: disable faulty fp16 CPU exponent for now * Fix logic of disabling exponent kernel

server : fix appearance of the chats list context menu for Safari (gg…

83790b0

…ml-org#14322)

scripts : make the shell scripts cross-platform (ggml-org#14341)

e9b6350

test-backend-ops : disable llama test (ggml-org#14461)

eb3fa29

JohannesGaessler and others added 26 commits July 16, 2025 09:33

scripts: synthetic prompt mode for server-bench.py (ggml-org#14695)

5cae766

server : fix handling of the ignore_eos flag (ggml-org#14710)

538cc77

ggml-ci

llama : fix parallel processing for plamo2 (ggml-org#14716)

e4841d2

server : pre-calculate EOG logit biases (ggml-org#14721)

6ffd4e9

ggml-ci

ggml : add asserts (ggml-org#14720)

6497834

* ggml : add asserts ggml-ci * cont : fix constant type Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

model : support output bias for qwen2 (ggml-org#14711)

b0f0ecc

Co-authored-by: qwaqrm <qwaqrm@126.com>

llama : fix parameter order for hybrid memory initialization (ggml-or…

496957e

…g#14725)

convert : make hf token optional (ggml-org#14717)

19e5943

* make hf token optional * fail if we can't get necessary tokenizer config

ci : disable failing vulkan crossbuilds (ggml-org#14723)

1ba45d4

batch : fix uninitialized has_cpl flag (ggml-org#14733)

ad57d3e

ggml-ci

kv-cache : opt mask set input (ggml-org#14600)

d9b6910

ggml-ci

llama : fix parallel processing for lfm2 (ggml-org#14705)

086cf81

kv-cache : fix k-shift for multiple streams (ggml-org#14742)

d6fb3f6

ggml-ci

nix : use optionalAttrs for env mkDerivation attrset argument (ggml-o…

760b448

…rg#14726)

convert : fix Ernie4.5 MoE without shared experts (ggml-org#14746)

670e136

use max work group size for device to replace the magic number (ggml-…

349ea79

…org#14732)

graph : Pass the graph placeholder message in debug mode (ggml-org#14748

09651d0

) Without that condition, this debug log clutters the screen every batch treated in the prompt processing, or every token generated in Kobold.cpp.

graph : refactor context to not pass gf explicitly (ggml-org#14629)

8f974bc

ggml-ci

CUDA: set_rows + cpy.cu refactor (ggml-org#14712)

f9a31ee

model : add EXAONE 4.0 support (ggml-org#14630)

e0cb5c5

model : fix build after merge conflict (ggml-org#14754)

eacdeb5

chetasr approved these changes Aug 13, 2025

View reviewed changes

jpgaribotti approved these changes Aug 14, 2025

View reviewed changes

olek-tether approved these changes Aug 15, 2025

View reviewed changes

olek-tether merged commit 4fb2556 into tetherto:master Aug 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to upstream eacdeb5b #2

Update to upstream eacdeb5b #2

Uh oh!

olyasir commented Aug 13, 2025

Uh oh!

Uh oh!

Update to upstream eacdeb5b #2

Update to upstream eacdeb5b #2

Uh oh!

Conversation

olyasir commented Aug 13, 2025

Uh oh!

Uh oh!