sync : ggml #3215

ggerganov · 2025-06-01T11:04:18Z

No description provided.

…gml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235

* SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

Also change it to be controlled by an env var rather than cmake flag

…(llama/13790)

…`group_norm` (llama/13787) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm`

…13843) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE

…gorithm (llama/13882) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cmake: Define function for querying architecture The tests and results match exactly those of src/CMakeLists.txt * Switch arch detection over to new function

This PR improves q4_k_q8_k gemm kernel with arm64 i8mm instruction. Tested on neoverse-n2 with llama3 8b q4_k_m quantization model. - 34% ~ 50% S_PP uplift for all batch sizes - 12% ~ 37% S_TG uplift for batch size 4 and above Perplexity doesn't change with this PR. ``` // tested on neoverse-n2 $ llama-batched-bench \ -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \ --no-mmap -fa \ -c 8192 -b 4096 -ub 512 -npp 128 -ntg 128 \ -npl 1,2,4,8,16,32 \ -t 64 --------------------------------------------------------------------- | PP | TG | B | S_PP t/s | S_TG t/s | | | | | original | this pr | original | this pr | |-------|--------|------|----------|----------|----------|----------| | 128 | 128 | 1 | 110.12 | 147.83 | 24.36 | 24.28 | | 128 | 128 | 2 | 121.16 | 172.42 | 46.36 | 47.93 | | 128 | 128 | 4 | 120.15 | 169.75 | 74.68 | 84.00 | | 128 | 128 | 8 | 130.97 | 196.81 | 91.04 | 114.74 | | 128 | 128 | 16 | 131.01 | 196.88 | 101.43 | 135.79 | | 128 | 128 | 32 | 130.85 | 196.51 | 106.97 | 147.29 | --------------------------------------------------------------------- ```

* SYCL: Add mrope kernel * feat: Optimize rope operations with vectorization Uses `sycl::vec` to load and store two elements at a time, significantly improving performance in `rope_norm`, `rope_neox`, and `rope_multi`. This reduces the number of memory accesses and leverages SIMD instructions for faster execution. * Use ceil_div

…ama/13922)

…PU in cuda (#13856) (llama/13895) * 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu 2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted code indentation Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fixed incorrect setting of variable types Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the judgment logic Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()' * Update ggml/src/ggml-cuda/ggml-cuda.cu Add a defensive security assert Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/ggml-cuda.cu Adjusted the support judgment logic. Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * revoke the suggest commit changes due to it's not applicable in jetson_device * Update ggml/src/ggml-cuda/ggml-cuda.cu Add parentheses to enforce operator precedence Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/ggml-cuda.cu Fix ci bug: add a spaces Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: yangxiao <yang_xl@tju.edu.cn> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: yangxiao <yangxl_zz@qq.com> Co-authored-by: Diego Devesa <slarengh@gmail.com>

…dows to avoid throttling (llama/12995) * threading: support for GGML_SCHED_PRIO_LOW, update thread info on Windows to avoid throttling We talked about adding LOW priority for GGML threads in the original threadpool PR. It might be useful for some cases to avoid contention. Latest Windows ARM64 releases started parking (offlining) the CPU cores more aggresively which results in suboptimal performance with n_threads > 4. To deal with that we now disable Power Throttling for our threads for the NORMAL and higher priorities. Co-authored-by: Diego Devesa <slarengh@gmail.com> * threading: disable SetThreadInfo() calls for older Windows versions * Update tools/llama-bench/llama-bench.cpp Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

ggml-ci

rgerganov and others added 23 commits June 1, 2025 14:03

ggml : remove ggml_graph_import and ggml_graph_export declarations (g…

57549f6

…gml/1247) The implementation is already deleted with commit 9d0762e. closes: #1235

cmake : Fix broken CMake error messages (ggml/1252)

2e445b2

vulkan : Remove unexpected ; (ggml/1253)

59c6afa

ggml : add ggml_repeat_4d (llama/13824)

ee61e88

SYCL: add gelu_erf kernel (llama/13749)

da9b2d3

* SYCL: add gelu_erf kernel * refactor code Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com> * Use scope_op_debug_print --------- Co-authored-by: Atharva Dubey <atharva.dubey@codeplay.com>

vulkan: use timestamp queries for GGML_VULKAN_PERF (llama/13817)

52520d8

Also change it to be controlled by an env var rather than cmake flag

opencl: mark mul_mat f32f32 as supporting non-contiguous tensors …

355202e

…(llama/13790)

opencl: add new ops - argsort, div, sub, addrows, sigmoid, …

f22ba6e

…`group_norm` (llama/13787) * opencl: add `argsort` * opencl: add `div` * opencl: add `add_rows` * opencl: add `sub` * opencl: add `sigmoid`, both `f16` and `f32` * opencl: add `group_norm`

CANN: Add SOC TYPE printing in cmake configuration (llama/13837)

194a5a1

CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)

395e809

ggml: aarch64: Implement SVE F32 kernels for vector functions (llama/…

2995878

…13843) * F32-Mamba-SVE * F32-Mamba-SVE * Resolve test errors-1 * Resolve test errors-2 * F32-vec-SVE * F32-vec-SVE * F32-vec-SVE

ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Al…

c0f50c4

…gorithm (llama/13882) * F32-Mamba-Seq_Scan-SVE * Fix formatting * ggml : missing space --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

cmake: Factor out CPU architecture detection (llama/13883)

9241a94

* cmake: Define function for querying architecture The tests and results match exactly those of src/CMakeLists.txt * Switch arch detection over to new function

cmake: Guard GGML_CPU_ALL_VARIANTS by architecture (llama/13890)

5138044

cuda : prevent using split buffers with 3d/4d matrices (llama/13919)

1c216c0

sched : avoid changing cur_copy when a graph is already allocated (ll…

085f43f

…ama/13922)

CUDA: fix typo in FlashAttention code (llama/13926)

6a2f2c4

sync : ggml

85ca186

ggml-ci

talk-llama : sync llama.cpp

95001c7

ggml-ci

ggerganov merged commit 7fd6fa8 into master Jun 1, 2025
56 of 60 checks passed

ggerganov deleted the sync-ggml-25-06-01 branch June 1, 2025 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3215

sync : ggml #3215

Uh oh!

ggerganov commented Jun 1, 2025

Uh oh!

Uh oh!

Uh oh!

sync : ggml #3215

sync : ggml #3215

Uh oh!

Conversation

ggerganov commented Jun 1, 2025

Uh oh!

Uh oh!

Uh oh!