sync : ggml #3125

ggerganov · 2025-05-07T10:23:49Z

No description provided.

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests

…der (llama/13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

* vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O

* build : fix build info on windows * fix cuda host compiler msg

The following scenario will cause an assertion failure in the graph allocator: - Build and allocate a graph containing a tensor with a non-NULL data pointer - Build and allocate a new graph where that data is NULL Result: ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed This happens during revalidation because we think that memory should have been previously allocated based on the current graph but in reality the previous graph was different. In this situation, we should do a full reallocation pass.

Zero out the name and padding buffers.

This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type. This change results in 9x - 40x gains in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark. The patch is tested with Meta-Lllama-3-8B, and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

ggml-ci

Acly and others added 10 commits May 7, 2025 13:17

vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

87b88ed

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests

vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul sha…

df45838

…der (llama/13191) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

build : fix build info on windows (llama/13239)

5a9ccde

* build : fix build info on windows * fix cuda host compiler msg

rpc : avoid uninitialized memory in serialize_tensor (llama/13210)

a7988d7

Zero out the name and padding buffers.

vulkan : fix lint (llama/0)

a652c8b

sync : ggml

eeaa1cd

ggml-ci

cli : avoid std::exchange

0055356

ggml-ci

danbev approved these changes May 7, 2025

View reviewed changes

ggerganov merged commit 4a512cb into master May 7, 2025
58 of 59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3125

sync : ggml #3125

Uh oh!

ggerganov commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

sync : ggml #3125

sync : ggml #3125

Uh oh!

Conversation

ggerganov commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!