vulkan: support SET_ROWS #14587

jeffbolznv · 2025-07-09T02:58:48Z

Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.

jeffbolznv · 2025-07-09T13:28:10Z

Prompt processing performance is a few percent slower with LLAMA_SET_ROWS=1. I'll look into it soon. But I don't think it needs to block merging this.

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

jeffbolznv · 2025-07-09T15:52:40Z

I did some optimizations for set_rows. It's maybe 1% slower than the default, but pretty close now.

0cc4m

LGTM

ggerganov · 2025-07-12T11:17:32Z

@jeffbolznv @0cc4m Do you know why some of the F16 tests slightly exceed the NMSE limit on my RTX 2060:

cmake .. -DGGML_VULKAN=ON
-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- Found Vulkan: /usr/lib/x86_64-linux-gnu/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator 
-- Vulkan found
-- GL_KHR_cooperative_matrix supported by glslc
-- GL_NV_cooperative_matrix2 supported by glslc
-- GL_EXT_integer_dot_product supported by glslc
-- GL_EXT_bfloat16 supported by glslc
-- Including Vulkan backend
-- ggml version: 0.0.5876
-- ggml commit:  0c1df14b5
-- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "8.5.0")  
-- Configuring done (1.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/ggerganov/development/github/llama.cpp/build-vulkan-new

make -j && ./bin/test-backend-ops

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2060 SUPER (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
Testing 2 devices
Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2060 SUPER
  Device memory: 8192 MB (8192 MB free)

[REGLU] NMSE = 0.000000313 > 0.000000100   REGLU(type=f16,ne_a=[128,2,2,2],v=0,swapped=0): FAIL
[REGLU] NMSE = 0.000000257 > 0.000000100   REGLU(type=f16,ne_a=[5,7,11,13],v=0,swapped=0): FAIL
[REGLU] NMSE = 0.000000281 > 0.000000100   REGLU(type=f16,ne_a=[128,2,2,2],v=0,swapped=1): FAIL
[REGLU] NMSE = 0.000000233 > 0.000000100   REGLU(type=f16,ne_a=[5,7,11,13],v=0,swapped=1): FAIL
[REGLU] NMSE = 0.000000284 > 0.000000100   REGLU(type=f16,ne_a=[128,2,2,2],v=0,split): FAIL
[REGLU] NMSE = 0.000000264 > 0.000000100   REGLU(type=f16,ne_a=[5,7,11,13],v=0,split): FAIL
[GEGLU] NMSE = 0.000000235 > 0.000000100   GEGLU(type=f16,ne_a=[128,2,2,2],v=0,swapped=0): FAIL
[GEGLU] NMSE = 0.000000257 > 0.000000100   GEGLU(type=f16,ne_a=[5,7,11,13],v=0,swapped=0): FAIL
[GEGLU] NMSE = 0.000000204 > 0.000000100   GEGLU(type=f16,ne_a=[128,2,2,2],v=0,swapped=1): FAIL
[GEGLU] NMSE = 0.000000241 > 0.000000100   GEGLU(type=f16,ne_a=[5,7,11,13],v=0,swapped=1): FAIL
[GEGLU] NMSE = 0.000000256 > 0.000000100   GEGLU(type=f16,ne_a=[128,2,2,2],v=0,split): FAIL
[GEGLU] NMSE = 0.000000245 > 0.000000100   GEGLU(type=f16,ne_a=[5,7,11,13],v=0,split): FAIL
[SWIGLU] NMSE = 0.000000234 > 0.000000100   SWIGLU(type=f16,ne_a=[128,2,2,2],v=0,swapped=0): FAIL
[SWIGLU] NMSE = 0.000000285 > 0.000000100   SWIGLU(type=f16,ne_a=[5,7,11,13],v=0,swapped=0): FAIL
[SWIGLU] NMSE = 0.000000259 > 0.000000100   SWIGLU(type=f16,ne_a=[128,2,2,2],v=0,swapped=1): FAIL
[SWIGLU] NMSE = 0.000000276 > 0.000000100   SWIGLU(type=f16,ne_a=[5,7,11,13],v=0,swapped=1): FAIL
[SWIGLU] NMSE = 0.000000264 > 0.000000100   SWIGLU(type=f16,ne_a=[128,2,2,2],v=0,split): FAIL
[SWIGLU] NMSE = 0.000000267 > 0.000000100   SWIGLU(type=f16,ne_a=[5,7,11,13],v=0,split): FAIL
[GEGLU_ERF] NMSE = 0.000000381 > 0.000000100   GEGLU_ERF(type=f16,ne_a=[128,2,2,2],v=0,swapped=0): FAIL
[GEGLU_ERF] NMSE = 0.000000257 > 0.000000100   GEGLU_ERF(type=f16,ne_a=[5,7,11,13],v=0,swapped=0): FAIL
[GEGLU_ERF] NMSE = 0.000000232 > 0.000000100   GEGLU_ERF(type=f16,ne_a=[128,2,2,2],v=0,swapped=1): FAIL
[GEGLU_ERF] NMSE = 0.000000279 > 0.000000100   GEGLU_ERF(type=f16,ne_a=[5,7,11,13],v=0,swapped=1): FAIL
[GEGLU_ERF] NMSE = 0.000000203 > 0.000000100   GEGLU_ERF(type=f16,ne_a=[128,2,2,2],v=0,split): FAIL
[GEGLU_ERF] NMSE = 0.000000261 > 0.000000100   GEGLU_ERF(type=f16,ne_a=[5,7,11,13],v=0,split): FAIL
[GEGLU_QUICK] NMSE = 0.000000287 > 0.000000100   GEGLU_QUICK(type=f16,ne_a=[128,2,2,2],v=0,swapped=0): FAIL
[GEGLU_QUICK] NMSE = 0.000000253 > 0.000000100   GEGLU_QUICK(type=f16,ne_a=[5,7,11,13],v=0,swapped=0): FAIL
[GEGLU_QUICK] NMSE = 0.000000224 > 0.000000100   GEGLU_QUICK(type=f16,ne_a=[128,2,2,2],v=0,swapped=1): FAIL
[GEGLU_QUICK] NMSE = 0.000000246 > 0.000000100   GEGLU_QUICK(type=f16,ne_a=[5,7,11,13],v=0,swapped=1): FAIL
[GEGLU_QUICK] NMSE = 0.000000308 > 0.000000100   GEGLU_QUICK(type=f16,ne_a=[128,2,2,2],v=0,split): FAIL
[GEGLU_QUICK] NMSE = 0.000000255 > 0.000000100   GEGLU_QUICK(type=f16,ne_a=[5,7,11,13],v=0,split): FAIL
[MUL] NMSE = 0.000000105 > 0.000000100   MUL(type=f16,ne=[1,1,8,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000127 > 0.000000100   DIV(type=f16,ne=[1,1,8,1],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000298 > 0.000000100   ADD(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]): FAIL
[SUB] NMSE = 0.000000439 > 0.000000100   SUB(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]): FAIL
[DIV] NMSE = 0.000000374 > 0.000000100   DIV(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]): FAIL
[ADD] NMSE = 0.000000157 > 0.000000100   ADD(type=f16,ne=[1,1,320,320],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000162 > 0.000000100   SUB(type=f16,ne=[1,1,320,320],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000275 > 0.000000100   MUL(type=f16,ne=[1,1,320,320],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000307 > 0.000000100   DIV(type=f16,ne=[1,1,320,320],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000269 > 0.000000100   ADD(type=f16,ne=[10,5,1,1],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000181 > 0.000000100   SUB(type=f16,ne=[10,5,1,1],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000236 > 0.000000100   MUL(type=f16,ne=[10,5,1,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000373 > 0.000000100   DIV(type=f16,ne=[10,5,1,1],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000157 > 0.000000100   ADD(type=f16,ne=[10,5,4,1],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000157 > 0.000000100   SUB(type=f16,ne=[10,5,4,1],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000319 > 0.000000100   MUL(type=f16,ne=[10,5,4,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000340 > 0.000000100   DIV(type=f16,ne=[10,5,4,1],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000134 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000159 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000265 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000299 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000139 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[2,1,1,1]): FAIL
[SUB] NMSE = 0.000000166 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[2,1,1,1]): FAIL
[MUL] NMSE = 0.000000268 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[2,1,1,1]): FAIL
[DIV] NMSE = 0.000000323 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[2,1,1,1]): FAIL
[ADD] NMSE = 0.000000161 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[1,2,1,1]): FAIL
[SUB] NMSE = 0.000000162 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[1,2,1,1]): FAIL
[MUL] NMSE = 0.000000266 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[1,2,1,1]): FAIL
[DIV] NMSE = 0.000000291 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[1,2,1,1]): FAIL
[ADD] NMSE = 0.000000129 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[1,1,2,1]): FAIL
[SUB] NMSE = 0.000000170 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[1,1,2,1]): FAIL
[MUL] NMSE = 0.000000268 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[1,1,2,1]): FAIL
[DIV] NMSE = 0.000000326 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[1,1,2,1]): FAIL
[ADD] NMSE = 0.000000157 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[1,1,1,2]): FAIL
[SUB] NMSE = 0.000000148 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[1,1,1,2]): FAIL
[MUL] NMSE = 0.000000275 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[1,1,1,2]): FAIL
[DIV] NMSE = 0.000000292 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[1,1,1,2]): FAIL
[ADD] NMSE = 0.000000163 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[1,1,2,2]): FAIL
[SUB] NMSE = 0.000000165 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[1,1,2,2]): FAIL
[MUL] NMSE = 0.000000282 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[1,1,2,2]): FAIL
[DIV] NMSE = 0.000000320 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[1,1,2,2]): FAIL
[ADD] NMSE = 0.000000163 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[1,2,2,2]): FAIL
[SUB] NMSE = 0.000000169 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[1,2,2,2]): FAIL
[MUL] NMSE = 0.000000277 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[1,2,2,2]): FAIL
[DIV] NMSE = 0.000000305 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[1,2,2,2]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[10,5,4,3],nr=[2,2,2,2]): FAIL
[SUB] NMSE = 0.000000169 > 0.000000100   SUB(type=f16,ne=[10,5,4,3],nr=[2,2,2,2]): FAIL
[MUL] NMSE = 0.000000282 > 0.000000100   MUL(type=f16,ne=[10,5,4,3],nr=[2,2,2,2]): FAIL
[DIV] NMSE = 0.000000307 > 0.000000100   DIV(type=f16,ne=[10,5,4,3],nr=[2,2,2,2]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[1280,1,1,1],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000174 > 0.000000100   SUB(type=f16,ne=[1280,1,1,1],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000262 > 0.000000100   MUL(type=f16,ne=[1280,1,1,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000312 > 0.000000100   DIV(type=f16,ne=[1280,1,1,1],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000161 > 0.000000100   ADD(type=f16,ne=[1280,1,1,1],nr=[1,16,16,1]): FAIL
[SUB] NMSE = 0.000000160 > 0.000000100   SUB(type=f16,ne=[1280,1,1,1],nr=[1,16,16,1]): FAIL
[MUL] NMSE = 0.000000272 > 0.000000100   MUL(type=f16,ne=[1280,1,1,1],nr=[1,16,16,1]): FAIL
[DIV] NMSE = 0.000000308 > 0.000000100   DIV(type=f16,ne=[1280,1,1,1],nr=[1,16,16,1]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[1280,16,16,1],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000160 > 0.000000100   SUB(type=f16,ne=[1280,16,16,1],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000273 > 0.000000100   MUL(type=f16,ne=[1280,16,16,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000308 > 0.000000100   DIV(type=f16,ne=[1280,16,16,1],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[1280,1,1,1],nr=[1,256,1,1]): FAIL
[SUB] NMSE = 0.000000160 > 0.000000100   SUB(type=f16,ne=[1280,1,1,1],nr=[1,256,1,1]): FAIL
[MUL] NMSE = 0.000000277 > 0.000000100   MUL(type=f16,ne=[1280,1,1,1],nr=[1,256,1,1]): FAIL
[DIV] NMSE = 0.000000308 > 0.000000100   DIV(type=f16,ne=[1280,1,1,1],nr=[1,256,1,1]): FAIL
[ADD] NMSE = 0.000000161 > 0.000000100   ADD(type=f16,ne=[1,1,1280,1],nr=[16,16,1,1]): FAIL
[SUB] NMSE = 0.000000159 > 0.000000100   SUB(type=f16,ne=[1,1,1280,1],nr=[16,16,1,1]): FAIL
[MUL] NMSE = 0.000000272 > 0.000000100   MUL(type=f16,ne=[1,1,1280,1],nr=[16,16,1,1]): FAIL
[DIV] NMSE = 0.000000306 > 0.000000100   DIV(type=f16,ne=[1,1,1280,1],nr=[16,16,1,1]): FAIL
[ADD] NMSE = 0.000000161 > 0.000000100   ADD(type=f16,ne=[16,16,1280,1],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000160 > 0.000000100   SUB(type=f16,ne=[16,16,1280,1],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000273 > 0.000000100   MUL(type=f16,ne=[16,16,1280,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000306 > 0.000000100   DIV(type=f16,ne=[16,16,1280,1],nr=[1,1,1,1]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[1,1,1920,1],nr=[16,16,1,1]): FAIL
[SUB] NMSE = 0.000000161 > 0.000000100   SUB(type=f16,ne=[1,1,1920,1],nr=[16,16,1,1]): FAIL
[MUL] NMSE = 0.000000275 > 0.000000100   MUL(type=f16,ne=[1,1,1920,1],nr=[16,16,1,1]): FAIL
[DIV] NMSE = 0.000000305 > 0.000000100   DIV(type=f16,ne=[1,1,1920,1],nr=[16,16,1,1]): FAIL
[ADD] NMSE = 0.000000158 > 0.000000100   ADD(type=f16,ne=[1,1,2560,1],nr=[16,16,1,1]): FAIL
[SUB] NMSE = 0.000000160 > 0.000000100   SUB(type=f16,ne=[1,1,2560,1],nr=[16,16,1,1]): FAIL
[MUL] NMSE = 0.000000273 > 0.000000100   MUL(type=f16,ne=[1,1,2560,1],nr=[16,16,1,1]): FAIL
[DIV] NMSE = 0.000000308 > 0.000000100   DIV(type=f16,ne=[1,1,2560,1],nr=[16,16,1,1]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[1,1,1280,1],nr=[32,32,1,1]): FAIL
[SUB] NMSE = 0.000000158 > 0.000000100   SUB(type=f16,ne=[1,1,1280,1],nr=[32,32,1,1]): FAIL
[MUL] NMSE = 0.000000273 > 0.000000100   MUL(type=f16,ne=[1,1,1280,1],nr=[32,32,1,1]): FAIL
[DIV] NMSE = 0.000000307 > 0.000000100   DIV(type=f16,ne=[1,1,1280,1],nr=[32,32,1,1]): FAIL
[ADD] NMSE = 0.000000158 > 0.000000100   ADD(type=f16,ne=[1,1,1920,1],nr=[32,32,1,1]): FAIL
[SUB] NMSE = 0.000000160 > 0.000000100   SUB(type=f16,ne=[1,1,1920,1],nr=[32,32,1,1]): FAIL
[MUL] NMSE = 0.000000274 > 0.000000100   MUL(type=f16,ne=[1,1,1920,1],nr=[32,32,1,1]): FAIL
[DIV] NMSE = 0.000000306 > 0.000000100   DIV(type=f16,ne=[1,1,1920,1],nr=[32,32,1,1]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[1,1,640,1],nr=[32,32,1,1]): FAIL
[SUB] NMSE = 0.000000164 > 0.000000100   SUB(type=f16,ne=[1,1,640,1],nr=[32,32,1,1]): FAIL
[MUL] NMSE = 0.000000272 > 0.000000100   MUL(type=f16,ne=[1,1,640,1],nr=[32,32,1,1]): FAIL
[DIV] NMSE = 0.000000304 > 0.000000100   DIV(type=f16,ne=[1,1,640,1],nr=[32,32,1,1]): FAIL
[ADD] NMSE = 0.000000159 > 0.000000100   ADD(type=f16,ne=[5120,1,1,1],nr=[1,256,1,1]): FAIL
[SUB] NMSE = 0.000000159 > 0.000000100   SUB(type=f16,ne=[5120,1,1,1],nr=[1,256,1,1]): FAIL
[MUL] NMSE = 0.000000274 > 0.000000100   MUL(type=f16,ne=[5120,1,1,1],nr=[1,256,1,1]): FAIL
[DIV] NMSE = 0.000000307 > 0.000000100   DIV(type=f16,ne=[5120,1,1,1],nr=[1,256,1,1]): FAIL
[ADD] NMSE = 0.000000163 > 0.000000100   ADD(type=f16,ne=[640,1,1,1],nr=[1,1,1,1]): FAIL
[SUB] NMSE = 0.000000134 > 0.000000100   SUB(type=f16,ne=[640,1,1,1],nr=[1,1,1,1]): FAIL
[MUL] NMSE = 0.000000285 > 0.000000100   MUL(type=f16,ne=[640,1,1,1],nr=[1,1,1,1]): FAIL
[DIV] NMSE = 0.000000284 > 0.000000100   DIV(type=f16,ne=[640,1,1,1],nr=[1,1,1,1]): FAIL
  Backend Vulkan0: FAIL

jeffbolznv · 2025-07-12T16:04:21Z

Do you know why some of the F16 tests slightly exceed the NMSE limit on my RTX 2060:

Last time we saw something like this it was that the device didn't automatically round to nearest even. But we have shader variants that force that, and this device/driver should support it. I'll see if I can reproduce it.

jeffbolznv · 2025-07-12T16:13:55Z

Looks like it was different shaders it affected last time, so we don't force RTNE in these. I'll add variants that do so.

jeffbolznv requested a review from 0cc4m July 9, 2025 02:58

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 9, 2025

vulkan: support SET_ROWS

0cf756e

Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.

jeffbolznv force-pushed the set_rows branch from 56c5e8a to 0cf756e Compare July 9, 2025 03:07

vulkan: optimize set_rows

b4cccd9

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

0cc4m approved these changes Jul 12, 2025

View reviewed changes

0cc4m merged commit b3ad3a0 into ggml-org:master Jul 12, 2025
48 checks passed

jeffbolznv mentioned this pull request Jul 12, 2025

vulkan: add RTE variants for glu/add/sub/mul/div #14653

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: support SET_ROWS #14587

vulkan: support SET_ROWS #14587

Uh oh!

jeffbolznv commented Jul 9, 2025

Uh oh!

jeffbolznv commented Jul 9, 2025

Uh oh!

jeffbolznv commented Jul 9, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

ggerganov commented Jul 12, 2025

Uh oh!

jeffbolznv commented Jul 12, 2025

Uh oh!

jeffbolznv commented Jul 12, 2025

Uh oh!

Uh oh!

vulkan: support SET_ROWS #14587

vulkan: support SET_ROWS #14587

Uh oh!

Conversation

jeffbolznv commented Jul 9, 2025

Uh oh!

jeffbolznv commented Jul 9, 2025

Uh oh!

jeffbolznv commented Jul 9, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov commented Jul 12, 2025

Uh oh!

jeffbolznv commented Jul 12, 2025

Uh oh!

jeffbolznv commented Jul 12, 2025

Uh oh!

Uh oh!