-
Notifications
You must be signed in to change notification settings - Fork 12.4k
vulkan: support SET_ROWS #14587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: support SET_ROWS #14587
Conversation
Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.
Prompt processing performance is a few percent slower with LLAMA_SET_ROWS=1. I'll look into it soon. But I don't think it needs to block merging this. |
Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.
I did some optimizations for set_rows. It's maybe 1% slower than the default, but pretty close now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jeffbolznv @0cc4m Do you know why some of the F16 tests slightly exceed the NMSE limit on my RTX 2060: cmake .. -DGGML_VULKAN=ON
-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.43.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Found Vulkan: /usr/lib/x86_64-linux-gnu/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator
-- Vulkan found
-- GL_KHR_cooperative_matrix supported by glslc
-- GL_NV_cooperative_matrix2 supported by glslc
-- GL_EXT_integer_dot_product supported by glslc
-- GL_EXT_bfloat16 supported by glslc
-- Including Vulkan backend
-- ggml version: 0.0.5876
-- ggml commit: 0c1df14b5
-- Found CURL: /usr/lib/x86_64-linux-gnu/libcurl.so (found version "8.5.0")
-- Configuring done (1.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/ggerganov/development/github/llama.cpp/build-vulkan-new
make -j && ./bin/test-backend-ops
|
Last time we saw something like this it was that the device didn't automatically round to nearest even. But we have shader variants that force that, and this device/driver should support it. I'll see if I can reproduce it. |
Looks like it was different shaders it affected last time, so we don't force RTNE in these. I'll add variants that do so. |
Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.