SYCL: Implement few same quantized type copy kernels #13739

qnixsynapse · 2025-05-24T06:28:51Z

This is for supporting kv cache defragmentation when quatized kv cache is used. test-backend-ops seems to pass with this change.
Need further testing before we can merge.

Rbiessy

This may be a good opportunity to revisit the copy kernel when the source and destination are of the same type and don't require any casting. These should use the same function that copies a number of byte without depending on the type itself. It would reduce the number of kernels and simplify the code. This can be done with sycl::queue::memcpy. What do you think?

qnixsynapse · 2025-05-26T14:50:19Z

@Rbiessy I think memcpy only work if the src and dst tensors are not permutted. Please feel free to correct me if I am wrong.

Rbiessy · 2025-05-26T16:08:37Z

You're right, that wouldn't work if there are permutations. I didn't think too much about what would a permuted tensor with quantized type look like? Is it by design only permuting the blocks and never the values inside a block?

memcpy could be introduced in a separate PR if we find these cpy are slow and permutations are not used I suppose.

qnixsynapse · 2025-05-27T15:41:08Z

Yeah. Generally permuted tensors are non contiguous. If I find time, I will see if I can use memcpy to copy contiguous quantized tensors or not. (I think it is possible). I am marking this PR draft for now.

qnixsynapse · 2025-05-31T07:12:15Z

Some comparison I did.
Master:

CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 6150 runs -   164.90 us/run -    32768 kB/run -  189.60 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   411.40 us/run -    98272 kB/run -  228.14 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.48 us/run -    24576 kB/run -  182.49 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 7175 runs -   159.06 us/run -    32768 kB/run -  196.56 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   407.97 us/run -    98272 kB/run -  230.06 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.33 us/run -    24576 kB/run -  182.70 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 7175 runs -   159.37 us/run -    32768 kB/run -  196.18 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   409.34 us/run -    98272 kB/run -  229.29 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.09 us/run -    24576 kB/run -  183.04 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 8196 runs -   154.12 us/run -    16384 kB/run -  101.41 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   399.94 us/run -    49136 kB/run -  117.25 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   119.95 us/run -    12288 kB/run -   97.71 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 8196 runs -   153.45 us/run -    16384 kB/run -  101.85 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   402.82 us/run -    49136 kB/run -  116.41 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   120.06 us/run -    12288 kB/run -   97.63 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 8196 runs -   153.56 us/run -    16384 kB/run -  101.78 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   404.33 us/run -    49136 kB/run -  115.98 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   120.60 us/run -    12288 kB/run -   97.19 GB/s

This PR:

CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                10250 runs -    97.75 us/run -    32768 kB/run -  319.85 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   411.31 us/run -    98272 kB/run -  228.19 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.75 us/run -    24576 kB/run -  182.10 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                11275 runs -    95.40 us/run -    32768 kB/run -  327.71 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   407.66 us/run -    98272 kB/run -  230.23 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.75 us/run -    24576 kB/run -  182.11 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                11275 runs -    95.41 us/run -    32768 kB/run -  327.71 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   407.15 us/run -    98272 kB/run -  230.52 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.20 us/run -    24576 kB/run -  182.89 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                18441 runs -    59.33 us/run -    16384 kB/run -  263.41 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   400.63 us/run -    49136 kB/run -  117.05 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   118.94 us/run -    12288 kB/run -   98.54 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                18441 runs -    59.34 us/run -    16384 kB/run -  263.38 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   401.69 us/run -    49136 kB/run -  116.74 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   118.38 us/run -    12288 kB/run -   99.01 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                18441 runs -    59.46 us/run -    16384 kB/run -  262.85 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   400.61 us/run -    49136 kB/run -  117.06 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   118.58 us/run -    12288 kB/run -   98.84 GB/s
CPY(type_src=bf16,type_dst=bf16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):              18441 runs -    59.46 us/run -    16384 kB/run -  262.83 GB/s

ggml-ci

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.

ggml-ci

ggml/src/ggml-sycl/cpy.cpp

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.

qnixsynapse marked this pull request as draft May 24, 2025 06:28

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2025

qnixsynapse marked this pull request as ready for review May 25, 2025 07:13

Rbiessy reviewed May 26, 2025

View reviewed changes

qnixsynapse marked this pull request as draft May 27, 2025 15:41

qnixsynapse force-pushed the sycl/same_q_cpy branch 2 times, most recently from 3cdc64b to c8c2278 Compare May 31, 2025 07:08

qnixsynapse marked this pull request as ready for review June 3, 2025 01:38

qnixsynapse requested a review from Alcpz June 3, 2025 12:24

qnixsynapse added 5 commits June 3, 2025 19:19

SYCL: Implement few same quantized type copy kernels

c26934d

Use memcpy for copying contiguous tensors

608e881

ggml-ci

feat(sycl): add contiguous tensor copy support and device checks

faeb7f3

Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.

Exclude BF16 support for COPY tensors for now

b6db005

ggml-ci

qnixsynapse force-pushed the sycl/same_q_cpy branch from 2fb5967 to b6db005 Compare June 3, 2025 13:49

Rbiessy reviewed Jun 3, 2025

View reviewed changes

ggml/src/ggml-sycl/cpy.cpp Outdated Show resolved Hide resolved

perf: adjust SYCL copy kernel block sizes for efficiency

4aa261a

Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SYCL: Implement few same quantized type copy kernels #13739

SYCL: Implement few same quantized type copy kernels #13739

qnixsynapse commented May 24, 2025

Uh oh!

Rbiessy left a comment

Uh oh!

qnixsynapse commented May 26, 2025

Uh oh!

Rbiessy commented May 26, 2025 •

edited

Loading

Uh oh!

qnixsynapse commented May 27, 2025

Uh oh!

qnixsynapse commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

SYCL: Implement few same quantized type copy kernels #13739

Are you sure you want to change the base?

SYCL: Implement few same quantized type copy kernels #13739

Conversation

qnixsynapse commented May 24, 2025

Uh oh!

Rbiessy left a comment

Choose a reason for hiding this comment

Uh oh!

qnixsynapse commented May 26, 2025

Uh oh!

Rbiessy commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qnixsynapse commented May 27, 2025

Uh oh!

qnixsynapse commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

Rbiessy commented May 26, 2025 •

edited

Loading