Skip to content

SYCL: Implement few same quantized type copy kernels #13739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

qnixsynapse
Copy link
Collaborator

This is for supporting kv cache defragmentation when quatized kv cache is used. test-backend-ops seems to pass with this change.
Need further testing before we can merge.

@qnixsynapse qnixsynapse marked this pull request as draft May 24, 2025 06:28
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2025
@qnixsynapse qnixsynapse marked this pull request as ready for review May 25, 2025 07:13
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a good opportunity to revisit the copy kernel when the source and destination are of the same type and don't require any casting. These should use the same function that copies a number of byte without depending on the type itself. It would reduce the number of kernels and simplify the code. This can be done with sycl::queue::memcpy. What do you think?

@qnixsynapse
Copy link
Collaborator Author

@Rbiessy I think memcpy only work if the src and dst tensors are not permutted. Please feel free to correct me if I am wrong.

@Rbiessy
Copy link
Collaborator

Rbiessy commented May 26, 2025

You're right, that wouldn't work if there are permutations. I didn't think too much about what would a permuted tensor with quantized type look like? Is it by design only permuting the blocks and never the values inside a block?

memcpy could be introduced in a separate PR if we find these cpy are slow and permutations are not used I suppose.

@qnixsynapse
Copy link
Collaborator Author

Yeah. Generally permuted tensors are non contiguous. If I find time, I will see if I can use memcpy to copy contiguous quantized tensors or not. (I think it is possible). I am marking this PR draft for now.

@qnixsynapse qnixsynapse marked this pull request as draft May 27, 2025 15:41
@qnixsynapse qnixsynapse force-pushed the sycl/same_q_cpy branch 2 times, most recently from 3cdc64b to c8c2278 Compare May 31, 2025 07:08
@qnixsynapse
Copy link
Collaborator Author

Some comparison I did.
Master:

CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 6150 runs -   164.90 us/run -    32768 kB/run -  189.60 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   411.40 us/run -    98272 kB/run -  228.14 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.48 us/run -    24576 kB/run -  182.49 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 7175 runs -   159.06 us/run -    32768 kB/run -  196.56 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   407.97 us/run -    98272 kB/run -  230.06 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.33 us/run -    24576 kB/run -  182.70 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 7175 runs -   159.37 us/run -    32768 kB/run -  196.18 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   409.34 us/run -    98272 kB/run -  229.29 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.09 us/run -    24576 kB/run -  183.04 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 8196 runs -   154.12 us/run -    16384 kB/run -  101.41 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   399.94 us/run -    49136 kB/run -  117.25 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   119.95 us/run -    12288 kB/run -   97.71 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 8196 runs -   153.45 us/run -    16384 kB/run -  101.85 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   402.82 us/run -    49136 kB/run -  116.41 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   120.06 us/run -    12288 kB/run -   97.63 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                 8196 runs -   153.56 us/run -    16384 kB/run -  101.78 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   404.33 us/run -    49136 kB/run -  115.98 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   120.60 us/run -    12288 kB/run -   97.19 GB/s

This PR:

CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                10250 runs -    97.75 us/run -    32768 kB/run -  319.85 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   411.31 us/run -    98272 kB/run -  228.19 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.75 us/run -    24576 kB/run -  182.10 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                11275 runs -    95.40 us/run -    32768 kB/run -  327.71 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   407.66 us/run -    98272 kB/run -  230.23 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.75 us/run -    24576 kB/run -  182.11 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                11275 runs -    95.41 us/run -    32768 kB/run -  327.71 GB/s
CPY(type_src=f32,type_dst=f32,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2736 runs -   407.15 us/run -    98272 kB/run -  230.52 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                 8196 runs -   128.20 us/run -    24576 kB/run -  182.89 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                18441 runs -    59.33 us/run -    16384 kB/run -  263.41 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   400.63 us/run -    49136 kB/run -  117.05 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   118.94 us/run -    12288 kB/run -   98.54 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                18441 runs -    59.34 us/run -    16384 kB/run -  263.38 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   401.69 us/run -    49136 kB/run -  116.74 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   118.38 us/run -    12288 kB/run -   99.01 GB/s
CPY(type_src=f16,type_dst=f16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                18441 runs -    59.46 us/run -    16384 kB/run -  262.85 GB/s
CPY(type_src=f16,type_dst=f16,ne=[512,3071,2,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]):                 2732 runs -   400.61 us/run -    49136 kB/run -  117.06 GB/s
CPY(type_src=f16,type_dst=f16,ne=[3072,512,2,1],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]):                10924 runs -   118.58 us/run -    12288 kB/run -   98.84 GB/s
CPY(type_src=bf16,type_dst=bf16,ne=[8192,512,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):              18441 runs -    59.46 us/run -    16384 kB/run -  262.83 GB/s

@qnixsynapse qnixsynapse marked this pull request as ready for review June 3, 2025 01:38
@qnixsynapse qnixsynapse requested a review from Alcpz June 3, 2025 12:24
Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.
The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.
Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants