-
Notifications
You must be signed in to change notification settings - Fork 12k
SYCL: Implement few same quantized type copy kernels #13739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a good opportunity to revisit the copy kernel when the source and destination are of the same type and don't require any casting. These should use the same function that copies a number of byte without depending on the type itself. It would reduce the number of kernels and simplify the code. This can be done with sycl::queue::memcpy
. What do you think?
@Rbiessy I think |
You're right, that wouldn't work if there are permutations. I didn't think too much about what would a permuted tensor with quantized type look like? Is it by design only permuting the blocks and never the values inside a block?
|
Yeah. Generally permuted tensors are non contiguous. If I find time, I will see if I can use memcpy to copy contiguous quantized tensors or not. (I think it is possible). I am marking this PR draft for now. |
3cdc64b
to
c8c2278
Compare
Some comparison I did.
This PR:
|
Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance.
The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed.
2fb5967
to
b6db005
Compare
Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.
This is for supporting kv cache defragmentation when quatized kv cache is used.
test-backend-ops
seems to pass with this change.Need further testing before we can merge.