-
Notifications
You must be signed in to change notification settings - Fork 12.4k
SYCL: Initial set_rows kernel implementation #14562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ggml/src/ggml-sycl/set_rows.cpp
Outdated
const int max_threads_per_row = 128; // KEEPING 128 for now | ||
const int threads_per_row = std::min((int)ne00, max_threads_per_row); | ||
|
||
const int max_threads_per_block = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with 256 initially since CUDA used it. Reverted!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check performance with both? If both of them give you correct results, select the one that gives you better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. A value of 64 instead of 256 so far gives the best performance on my A750.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, the best solution would required to test it on a B580 or on LunarLake iGPU, but I can't do it now. It's always possible to tweak it later. Thank you for testing and selecting the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful if you are able to show us the results on a B580 or LL iGPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested on B580 and 64 gives the best result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
const sycl::range<3> block_size(1, rows_per_block, threads_per_row); | ||
const sycl::range<3> grid_size(ne03, ne02, (ne01 + rows_per_block - 1) / rows_per_block); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this, in essence is an element-wise operation, we can have a 1 Dimensional kernel here. and the helper function I had mentioned will take care ascertaining the position along these dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll try testing with a 1D kernel today.
Edit: It seems possible but do we really need a 1D kernel for scatter row shuffle/reorder kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the argument is that all (activations, conversions, etc etc...) our element-wise like kernels can be in condensed into just one 1 function for the most part as follows, as element-wise functions are just one-to-one mappings, be it anything.:
// kernel body
template<typename functor>
void elementwise(char* src, char* dst, sycl::nd_item<1> it) {
auto element_index = it.get_global_linear_id(); // id of the element in the flattened array.
src_offset = calculate_offset(...) // using global element id, and dims and strides via that helper function
auto val =reinterpret_cast<desired_type*>(src)[src_offset];
auto output = functor(val);
dst_offset = calculate_offset(...) // using global element id, and dims and strides, via that helper function
// and similarly store in dst
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR was merged before I attempt at this. I will open a new PR with 1D kerenl. (Almost same as cpy kernel performance)
3f944dd
to
bab2b3b
Compare
* origin/master: cmake : do not search for curl libraries by ourselves (ggml-org#14613) SYCL: Initial set_rows kernel implementation (ggml-org#14562) llama : minor coding style fix for smollm3 (ggml-org#14605) cmake : bump llguidance version to v1.0.1 (ggml-org#14609) cmake : llguidance build parser library only (ggml-org#14608) cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This is an early implementation only for fp32->fp32 and fp32->fp16
test-backend-ops seems to be passing, however I noticed when enabling this kernel during inference(llama 3.2 3B) by
LLAMA_SET_ROWS=1
, we loose about 3t/s decoding speed.Need to properly discuss and decide threads/block and threads/row...
I am all ears for further improvement.