SYCL: Initial set_rows kernel implementation #14562

qnixsynapse · 2025-07-07T08:36:51Z

This is an early implementation only for fp32->fp32 and fp32->fp16

test-backend-ops seems to be passing, however I noticed when enabling this kernel during inference(llama 3.2 3B) by LLAMA_SET_ROWS=1, we loose about 3t/s decoding speed.

Need to properly discuss and decide threads/block and threads/row...

I am all ears for further improvement.

ggerganov · 2025-07-07T11:26:02Z

ggml/src/ggml-sycl/set_rows.cpp

+    const int max_threads_per_row = 128; // KEEPING 128 for now
+    const int threads_per_row     = std::min((int)ne00, max_threads_per_row);
+
+    const int max_threads_per_block = 128;


Shouldn't this be 256?

I tried with 256 initially since CUDA used it. Reverted!

Did you check performance with both? If both of them give you correct results, select the one that gives you better performance.

Yes. A value of 64 instead of 256 so far gives the best performance on my A750.

Ideally, the best solution would required to test it on a B580 or on LunarLake iGPU, but I can't do it now. It's always possible to tweak it later. Thank you for testing and selecting the best.

I think it would be helpful if you are able to show us the results on a B580 or LL iGPU.

I tested on B580 and 64 gives the best result.

ggml/src/ggml-sycl/set_rows.cpp

AD2605 · 2025-07-07T13:59:21Z

ggml/src/ggml-sycl/set_rows.cpp

+    const sycl::range<3> block_size(1, rows_per_block, threads_per_row);
+    const sycl::range<3> grid_size(ne03, ne02, (ne01 + rows_per_block - 1) / rows_per_block);
+


As this, in essence is an element-wise operation, we can have a 1 Dimensional kernel here. and the helper function I had mentioned will take care ascertaining the position along these dimensions.

Thanks. I'll try testing with a 1D kernel today.

Edit: It seems possible but do we really need a 1D kernel for scatter row shuffle/reorder kernel?

So the argument is that all (activations, conversions, etc etc...) our element-wise like kernels can be in condensed into just one 1 function for the most part as follows, as element-wise functions are just one-to-one mappings, be it anything.:

// kernel body template<typename functor> void elementwise(char* src, char* dst, sycl::nd_item<1> it) { auto element_index = it.get_global_linear_id(); // id of the element in the flattened array. src_offset = calculate_offset(...) // using global element id, and dims and strides via that helper function auto val =reinterpret_cast<desired_type*>(src)[src_offset]; auto output = functor(val); dst_offset = calculate_offset(...) // using global element id, and dims and strides, via that helper function // and similarly store in dst }

The PR was merged before I attempt at this. I will open a new PR with 1D kerenl. (Almost same as cpy kernel performance)

ggml/src/ggml-sycl/set_rows.cpp

* origin/master: cmake : do not search for curl libraries by ourselves (ggml-org#14613) SYCL: Initial set_rows kernel implementation (ggml-org#14562) llama : minor coding style fix for smollm3 (ggml-org#14605) cmake : bump llguidance version to v1.0.1 (ggml-org#14609) cmake : llguidance build parser library only (ggml-org#14608) cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

qnixsynapse requested a review from Rbiessy July 7, 2025 08:36

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 7, 2025

ggerganov reviewed Jul 7, 2025

View reviewed changes

AD2605 reviewed Jul 7, 2025

View reviewed changes