Skip to content

SYCL: Initial set_rows kernel implementation #14562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 10, 2025
Merged

SYCL: Initial set_rows kernel implementation #14562

merged 6 commits into from
Jul 10, 2025

Conversation

qnixsynapse
Copy link
Collaborator

This is an early implementation only for fp32->fp32 and fp32->fp16

test-backend-ops seems to be passing, however I noticed when enabling this kernel during inference(llama 3.2 3B) by LLAMA_SET_ROWS=1, we loose about 3t/s decoding speed.

Need to properly discuss and decide threads/block and threads/row...

I am all ears for further improvement.

@qnixsynapse qnixsynapse requested a review from Rbiessy July 7, 2025 08:36
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 7, 2025
const int max_threads_per_row = 128; // KEEPING 128 for now
const int threads_per_row = std::min((int)ne00, max_threads_per_row);

const int max_threads_per_block = 128;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 256?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with 256 initially since CUDA used it. Reverted!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check performance with both? If both of them give you correct results, select the one that gives you better performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A value of 64 instead of 256 so far gives the best performance on my A750.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the best solution would required to test it on a B580 or on LunarLake iGPU, but I can't do it now. It's always possible to tweak it later. Thank you for testing and selecting the best.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful if you are able to show us the results on a B580 or LL iGPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on B580 and 64 gives the best result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +77 to +62
const sycl::range<3> block_size(1, rows_per_block, threads_per_row);
const sycl::range<3> grid_size(ne03, ne02, (ne01 + rows_per_block - 1) / rows_per_block);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this, in essence is an element-wise operation, we can have a 1 Dimensional kernel here. and the helper function I had mentioned will take care ascertaining the position along these dimensions.

Copy link
Collaborator Author

@qnixsynapse qnixsynapse Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll try testing with a 1D kernel today.

Edit: It seems possible but do we really need a 1D kernel for scatter row shuffle/reorder kernel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the argument is that all (activations, conversions, etc etc...) our element-wise like kernels can be in condensed into just one 1 function for the most part as follows, as element-wise functions are just one-to-one mappings, be it anything.:

// kernel body
template<typename functor>
void elementwise(char* src,  char* dst, sycl::nd_item<1> it) {
    auto element_index = it.get_global_linear_id(); // id of the element in the flattened array.
    src_offset = calculate_offset(...) // using global element id, and dims and strides via that helper function
    auto val =reinterpret_cast<desired_type*>(src)[src_offset];
    auto output = functor(val);
    dst_offset = calculate_offset(...) // using global element id, and dims and strides, via that helper function
    // and similarly store in dst
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR was merged before I attempt at this. I will open a new PR with 1D kerenl. (Almost same as cpy kernel performance)

@Alcpz Alcpz merged commit 704bb7a into master Jul 10, 2025
48 checks passed
@qnixsynapse qnixsynapse deleted the sycl/set_rows branch July 10, 2025 12:41
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 10, 2025
* origin/master:
cmake : do not search for curl libraries by ourselves (ggml-org#14613)
SYCL: Initial set_rows kernel implementation (ggml-org#14562)
llama : minor coding style fix for smollm3 (ggml-org#14605)
cmake : bump llguidance version to v1.0.1 (ggml-org#14609)
cmake : llguidance build parser library only (ggml-org#14608)
cuda : support Falcon-H1 state size for SSM_SCAN (ggml-org#14602)

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants