Skip to content

SYCL: use 1D kernel for set_rows #14618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 14, 2025
Merged

SYCL: use 1D kernel for set_rows #14618

merged 3 commits into from
Jul 14, 2025

Conversation

qnixsynapse
Copy link
Collaborator

Continue from #14562

Got it at the same level of performance as the cpy kernel

Model Batch size Test t/s master t/s sycl/set_rows_1d_kernel Speedup
llama 3B F16 64 pp1024 770.33 849.79 1.10
llama 3B F16 128 pp1024 1196.16 1412.73 1.18
llama 3B F16 256 pp1024 1823.87 2341.93 1.28
llama 3B F16 512 pp1024 2586.44 3224.97 1.25
llama 3B F16 1024 pp1024 2602.06 3244.33 1.25
llama 3B F16 2048 pp512 2648.21 3346.26 1.26
llama 3B F16 2048 tg128 23.99 25.31 1.06

With LLAMA_SET_ROWS=0

model size params backend ngl n_batch test t/s
llama 3B F16 5.98 GiB 3.21 B SYCL 99 64 pp1024 846.00 ± 1.63
llama 3B F16 5.98 GiB 3.21 B SYCL 99 128 pp1024 1409.40 ± 3.23
llama 3B F16 5.98 GiB 3.21 B SYCL 99 256 pp1024 2346.72 ± 1.56
llama 3B F16 5.98 GiB 3.21 B SYCL 99 512 pp1024 3226.67 ± 3.88
llama 3B F16 5.98 GiB 3.21 B SYCL 99 1024 pp1024 3253.42 ± 5.65
llama 3B F16 5.98 GiB 3.21 B SYCL 99 tg128 tg128 25.49 ± 0.07

Thanks @AD2605

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 10, 2025
@Alcpz Alcpz merged commit 0f4c6ec into master Jul 14, 2025
48 checks passed
@Alcpz
Copy link
Collaborator

Alcpz commented Jul 14, 2025

@qnixsynapse

After the merge I've seen these tests failing in our internal CI:

MUL_MAT] NMSE = 1.345918287 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.645138755 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.764227244 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.287292869 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.725105342 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.790304169 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.308091346 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.001969071 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.768436437 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.206630871 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.694702958 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.317521933 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.580890764 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.940136456 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.341873484 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.145200242 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL

Do we know if we are missing something else or should I revert?

Edit: I'm not 100% sure if this is the patch that broke things, our CI skipped #14617

@qnixsynapse
Copy link
Collaborator Author

qnixsynapse commented Jul 14, 2025

@qnixsynapse

After the merge I've seen these tests failing in our internal CI:

MUL_MAT] NMSE = 1.345918287 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.645138755 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.764227244 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.287292869 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.725105342 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.790304169 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.308091346 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.001969071 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.768436437 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.206630871 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.694702958 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.317521933 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.580890764 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.940136456 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.341873484 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.145200242 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL

Do we know if we are missing something else or should I revert the patch until a fix is found?

This seems problem with MUL_MAT so I think this PR is unrelated. Might need to check 65a3ebb

I'll test it in a moment. Although our ci seems passing

@Alcpz
Copy link
Collaborator

Alcpz commented Jul 14, 2025

Yeah sorry to disrupt you, I'm currently checking the other Commit as well, it's most likely the cause

@Alcpz
Copy link
Collaborator

Alcpz commented Jul 14, 2025

Confirmed to be 65a3ebb, again, sorry for the heads up. I got misled by the CI pointing at your PR. Seems to be specific to the Nvidia platform, so I think we can consider leaving it since Intel is the priority here. What do you think? Though probably the dev of the PR won't want anything merged that broke something.

@qnixsynapse qnixsynapse deleted the sycl/set_rows_1d_kernel branch July 14, 2025 11:54
@qnixsynapse
Copy link
Collaborator Author

@Alcpz IMO, our priority should be Intel. If it doesn't effect Intel GPUs then it's up to you whether you want to fix it or not.

We need to work on softmax kernel's broadcasting and batching support next and I prefer if you guys do it (I have a working implementation but somehow I had to remove all those templated block sizes in order to get it working, so I did not open a PR yet here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants