SYCL: use 1D kernel for set_rows #14618

qnixsynapse · 2025-07-10T15:34:03Z

Continue from #14562

Got it at the same level of performance as the cpy kernel

Model	Batch size	Test	t/s master	t/s sycl/set_rows_1d_kernel	Speedup
llama 3B F16	64	pp1024	770.33	849.79	1.10
llama 3B F16	128	pp1024	1196.16	1412.73	1.18
llama 3B F16	256	pp1024	1823.87	2341.93	1.28
llama 3B F16	512	pp1024	2586.44	3224.97	1.25
llama 3B F16	1024	pp1024	2602.06	3244.33	1.25
llama 3B F16	2048	pp512	2648.21	3346.26	1.26
llama 3B F16	2048	tg128	23.99	25.31	1.06

With LLAMA_SET_ROWS=0

model	size	params	backend	ngl	n_batch	test	t/s
llama 3B F16	5.98 GiB	3.21 B	SYCL	99	64	pp1024	846.00 ± 1.63
llama 3B F16	5.98 GiB	3.21 B	SYCL	99	128	pp1024	1409.40 ± 3.23
llama 3B F16	5.98 GiB	3.21 B	SYCL	99	256	pp1024	2346.72 ± 1.56
llama 3B F16	5.98 GiB	3.21 B	SYCL	99	512	pp1024	3226.67 ± 3.88
llama 3B F16	5.98 GiB	3.21 B	SYCL	99	1024	pp1024	3253.42 ± 5.65
llama 3B F16	5.98 GiB	3.21 B	SYCL	99	tg128	tg128	25.49 ± 0.07

Thanks @AD2605

Alcpz · 2025-07-14T11:40:50Z

@qnixsynapse

After the merge I've seen these tests failing in our internal CI:

MUL_MAT] NMSE = 1.345918287 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.645138755 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.764227244 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.287292869 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.725105342 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.790304169 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.308091346 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.001969071 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.768436437 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.206630871 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.694702958 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.317521933 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.580890764 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.940136456 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.341873484 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.145200242 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL

Do we know if we are missing something else or should I revert?

Edit: I'm not 100% sure if this is the patch that broke things, our CI skipped #14617

qnixsynapse · 2025-07-14T11:43:34Z

@qnixsynapse

After the merge I've seen these tests failing in our internal CI:

MUL_MAT] NMSE = 1.345918287 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.645138755 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.764227244 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.287292869 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.725105342 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.790304169 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.308091346 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.001969071 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.768436437 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.206630871 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.694702958 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.317521933 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 1.580890764 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL
[MUL_MAT] NMSE = 1.940136456 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,2,1,3],v=0): FAIL
[MUL_MAT] NMSE = 1.341873484 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,1,3,2],v=0): FAIL
[MUL_MAT] NMSE = 2.145200242 > 0.000500000   MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=4,bs=[2,3],nr=[1,1],per=[0,3,2,1],v=0): FAIL

Do we know if we are missing something else or should I revert the patch until a fix is found?

This seems problem with MUL_MAT so I think this PR is unrelated. Might need to check 65a3ebb

I'll test it in a moment. Although our ci seems passing

Alcpz · 2025-07-14T11:46:41Z

Yeah sorry to disrupt you, I'm currently checking the other Commit as well, it's most likely the cause

Alcpz · 2025-07-14T11:53:20Z

Confirmed to be 65a3ebb, again, sorry for the heads up. I got misled by the CI pointing at your PR. Seems to be specific to the Nvidia platform, so I think we can consider leaving it since Intel is the priority here. What do you think? Though probably the dev of the PR won't want anything merged that broke something.

qnixsynapse · 2025-07-14T12:00:53Z

@Alcpz IMO, our priority should be Intel. If it doesn't effect Intel GPUs then it's up to you whether you want to fix it or not.

We need to work on softmax kernel's broadcasting and batching support next and I prefer if you guys do it (I have a working implementation but somehow I had to remove all those templated block sizes in order to get it working, so I did not open a PR yet here).

qnixsynapse added 2 commits July 10, 2025 20:15

SYCL: Use 1D kernel for set_rows

58481d0

Remove dangling comment

f2d2818

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 10, 2025

AD2605 approved these changes Jul 11, 2025

View reviewed changes

Refactor and use ceil_div

3aa84d5

Alcpz approved these changes Jul 14, 2025

View reviewed changes

Alcpz merged commit 0f4c6ec into master Jul 14, 2025
48 checks passed

qnixsynapse deleted the sycl/set_rows_1d_kernel branch July 14, 2025 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SYCL: use 1D kernel for set_rows #14618

SYCL: use 1D kernel for set_rows #14618

Uh oh!

qnixsynapse commented Jul 10, 2025

Uh oh!

Uh oh!

Alcpz commented Jul 14, 2025 •

edited

Loading

Uh oh!

qnixsynapse commented Jul 14, 2025 •

edited

Loading

Uh oh!

Alcpz commented Jul 14, 2025

Uh oh!

Alcpz commented Jul 14, 2025 •

edited

Loading

Uh oh!

qnixsynapse commented Jul 14, 2025

Uh oh!

Uh oh!

SYCL: use 1D kernel for set_rows #14618

SYCL: use 1D kernel for set_rows #14618

Uh oh!

Conversation

qnixsynapse commented Jul 10, 2025

Uh oh!

Uh oh!

Alcpz commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qnixsynapse commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alcpz commented Jul 14, 2025

Uh oh!

Alcpz commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qnixsynapse commented Jul 14, 2025

Uh oh!

Uh oh!

Alcpz commented Jul 14, 2025 •

edited

Loading

qnixsynapse commented Jul 14, 2025 •

edited

Loading

Alcpz commented Jul 14, 2025 •

edited

Loading