llama : remove ggml_cont where possible #14568

CISC · 2025-07-07T15:20:53Z

AFAIK all backends support non-contiguous norm/rope, therefore we can remove a lot of unnecessary ggml_cont()s by directly creating a 3D view before these ops.

Gives a quite hefty PP boost for the affected models.

ggerganov

Sample M2 Ultra numbers:

Model	FlashAttention	Test	t/s master	t/s cisc/remove-unnecessary-conts	Speedup
falcon 7B Q4_0	No	pp512	1367.69	1382.38	1.01
falcon 7B Q4_0	No	tg32	99.63	104.74	1.05
falcon 7B Q4_0	Yes	pp512	1384.67	1396.75	1.01
falcon 7B Q4_0	Yes	tg32	102.58	108.62	1.06

src/llama-model.cpp

jeffbolznv · 2025-07-07T17:14:48Z

Nice boost for tg, not seeing a gain for pp:

before:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |    15172.83 ± 221.32 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        282.68 ± 1.78 |

after:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |    15177.85 ± 494.64 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        293.20 ± 1.74 |

CISC · 2025-07-07T18:16:05Z

Some CUDA numbers before:

model	size	params	backend	ngl	threads	test	t/s
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CUDA	99	7	pp512	7276.01 ± 54.48
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CUDA	99	7	tg128	193.72 ± 1.09

after:

model	size	params	backend	ngl	threads	test	t/s
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CUDA	99	7	pp512	7506.65 ± 22.00
phi3 3B IQ4_XS - 4.25 bpw	1.92 GiB	3.82 B	CUDA	99	7	tg128	196.35 ± 1.01

CISC · 2025-07-07T18:55:29Z

Looks like MiniCPM broke, investigating...

CISC · 2025-07-07T18:56:39Z

Ah, hm, nope, it was already broken.

CISC · 2025-07-07T19:05:34Z

@ggerganov ggml_reshape_2d in llama_kv_cache_unified::cpy_v fails with GGML_ASSERT(ggml_nelements(a) == ne0*ne1), wonder how long this has been broken for MiniCPM3?

ggerganov · 2025-07-07T19:26:18Z

Likely since #12449

CISC · 2025-07-07T19:28:16Z

Likely since #12449

The problem is simply that v_states is turned into a 2D view, removing that fixes it. :P

ggerganov · 2025-07-07T19:30:30Z

Is the ggml_cont right before that still needed?

CISC · 2025-07-07T19:33:05Z

Is the ggml_cont right before that still needed?

Yes, due to the ggml_reshape_2d.

CISC · 2025-07-07T19:33:50Z

I'll make a separate PR for MiniCPM3 for visibility.

CISC · 2025-07-07T20:37:33Z

Huh, for some weird reason this seems to have broken GPTNeoX:
https://github.com/ggml-org/ci/blob/results/llama.cpp/12/f55c302b35cfe900b84c5fe67c262026af9c44/ggml-4-x86-cuda-v100/stdall#L12992

* origin/master: model : fix hunyuan moe chat template (ggml-org#14584) model : add SmolLM3 (ggml-org#14581) memory : fix broken batch splits for recurrent cache (ggml-org#14575) vulkan : fix rope with partial rotation and non-cont src (ggml-org#14582) server: Add ability to mount server at prefix (ggml-org#14544) model : add hunyuan moe (ggml-org#14425) vulkan: increase timeout for CI (ggml-org#14574) cuda : fix rope with partial rotation and non-cont src (ggml-org#14580) CUDA: add bilinear interpolation for upscale (ggml-org#14563) musa: fix build warnings (unused variable) (ggml-org#14561) llama : fix incorrect minicpm3 v_states shape (ggml-org#14571) llama : remove ggml_cont where possible (ggml-org#14568)

remove ggml_cont where possible

a3403ae

CISC requested a review from ggerganov July 7, 2025 15:21

ggerganov approved these changes Jul 7, 2025

View reviewed changes

ggerganov reviewed Jul 7, 2025

View reviewed changes

src/llama-model.cpp Show resolved Hide resolved

CISC merged commit 12f55c3 into master Jul 7, 2025
48 checks passed

CISC deleted the cisc/remove-unnecessary-conts branch July 7, 2025 19:35

ggerganov mentioned this pull request Jul 8, 2025

cuda : fix rope with partial rotation and non-cont src #14580

Merged

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025

llama : remove ggml_cont where possible (ggml-org#14568)

e937372

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Jul 10, 2025

llama : remove ggml_cont where possible (ggml-org#14568)

0104b6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : remove ggml_cont where possible #14568

llama : remove ggml_cont where possible #14568

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

jeffbolznv commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

ggerganov commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

ggerganov commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

Uh oh!

llama : remove ggml_cont where possible #14568

llama : remove ggml_cont where possible #14568

Uh oh!

Conversation

CISC commented Jul 7, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffbolznv commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

ggerganov commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

ggerganov commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

Uh oh!

CISC commented Jul 7, 2025

Uh oh!

Uh oh!