Skip to content

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Aug 27, 2025

This fixes a problem noticed by @gabe-l-hart in #15507 (comment).

The upstream implementation of the SSM scan repeats the grouped parts of B and C like repeat_interleave behaves instead of like repeat.

Since most Mamba2 models use n_groups == 1, and that Mamba-Codestral-7B-v0.1 (which uses n_groups == 8) had a non-extreme perplexity, this was not really noticed until #15507.

On CPU, this reduces the perplexity of a Q8_0 Mamba-Codestral-7B-v0.1 on the first 10 chunks of wiki.test.raw quite a lot:

Before:

[1]8.2788,[2]10.8075,[3]12.6548,[4]14.6535,[5]14.1671,[6]14.0561,[7]14.6714,[8]14.8880,[9]15.2977,[10]16.1435,
Final estimate: PPL = 16.1435 +/- 0.88152

After:

[1]5.2122,[2]6.4318,[3]7.0763,[4]8.0914,[5]8.1234,[6]8.1319,[7]8.4960,[8]8.6492,[9]8.7738,[10]9.1851,
Final estimate: PPL = 9.1851 +/- 0.46159

To be clear, there's no need for reconversion of the affected models because this is purely a problem in the SSM_SCAN operation.

However, if imatrix was used, it would make sense to recompute that.

TODO:

  • Test with Metal
  • Test with CUDA

Make sure to read the contributing guidelines before submitting a PR

@compilade compilade requested a review from gabe-l-hart August 27, 2025 21:56
@compilade compilade added generation quality Quality of model output bugfix fixes an issue or bug labels Aug 27, 2025
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Aug 27, 2025
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, but I don't have them fully verified yet. When merged into my NemotronH branch, the results are much better than previously, but they still don't match the transformers CUDA output. This is likely somewhere else in the model implementation unrelated to this bug, though. I have run Mamba-Codestral-7B with the fix and it shows good quality output.

Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have this working for NemotronH. Without this fix, the results are still garbage, but with this fix, I get coherent results that match (with --temp 0) across CPU, Metal, and CUDA. I've also verified that results on Mamba-Codestral-7B match across CPU and Metal and show good quality, and for granite-4.0-tiny-preview (to sanity check a n_groups == 1 model).

It's good to ship in my book!

@gabe-l-hart
Copy link
Collaborator

@compilade Assuming you don't have any further testing you want to do, let's merge this one!

@compilade compilade merged commit 7380414 into master Aug 28, 2025
48 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master:
ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)
kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638)
model : jina-embeddings-v3 support (ggml-org#13693)
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…nemotron-nano-15409

* origin/master:
ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)
kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638)
model : jina-embeddings-v3 support (ggml-org#13693)
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Aug 28, 2025
…upport

* origin/master:
ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625)
kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638)
model : jina-embeddings-v3 support (ggml-org#13693)

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) bugfix fixes an issue or bug generation quality Quality of model output ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants