-
Notifications
You must be signed in to change notification settings - Fork 12.9k
ggml : fix SSM_SCAN for n_groups > 1 #15625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me, but I don't have them fully verified yet. When merged into my NemotronH branch, the results are much better than previously, but they still don't match the transformers
CUDA output. This is likely somewhere else in the model implementation unrelated to this bug, though. I have run Mamba-Codestral-7B
with the fix and it shows good quality output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now have this working for NemotronH. Without this fix, the results are still garbage, but with this fix, I get coherent results that match (with --temp 0
) across CPU, Metal, and CUDA. I've also verified that results on Mamba-Codestral-7B
match across CPU and Metal and show good quality, and for granite-4.0-tiny-preview
(to sanity check a n_groups == 1
model).
It's good to ship in my book!
@compilade Assuming you don't have any further testing you want to do, let's merge this one! |
…nemotron-nano-15409 * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693)
…nemotron-nano-15409 * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693)
…upport * origin/master: ggml : fix SSM_SCAN for n_groups > 1 (ggml-org#15625) kv-cache : fix find_slot to not search for continuous slot (ggml-org#15638) model : jina-embeddings-v3 support (ggml-org#13693) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This fixes a problem noticed by @gabe-l-hart in #15507 (comment).
The upstream implementation of the SSM scan repeats the grouped parts of B and C like
repeat_interleave
behaves instead of likerepeat
.Since most Mamba2 models use
n_groups == 1
, and that Mamba-Codestral-7B-v0.1 (which usesn_groups == 8
) had a non-extreme perplexity, this was not really noticed until #15507.On CPU, this reduces the perplexity of a
Q8_0
Mamba-Codestral-7B-v0.1 on the first 10 chunks ofwiki.test.raw
quite a lot:Before:
After:
To be clear, there's no need for reconversion of the affected models because this is purely a problem in the
SSM_SCAN
operation.However, if
imatrix
was used, it would make sense to recompute that.TODO:
Make sure to read the contributing guidelines before submitting a PR