Skip to content

Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions. #12816

Open
@zts9989

Description

@zts9989

Name and Version

built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

llama.cpp-b4702

llama.cpp-b4751

llama.cpp-b4756 **************

llama.cpp-b4759 **************

llama.cpp-b4761

llama.cpp-b4762

llama.cpp-b4769

llama.cpp-b4775

llama.cpp-b4800

llama.cpp-b4900

llama.cpp-b4940

llama.cpp-b4990

llama.cpp-b5026

llama.cpp-b5030

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server, llama-cli

Command line

./build/bin/llama-server -m /data/qwq-32b-q8_0-00001-of-00009.gguf -fa -s 3047 --temp 0.6 --top-p 0.95  -ngl 100 --host 0.0.0.0 -c 131072

Problem description & steps to reproduce

Testing with Different Versions of the llama.cpp Server for the Same Inference Task

Using two versions of the llama.cpp server to address the same problem:

llama.cpp-b4756
llama.cpp-b4759
Both versions employ identical parameters and models, yet exhibit significant performance differences.

Key observations:

Performance degradation:

b4759 is noticeably less capable than b4756 (performing worse than twice as poorly in some cases).
Token consumption for the same task:
b4756: ~3,000 tokens
b4759: ~6,000 tokens
Version comparison:

b4702 (an older version) shows superior performance compared to b4756.
The test problem used:

Can you help me decrypt this cipher I received?
"K nkmg rncakpi hqqvdcnn."

This behavior is reproducible through multiple tests. After extensive testing, version b4759 was identified as the one with drastically degraded performance.

If you can reproduce similar findings, please share your test cases!

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions