Skip to content

Eval bug: Persistent <think> Tags in Qwen3-32B Output Despite enable_thinking: False and --reasoning-format none in llama.cpp #13189

Open
@shyn01

Description

@shyn01

Name and Version

llama.cpp Version: b5218 (latest as of April 29, 2025)
Model: Qwen3-32B (4-bit quantized, GGUF format, Q4_K_M)
Hardware: Dual NVIDIA A100 (40GB VRAM each), using single GPU with -ngl 99
OS: Ubuntu 22.04
CUDA: Enabled, detected (compute capability 8.0)
Server Command:
bash

./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
API Client: Python with requests library, calling /v1/chat/completions

Operating systems

Linux

GGML backends

CUDA

Hardware

A100x2

Models

Qwen3-32B (4-bit GGUF, Q4_K_M)

Problem description & steps to reproduce

When running Qwen3-32B (4-bit GGUF, Q4_K_M) with llama.cpp, the model output consistently includes tags, despite setting "extra_body": {"enable_thinking": False} in the API payload and using --reasoning-format none in the server command. According to the ModelScope documentation for Qwen3-32B (link), setting enable_thinking: False should disable tags, aligning the behavior with Qwen2.5-Instruct models, but this does not work in llama.cpp.

Steps to Reproduce
Compile llama.cpp (b5215) with CUDA support:
bash

复制
cmake -B build -DGGML_CUDA=ON
cmake --build build -j $(nproc)
Convert Qwen3-32B to GGUF and quantize to 4-bit (Q4_K_M):
bash

复制
python convert_hf_to_gguf.py /home/models/Qwen3-32B --outfile /home/models/qwen3-32b-f16.gguf
./build/bin/quantize /home/models/qwen3-32b-f16.gguf /home/models/qwen3-32b-q4_k_m.gguf Q4_K_M
Start the llama.cpp server:
bash

复制
./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
Call the API using Python:
python

复制
import requests
import re
import logging

logger = logging.getLogger(name)
LLAMA_API_URL = "http://localhost:7901/v1/chat/completions"

messages = [
{"role": "system", "content": "Answer directly without thinking process or tags like ."},
{"role": "user", "content": "Give me a short introduction to large language model."}
]
payload = {
"messages": messages,
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0,
"stream": True,
"repeat_penalty": 1.5,
"extra_body": {"enable_thinking": False}
}
response = requests.post(LLAMA_API_URL, json=payload, stream=True)
role_response = ""
for line in response.iter_lines():
if line and line.startswith(b"data: "):
data = line[6:].decode('utf-8')
if data == "[DONE]": break
chunk = json.loads(data)
delta = chunk["choices"][0]["delta"].get("content", "")
logger.debug(f"Raw delta: {delta}")
delta_cleaned = re.sub(r'<[^>]>', '', delta)
role_response += delta_cleaned
print(role_response.strip())
Observe the raw output containing tags (e.g., Analyzing...Answer...).
Expected Behavior
With "extra_body": {"enable_thinking": False} and --reasoning-format none, the model output should not include tags, as specified in the Qwen3-32B documentation.
The model should behave similarly to Qwen2.5-Instruct, producing direct responses without thinking blocks.
Actual Behavior
The model output includes tags in the raw response (e.g., Reasoning...Answer...), even with enable_thinking: False and --reasoning-format none.
Client-side regex (re.sub(r'<[^>]
>', '', delta)) successfully removes the tags, but the goal is to prevent their generation on the server side.
Additional Notes
The --jinja parameter was tested and removed, but it had no impact on tag generation.
A system prompt ("Answer directly without thinking process or tags") was added, but it did not prevent tags.
The GGUF model was converted using convert_hf_to_gguf.py and quantized to Q4_K_M, with no issues during conversion.
The same issue persists when testing with llama-cli:
bash

./build/bin/llama-cli -m /home/models/qwen3-32b-q4_k_m.gguf -p "Give me a short introduction to large language model." -ngl 99
Questions
Does llama.cpp fully support Qwen3-32B's enable_thinking parameter in the API payload?
Is --reasoning-format none sufficient to disable tags for Qwen3-32B, or are additional parameters required?
Could the GGUF conversion process or Qwen3-32B's training data cause persistent tag generation?
Are there known workarounds to prevent Qwen3-32B from generating tags in llama.cpp?
Logs
Client debug log (example from successful run):
text

Please provide guidance on how to disable tag generation server-side for Qwen3-32B in llama.cpp. Thank you!

First Bad Commit

No response

Relevant log output

Raw delta: '<think>Analyzing request...</think>'
Cleaned delta: 'Analyzing request...'
Final response: 'A large language model is a neural network trained on vast text data to generate human-like text.'
Server log (example from successful run):
text

[INFO] Server listening on http://0.0.0.0:7901
[INFO] Loading model '/home/models/qwen3-32b-q4_k_m.gguf'

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions