Eval bug: Persistent <think> Tags in Qwen3-32B Output Despite enable_thinking: False and --reasoning-format none in llama.cpp

### Name and Version

llama.cpp Version: b5218 (latest as of April 29, 2025)
Model: Qwen3-32B (4-bit quantized, GGUF format, Q4_K_M)
Hardware: Dual NVIDIA A100 (40GB VRAM each), using single GPU with -ngl 99
OS: Ubuntu 22.04
CUDA: Enabled, detected (compute capability 8.0)
Server Command:
bash

./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
API Client: Python with requests library, calling /v1/chat/completions

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

A100x2

### Models

Qwen3-32B (4-bit GGUF, Q4_K_M)

### Problem description & steps to reproduce

When running Qwen3-32B (4-bit GGUF, Q4_K_M) with llama.cpp, the model output consistently includes <think> tags, despite setting "extra_body": {"enable_thinking": False} in the API payload and using --reasoning-format none in the server command. According to the ModelScope documentation for Qwen3-32B ([link](https://www.modelscope.cn/models/Qwen/Qwen3-32B)), setting enable_thinking: False should disable <think> tags, aligning the behavior with Qwen2.5-Instruct models, but this does not work in llama.cpp.

Steps to Reproduce
Compile llama.cpp (b5215) with CUDA support:
bash

复制
cmake -B build -DGGML_CUDA=ON
cmake --build build -j $(nproc)
Convert Qwen3-32B to GGUF and quantize to 4-bit (Q4_K_M):
bash

复制
python convert_hf_to_gguf.py /home/models/Qwen3-32B --outfile /home/models/qwen3-32b-f16.gguf
./build/bin/quantize /home/models/qwen3-32b-f16.gguf /home/models/qwen3-32b-q4_k_m.gguf Q4_K_M
Start the llama.cpp server:
bash

复制
./build/bin/llama-server -m /home/models/qwen3-32b-q4_k_m.gguf --host 0.0.0.0 --port 7901 -c 40960 -ngl 99 -t 24 --reasoning-format none
Call the API using Python:
python

复制
import requests
import re
import logging

logger = logging.getLogger(__name__)
LLAMA_API_URL = "http://localhost:7901/v1/chat/completions"

messages = [
    {"role": "system", "content": "Answer directly without thinking process or tags like <think>."},
    {"role": "user", "content": "Give me a short introduction to large language model."}
]
payload = {
    "messages": messages,
    "max_tokens": 100,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "min_p": 0,
    "stream": True,
    "repeat_penalty": 1.5,
    "extra_body": {"enable_thinking": False}
}
response = requests.post(LLAMA_API_URL, json=payload, stream=True)
role_response = ""
for line in response.iter_lines():
    if line and line.startswith(b"data: "):
        data = line[6:].decode('utf-8')
        if data == "[DONE]": break
        chunk = json.loads(data)
        delta = chunk["choices"][0]["delta"].get("content", "")
        logger.debug(f"Raw delta: {delta}")
        delta_cleaned = re.sub(r'<[^>]*>', '', delta)
        role_response += delta_cleaned
print(role_response.strip())
Observe the raw output containing <think> tags (e.g., <think>Analyzing...</think>Answer...).
Expected Behavior
With "extra_body": {"enable_thinking": False} and --reasoning-format none, the model output should not include <think> tags, as specified in the Qwen3-32B documentation.
The model should behave similarly to Qwen2.5-Instruct, producing direct responses without thinking blocks.
Actual Behavior
The model output includes <think> tags in the raw response (e.g., <think>Reasoning...</think>Answer...), even with enable_thinking: False and --reasoning-format none.
Client-side regex (re.sub(r'<[^>]*>', '', delta)) successfully removes the tags, but the goal is to prevent their generation on the server side.
Additional Notes
The --jinja parameter was tested and removed, but it had no impact on <think> tag generation.
A system prompt ("Answer directly without thinking process or tags") was added, but it did not prevent <think> tags.
The GGUF model was converted using convert_hf_to_gguf.py and quantized to Q4_K_M, with no issues during conversion.
The same issue persists when testing with llama-cli:
bash

./build/bin/llama-cli -m /home/models/qwen3-32b-q4_k_m.gguf -p "Give me a short introduction to large language model." -ngl 99
Questions
Does llama.cpp fully support Qwen3-32B's enable_thinking parameter in the API payload?
Is --reasoning-format none sufficient to disable <think> tags for Qwen3-32B, or are additional parameters required?
Could the GGUF conversion process or Qwen3-32B's training data cause persistent <think> tag generation?
Are there known workarounds to prevent Qwen3-32B from generating <think> tags in llama.cpp?
Logs
Client debug log (example from successful run):
text

Please provide guidance on how to disable <think> tag generation server-side for Qwen3-32B in llama.cpp. Thank you!

### First Bad Commit

_No response_

### Relevant log output

```shell
Raw delta: '<think>Analyzing request...</think>'
Cleaned delta: 'Analyzing request...'
Final response: 'A large language model is a neural network trained on vast text data to generate human-like text.'
Server log (example from successful run):
text

[INFO] Server listening on http://0.0.0.0:7901
[INFO] Loading model '/home/models/qwen3-32b-q4_k_m.gguf'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Persistent <think> Tags in Qwen3-32B Output Despite enable_thinking: False and --reasoning-format none in llama.cpp #13189

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Persistent <think> Tags in Qwen3-32B Output Despite enable_thinking: False and --reasoning-format none in llama.cpp #13189

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions