llama : add gpt-oss #15091

ggerganov · 2025-08-05T15:19:49Z

gpt-oss model support in native MXFP4 format:

Compute graph implementation in llama.cpp
Attention sinks support in ggml
New MXFP4 data type in ggml
New ggml_add_id operator in ggml

Usage:

Model collection: https://huggingface.co/collections/ggml-org/gpt-oss-68923b60bee37414546c70bf

Example command:

llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none

# Then, access http://localhost:8080

Model card

References:

Note to maintainers:

This an initial implementation with pretty much complete support for the CUDA, Vulkan, Metal and CPU backends. The idea is to merge this quicker than usual, in time for the official release today, and later we can work on polishing any potential problems and missing features.

Next PRs:

CUDA fattn-mma-f16 sinks (currently only the vec FA kernels are implemented)
Vulkan fattn sinks
Attention sinks and MXFP4 support for remaining backends
Improve chat template handling / tool calling / reasoning effort control / CoT / etc.

* llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>

kiuckhuang · 2025-08-05T21:25:49Z

The chat template by ggml-org, the tool calling is much better than from others:
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF?chat_template=default

QQ: were there any customization on the template? My conversion is hitting this exception https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L344-L345 and getting 500 after some followup question in the chat. (reconverted after the PR merged to master)
I also experience the same with @bartowski1182 's bartowski/openai_gpt-oss-20b-GGUF-MXFP4-Experimental
Anyone else experiencing this?

@csabakecskemeti: Same here. Weird problems with the Template. The same discussion is going on on HF

The user kiuckhuang said there that he was able to improve things by using the template from https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja.

~~But i don't yet know how to do that?~~

~~Anyone with a hint on how to use the other template?~~

nachoal · 2025-08-05T21:28:53Z

has anyone made it work with tool_calling?

thad0ctor · 2025-08-05T23:59:01Z

Is anyone else getting gibberish with long context prompts?

I get the same results with either of the ggufs I downloaded, I drop in a 40K context prompt and get a response of "GGGGGGGGGGGGGG.....". Smaller prompts work fine.

llama-server -m /media//T9/Models/gpt-oss-120b-MXFP4-00001-of-00002.gguf --threads 24 --threads-batch 48 --batch-size 4096 --ubatch-size 4096 --ctx-size 131072 --temp 0.6 --min-p 0.01 --tensor-split 20,20,20,20 --n-gpu-layers 80 --main-gpu 1 --flash-attn --no-mmap --mlock --host 0.0.0.0 --port 5001

llama-server -m /media//T9/Models/gpt-oss-120b-F16.gguf --threads 24 --threads-batch 48 --batch-size 4096 --ubatch-size 4096 --ctx-size 131072 --temp 0.6 --min-p 0.01 --tensor-split 20,20,20,20 --n-gpu-layers 80 --main-gpu 1 --flash-attn --no-mmap --mlock --host 0.0.0.0 --port 5001

joseph777111 · 2025-08-06T01:55:25Z

@ggerganov Thank you, ggerganov and everyone else for your expedient and awesome work! Will attention sinks be made available to all GGUF'd models? 🤔

CHNtentes · 2025-08-06T02:22:36Z

Could anyone help explain why use -c 0?

fat-tire · 2025-08-06T02:54:06Z

@nachoal it appears there's a lot of new stuff here. At least to me-- but I have not used openai's API with openai before, only local models.

Like, there are two kinds of system prompts-- a "system message" and a "developer message". Also there are two types of tools-- "builtin_tools" (python or browser tools) referenced in the system message and function tools (described in the developer message). There is a special format for describing the function tools but I'm guessing MCP would work too.

The function tools are called in a separate "commentary" channel from normal reply content (and distinct from the "reasoning_content") per the harmony response format.

So different types of output appear in different places in the chat completion. As an example, instead of parsing <think></think> tags directly in the response as with some other models, you would find the reasoning content (in python) w:

reasoning_text = response.choices[0].message.reasoning_content

where response is a ChatCompletion object that came from the usual OpenAI API call:

response = client.chat.completions.create( ... )

It looks like right now in llama.cpp by default when an assistant tries to use a tool, the content string is left empty and it's the reasoning_content actually that contains the call stuff at the end of the reasoning text in the following format (this is just a reference MCP fetch tool call):

<|start|>assistant<|channel|>commentary to=fetch json<|message|>{\"url\":\"https://www.github.com\",\"max_length\":5000}

The expected format is is supposed to come after the reasoning like this:

<|start|>assistant<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location":"San Francisco"}<|call|>

So the output looks very close but not exactly right from what I am seeing. It's missing the <|call|>

I'm sure that in the near future a tool call will be fully extracted by llama.cpp and put in response.choices[0].message.tool_calls or response.choices[0].message.function_call or wherever it's supposed to go, but as of right now it isn't recognizing the commentary channel at all.

The Harmony dox also discusses "Output by the model which can either be a tool call or a message output. " -- so apparently you can get a message OR a tool call, but apparently not both, which is why the content is blank when it tries to use a tool.

The hacky temporary workaround to this bug to maintain compatibility with other models would be to come up with a regex expression you could use to pull the json toolname and arguments/{output} from the reasoning_content text and substitute the resulting json as the reply text.

There's a note in this PR that the tool template stuff is a WIP and tool use is still to come, so I guess it may make the most sense to just wait for this to get fixed unless you're really itching to get tools working.

Anyone that knows more please correct me as I'm just figuring this out myself!

nachoal · 2025-08-06T03:23:33Z

@fat-tire Appreciate the complete explanation 🙏, I ended up just parsing the strings to try tool calling for now, a bit broken but it works. Thanks!

nai-kon · 2025-08-06T05:33:09Z

Has the reasoning_effort not been implemented yet?

I'm hosting gpt-oss-20b on llama-server and calling it from the OpenAI API chat.completions.create().
I'm changing reasoning_effort of create(..., reasoning_effort="low"), but the actual system prompt remains "reasoning: medium" and not changed.
It seems the reasoning_effort option is not working.

Here is quick sample.

llama-server -m gpt-oss-20b-mxfp4.gguf -c 0 -fa --jinja --reasoning-format none -ngl 1000 (I'm using b6096 build)

from openai import OpenAI

model = OpenAI(api_key="dummy", base_url="http://127.0.0.1:8080")
completion = model.chat.completions.create(
    model="dummy",
    messages=[{"role": "user", "content": "Write fizzbuzz in Python"}],
    reasoning_effort="high",
)

print(completion)

fat-tire · 2025-08-06T06:05:43Z

@nachoal Yup a simple regex pattern on that reasoning_content string like:

pattern = r".*\<\|start\|\>assistant\<\|channel\|\>commentary to=(?P<toolname>\w+) json\<\|message\|\>(?P<output>.*)"

gets you two match groups, toolname & output which you can use to reassemble back to a MCP tool call (or whatever). It's not the best way to do it, but it works most of the time while we wait for the "right way". I've noticed that the toolname can also vary. Sometimes it is just "fetch" (and then the output are the fetch tool arguments) but sometimes it's just "functions" and then the output is the entire tool json including the tool name AND the arguments. I think it may be "function.fetch" at times as well. I guess the system prompt will need to more closely follow the suggested prompt examples for # tool and would ideally be in the "developer message" rather than the "system message" but I'll play around a bit more. Just adding this in case anyone else is doing the same.

uazure · 2025-08-06T09:02:05Z

Using build: llama-b6098-bin-win-cuda-12.4-x64
./llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --reasoning-format none -dev none
it starts, but to any request it responds with GGGGGGGGGGGGGGGGGGGGGGG... sequence. I see similar issue mentioned few comments above. Perhaps they have common root

slaren · 2025-08-06T09:33:34Z

@uazure what CPU backend is it loading?

createthis · 2025-08-06T12:50:14Z

Has the reasoning_effort not been implemented yet?

@nai-kon I noticed the same behavior. I think we should open a defect issue rather than clog this further.

Added in GPT-OSS PR ggml-org/llama.cpp#15091 --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

CISC · 2025-08-08T07:08:12Z

convert_hf_to_gguf.py

+                new_name_gate = self.map_tensor_name(name.replace("gate_up_proj_scales", "gate_proj.weight"))
+                new_name_up = self.map_tensor_name(name.replace("gate_up_proj_scales", "up_proj.weight"))


Too late, but why was this split? Only adds extra ops on the graph...

The gate_up tensor is organized in a way that a row of gate is followed by a row of up, aka interleaving. While we can rearrange it to the expected layout for fused op, I think it's easier to just split it into gate and up independently

Ahhh, didn't catch that.

CISC · 2025-08-08T07:09:51Z

ggml/src/ggml.c

+struct ggml_tensor * ggml_swiglu_oai(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * a,
+        struct ggml_tensor  * b,
+        float                 alpha,
+        float                 limit) {
+    struct ggml_tensor * result = ggml_glu_impl(ctx, a, b, GGML_GLU_OP_SWIGLU_OAI, false);
+    ggml_set_op_params_f32(result, 2, alpha);
+    ggml_set_op_params_f32(result, 3, limit);
+
+    return result;
+}


Technically this is ggml_swiglu_oai_split.

ericcurtin · 2025-08-10T13:55:54Z

Does anybody see value in adding a simple chat client to upstream llama.cpp in C++ or python3 that we can consolidate on like this:

https://github.com/ericcurtin/lm-chat/blob/main/lm-chat.py

?

For formats like this new harmony one it can be hard to find simple reference implementations, that are not "from openai_harmony"

I guess there sort of is the html client implementation, but I'm not sure how many people are ready to crack that open as it's more that just a simple cli chat client.

JohannesGaessler · 2025-08-10T14:24:59Z

My opinion is that efforts should be focused on the existing web interface of the HTTP server.

ericcurtin · 2025-08-10T16:36:26Z

My opinion is that efforts should be focused on the existing web interface of the HTTP server.

A couple off issues with the web interface. It's a UI, so added complexity for a simple reference implementation. It's compressed, which kills all the version control, can't easily see changes, it's like a mystery blob that's committed from time to time.

I think it would be better if we committed both:

./tools/server/public/index.html.gz

and

./tools/server/public/index.html

on changes, at least we could track the changes then.

ngxson · 2025-08-10T21:12:16Z

index.html... hmm, good luck decoding the diff of the transpiled JS code

ericcurtin · 2025-08-10T21:38:41Z

index.html... hmm, good luck decoding the diff of the transpiled JS code

My bad, the true sources are there.

* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (ggml-org#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (ggml-org#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (ggml-org#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (ggml-org#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (ggml-org#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>

ngxson and others added 30 commits July 7, 2025 15:12

oai moe

81991fc

compat with new checkpoint

917f923

add attn sink impl

a4ab869

add rope scaling yarn

3801c36

logits match with latest transformers code

13f39f6

wip chat template

b3594b3

Merge branch 'master' into xsn/oai_moe

bd57158

rm trailing space

089a7ab

use ggml_scale_bias

4d01b36

Merge branch 'master' into xsn/oai_moe

f271cc8

rm redundant is_swa_all

106b17e

convert interleaved gate_up

e2c1beb

Merge remote-tracking branch 'gg-public/master' into xsn/oai_moe-gg

4431c82

Merge remote-tracking branch 'gg-public/master' into xsn/oai_moe-gg

fe9b818

Merge remote-tracking branch 'gg-public/master' into xsn/oai_moe-gg

539c2b6

graph : fix activation function to match reference (#7)

039a6f1

Merge branch 'master' into xsn/oai_moe-gg

aa240b9

Merge branch 'master' into xsn/oai_moe-gg

32a654c

vocab : handle o200k_harmony special tokens

13f3568

repack mxfp4 upon conversion

832dc26

clean up a bit

c68069d

enable thinking

423b191

add quick hack to render only some special tokens

4dd479b

fix bf16 conversion

ebc7da5

remove vocab hack

a543ddf

webui ok

6b30372

support chat parsing for gpt-oss

44bdb75

Merge branch 'master' into xsn/oai_moe

65b536f

fix webui

6197917

xunjieliu mentioned this pull request Aug 6, 2025

Reddit News Daily 2025-08-06 xunjieliu/reddit-daily-news#143

Open

caer mentioned this pull request Aug 6, 2025

Update llama.cpp to latest version supporting gpt-oss utilityai/llama-cpp-rs#797

Merged

compilade mentioned this pull request Aug 6, 2025

gguf-py : add Numpy MXFP4 de/quantization support #15111

Merged

VictorWangwz mentioned this pull request Aug 6, 2025

gpt-oss 20b gguf model fail to run ollama/ollama#11714

Open

Lyrcaxis mentioned this pull request Aug 6, 2025

Update july 2025 SciSharp/LLamaSharp#1225

Merged

9 tasks

mudler mentioned this pull request Aug 6, 2025

chore: ⬆️ Update ggml-org/llama.cpp to fd1234cb468935ea087d6929b2487926c3afff4b mudler/LocalAI#5972

Merged

CISC mentioned this pull request Aug 7, 2025

Add MXFP4 GGUF QuantizationType huggingface/huggingface.js#1677

Merged

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025

Revert "llama : add gpt-oss (ggml-org#15091)"

ddbba17

lhez mentioned this pull request Aug 7, 2025

opencl: support sinks in soft_max (attn sinks) #15152

Merged

ngxson added a commit to huggingface/huggingface.js that referenced this pull request Aug 7, 2025

Add MXFP4 GGUF QuantizationType (#1677)

e841a53

Added in GPT-OSS PR ggml-org/llama.cpp#15091 --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

CISC reviewed Aug 8, 2025

View reviewed changes

lhez mentioned this pull request Aug 12, 2025

opencl: add initial mxfp4 support via mv #15270

Open

		new_name_gate = self.map_tensor_name(name.replace("gate_up_proj_scales", "gate_proj.weight"))
		new_name_up = self.map_tensor_name(name.replace("gate_up_proj_scales", "up_proj.weight"))

llama : add gpt-oss #15091

llama : add gpt-oss #15091

Conversation

ggerganov commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiuckhuang commented Aug 5, 2025

Uh oh!

nachoal commented Aug 5, 2025

Uh oh!

thad0ctor commented Aug 5, 2025

Uh oh!

joseph777111 commented Aug 6, 2025

Uh oh!

CHNtentes commented Aug 6, 2025

Uh oh!

fat-tire commented Aug 6, 2025

Uh oh!

nachoal commented Aug 6, 2025

Uh oh!

nai-kon commented Aug 6, 2025

Uh oh!

fat-tire commented Aug 6, 2025

Uh oh!

uazure commented Aug 6, 2025

Uh oh!

slaren commented Aug 6, 2025

Uh oh!

createthis commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Aug 10, 2025

Uh oh!

ericcurtin commented Aug 10, 2025

Uh oh!

ngxson commented Aug 10, 2025

Uh oh!

ericcurtin commented Aug 10, 2025

Uh oh!

Uh oh!

ggerganov commented Aug 5, 2025 •

edited

Loading

createthis commented Aug 6, 2025 •

edited

Loading

ericcurtin commented Aug 10, 2025 •

edited

Loading