Skip to content

Misc. bug: Fail to run Qwen3-Coder-480B-A35B-Instruct (Q4_0) model at Ascend 910B NPU. #15759

@MaoJianwei

Description

@MaoJianwei

Name and Version

register_backend: registered backend CANN (8 devices)
register_device: registered device CANN0 (Ascend910B3)
register_device: registered device CANN1 (Ascend910B3)
register_device: registered device CANN2 (Ascend910B3)
register_device: registered device CANN3 (Ascend910B3)
register_device: registered device CANN4 (Ascend910B3)
register_device: registered device CANN5 (Ascend910B3)
register_device: registered device CANN6 (Ascend910B3)
register_device: registered device CANN7 (Ascend910B3)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
load_backend: failed to find ggml_backend_init in /home/m00522703/llama.cpp/build/bin/libggml-cann.so
load_backend: failed to find ggml_backend_init in /home/m00522703/llama.cpp/build/bin/libggml-cpu.so
version: 6362 (f6da8cb8)
built with cc (GCC) 12.3.1 (openEuler 12.3.1-38.oe2403) for aarch64-openEuler-linux

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

./build/bin/llama-server --model /data1/llm/gguf/Qwen3-Coder-480B-A35B-Instruct-Q4_0.gguf -ngl 320 --host 0.0.0.0

Problem description & steps to reproduce

llama_model_loader: - type  f32:  311 tensors
llama_model_loader: - type q4_0:  435 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 251.96 GiB (4.51 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 6144
print_info: n_layer          = 62
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 160
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 480.15 B
print_info: general.name     = Qwen3 Coder 480B A35B Instruct
print_info: n_ff_exp         = 2560
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1231.07 MiB
load_tensors:        CANN0 model buffer size = 33132.38 MiB
load_tensors:        CANN1 model buffer size = 33132.38 MiB
load_tensors:        CANN2 model buffer size = 33132.38 MiB
load_tensors:        CANN3 model buffer size = 33132.38 MiB
load_tensors:        CANN4 model buffer size = 33132.38 MiB
load_tensors:        CANN5 model buffer size = 33132.38 MiB
load_tensors:        CANN6 model buffer size = 33132.38 MiB
load_tensors:        CANN7 model buffer size = 24849.31 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
ggml_backend_cann_context: device 0 async operator submission is OFF
ggml_backend_cann_context: device 0 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 1 async operator submission is OFF
ggml_backend_cann_context: device 1 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 2 async operator submission is OFF
ggml_backend_cann_context: device 2 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 3 async operator submission is OFF
ggml_backend_cann_context: device 3 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 4 async operator submission is OFF
ggml_backend_cann_context: device 4 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 5 async operator submission is OFF
ggml_backend_cann_context: device 5 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 6 async operator submission is OFF
ggml_backend_cann_context: device 6 execution mode is GRAPH (acl graph enabled)
ggml_backend_cann_context: device 7 async operator submission is OFF
ggml_backend_cann_context: device 7 execution mode is GRAPH (acl graph enabled)
llama_context:  CANN_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CANN0 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN1 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN2 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN3 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN4 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN5 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN6 KV buffer size =   128.00 MiB
llama_kv_cache:      CANN7 KV buffer size =    96.00 MiB
llama_kv_cache: size =  992.00 MiB (  4096 cells,  62 layers,  1/1 seqs), K (f16):  496.00 MiB, V (f16):  496.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CANN0 compute buffer size =   276.01 MiB
llama_context:      CANN1 compute buffer size =   244.63 MiB
llama_context:      CANN2 compute buffer size =   244.63 MiB
llama_context:      CANN3 compute buffer size =   244.63 MiB
llama_context:      CANN4 compute buffer size =   244.63 MiB
llama_context:      CANN5 compute buffer size =   244.63 MiB
llama_context:      CANN6 compute buffer size =   244.63 MiB
llama_context:      CANN7 compute buffer size =   256.01 MiB
llama_context:  CANN_Host compute buffer size =   308.75 MiB
llama_context: graph nodes  = 3851
llama_context: graph splits = 10
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
new_pool_for_device: device 0 use vmm pool
/home/xxx/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:69: CANN error
CANN error: EE9999: Inner Error!
EE9999: [PID: 182660] 2025-09-03-14:14:41.418.758 Not allow to synchronize captured-stream, stream_id=2.[FUNC:StreamSynchronize][FILE:api_error.cc][LINE:960]
        TraceBack (most recent call last):
       rtStreamSynchronize execute failed, reason=[stream is captured][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
       synchronize stream failed, runtime result = 107027[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

  current device: 0, in function ggml_cann_mul_mat_id_quant at /home/xxx/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3008
  aclrtSynchronizeStream(ctx.stream())
[New LWP 182661]
[New LWP 182662]
[New LWP 182663]
[New LWP 182779]
[New LWP 182780]

... many outputs ...

[New LWP 195125]
[New LWP 195126]
[New LWP 195127]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
0x0000ffffb9721254 in wait4 () from /usr/lib64/libc.so.6
#0  0x0000ffffb9721254 in wait4 () from /usr/lib64/libc.so.6
#1  0x0000ffffb9b66314 in ggml_print_backtrace () at /home/xxx/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#2  0x0000ffffb9b664b8 in ggml_abort (file=0xffffb9cbd810 "/home/xxx/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp", line=69, fmt=0xffffb9cbd800 "CANN error") at /home/xxx/llama.cpp/ggml/src/ggml.c:230
230             ggml_print_backtrace();
#3  0x0000ffffb9c9fb9c in ggml_cann_error (stmt=0xffffb9cbc338 "aclrtSynchronizeStream(ctx.stream())", func=0xffffb9cbc318 "ggml_cann_mul_mat_id_quant", file=0xffffb9cb8a58 "/home/xxx/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp", line=3006, msg=0x307195d8 "EE9999: Inner Error!\nEE9999: [PID: 313721] 2025-09-03-15:42:35.548.491 Not allow to synchronize captured-stream, stream_id=2.[FUNC:StreamSynchronize][FILE:api_error.cc][LINE:960]\n        TraceBack (mo"...) at /home/xxx/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:69
69          GGML_ABORT("CANN error");
#4  0x0000ffffb9c8fe34 in ggml_cann_mul_mat_id_quant (ctx=..., dst=0x2c4f55d0) at /home/xxx/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3006
3006        ACL_CHECK(aclrtSynchronizeStream(ctx.stream()));
#5  0x0000ffffb9c90318 in ggml_cann_mul_mat_id (ctx=..., dst=0x2c4f55d0) at /home/xxx/llama.cpp/ggml/src/ggml-cann/aclnn_ops.cpp:3103
3103                ggml_cann_mul_mat_id_quant(ctx, dst);
#6  0x0000ffffb9ca47b0 in ggml_cann_compute_forward (ctx=..., dst=0x2c4f55d0) at /home/xxx/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:1805
1805                ggml_cann_mul_mat_id(ctx, dst);
#7  0x0000ffffb9ca59a0 in evaluate_and_capture_cann_graph (cann_ctx=0x2b55f0f0, cgraph=0x2b583f48, use_cann_graph=@0xffffef81dc57: true, cann_graph_update_required=@0xffffef81dc56: true) at /home/xxx/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2211
2211                bool ok = ggml_cann_compute_forward(*cann_ctx, node);
#8  0x0000ffffb9ca5c34 in ggml_backend_cann_graph_compute (backend=0x2bf7cf00, cgraph=0x2b583f48) at /home/xxx/llama.cpp/ggml/src/ggml-cann/ggml-cann.cpp:2273
2273        evaluate_and_capture_cann_graph(
#9  0x0000ffffb9b7e838 in ggml_backend_graph_compute_async (backend=0x2bf7cf00, cgraph=0x2b583f48) at /home/xxx/llama.cpp/ggml/src/ggml-backend.cpp:359
359         return backend->iface.graph_compute(backend, cgraph);
#10 0x0000ffffb9b82de8 in ggml_backend_sched_compute_splits (sched=0x2bf802d0) at /home/xxx/llama.cpp/ggml/src/ggml-backend.cpp:1542
1542                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#11 0x0000ffffb9b83a50 in ggml_backend_sched_graph_compute_async (sched=0x2bf802d0, graph=0x2c4a8ea0) at /home/xxx/llama.cpp/ggml/src/ggml-backend.cpp:1742
1742        return ggml_backend_sched_compute_splits(sched);
#12 0x0000ffffba2fd378 in llama_context::graph_compute (this=0x27087dd0, gf=0x2c4a8ea0, batched=true) at /home/xxx/llama.cpp/src/llama-context.cpp:1454
1454        auto status = ggml_backend_sched_graph_compute_async(sched.get(), gf);
#13 0x0000ffffba2fad64 in llama_context::process_ubatch (this=0x27087dd0, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x2e216e90, ret=@0xffffef81feb4: GGML_STATUS_SUCCESS) at /home/xxx/llama.cpp/src/llama-context.cpp:781
781         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#14 0x0000ffffba2fbd44 in llama_context::decode (this=0x27087dd0, batch_inp=...) at /home/xxx/llama.cpp/src/llama-context.cpp:1085
1085            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#15 0x0000ffffba300ef8 in llama_decode (ctx=0x27087dd0, batch=...) at /home/xxx/llama.cpp/src/llama-context.cpp:2720
2720        const int ret = ctx->decode(batch);
#16 0x0000000000702824 in common_init_from_params (params=...) at /home/xxx/llama.cpp/common/common.cpp:1066
1066                llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
#17 0x00000000005123b8 in server_context::load_model (this=0xffffef822150, params=...) at /home/xxx/llama.cpp/tools/server/server.cpp:2087
2087            llama_init = common_init_from_params(params_base);
#18 0x00000000004cac54 in main (argc=7, argv=0xffffef825b98) at /home/xxx/llama.cpp/tools/server/server.cpp:5128
5128        if (!ctx_server.load_model(params)) {
[Inferior 1 (process 313721) detached]
Aborted (core dumped)

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions