musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex #13887

yeahdongcn · 2025-05-29T11:21:00Z

Make sure to read the contributing guidelines before submitting a PR

This PR extracts ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex and implements a MUSA-only version that allocates memory for pointer arrays using cudaMalloc, in order to avoid segmentation faults in muBLAS.

Since #13842 is still open, I will rebase this PR once it is merged into master.

Hopefully, we can revert this change in the next MUSA SDK release.

Testing Done

All tests below were performed on the MTT S4000.

Build completed successfully

./build/bin/test-backend-ops passed

root@10e3931cadb6:/ws# ./build/bin/test-backend-ops 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S4000, compute capability 2.2, VMM: yes
Testing 2 devices

Backend 1/2: MUSA0
  Device description: MTT S4000
  Device memory: 49069 MB (48988 MB free)

  ABS(type=f16,ne_a=[128,2,2,2],v=0): OK
  ABS(type=f16,ne_a=[5,7,11,13],v=0): OK
  ...
  CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): OK
  CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): OK
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): OK
  5527/5527 tests passed
  Backend MUSA0: OK

Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Tested DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf, qwen3_8b_q4_k_m.gguf, nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf with or without the -fa flag

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

slaren · 2025-05-29T12:46:54Z

ggml/src/ggml-musa/mublas.cu

+    CUBLAS_CHECK(
+    cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N,
+            ne01, ne11, ne10,
+            alpha, (const void **) (ptrs_src + 0*ne23), CUDA_R_16F,   nb01/nb00,
+                   (const void **) (ptrs_src + 1*ne23), CUDA_R_16F,   s11,
+            beta,  (      void **) (ptrs_dst + 0*ne23), cu_data_type, ne0,
+            ne23,
+            cu_compute_type,
+            CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+
+    CUDA_CHECK(cudaFree(ptrs_src));
+    CUDA_CHECK(cudaFree(ptrs_dst));


This wouldn't be ok in CUDA, since cublasGemmBatchedEx is normally asynchronous and freeing the memory immediately would likely lead to a use after free.

Thanks for taking a look at this!
Yes, that's due to the current incompatibility between cublasGemmBatchedEx and mublasGemmBatchedEx. Also, mublas.cu is only compiled when using the MUSA backend.

JohannesGaessler · 2025-05-29T13:33:11Z

This is not an acceptable solution to me. For this level of changes I would only be willing to accept vendor-specific code if

the entrypoint is in ggml_cuda_mul_mat so that it is in effect completely decoupled from the CUDA code and
someone is pledging to take over the maintenance of the vendor-specific code.

Hopefully, we can revert this change in the next MUSA SDK release.

How about we just wait for the next release then?

yeahdongcn · 2025-05-30T01:03:46Z

Thanks for the review!

Here are a few considerations and experiments I explored prior to the last PR:

I initially attempted to create a separate ggml_cuda_mul_mat, but ran into significant code duplication due to the many statically linked functions in ggml-cuda.cu, which I believe would make it difficult to maintain in the long term.
The incompatibility between cuBLAS and muBLAS is primarily limited to memory handling in cublasGemmBatchedEx / mublasGemmBatchedEx—which, admittedly, is on our side. Given that, the current approach introduces the minimal necessary changes.
This modification provides an ~20% improvement in end-to-end TGS on the MTT S4000, which is the main reason I was hoping to integrate it sooner.

How about we just wait for the next release then?

That’s certainly an option, but based on previous experience, MUSA SDK releases often take several months. So it may delay this improvement for quite a while.

yeahdongcn · 2025-06-04T09:03:18Z

Just wanted to share some good news — in our internal build, mublasGemmBatchedEx now behaves the same as cublasGemmBatchedEx. I will drop this one and update the new MUSA SDK once it's officially released.

yeahdongcn and others added 5 commits May 28, 2025 13:58

musa: enable fp16 mma (all) and cublas on qy2

6c665fa

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Update ggml/src/ggml-cuda/ggml-cuda.cu

39ab3de

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Address review comments

3e5b500

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Address review comments

cd42f3e

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex

5d9ae95

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

yeahdongcn requested a review from JohannesGaessler as a code owner May 29, 2025 11:21

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 29, 2025

slaren reviewed May 29, 2025

View reviewed changes

yeahdongcn closed this Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex #13887

musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex #13887

Uh oh!

yeahdongcn commented May 29, 2025

Uh oh!

slaren May 29, 2025

Uh oh!

yeahdongcn May 30, 2025

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

yeahdongcn commented May 30, 2025 •

edited

Loading

Uh oh!

yeahdongcn commented Jun 4, 2025

Uh oh!

Uh oh!

musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex #13887

musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex #13887

Uh oh!

Conversation

yeahdongcn commented May 29, 2025

Testing Done

Uh oh!

slaren May 29, 2025

Choose a reason for hiding this comment

Uh oh!

yeahdongcn May 30, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented May 29, 2025

Uh oh!

yeahdongcn commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeahdongcn commented Jun 4, 2025

Uh oh!

Uh oh!

yeahdongcn commented May 30, 2025 •

edited

Loading