Skip to content

musa: extract ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex #13887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

yeahdongcn
Copy link
Collaborator

Make sure to read the contributing guidelines before submitting a PR

This PR extracts ggml_cuda_mul_mat_batched_cublas_gemm_batched_ex and implements a MUSA-only version that allocates memory for pointer arrays using cudaMalloc, in order to avoid segmentation faults in muBLAS.

Since #13842 is still open, I will rebase this PR once it is merged into master.

Hopefully, we can revert this change in the next MUSA SDK release.

Testing Done

All tests below were performed on the MTT S4000.

  • Build completed successfully
  • ./build/bin/test-backend-ops passed
    root@10e3931cadb6:/ws# ./build/bin/test-backend-ops 
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 MUSA devices:
      Device 0: MTT S4000, compute capability 2.2, VMM: yes
    Testing 2 devices
    
    Backend 1/2: MUSA0
      Device description: MTT S4000
      Device memory: 49069 MB (48988 MB free)
    
      ABS(type=f16,ne_a=[128,2,2,2],v=0): OK
      ABS(type=f16,ne_a=[5,7,11,13],v=0): OK
      ...
      CROSS_ENTROPY_LOSS(type=f32,ne=[10,5,4,3]): OK
      CROSS_ENTROPY_LOSS(type=f32,ne=[30000,1,1,1]): OK
      CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[10,5,4,3]): OK
      CROSS_ENTROPY_LOSS_BACK(type=f32,ne=[30000,1,1,1]): OK
      OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): OK
      5527/5527 tests passed
      Backend MUSA0: OK
    
    Backend 2/2: CPU
      Skipping CPU backend
    2/2 backends passed
    OK
  • Tested DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf, qwen3_8b_q4_k_m.gguf, nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf with or without the -fa flag

yeahdongcn and others added 5 commits May 28, 2025 13:58
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 29, 2025
Comment on lines +67 to +78
CUBLAS_CHECK(
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N,
ne01, ne11, ne10,
alpha, (const void **) (ptrs_src + 0*ne23), CUDA_R_16F, nb01/nb00,
(const void **) (ptrs_src + 1*ne23), CUDA_R_16F, s11,
beta, ( void **) (ptrs_dst + 0*ne23), cu_data_type, ne0,
ne23,
cu_compute_type,
CUBLAS_GEMM_DEFAULT_TENSOR_OP));

CUDA_CHECK(cudaFree(ptrs_src));
CUDA_CHECK(cudaFree(ptrs_dst));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't be ok in CUDA, since cublasGemmBatchedEx is normally asynchronous and freeing the memory immediately would likely lead to a use after free.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this!
Yes, that's due to the current incompatibility between cublasGemmBatchedEx and mublasGemmBatchedEx. Also, mublas.cu is only compiled when using the MUSA backend.

@JohannesGaessler
Copy link
Collaborator

This is not an acceptable solution to me. For this level of changes I would only be willing to accept vendor-specific code if

  1. the entrypoint is in ggml_cuda_mul_mat so that it is in effect completely decoupled from the CUDA code and
  2. someone is pledging to take over the maintenance of the vendor-specific code.

Hopefully, we can revert this change in the next MUSA SDK release.

How about we just wait for the next release then?

@yeahdongcn
Copy link
Collaborator Author

yeahdongcn commented May 30, 2025

Thanks for the review!

Here are a few considerations and experiments I explored prior to the last PR:

  1. I initially attempted to create a separate ggml_cuda_mul_mat, but ran into significant code duplication due to the many statically linked functions in ggml-cuda.cu, which I believe would make it difficult to maintain in the long term.
  2. The incompatibility between cuBLAS and muBLAS is primarily limited to memory handling in cublasGemmBatchedEx / mublasGemmBatchedEx—which, admittedly, is on our side. Given that, the current approach introduces the minimal necessary changes.
  3. This modification provides an ~20% improvement in end-to-end TGS on the MTT S4000, which is the main reason I was hoping to integrate it sooner.

How about we just wait for the next release then?

That’s certainly an option, but based on previous experience, MUSA SDK releases often take several months. So it may delay this improvement for quite a while.

@yeahdongcn
Copy link
Collaborator Author

Just wanted to share some good news — in our internal build, mublasGemmBatchedEx now behaves the same as cublasGemmBatchedEx. I will drop this one and update the new MUSA SDK once it's officially released.

@yeahdongcn yeahdongcn closed this Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants