[WIP] Initial implementation of Grouped Gemm API #148531

ngimel · 2025-03-05T04:45:59Z

This PR provides initial cutlass implementation of grouped gemm api as described in this document. Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor offs. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's Sm90RowBroadcast and Sm90ColBroadcast structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for fast_accum=False.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.
cc @vkuzo @drisspg @lw

pytorch-bot · 2025-03-05T04:46:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148531

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 37a054c with merge base e0d4c43 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-03-05T04:49:23Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

drisspg · 2025-03-05T05:00:28Z

cmake/Codegen.cmake

@@ -95,6 +95,26 @@ if(INTERN_BUILD_ATEN_OPS)
    endif()
    list(JOIN ROWWISE_SCALED_MM_FILE_COMPILE_FLAGS " " ROWWISE_SCALED_MM_FILE_COMPILE_FLAGS)
    set_source_files_properties(${ROWWISE_SCALED_MM_FILE} PROPERTIES COMPILE_FLAGS "${ROWWISE_SCALED_MM_FILE_COMPILE_FLAGS}")
+
+    set(ROWWISE_SCALED_MM_FILE "${CMAKE_CURRENT_LIST_DIR}/../aten/src/ATen/native/cuda/ScaledGroupMM.cu")


Unrelated: It seems that these non portable arches are becoming more and more important we might want to figure out a more generalizable approach

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

aten/src/ATen/native/cuda/Blas.cpp

drisspg · 2025-03-05T05:11:42Z

aten/src/ATen/native/cuda/Blas.cpp

+bool use_fast_accum) {
+#ifndef USE_ROCM
+  bool allowed_device = _scaled_mm_allowed_device();
+  TORCH_CHECK(allowed_device, "torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+");


Nit: maybe remove the rocm part here

aten/src/ATen/native/cuda/cutlass_utils.hpp

drisspg · 2025-03-05T05:20:59Z

aten/src/ATen/native/cuda/ScaledGroupMM.cu

+#include <c10/util/irange.h>
+
+// Two warninngs in Cutlass included header files
+C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wset-but-not-used")


nit/TODO: we should double check if these pragmas are still needed

aten/src/ATen/native/cuda/ScaledGroupMM.cu

drisspg

Sprinkled in some random comments overall look good

linux-foundation-easycla · 2025-03-05T23:46:09Z

The committers listed above are authorized under a signed CLA.

✅ login: ngimel / name: Natalia Gimelshein (da817c3, bed4a60, cc0a265, 5168538, ffefa48, 0dc6cbf, ebbbd8c, b1d3197, 6f5ff37, 523e034, 31f5286, ba9c8fa, 9c6d3de, c679d37, 37a054c, 510c76d, 7f38aab, 424cf47, 47abdc7, db5878c)

pytorchmergebot · 2025-03-10T21:22:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

janeyx99 · 2025-03-11T14:39:27Z

@pytorchbot revert -m "Sorry but this broke ROCm jobs on trunk" -c nosignal

pytorchmergebot · 2025-03-11T14:40:52Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-03-11T14:41:01Z

@ngimel your PR has been successfully reverted.

This reverts commit ff29791. Reverted #148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](#148531 (comment)))

This reverts commit c983e11.

facebook-github-bot · 2025-03-11T16:41:10Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ngimel · 2025-03-11T21:42:20Z

@pytorchbot merge

pytorchmergebot · 2025-03-11T21:44:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ngimel added 11 commits February 21, 2025 10:13

compiles with new API

da817c3

Merge branch 'main' into ngimel/gg

c679d37

compiles, sizes checks

523e034

Merge branch 'main' into ngimel/gg

5168538

add rowwise scaling

0dc6cbf

support strided inputs

cc0a265

add schedule depending on the size

ba9c8fa

some bug fixes

ffefa48

test 2d_3d

424cf47

rest of 2d/3d tests

31f5286

comment out debug prints

6f5ff37

ngimel requested review from eqy and syed-ahmed as code owners March 5, 2025 04:45