[Draft][CUDA] Upgrade torch._scaled_grouped_mm to SM100+ #156806

eee4017 · 2025-06-25T09:25:37Z

This PR adds support for the Blackwell-specific scaling factor layouts, following the Scaling Factor Types outlined in issues #157950 and #158037. The operator now supports the following configurations:

Operation	API	Input A	Input B	Scaling Factor A	Scaling Factor B	Hardware	Limitations
Hopper Grouped GEMM (FP8)	_scaled_grouped_mm	2D/3D Layout: RowMajor	2D/3D Layout: ColMajor	shape: [M, 1] Layout: Per-row only	shape: [1, N] Layout: Per-row only	90	FP8_E4M3 only, no bias/scale_result
Blackwell Grouped GEMM (FP8)	_scaled_grouped_mm	2D/3D Layout: RowMajor	2D/3D Layout:ColMajor	shape: [M, K//128] Layout: BlockWise1x128 Outer-dim-major	shape: [K//128, N//128] Layout: BlockWise128x128 near-inner-dim-major	100+	FP8_E4M3 only, no bias/scale_result

GroupMMInputMatrixType and a helper function get_group_info have been introduced to generalize the logic for handling various input tensor combinations (2D/3D, representing batched or ragged inputs). This abstracts away the complexity of determining matrix dimensions (M, N, K) and group counts for different scenarios.
- Note: When inputs are 2D (ragged), the returned M, N, and K dimensions are derived from the total tensor size and group count. These values are used for heuristics and may not represent the actual dimensions of each individual matrix in the group. See dispatch_fp8_grouped_gemm_on_tile_size
The refactoring supports ragged inputs (as introduced in PR [WIP] Initial implementation of Grouped Gemm API #148531). For the initial Blackwell implementation with ragged tensors, the scaling factors are expected to be 3D tensors with shapes like [n_groups, M, K//128] for scale_A and [n_groups, K//128, N//128] for scale_B.
As discussed in issue Upgrade torch._scaled_grouped_mm to SM100+ #156238, we are going to support more flexible input layouts in the future.
_scaled_mm is not yet available for SM100, the tests use a block-wise emulated reference implementation for comparison.

pytorch-bot · 2025-06-25T09:25:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156806

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

❌ 3 New Failures, 1 Unrelated Failure

As of commit b9529e1 with merge base e9d27aa ():

NEW FAILURES - The following jobs have failed:

pull / cuda12.8-py3.10-gcc9-sm75 / build (gh)
/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/ScaledGroupMM.cu:621:7: error: typedef ‘using DtypeProblemShape = using UnderlyingProblemShape = struct cute::tuple<int, int, int>’ locally defined but not used [-Werror=unused-local-typedefs]
pull / linux-jammy-cuda12.8-py3.10-gcc11 / build (gh)
/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/ScaledGroupMM.cu:621:7: error: typedef ‘using DtypeProblemShape = using UnderlyingProblemShape = struct cute::tuple<int, int, int>’ locally defined but not used [-Werror=unused-local-typedefs]
pull / linux-jammy-cuda12.8-py3.10-gcc11-build-distributed / build (gh)
/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/ScaledGroupMM.cu:621:7: error: typedef ‘using DtypeProblemShape = using UnderlyingProblemShape = struct cute::tuple<int, int, int>’ locally defined but not used [-Werror=unused-local-typedefs]

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eee4017 · 2025-06-25T09:30:08Z

@pytorchbot label "topic: not user facing" "module: cuda"

aten/src/ATen/native/cuda/ScaledGroupMM.cu

eee4017 · 2025-07-09T14:34:39Z

aten/src/ATen/native/cuda/ScaledGroupMM.cu

 C10_DIAGNOSTIC_POP()
 C10_DIAGNOSTIC_POP()

+namespace at::cuda::detail {
+
+GroupCountInfo get_group_count(


I noticed that the 2D/3D logic is used across multiple functions, so I consolidated it into a struct for better reuse and readability.

ngimel

Can you give a description of what scaling formats you indend to support?

ngimel · 2025-07-17T19:40:27Z

aten/src/ATen/native/cuda/GroupMMCommon.cuh

+    typename LayoutSFA,
+    typename LayoutSFB,
+    typename ScaleConfig>
+__global__ void prepare_grouped_gemm_data_sm100(


why does this need to be separate, is there enough changes to justify that? prepare_grouped_gemm_data has been changed recently to relax restrictions (e.g. cutlass 4 no longer requires K>0

ngimel · 2025-07-17T19:50:54Z

aten/src/ATen/native/cuda/Blas.cpp

-          scale.dim(),
-          "D, arg ",
-          arg_idx);
+        mat_a.size(-1) % 128 == 0,


oh wow 128 is a big multiplier, is there a reference for why it's required?

https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/fp8_blockwise_moe_kernel.cu#L318
https://github.com/sgl-project/sglang/blob/main/sgl-kernel/tests/test_fp8_blockwise_moe.py#L188

Tile shapes are large on sm90 too, that doesn't translate into input shape requirements

ngimel · 2025-07-17T20:15:23Z

aten/src/ATen/native/cuda/ScaledGroupMM.h

+  GroupMMInputMatrixType input_matrix_type;
+};
+
+GroupCountInfo get_group_count(


this function doesn't do what it says

ngimel · 2025-07-17T20:17:20Z

aten/src/ATen/native/cuda/ScaledGroupMM.cu

+    type = GroupMMInputMatrixType::GroupMMInputMatrixType_MatrixA_2D_MatrixB_2D;
+
+    // stack on the K dimension
+    K = K / group_count;


this is super confusing, why would you do that? What does this K even mean, an average K of grouped matrix multiply? Same for other average values

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ScaledGroupMM.cu#L436
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ScaledGroupMM.cu#L236

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/GroupMM.cu#L331
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/GroupMM.cu#L186

To reduce duplication and improve maintainability, I’ve consolidated this logic into a single helper function.

ngimel · 2025-07-17T20:23:26Z

aten/src/ATen/native/cuda/Blas.cpp

      TORCH_CHECK(
-          scale.is_contiguous(), "scale must be contiguous for arg ", arg_idx);
+          scale_a.dim() == 3,


without a description of what you expect scale to be, expecting a 3d scale for a matrix that can be either 2d or 3d doesn't seem correct

drisspg

Can yo make sure you are rebased on top offf:
#158037

there might be conflicts

drisspg · 2025-07-17T20:42:46Z

aten/src/ATen/native/cuda/Blas.cpp

+          TORCH_CHECK(
+              scale.is_contiguous(), "scale must be contiguous for arg ", arg_idx);
+          TORCH_CHECK(
+              scale.size(0) == mat.size(dim) * ( (info.input_matrix_type == at::cuda::detail::GroupMMInputMatrixType::GroupMMInputMatrixType_MatrixA_2D_MatrixB_2D) ? info.group_count : 1), "scale must have the same length as mat for arg ", arg_idx);


nit, maybe a using expression so that this line is easier to pasre

aten/src/ATen/native/cuda/Blas.cpp

drisspg · 2025-07-17T21:01:36Z

aten/src/ATen/native/cuda/GroupMMCommon.cuh

+
+  if (!transpose) {
+    *layout_sfa_ptr =
+        ScaleConfig::tile_atom_to_shape_SFA(cute::make_shape(M, N, K, 1));


This PR is adding rowwise scaling support, not MX right?

drisspg · 2025-07-17T21:03:21Z

aten/src/ATen/native/cuda/ScaledGroupMM.cu

 namespace {

 using Strides = at::cuda::detail::Strides;

+int ceildiv(int a, int b) {


we have these here: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ceil_div.h

test/test_matmul_cuda.py

eee4017 · 2025-07-17T23:19:22Z

aten/src/ATen/native/cuda/ScaledGroupMM.cu

+    type = GroupMMInputMatrixType::GroupMMInputMatrixType_MatrixA_2D_MatrixB_2D;
+
+    // stack on the K dimension
+    K = K / group_count;


https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ScaledGroupMM.cu#L436
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/ScaledGroupMM.cu#L236

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/GroupMM.cu#L331
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/GroupMM.cu#L186

To reduce duplication and improve maintainability, I’ve consolidated this logic into a single helper function.

eee4017 · 2025-07-17T23:38:33Z

aten/src/ATen/native/cuda/Blas.cpp

-          scale.dim(),
-          "D, arg ",
-          arg_idx);
+        mat_a.size(-1) % 128 == 0,


https://github.com/sgl-project/sglang/blob/main/sgl-kernel/csrc/moe/fp8_blockwise_moe_kernel.cu#L318
https://github.com/sgl-project/sglang/blob/main/sgl-kernel/tests/test_fp8_blockwise_moe.py#L188

aten/src/ATen/native/cuda/ScaledGroupMM.cu

test/test_matmul_cuda.py

eee4017 · 2025-07-17T23:53:45Z

sm90 utilizes a 1D tensor for scale factors. It performs row-wise and column-wise broadcasting across the A and B matrices, see code here. However, sm100 requires a more explicit scale factor layout, as documented here

I've tried to achieve compatibility with the existing sm90 scale factor tensor shape, I think Sm100BlockwiseScaleConfig<1,1,128,MN,MN> is the most closely mimicked row-wise broadcasting as in sm90. However, but my performance benchmarks revealed that this ScaleConfig leads to performance penalty.

Thus, I bascially use the ScaleConfig from sgl-kernel for now, but this requires 128 size factor as you see in the code. Other layout may need to tuned carefully to ensure the max performance.

drisspg · 2025-07-18T16:23:03Z

#157950, This might provide more clarity, I am curious as to what specific scaling strategy this adds support for on blackwell

eee4017 · 2025-07-18T16:33:04Z

I think the scale config for current implementation is scale_a: [M, K//128], scale_b: [K//128, N//128]

eee4017 · 2025-08-06T16:05:36Z

test/test_matmul_cuda.py

+
+    @unittest.skipIf(TEST_WITH_ROCM, "ROCm doesn't support CUTLASS")
+    @unittest.skipIf(not SM100OrLater, "Grouped gemm supported on SM100")
+    def test_scaled_grouped_gemm_3d_3d_sm100(self):


_scaled_mm is not yet available for SM100, the tests use a block-wise emulated reference implementation for comparison.

Can you elaborate on which version (scaling strategy + input types) is not supported on sm100 we should have pretty good coverage now

#158037 (comment)

Then I question the wisdom of enabling DeepSeek-like scaling for _scaled_grouped_mm on Blackwell if just _scaled_mm is not supported. Should we start by supporting mx scales in grouped mm?

eee4017 requested review from eqy and syed-ahmed as code owners June 25, 2025 09:25

pytorch-bot bot added module: cuda Related to torch.cuda, and CUDA support in general topic: not user facing topic category labels Jun 25, 2025

eee4017 marked this pull request as draft June 25, 2025 09:30

eee4017 mentioned this pull request Jun 25, 2025

Upgrade torch._scaled_grouped_mm to SM100+ #156238

Open

pytorchbot added the open source label Jun 25, 2025

eee4017 force-pushed the fralin/scaled_grouped_mm_sm100 branch from 43b4d74 to 21d3935 Compare July 9, 2025 13:15

eee4017 marked this pull request as ready for review July 9, 2025 13:16

eee4017 commented Jul 9, 2025

View reviewed changes

mikaylagawarecki requested a review from ngimel July 14, 2025 15:02

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 14, 2025

drisspg self-requested a review July 17, 2025 19:59

ngimel reviewed Jul 17, 2025

View reviewed changes

drisspg reviewed Jul 17, 2025

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Show resolved Hide resolved

drisspg reviewed Jul 17, 2025

View reviewed changes

eqy reviewed Jul 17, 2025

View reviewed changes

test/test_matmul_cuda.py Outdated Show resolved Hide resolved

eee4017 commented Jul 17, 2025

View reviewed changes

danielvegamyhre mentioned this pull request Jul 29, 2025

mxfp8 emulated grouped gemm pytorch/ao#2626

Merged

eee4017 force-pushed the fralin/scaled_grouped_mm_sm100 branch from f9ba45b to 567e707 Compare August 6, 2025 13:00

eee4017 added 2 commits August 6, 2025 13:24

add f8f8bf16_grouped_gemm_impl_sm100

e6c36e6

fix transpose

b59e7a0

eee4017 added 4 commits August 6, 2025 13:27

Extending f8f8bf16_grouped_mm to sm100

af18ac6

add test

936a152

Use ScalingType to represent the scaling factor layout

1a79ef2

Add shape check

ec3835a

eee4017 force-pushed the fralin/scaled_grouped_mm_sm100 branch from 567e707 to ec3835a Compare August 6, 2025 15:06

lint

b9529e1

eee4017 commented Aug 6, 2025

View reviewed changes

[Draft][CUDA] Upgrade torch._scaled_grouped_mm to SM100+ #156806

Are you sure you want to change the base?

[Draft][CUDA] Upgrade torch._scaled_grouped_mm to SM100+ #156806

Uh oh!

Conversation

eee4017 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156806

❗ 1 Active SEVs

❌ 3 New Failures, 1 Unrelated Failure

Uh oh!

eee4017 commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eee4017 commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Jul 18, 2025

Uh oh!

eee4017 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eee4017 commented Jun 25, 2025 •

edited

Loading

pytorch-bot bot commented Jun 25, 2025 •

edited

Loading

eee4017 commented Jul 17, 2025 •

edited

Loading

eee4017 commented Jul 18, 2025 •

edited

Loading

ngimel Aug 7, 2025 •

edited

Loading