[ROCm] Limit number of values per thread for reductions on three dimensions #159652

doru1004 · 2025-08-01T16:35:06Z

In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-08-01T16:35:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159652

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

✅ No Failures

As of commit 70792b5 with merge base 1465757 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

petrex

Question : Was the choice of 2048 as the threshold for "values per thread" purely heuristic? It would be helpful to add a comment or reference explaining why this value was chosen and whether it is empirically optimal.

petrex

Another question : Is there an upper bound for config.ctas_per_output *= 2;

jerrymannil · 2025-08-01T17:30:45Z

Reproducer:

import time
import torch

shapes = [(1, 2, 3, 420, 648, 128),
    (1, 2, 3, 420, 648, 128),
    (5079670, 128)
]

dims = [(3,4),
    (-3, -2), 
    (1)
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.float)
    for _ in range(20):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()

    start_time = time.time()
    for _ in range(100):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    end_time = time.time()
    mean_time = (end_time - start_time)/100
    print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us")

Before
Avg time for shape (1, 2, 3, 420, 648, 128): 4408.10 us
Avg time for shape (1, 2, 3, 420, 648, 128): 4428.89 us
Avg time for shape (5079670, 128): 1458.86 us

After:
Avg time for shape (1, 2, 3, 420, 648, 128): 223.73 us
Avg time for shape (1, 2, 3, 420, 648, 128): 218.85 us
Avg time for shape (5079670, 128): 1461.55 us

pytorch-bot · 2025-08-01T19:16:49Z

To add the ciflow label ciflow/periodic-rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

doru1004 · 2025-08-04T10:29:59Z

Question : Was the choice of 2048 as the threshold for "values per thread" purely heuristic? It would be helpful to add a comment or reference explaining why this value was chosen and whether it is empirically optimal.

It was indeed empirically determined. I'll add a comment.

doru1004 · 2025-08-04T10:51:32Z

Another question : Is there an upper bound for config.ctas_per_output *= 2;

From the previous semantics there doesn't seem to be the case.

…nsions (#2460) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. cherry-pick of pytorch#159652

doru1004 requested review from eqy and syed-ahmed as code owners August 1, 2025 16:35

pytorch-bot bot added the release notes: cuda release notes category label Aug 1, 2025

pytorchbot added the open source label Aug 1, 2025

petrex reviewed Aug 1, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Aug 1, 2025

pruthvistony added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Aug 1, 2025

pruthvistony requested review from jeffdaily, jithunnair-amd and pruthvistony August 1, 2025 19:17

Limit number of values per thread for reductions on three dimensions

70792b5

doru1004 force-pushed the fix-3d-reduction branch from 9d82d7d to 70792b5 Compare August 4, 2025 16:19

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Aug 4, 2025

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 4, 2025

doru1004 changed the title ~~[AMDGPU] Limit number of values per thread for reductions on three dimensions~~ [ROCm] Limit number of values per thread for reductions on three dimensions Aug 5, 2025

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Aug 5, 2025

jerrymannil mentioned this pull request Aug 5, 2025

[ROCm] Limit number of values per thread for reductions on three dimensions ROCm/pytorch#2460

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

Uh oh!

doru1004 commented Aug 1, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 1, 2025 •

edited

Loading

Uh oh!

petrex left a comment •

edited

Loading

Uh oh!

petrex left a comment

Uh oh!

jerrymannil commented Aug 1, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 1, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

Uh oh!

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

Are you sure you want to change the base?

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

Uh oh!

Conversation

doru1004 commented Aug 1, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159652

❗ 1 Active SEVs

✅ No Failures

Uh oh!

petrex left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrex left a comment

Choose a reason for hiding this comment

Uh oh!

jerrymannil commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 1, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

Uh oh!

doru1004 commented Aug 1, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 1, 2025 •

edited

Loading

petrex left a comment •

edited

Loading

jerrymannil commented Aug 1, 2025 •

edited

Loading