[ROCm] Use opportunistic fastatomics based on hueristics #159430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jerrymannil wants to merge 1 commit into pytorch:main from jerrymannil:patch-1

Contributor

jerrymannil commented Jul 29, 2025 •

edited

Loading

Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:

import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")

Perf numbers:

Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

jerrymannil requested review from eqy and syed-ahmed as code owners

July 29, 2025 23:29

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159430

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit 44e97dd with merge base 1ebcba4 ():

CANCELLED JOB - The following job was cancelled. Please retry:

Apply lint suggestions (gh)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added module: rocm release notes: cuda labels

jerrymannil marked this pull request as draft

July 29, 2025 23:29

pytorchbot added the open source label

pruthvistony added topic: not user facing ciflow/periodic ciflow/rocm ciflow/inductor-rocm ciflow/rocm-mi300 ciflow/periodic-rocm-mi300 and removed release notes: cuda labels

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/periodic please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/inductor-rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/periodic-rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot removed ciflow/rocm ciflow/periodic ciflow/inductor-rocm ciflow/rocm-mi300 ciflow/periodic-rocm-mi300 labels

pruthvistony added ciflow/periodic rocm ciflow/rocm ciflow/inductor-rocm ciflow/rocm-mi300 ciflow/periodic-rocm-mi300 labels

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/inductor-rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot removed ciflow/periodic ciflow/rocm ciflow/inductor-rocm ciflow/periodic-rocm-mi300 ciflow/rocm-mi300 labels

pruthvistony requested review from jeffdaily and jithunnair-amd

July 30, 2025 14:52

pytorch-bot bot added the ciflow/rocm label

Collaborator

jithunnair-amd commented Jul 31, 2025

@jerrymannil Can you please mention if there's a unit test or workload you used to qualify and quantify the improvement?

jerrymannil mentioned this pull request

[release/2.7] [ROCm] Use opportunistic fastatomics based on heuristics ROCm/pytorch#2438

Merged

Collaborator

pruthvistony commented Jul 31, 2025 •

edited

Loading

Tickets comments shows the results - https://ontrack-internal.amd.com/browse/SWDEV-546136?focusedId=19870475&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-19870475

@jerrymannil
Please update the PR description with all the perf results and testing done.

pruthvistony approved these changes

View reviewed changes

pruthvistony marked this pull request as ready for review

July 31, 2025 18:54

Contributor Author

jerrymannil commented Jul 31, 2025

@pruthvistony @jithunnair-amd
Updated PR description with reproducer and numbers

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request


          [release/2.7] [ROCm] Use opportunistic fastatomics based on heuristics (

faae1f3

#2438)

* Merge of pytorch#159430
* Opportunistic fast atomics works better will small sizes, since there
is more chance of lanes doing atomics on the same address

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Co-author: @amd-hhashemi

pruthvistony requested review from malfet and atalman

July 31, 2025 21:35

Contributor Author

jerrymannil commented Aug 1, 2025

@pytorchbot rebase

Collaborator

pytorchmergebot commented Aug 1, 2025

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here


          [ROCm] Use opportunistic fastatomics based on hueristics

44e97dd

* Opportunistic fast atomics works better will small sizes, since there is more chance of lanes doing atomics on the same address

Collaborator

pytorchmergebot commented Aug 1, 2025

Successfully rebased patch-1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout patch-1 && git pull --rebase)

pytorchmergebot force-pushed the patch-1 branch from 5cf1e0f to 44e97dd Compare

August 1, 2025 17:00

pytorch-bot bot removed the ciflow/rocm label

Contributor Author

jerrymannil commented Aug 1, 2025

@pytorchbot merge

pytorch-bot bot added the ciflow/trunk label

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Aug 1, 2025

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

pytorchmergebot removed the merging label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

pruthvistony pruthvistony approved these changes

eqy Awaiting requested review from eqy eqy is a code owner

syed-ahmed Awaiting requested review from syed-ahmed syed-ahmed is a code owner

jeffdaily Awaiting requested review from jeffdaily

jithunnair-amd Awaiting requested review from jithunnair-amd

malfet Awaiting requested review from malfet

atalman Awaiting requested review from atalman

Labels

ciflow/trunk module: rocm open source rocm topic: not user facing