Skip to content

[CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners #158882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

deedongala
Copy link
Contributor

@deedongala deedongala commented Jul 22, 2025

Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Copy link

pytorch-bot bot commented Jul 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158882

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9022e8e with merge base 8d3d1c8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jul 22, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jul 22, 2025
@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 24, 2025
@deedongala deedongala marked this pull request as ready for review July 24, 2025 21:16
@deedongala deedongala requested a review from a team as a code owner July 24, 2025 21:16
@pytorch-bot pytorch-bot bot removed ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 24, 2025
@jeffdaily jeffdaily added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 24, 2025
echo "Error: only 1 GPU detected, at least 2 GPUs are needed for distributed jobs"
echo "$msg"
exit 1
fi
Copy link
Collaborator

@jithunnair-amd jithunnair-amd Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saienduri @jeffdaily How about we add an equivalent check in https://github.com/pytorch/pytorch/blob/main/.github/workflows/_rocm-test.yml which checks the matrix.config value to ensure that distributed jobs have 4GPUs visible? That was the main reason for introducing this check.

@jithunnair-amd jithunnair-amd changed the title Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners Jul 24, 2025
@pytorch-bot pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label Jul 24, 2025
@saienduri
Copy link
Collaborator

@deedongala can you please rebase, add a commit that checks for multi gpu if the matrix config is "distributed" in _rocm-test.yml, and still keep both labels in actionlint (linux.rocm.gpu.gfx942.1 and linux.rocm.gpu.gfx942.2)

@pytorch-bot pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants