-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners #158882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners #158882
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158882
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9022e8e with merge base 8d3d1c8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
echo "Error: only 1 GPU detected, at least 2 GPUs are needed for distributed jobs" | ||
echo "$msg" | ||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saienduri @jeffdaily How about we add an equivalent check in https://github.com/pytorch/pytorch/blob/main/.github/workflows/_rocm-test.yml which checks the matrix.config
value to ensure that distributed
jobs have 4GPUs visible? That was the main reason for introducing this check.
@deedongala can you please rebase, add a commit that checks for multi gpu if the matrix config is "distributed" in |
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list
Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd