Skip to content

[CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) #154437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

atalman
Copy link
Contributor

@atalman atalman commented May 27, 2025

Fixes #154157

Inductor Workflows where moved from focal to jammy here: #154153

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Copy link

pytorch-bot bot commented May 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154437

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 84 Pending

As of commit 569ad47 with merge base 523b637 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 27, 2025
fi

if [ -n "${UBUNTU_VERSION}" ]; then
OS="ubuntu"
elif [ -n "${CENTOS_VERSION}" ]; then
OS="centos"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no CentOS in CI/CD anymore

@@ -370,14 +361,6 @@ esac

tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvidia images used on Focal builds. For Jammy builds these images are not used anymore

@huydhn
Copy link
Contributor

huydhn commented May 31, 2025

@pytorchbot rebase -b main

Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about splitting this into mulitple separate PRs to derisk the switch:

PR 1: Builds the jammy docker images for all desired configs
PR 2: Switches over all workflows to use the jammy images
PR 3: Stops building the focal docker images

It'll cut down on the blast radius in case something goes wrong. For example, it'll let us make sure that all jammy docker builds succeed before we start taking CI dependencies on them.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased focal_jammy onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout focal_jammy && git pull --rebase)

@atalman
Copy link
Contributor Author

atalman commented Jun 2, 2025

@ZainRizvi I believe if the docker build + CI is successful its a lot easier to do in one shot as was done here: #154153

Building Docker images by itself are not really useful, since we most likely need to change them once we start migrating jobs over. To minimize the blast radius we can probably try to migrate in smaller chunks

@atalman atalman force-pushed the focal_jammy branch 2 times, most recently from 56c6445 to 2ad7c7d Compare June 2, 2025 23:17
@atalman atalman added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/slow labels Jun 2, 2025
@atalman
Copy link
Contributor Author

atalman commented Jun 3, 2025

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased focal_jammy onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout focal_jammy && git pull --rebase)

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you think we could just keep one version in CI going forward? for example, everything is using 22.04 (jammy) now, then we will upgrade it all to 24.04 (the next LTS)

@huydhn huydhn added the keep-going Don't stop on first failure, keep running tests until the end label Jun 4, 2025
@atalman
Copy link
Contributor Author

atalman commented Jun 5, 2025

Following errors alredy exist on trunk:
test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 GH job link HUD commit link

and

test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_float16 GH job link HUD commit link

@atalman
Copy link
Contributor Author

atalman commented Jun 5, 2025

@pytorchmergebot merge -f "lint is green other jobs where already tested"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Jun 6, 2025

@pytorchbot revert -m "I could be wrong, but looks like it broke slow jobs, see https://hud.pytorch.org/hud/pytorch/pytorch/b0fbbef1361ccaab8a5aec8e7cd62150e7b361de/1?per_page=50&name_filter=slow&mergeEphemeralLF=true" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 154437 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit cd361fc247a9abdbe9851867e31ac3cefcff299e returned non-zero exit code 1

Auto-merging .ci/docker/build.sh
CONFLICT (content): Merge conflict in .ci/docker/build.sh
Auto-merging .github/workflows/docker-builds.yml
CONFLICT (content): Merge conflict in .github/workflows/docker-builds.yml
error: could not revert cd361fc247a... [CI] Migrate focal (ubuntu 20.04) images to jammy (ubuntu 22.04) (#154437)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/pytorch that referenced this pull request Jun 22, 2025
vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/pytorch that referenced this pull request Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move all CI/CD workflows from focal to jammy
8 participants