[vllm in torch ci ][step 1/3] add build logics #159815

yangw-dev · 2025-08-04T23:28:20Z

Description

Add vllm build into pytorch ci pipeline.

Details: how we do this

This pr setup vllm build logics that pass around wheels generated by the docker run and docker build. We set up a cli tool with customized logics for external build such as vllm, the vllm build process in brief:

the torch build generates a sm80/sm90 torch ci whl based on pr
the vllm-build job uses the torch dockerImage as base image and install torch whls from step 1.
the vllm-build job run docker build and generate xformer whl, vllm whl and flashinfer whl
the artifacts stored in s3 waiting for usage in test [next step]

Code implementation

This code covers 3 sections

dockerfile.base from vllm
torch_cli cli tool for build & configuration yml file
vllm yaml file to run vllm x torch ci tests

Dockerfile.torch_nightly in .github/docker/external/vllm/

Modified the vllm docker_file.torch_nightly to work both for default image and torch ci one
Add extra stage to output the developed whls from the stages
Upadted the flashinfer version aligned with vllm stable

Originally Made a copy of the Dockerfile.torch_nightly from vllm to accelerate the code development, then a RCE happend last week regarding the dependency fro external repo such as vlllm.
So leave it in torch now to decide the further discussion regarding install xfn repo's dependencies in torch

Setup cli tool scripts/torch_cli

set up a cli tool for torch ci build, this also set up a way to do our build and tests for new features.
it normally comes with a config yml files to store specifc parameters which affect the build/test results

pip install -e . scripts/torch_cli
python3 -m cli.build --config ".github/configs/vllm.yml" external vllm

the cli tool must be run at root of pytorch repo.

vllm yaml file

vllm.yaml: main yaml file to set up vllm test workflow
_linux-external-build.yml: build yml file to handle external lib build such as vllm, which depends on torch ci from the pr.

next step

add vllm test step.
see experimental pr [not final, just a workable one]: https://github.com/pytorch/pytorch/actions/runs/16759050517/job/47453916017

pytorch-bot · 2025-08-04T23:28:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159815

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Unrelated Failure

As of commit 1184e6e with merge base e16c48a ():

NEW FAILURES - The following jobs have failed:

.github/workflows/_linux-external-build-main.yml (gh)
.github/workflows/vllm.yml (gh)
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for scripts/torch_cli/cli/build_cli/external/controller.py:
Lint / quick-checks / linux-job (gh)
RuntimeError: Command docker exec -t eaa261fe23e85a3b1e452a65b2cc2cf019a6ec062c3dc63b34ce78ae4cb1ee00 /exec failed with exit code 1
Lint / workflow-checks / linux-job (gh)
RuntimeError: Command docker exec -t 56d910b67c7861fc624de66f0fa853461b944582f0f3a8ca9c848bb5b81d216a /exec failed with exit code 1

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.github/docker/external/vllm/Dockerfile.base

pytorch-bot · 2025-08-05T21:05:24Z

Warning: Unknown label ciflow/vllm.
Currently recognized labels are

ciflow/binaries
ciflow/binaries_libtorch
ciflow/binaries_wheel
ciflow/triton_binaries
ciflow/inductor
ciflow/inductor-periodic
ciflow/inductor-rocm
ciflow/inductor-perf-test-nightly-rocm
ciflow/inductor-perf-compare
ciflow/inductor-micro-benchmark
ciflow/inductor-micro-benchmark-cpu-x86
ciflow/inductor-perf-test-nightly-x86-zen
ciflow/inductor-cu126
ciflow/linux-aarch64
ciflow/mps
ciflow/nightly
ciflow/periodic
ciflow/periodic-rocm-mi300
ciflow/rocm
ciflow/rocm-mi300
ciflow/s390
ciflow/slow
ciflow/trunk
ciflow/unstable
ciflow/xpu
ciflow/torchbench
ciflow/op-benchmark
ciflow/pull
ciflow/h100
ciflow/h100-distributed
ciflow/win-arm64
ciflow/h100-symm-mem
ciflow/h100-cutlass-backend

Please add the new label to .github/pytorch-probot.yml

yangw-dev · 2025-08-06T00:06:00Z

scripts/torch_cli/cli/build_cli/external/vllm_build.py

+            --build-arg vllm_fa_cmake_gpu_arches={cfg.vllm_fa_cmake_gpu_arches}\
+            --target {cfg.target} \
+            -t {cfg.tag_name} \
+            --progress=plain .


@seemethere

wonder why this is the statment
is there a reason why for the build.sh it says:

Do not use cache and progress=plain when in CI

pytorch/.ci/docker/build.sh

Line 344 in 49abc0e

# Do not use cache and progress=plain when in CI

For caching: When we had non-ephemeral runners we'd typically run into situations where we'd utilize cached docker images where we didn't want them. This isn't really an issue today.

For progress=plain: If you don't use this you'll have a bunch of garbage output since buildkit does a lot of output sugar for its main output.

.github/ci_configs/vllm.yaml

scripts/torch_cli/cli/build_cli/external/vllm_build.py

seemethere

Okay so I appreciate the ambition in this PR but I think we're going to need to split it up a bit to make it easier for people (like myself to review):

PR1: Introduce the new torch_cli
PR2: Implement the vllm specific parts of the torch_cli
PR3: Add the workflows on top of this that will run

I'd recommend using ghstack to do this.

yangw-dev · 2025-08-06T17:48:24Z

Okay so I appreciate the ambition in this PR but I think we're going to need to split it up a bit to make it easier for people (like myself to review):

PR1: Introduce the new torch_cli

PR2: Implement the vllm specific parts of the torch_cli

PR3: Add the workflows on top of this that will run

I'd recommend using ghstack to do this.

sounds good! thought iit's small enought without the test cli, but will split it even smaller!

setup

f9dbbde

pytorch-bot bot added the topic: not user facing topic category label Aug 4, 2025

yangw-dev added 3 commits August 4, 2025 16:28

setup

a8428ca

setup

f4230ac

setup

c0470be

yangw-dev commented Aug 4, 2025

View reviewed changes

.github/docker/external/vllm/Dockerfile.base Show resolved Hide resolved

yangw-dev changed the title ~~setup build logics for vllm x torch ci~~ [vllm in torch ci ][step 1/3] build logics for vllm x torch ci Aug 4, 2025

yangw-dev changed the title ~~[vllm in torch ci ][step 1/3] build logics for vllm x torch ci~~ [vllm in torch ci ][step 1/3] add build logics Aug 4, 2025

yangw-dev added 14 commits August 4, 2025 16:55

setup

218a90b

setup

c6ef94e

setup

e6abe22

setup

0ba7547

setup

11b7f52

setup

9a327e4

setup

790ba79

setup

2757b05

setup

0027147

setup

b952bca

setup

6ff6516

setup

77aee73

setup

c6672d6

setup

c24b7b6

yangw-dev requested review from seemethere and huydhn August 5, 2025 20:58

yangw-dev marked this pull request as ready for review August 5, 2025 21:00

yangw-dev requested a review from a team as a code owner August 5, 2025 21:00

setup

d8cf04a

yangw-dev added the ciflow/vllm label Aug 5, 2025

yangw-dev added 7 commits August 5, 2025 14:05

setup

b418ad7

setup

7941332

setup

4e90080

setup

2ef6e44

setup

c655d9c

setup

ce9d126

setup

5cba6d3

yangw-dev commented Aug 6, 2025

View reviewed changes

yangw-dev added 2 commits August 5, 2025 17:19

setup

615939e

setup

8cd9fe6

yangw-dev commented Aug 6, 2025

View reviewed changes

.github/ci_configs/vllm.yaml Outdated Show resolved Hide resolved

scripts/torch_cli/cli/build_cli/external/vllm_build.py Outdated Show resolved Hide resolved

yangw-dev added 10 commits August 5, 2025 17:27

setup

ab198d1

setup

75a3bb0

setup

af066d8

fix linter

b30a421

fix linter

5109207

Merge branch 'main' into vllmbuildci

1bebb2e

fix linter

7b4010f

fix linter

39b5bfd

fix linter

0223eaf

fix linter

8b16c9c

seemethere requested changes Aug 6, 2025

View reviewed changes

yangw-dev added 2 commits August 6, 2025 10:42

fix linter

584fec0

fix linter

1184e6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vllm in torch ci ][step 1/3] add build logics #159815

[vllm in torch ci ][step 1/3] add build logics #159815

Uh oh!

yangw-dev commented Aug 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 5, 2025

Uh oh!

yangw-dev Aug 6, 2025 •

edited

Loading

Uh oh!

seemethere Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

seemethere left a comment

Uh oh!

yangw-dev commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

[vllm in torch ci ][step 1/3] add build logics #159815

Are you sure you want to change the base?

[vllm in torch ci ][step 1/3] add build logics #159815

Uh oh!

Conversation

yangw-dev commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Details: how we do this

Code implementation

Dockerfile.torch_nightly in .github/docker/external/vllm/

Setup cli tool scripts/torch_cli

vllm yaml file

next step

Uh oh!

pytorch-bot bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159815

❌ 5 New Failures, 1 Unrelated Failure

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 5, 2025

Uh oh!

yangw-dev Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seemethere Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

seemethere left a comment

Choose a reason for hiding this comment

Uh oh!

yangw-dev commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yangw-dev commented Aug 4, 2025 •

edited

Loading

pytorch-bot bot commented Aug 4, 2025 •

edited

Loading

yangw-dev Aug 6, 2025 •

edited

Loading

yangw-dev commented Aug 6, 2025 •

edited

Loading