Skip to content

[vllm in torch ci ][step 1/3] add build logics #159815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Conversation

yangw-dev
Copy link
Contributor

@yangw-dev yangw-dev commented Aug 4, 2025

Description

Add vllm build into pytorch ci pipeline.

Details: how we do this

This pr setup vllm build logics that pass around wheels generated by the docker run and docker build. We set up a cli tool with customized logics for external build such as vllm, the vllm build process in brief:

  • the torch build generates a sm80/sm90 torch ci whl based on pr
  • the vllm-build job uses the torch dockerImage as base image and install torch whls from step 1.
  • the vllm-build job run docker build and generate xformer whl, vllm whl and flashinfer whl
  • the artifacts stored in s3 waiting for usage in test [next step]

Code implementation

This code covers 3 sections

  • dockerfile.base from vllm
  • torch_cli cli tool for build & configuration yml file
  • vllm yaml file to run vllm x torch ci tests

Dockerfile.torch_nightly in .github/docker/external/vllm/

  • Modified the vllm docker_file.torch_nightly to work both for default image and torch ci one
  • Add extra stage to output the developed whls from the stages
  • Upadted the flashinfer version aligned with vllm stable

Originally Made a copy of the Dockerfile.torch_nightly from vllm to accelerate the code development, then a RCE happend last week regarding the dependency fro external repo such as vlllm.
So leave it in torch now to decide the further discussion regarding install xfn repo's dependencies in torch

Setup cli tool scripts/torch_cli

set up a cli tool for torch ci build, this also set up a way to do our build and tests for new features.
it normally comes with a config yml files to store specifc parameters which affect the build/test results

pip install -e . scripts/torch_cli
python3 -m cli.build --config ".github/configs/vllm.yml" external vllm 

the cli tool must be run at root of pytorch repo.

vllm yaml file

vllm.yaml: main yaml file to set up vllm test workflow
_linux-external-build.yml: build yml file to handle external lib build such as vllm, which depends on torch ci from the pr.

next step

add vllm test step.
see experimental pr [not final, just a workable one]: https://github.com/pytorch/pytorch/actions/runs/16759050517/job/47453916017

Copy link

pytorch-bot bot commented Aug 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159815

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Unrelated Failure

As of commit 1184e6e with merge base e16c48a (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

  • pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
    /var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Aug 4, 2025
@yangw-dev yangw-dev changed the title setup build logics for vllm x torch ci [vllm in torch ci ][step 1/3] build logics for vllm x torch ci Aug 4, 2025
@yangw-dev yangw-dev changed the title [vllm in torch ci ][step 1/3] build logics for vllm x torch ci [vllm in torch ci ][step 1/3] add build logics Aug 4, 2025
@yangw-dev yangw-dev requested review from seemethere and huydhn August 5, 2025 20:58
@yangw-dev yangw-dev marked this pull request as ready for review August 5, 2025 21:00
@yangw-dev yangw-dev requested a review from a team as a code owner August 5, 2025 21:00
Copy link

pytorch-bot bot commented Aug 5, 2025

Warning: Unknown label ciflow/vllm.
Currently recognized labels are

  • ciflow/binaries
  • ciflow/binaries_libtorch
  • ciflow/binaries_wheel
  • ciflow/triton_binaries
  • ciflow/inductor
  • ciflow/inductor-periodic
  • ciflow/inductor-rocm
  • ciflow/inductor-perf-test-nightly-rocm
  • ciflow/inductor-perf-compare
  • ciflow/inductor-micro-benchmark
  • ciflow/inductor-micro-benchmark-cpu-x86
  • ciflow/inductor-perf-test-nightly-x86-zen
  • ciflow/inductor-cu126
  • ciflow/linux-aarch64
  • ciflow/mps
  • ciflow/nightly
  • ciflow/periodic
  • ciflow/periodic-rocm-mi300
  • ciflow/rocm
  • ciflow/rocm-mi300
  • ciflow/s390
  • ciflow/slow
  • ciflow/trunk
  • ciflow/unstable
  • ciflow/xpu
  • ciflow/torchbench
  • ciflow/op-benchmark
  • ciflow/pull
  • ciflow/h100
  • ciflow/h100-distributed
  • ciflow/win-arm64
  • ciflow/h100-symm-mem
  • ciflow/h100-cutlass-backend

Please add the new label to .github/pytorch-probot.yml

--build-arg vllm_fa_cmake_gpu_arches={cfg.vllm_fa_cmake_gpu_arches}\
--target {cfg.target} \
-t {cfg.tag_name} \
--progress=plain .
Copy link
Contributor Author

@yangw-dev yangw-dev Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seemethere

wonder why this is the statment
is there a reason why for the build.sh it says:

Do not use cache and progress=plain when in CI

# Do not use cache and progress=plain when in CI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For caching: When we had non-ephemeral runners we'd typically run into situations where we'd utilize cached docker images where we didn't want them. This isn't really an issue today.

For progress=plain: If you don't use this you'll have a bunch of garbage output since buildkit does a lot of output sugar for its main output.

Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so I appreciate the ambition in this PR but I think we're going to need to split it up a bit to make it easier for people (like myself to review):

  • PR1: Introduce the new torch_cli
  • PR2: Implement the vllm specific parts of the torch_cli
  • PR3: Add the workflows on top of this that will run

I'd recommend using ghstack to do this.

@yangw-dev
Copy link
Contributor Author

yangw-dev commented Aug 6, 2025

Okay so I appreciate the ambition in this PR but I think we're going to need to split it up a bit to make it easier for people (like myself to review):

  • PR1: Introduce the new torch_cli
  • PR2: Implement the vllm specific parts of the torch_cli
  • PR3: Add the workflows on top of this that will run

I'd recommend using ghstack to do this.

sounds good! thought iit's small enought without the test cli, but will split it even smaller!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants