Skip to content

Recursively sync fbgemm submodules before build #159477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jataylo
Copy link
Collaborator

@jataylo jataylo commented Jul 30, 2025

ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622

2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’
2025-07-27T08:00:32.3444080Z   389 |         x86::Xmm partial_sum_xmm(partial_sum_vreg.id());

It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit.

@jataylo jataylo added ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-perf-test-nightly-rocm Trigger inductor perf tests on ROCm labels Jul 30, 2025
Copy link

pytorch-bot bot commented Jul 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159477

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 1 Unrelated Failure

As of commit 12a7bbf with merge base 20b5f69 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jataylo jataylo requested a review from huydhn July 31, 2025 11:32
@jataylo jataylo changed the title Recursively sync fbgemm packages Recursively sync fbgemm submodules before build Jul 31, 2025
@jataylo jataylo marked this pull request as ready for review July 31, 2025 11:32
@jataylo jataylo requested a review from a team as a code owner July 31, 2025 11:32
@jataylo jataylo added the topic: not user facing topic category label Jul 31, 2025
@jataylo jataylo requested a review from desertfire July 31, 2025 11:33
@jataylo jataylo added the ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests label Jul 31, 2025
Copy link

pytorch-bot bot commented Jul 31, 2025

Warning: Unknown label ciflow/inductor-perf-test-nightly.
Currently recognized labels are

  • ciflow/binaries
  • ciflow/binaries_libtorch
  • ciflow/binaries_wheel
  • ciflow/triton_binaries
  • ciflow/inductor
  • ciflow/inductor-periodic
  • ciflow/inductor-rocm
  • ciflow/inductor-perf-test-nightly-rocm
  • ciflow/inductor-perf-compare
  • ciflow/inductor-micro-benchmark
  • ciflow/inductor-micro-benchmark-cpu-x86
  • ciflow/inductor-perf-test-nightly-x86-zen
  • ciflow/inductor-cu126
  • ciflow/linux-aarch64
  • ciflow/mps
  • ciflow/nightly
  • ciflow/periodic
  • ciflow/periodic-rocm-mi300
  • ciflow/rocm
  • ciflow/rocm-mi300
  • ciflow/s390
  • ciflow/slow
  • ciflow/trunk
  • ciflow/unstable
  • ciflow/xpu
  • ciflow/torchbench
  • ciflow/op-benchmark
  • ciflow/pull
  • ciflow/h100
  • ciflow/h100-distributed
  • ciflow/win-arm64
  • ciflow/h100-symm-mem
  • ciflow/h100-cutlass-backend

Please add the new label to .github/pytorch-probot.yml

@jataylo
Copy link
Collaborator Author

jataylo commented Jul 31, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fbgemm_test onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fbgemm_test && git pull --rebase)

@@ -246,6 +246,7 @@ function install_torchrec_and_fbgemm() {
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git checkout has a --recurse-submodules arg, just use that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You learn something new every day... updated, thanks!

@jataylo
Copy link
Collaborator Author

jataylo commented Jul 31, 2025

Failures seem unrelated, flakey GPU hangs on ROCm inductor jobs 🤔 @Skylion007 @huydhn

@jataylo jataylo requested a review from Skylion007 August 1, 2025 16:08
@pruthvistony
Copy link
Collaborator

@malfet @atalman ,
Can you please help on review of this PR. It is blocking few inductor perf dashboard test cases.

@pruthvistony pruthvistony requested review from malfet and atalman August 4, 2025 16:17
@naromero77amd
Copy link
Collaborator

@eqy in case you have the bandwidth to review/approve.

@naromero77amd
Copy link
Collaborator

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-perf-nightly-rocm / rocm-py3_10-inductor-benchmark-test / test (inductor_huggingface_perf_rocm, 3, 4, linux.rocm.gpu.gfx942.2)

Details for Dev Infra team Raised by workflow job

@naromero77amd
Copy link
Collaborator

naromero77amd commented Aug 4, 2025

@eqy Previously we had one of the GPU hangs on this workflow

ciflow/inductor-perf-test-nightly-rocm

Can we have a redo on the workflow? I think it's also safe to ignore during merge since the other shards made it through OK.

@naromero77amd
Copy link
Collaborator

@pytorchmergebot merge -i "failure was due to GPU hang."

Copy link

pytorch-bot bot commented Aug 5, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: failure was due to GPU hang.

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@naromero77amd
Copy link
Collaborator

@pytorchbot merge -i "failure was due to GPU hang."

Copy link

pytorch-bot bot commented Aug 5, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: failure was due to GPU hang.

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@naromero77amd
Copy link
Collaborator

@pytorchbot merge -f 'failure was due to GPU hang'

Copy link

pytorch-bot bot commented Aug 5, 2025

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

@jataylo
Copy link
Collaborator Author

jataylo commented Aug 5, 2025

@pytorchbot merge -f 'failure was due to GPU hang'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@AmdSampsa
Copy link
Collaborator

hey lets run that hud compiler benchmark

@AmdSampsa AmdSampsa reopened this Aug 5, 2025
@jataylo jataylo marked this pull request as draft August 5, 2025 09:46
@jataylo
Copy link
Collaborator Author

jataylo commented Aug 5, 2025

Just piggybacking on this commit to retest hud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor-perf-test-nightly Trigger nightly inductor perf tests ciflow/inductor-perf-test-nightly-rocm Trigger inductor perf tests on ROCm ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants