Skip to content

Add B200 smoke test #159494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: gh/drisspg/177/base
Choose a base branch
from
Open

Add B200 smoke test #159494

wants to merge 5 commits into from

Conversation

drisspg
Copy link
Contributor

@drisspg drisspg commented Jul 30, 2025

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
@drisspg drisspg requested a review from a team as a code owner July 30, 2025 17:59
Copy link

pytorch-bot bot commented Jul 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159494

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 1 Unrelated Failure

As of commit 876f42d with merge base e5a81aa (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

  • pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
    /var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jul 30, 2025
drisspg added a commit that referenced this pull request Jul 30, 2025
ghstack-source-id: 1f0ab3e
Pull-Request: #159494
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jul 30, 2025
ghstack-source-id: c8041d8
Pull-Request: #159494
@nWEIdia
Copy link
Collaborator

nWEIdia commented Jul 30, 2025

The use of "linux.dgx.b200" label, requires cherrypicking infra change commit from cb91f09
So this PR would depend on @huydhn 's PR, which is blocked on the torchbench runtime.

@drisspg
Copy link
Contributor Author

drisspg commented Aug 4, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/drisspg/177/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/159494)

pytorchmergebot pushed a commit that referenced this pull request Aug 4, 2025
ghstack-source-id: b9be81b
Pull-Request: #159494
@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 4, 2025

Please also use: aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only in your .yaml This is indeed a constraint to deal with for this B200 runner. Please see #158011 or #159323 as reference.

[ghstack-poisoned]
drisspg added a commit that referenced this pull request Aug 4, 2025
ghstack-source-id: 0e6fded
Pull-Request: #159494
Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lint failure is real I think, you need to add ciflow/b200 into https://github.com/pytorch/pytorch/blob/main/.github/pytorch-probot.yml#L3

[ghstack-poisoned]
drisspg added a commit that referenced this pull request Aug 4, 2025
ghstack-source-id: 6f3bd85
Pull-Request: #159494
Copy link
Collaborator

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looks like we have some annoying threshold issue with the b200 signal.

@nWEIdia
Copy link
Collaborator

nWEIdia commented Aug 5, 2025

cc @eqy about a potential tolerance bump.

@eqy
Copy link
Collaborator

eqy commented Aug 5, 2025

#159915 is the tolerance bump
#159914 seems like it warrants further investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants