Skip to content

[inductor] Add TLParse artifact for logging runtime of collective and compute ops #159730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

skarjala
Copy link
Contributor

@skarjala skarjala commented Aug 3, 2025

Summary:

  • debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format
  • test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @mlazos

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Aug 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159730

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (3 Unrelated Failures)

As of commit 68a0c0c with merge base 2249284 (image):

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

skarjala added a commit that referenced this pull request Aug 3, 2025
@skarjala skarjala changed the title Create TLParse artifact for logging runtime [inductor] Create TLParse artifact for logging runtime Aug 3, 2025
@skarjala skarjala added the topic: not user facing topic category label Aug 3, 2025
@skarjala skarjala changed the title [inductor] Create TLParse artifact for logging runtime [inductor] Add Logging runtime TLParse artifact for collective and compute ops Aug 3, 2025
@skarjala skarjala changed the title [inductor] Add Logging runtime TLParse artifact for collective and compute ops [inductor] Add TLParse artifact for logging runtime of collective and compute ops Aug 3, 2025
},
payload_fn=lambda: data,
)
except Exception:
Copy link
Contributor

@yushangdi yushangdi Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a try-except here, you can just directly use trace_structured in log_runtime_estimates without adding a separate _dump_tlparse_runtime function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -818,7 +817,7 @@ def get_estimated_runtime(self) -> float:
return 0

# Collective kernels
if is_collective(self.node):
if isinstance(self.node, ir._CollectiveKernel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier it wasn't picking up schedule collectives, but after further review if I ran the code a different way it was able to pick up the collectives, the orig implementation works fine. Fixed

[ghstack-poisoned]
skarjala added a commit that referenced this pull request Aug 4, 2025
ghstack-source-id: c9eb67f
Pull-Request: #159730

fix pr feedback
@skarjala
Copy link
Contributor Author

skarjala commented Aug 4, 2025

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: Check mergeability of ghstack PR / ghstack-mergeability-check, pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Check Labels / Check labels

Details for Dev Infra team Raised by workflow job

@skarjala
Copy link
Contributor Author

skarjala commented Aug 4, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: Check mergeability of ghstack PR / ghstack-mergeability-check, pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Check Labels / Check labels

Details for Dev Infra team Raised by workflow job

@skarjala skarjala added the ci: sev critical failure affecting PyTorch CI label Aug 5, 2025
@skarjala
Copy link
Contributor Author

skarjala commented Aug 5, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: Check mergeability of ghstack PR / ghstack-mergeability-check, pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@skarjala skarjala removed the ci: sev critical failure affecting PyTorch CI label Aug 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Check Labels / Check labels

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
skarjala added a commit that referenced this pull request Aug 5, 2025
ghstack-source-id: ddb7f4f
Pull-Request: #159730

fix pr feedback

update to graph pass once
[ghstack-poisoned]
skarjala added a commit that referenced this pull request Aug 5, 2025
ghstack-source-id: c1e4a3a
Pull-Request: #159730

fix pr feedback

update to graph pass once

new flag
[ghstack-poisoned]
skarjala added a commit that referenced this pull request Aug 5, 2025
ghstack-source-id: 3b5771f
Pull-Request: #159730

fix pr feedback

update to graph pass once

new flag

update test_structured_trace
@skarjala
Copy link
Contributor Author

skarjala commented Aug 5, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: Check Labels / Check labels, Check mergeability of ghstack PR / ghstack-mergeability-check, pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants