Skip to content

[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL #158568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

frost-intel
Copy link
Collaborator

@frost-intel frost-intel commented Jul 17, 2025

Adds support for FlightRecorder in ProcessGroupXCCL.

See intel/torch-xpu-ops#1867 for XCCL implementation and more details.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey

Copy link

pytorch-bot bot commented Jul 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158568

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5e920cb with merge base aeb5321 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 17, 2025
@guangyey guangyey added the module: xpu Intel XPU related issues label Jul 24, 2025
@@ -389,7 +389,9 @@ def __init__(
):
self.profiling_name = event["profiling_name"]
nccl, name = self.profiling_name.split(":")
assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'"
assert nccl in ["nccl", "xccl"], (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl is a bit ambiguous,please consider using a more clearer term.

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

@pytorchbot rebase

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR.

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR.

Github is back. Done.

@@ -200,7 +201,9 @@ struct FlightRecorder {
const std::tuple<std::string, std::string>& pg_name,
std::vector<uint64_t> ranks);

void record_accelerator_version(const std::string nccl_version);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only keep one parameter here is enough, such as void record_accelerator_version(const std::string ccl_version);
And define a ccl_version_key_str in

DEFINE_CONSTANT(nccl_version_key, "nccl_version")
and assign ccl_version to both nccl_version_str and ccl_version_str. (keep nccl_version_str for BC only)
What do you think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, I've fixed the code as you said.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not directly call it accelerator? why ccl?

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, otherwise LGTM. Leave @zhangxiaoli73 to make the decision.

@guangyey guangyey marked this pull request as ready for review August 7, 2025 02:10
@guangyey guangyey requested a review from EikanWang as a code owner August 7, 2025 02:10
@guangyey guangyey requested a review from gujinghui as a code owner August 7, 2025 02:10
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 7, 2025
@guangyey guangyey moved this to Review Required in PyTorch Intel Aug 7, 2025
@guangyey guangyey requested review from d4l3k and zhangxiaoli73 and removed request for zhangxiaoli73 August 7, 2025 02:43
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 7, 2025
@frost-intel frost-intel requested a review from kwen2501 as a code owner August 8, 2025 20:07
@d4l3k d4l3k requested review from fduwjj and removed request for zhangxiaoli73 August 11, 2025 16:47
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I have concern about is the CCL part, everything else looks good to me.

assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'"
ccl_backend, name = self.profiling_name.split(":")
assert ccl_backend in ["nccl", "xccl"], (
f"name formatting error? {ccl_backend} != 'nccl' or 'xccl'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call it accelerator instead of ccl?

@@ -24,6 +24,7 @@ DEFINE_CONSTANT(version_val, "2.9")
DEFINE_CONSTANT(entries_key, "entries")
DEFINE_CONSTANT(nccl_comm_key, "nccl_comm_state")
DEFINE_CONSTANT(nccl_version_key, "nccl_version")
DEFINE_CONSTANT(ccl_version_key, "ccl_version")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should name it accelerator version. The nccl_version is something we should change tbh. And I am ok with a bc change if we change it to accelerator_version so the code is generic enough. ccl is not ideal tbh.

@@ -200,7 +201,9 @@ struct FlightRecorder {
const std::tuple<std::string, std::string>& pg_name,
std::vector<uint64_t> ranks);

void record_accelerator_version(const std::string nccl_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not directly call it accelerator? why ccl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: xpu Intel XPU related issues oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category
Projects
Status: Review Required
Development

Successfully merging this pull request may close these issues.

5 participants