-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL #158568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158568
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5e920cb with merge base aeb5321 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -389,7 +389,9 @@ def __init__( | |||
): | |||
self.profiling_name = event["profiling_name"] | |||
nccl, name = self.profiling_name.split(":") | |||
assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'" | |||
assert nccl in ["nccl", "xccl"], ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl
is a bit ambiguous,please consider using a more clearer term.
@pytorchbot rebase |
Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR. |
Github is back. Done. |
@@ -200,7 +201,9 @@ struct FlightRecorder { | |||
const std::tuple<std::string, std::string>& pg_name, | |||
std::vector<uint64_t> ranks); | |||
|
|||
void record_accelerator_version(const std::string nccl_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only keep one parameter here is enough, such as void record_accelerator_version(const std::string ccl_version);
And define a ccl_version_key_str
in
DEFINE_CONSTANT(nccl_version_key, "nccl_version") |
ccl_version
to both nccl_version_str
and ccl_version_str
. (keep nccl_version_str
for BC only)What do you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, I've fixed the code as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not directly call it accelerator? why ccl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, otherwise LGTM. Leave @zhangxiaoli73 to make the decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I have concern about is the CCL part, everything else looks good to me.
assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'" | ||
ccl_backend, name = self.profiling_name.split(":") | ||
assert ccl_backend in ["nccl", "xccl"], ( | ||
f"name formatting error? {ccl_backend} != 'nccl' or 'xccl'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we call it accelerator instead of ccl?
@@ -24,6 +24,7 @@ DEFINE_CONSTANT(version_val, "2.9") | |||
DEFINE_CONSTANT(entries_key, "entries") | |||
DEFINE_CONSTANT(nccl_comm_key, "nccl_comm_state") | |||
DEFINE_CONSTANT(nccl_version_key, "nccl_version") | |||
DEFINE_CONSTANT(ccl_version_key, "ccl_version") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should name it accelerator version. The nccl_version is something we should change tbh. And I am ok with a bc change if we change it to accelerator_version so the code is generic enough. ccl is not ideal tbh.
@@ -200,7 +201,9 @@ struct FlightRecorder { | |||
const std::tuple<std::string, std::string>& pg_name, | |||
std::vector<uint64_t> ranks); | |||
|
|||
void record_accelerator_version(const std::string nccl_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not directly call it accelerator? why ccl?
Adds support for FlightRecorder in ProcessGroupXCCL.
See intel/torch-xpu-ops#1867 for XCCL implementation and more details.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey