-
Notifications
You must be signed in to change notification settings - Fork 24.9k
[C10D] fix slow init due to repeated dns resolution failure #159596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159596
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (2 Unrelated Failures)As of commit ce64159 with merge base 31b3b38 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 ghstack-source-id: f03d078 Pull Request resolved: #159596
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// hit a transient failure we just disable it for the remainder of the job, | ||
// logging IP addresses instead. | ||
// See https://github.com/pytorch/pytorch/issues/159007 | ||
bool DISABLE_getnameinfo = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since this is only used from formatSockAddr we could make it a Meyer's singleton and move it into formatSockAddr
(i.e. static bool ...
)
I was also debating whether we should make this a std::atomic but I don't think it matters in this case. Writing a single bool is fine from multiple threads even if we don't evict the cache
It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]
It can be be very slow to repeatedly hit DNS resolution failure, but its very helpful to have DNS names in logs by default. So we try to use DNS but if we hit a transient failure we just disable it for the remainder of the job, logging IP addresses instead. Fixes #159007 ghstack-source-id: b3574b5 Pull Request resolved: #159596
addr, len, host, NI_MAXHOST, port, NI_MAXSERV, NI_NUMERICSERV)) { | ||
C10D_WARNING( | ||
"The hostname of the client socket cannot be retrieved. err={}", err); | ||
char host[NI_MAXHOST], port[NI_MAXSERV]; // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use std::array while we are refactoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went ahead and did the change but pushed it as a separate PR, please TAL
return fmt::format("{}:{}", host, port); | ||
struct sockaddr_in* psai = (struct sockaddr_in*)&addr; | ||
// NOLINTNEXTLINE(*array*) | ||
char ip[INET_ADDRSTRLEN]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, you can pass the raw carry with .data()
} else if (addr->sa_family == AF_INET6) { | ||
struct sockaddr_in6* psai = (struct sockaddr_in6*)&addr; | ||
// NOLINTNEXTLINE(*array*) | ||
char ip[INET6_ADDRSTRLEN]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.
Fixes #159007
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta