Skip to content

[C10D] fix slow init due to repeated dns resolution failure #159596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Jul 31, 2025

Stack from ghstack (oldest at bottom):

It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta

It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 31, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159596

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (2 Unrelated Failures)

As of commit ce64159 with merge base 31b3b38 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 31, 2025
wconstab added a commit that referenced this pull request Jul 31, 2025
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

ghstack-source-id: f03d078
Pull Request resolved: #159596
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// hit a transient failure we just disable it for the remainder of the job,
// logging IP addresses instead.
// See https://github.com/pytorch/pytorch/issues/159007
bool DISABLE_getnameinfo = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since this is only used from formatSockAddr we could make it a Meyer's singleton and move it into formatSockAddr

(i.e. static bool ...)

I was also debating whether we should make this a std::atomic but I don't think it matters in this case. Writing a single bool is fine from multiple threads even if we don't evict the cache

It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Aug 1, 2025
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.

Fixes #159007

ghstack-source-id: b3574b5
Pull Request resolved: #159596
addr, len, host, NI_MAXHOST, port, NI_MAXSERV, NI_NUMERICSERV)) {
C10D_WARNING(
"The hostname of the client socket cannot be retrieved. err={}", err);
char host[NI_MAXHOST], port[NI_MAXSERV]; // NOLINT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use std::array while we are refactoring it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i went ahead and did the change but pushed it as a separate PR, please TAL

return fmt::format("{}:{}", host, port);
struct sockaddr_in* psai = (struct sockaddr_in*)&addr;
// NOLINTNEXTLINE(*array*)
char ip[INET_ADDRSTRLEN];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you can pass the raw carry with .data()

} else if (addr->sa_family == AF_INET6) {
struct sockaddr_in6* psai = (struct sockaddr_in6*)&addr;
// NOLINTNEXTLINE(*array*)
char ip[INET6_ADDRSTRLEN];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here

@wconstab
Copy link
Contributor Author

wconstab commented Aug 4, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants