Skip to content

fix: reduce excessive logging when database is unreachable #17363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 15, 2025
Merged

Conversation

dannykopping
Copy link
Contributor

Fixes #17045

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping changed the title Reduce excessive logging when database is unreachable fix: reduce excessive logging when database is unreachable Apr 11, 2025
p.acquireAndRunOne(client)
err := p.acquireAndRunOne(client)
if err != nil && ctx.Err() == nil { // Only log if context is not done.
p.opts.Logger.Debug(ctx, "retrying to acquire job", slog.F("retry_in_ms", retrier.Delay.Milliseconds()), slog.Error(err))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review: acquireAndRunOne already logs its own warning - specifically the provisionerd was unable to acquire job one is logged when the db is unreachable - so Debug is what felt most appropriate to me.

…ailnet control protocol dialer

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping marked this pull request as ready for review April 11, 2025 14:50
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
// This needs to be done *after* the server "starts" otherwise it'll fail straight away when trying to initialize.
pdb.MarkUnhealthy()

// Then: the tailnet controller will continually try to dial the coordination endpoint, exceeding its context timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is wrong, we don't continually retry because DialAgent only waits until we hit a dial error. Once the first error is returned the test is complete and we tear down the context.

Furthermore, I don't think the SDK DialAgent is really the thing that you care about testing here. It doesn't handle the retries anyways, tailnet does. Maybe simplify this and just use the WebsocketDialer and ensure it returns an error.

This has a downside of losing the details of the received error, but in this case it seems justified since we need to conditionalize responses based on codersdk.ErrDatabaseNotReachable

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping merged commit 0b18e45 into main Apr 15, 2025
37 checks passed
@dannykopping dannykopping deleted the dk/17045 branch April 15, 2025 08:55
@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2025
@dannykopping
Copy link
Contributor Author

/cherry-pick release/2.21

@dannykopping
Copy link
Contributor Author

/cherry-pick release/2.20

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Excessive log generation when PostgreSQL database is offline
3 participants