Skip to content

bug: Excessive log generation when PostgreSQL database is offline #17045

Closed
@bjornrobertsson

Description

@bjornrobertsson

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When the Coder PostgreSQL database becomes unavailable, the Coder instance/pod generates an excessive amount of logs - approximately 500,000 lines per minute. The logs repeatedly show connection attempts and failures without any throttling or backoff mechanism.

The Example log pattern that repeats continuously shows up at least 40 lines every 1/100th of a second

Connection attempts continue at full speed with no reduction in frequency, generating approximately 500,000 log lines per minute, which could:

  • Fill up disk space rapidly

  • Make log analysis difficult

  • Potentially impact system performance

  • Log files can grow to unmanageable sizes

  • Difficulty diagnosing other issues due to log flooding

  • Potential disk space exhaustion

Relevant Log Output

2025-03-21 17:16:24.504 [info]  coderd.pgcoord: closed incoming coordinate call while unhealthy  coordinator_id=62a74eaa-8477-451e-a81b-1bf4247b23bc  peer_id=e0fe7b54-8f5b-4568-be73-f43586e8b3f3
2025-03-21 17:16:24.504 [info]  coderd.servertailnet: obtained tailnet API v2+ client
2025-03-21 17:16:24.504 [info]  coderd.servertailnet: tailnet API v2+ connection lost

Expected Behavior

When the DB is 'offline', the Tailnet process should have a mechanism where it can be 'silent' if the DB is unavailable or somewhat muted, to avoid running out of disk-space or memory depending on the storage.

The system should implement an exponential backoff or throttling mechanism to reduce log verbosity when the database is unavailable. Connection attempts should decrease in frequency over time.

Possible Solution
Implement a tapering retry mechanism in the database connection logic:

  • Add exponential backoff for connection retries
  • Reduce logging verbosity after initial connection failures
  • Log only state changes (e.g., "database still unavailable after X attempts")

Related Issues
Similar to Issue #11799

Steps to Reproduce

  1. Have a running Coder deployment
  2. tail the log, kubectl logs -f
  3. Stop your PostgreSQL Service

Environment

  • Host OS: Irrelevant
  • Coder version: 2.19.1+

Additional Context

No response

Metadata

Metadata

Assignees

Labels

must-doIssues that must be completed by the end of the Sprint. Or else. Only humans may set this.s3Bugs that confuse, annoy, or are purely cosmetic

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions