Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
When the Coder PostgreSQL database becomes unavailable, the Coder instance/pod generates an excessive amount of logs - approximately 500,000 lines per minute. The logs repeatedly show connection attempts and failures without any throttling or backoff mechanism.
The Example log pattern that repeats continuously shows up at least 40 lines every 1/100th of a second
Connection attempts continue at full speed with no reduction in frequency, generating approximately 500,000 log lines per minute, which could:
-
Fill up disk space rapidly
-
Make log analysis difficult
-
Potentially impact system performance
-
Log files can grow to unmanageable sizes
-
Difficulty diagnosing other issues due to log flooding
-
Potential disk space exhaustion
Relevant Log Output
2025-03-21 17:16:24.504 [info] coderd.pgcoord: closed incoming coordinate call while unhealthy coordinator_id=62a74eaa-8477-451e-a81b-1bf4247b23bc peer_id=e0fe7b54-8f5b-4568-be73-f43586e8b3f3
2025-03-21 17:16:24.504 [info] coderd.servertailnet: obtained tailnet API v2+ client
2025-03-21 17:16:24.504 [info] coderd.servertailnet: tailnet API v2+ connection lost
Expected Behavior
When the DB is 'offline', the Tailnet process should have a mechanism where it can be 'silent' if the DB is unavailable or somewhat muted, to avoid running out of disk-space or memory depending on the storage.
The system should implement an exponential backoff or throttling mechanism to reduce log verbosity when the database is unavailable. Connection attempts should decrease in frequency over time.
Possible Solution
Implement a tapering retry mechanism in the database connection logic:
- Add exponential backoff for connection retries
- Reduce logging verbosity after initial connection failures
- Log only state changes (e.g., "database still unavailable after X attempts")
Related Issues
Similar to Issue #11799
Steps to Reproduce
- Have a running Coder deployment
- tail the log, kubectl logs -f
- Stop your PostgreSQL Service
Environment
- Host OS: Irrelevant
- Coder version: 2.19.1+
Additional Context
No response