Skip to content

Commit 47f5697

Browse files
CLOUDP-319783: Added one entry to AKO SRE run book (#2540)
Added one entry to AKO SRE runbook
1 parent f49544d commit 47f5697

File tree

2 files changed

+53
-0
lines changed

2 files changed

+53
-0
lines changed

docs/sre-runbook/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# SRE Runbook Index
2+
3+
Welcome to the Atlas Kubernetes Operator (AKO) SRE Runbook collection. This directory contains documentation for common incidents and anomalies that can occur in the operation of AKO. Each runbook includes:
4+
5+
- Description of the problem and symptoms
6+
- Diagnostic metrics
7+
- Step-by-step actions to resolve
8+
- Links to additional documentation
9+
10+
Use this index to quickly navigate to the desired runbook.
11+
12+
## Available Runbooks
13+
- [Resources are not reconciling](resources_are_not_reconciling.md)
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# SRE Runbook: Resource Stuck in Reconciliation
2+
3+
## Problem: Resource stuck in reconciliation
4+
This problem showcases the issue with AtlasProject resource not being ready. It can be applied to every AKO resource
5+
6+
### Symptoms:
7+
- The resource is not ready.
8+
- High error rate metric.
9+
10+
To monitor the error rate, you can create a query to calculate the reconciliation error rate for the `AtlasProject` controller as a percentage over the last minute. This metric helps in identifying and monitoring the health and stability of the `AtlasProject` controller. A high or rising error percentage indicates issues in the reconciliation process.
11+
12+
#### Example Query:
13+
To calculate the error rate, use the following Prometheus query:
14+
```prometheus
15+
100 * rate(controller_runtime_reconcile_errors_total{controller="AtlasProject"}[1m]) / rate(controller_runtime_reconcile_total{controller="AtlasProject"}[1m])
16+
```
17+
18+
### Status:
19+
Check the resource status condition for further details:
20+
```yaml
21+
status:
22+
conditions:
23+
- type: Ready
24+
status: "False"
25+
reason: ....
26+
```
27+
28+
### Action Items:
29+
1. **Verify Resource Status:**
30+
- Check the status condition message for more detailed information.
31+
- If the `AtlasProject` is not ready, proceed with the next troubleshooting steps.
32+
33+
2. **Check Connection Secret:**
34+
- Ensure the connection secret referenced by `spec.connectionSecretRef.name` is correctly labeled with `atlas.mongodb.com/type=credentials`.
35+
36+
3. **Investigate Logs:**
37+
- Review logs for the `AtlasProject` controller for any potential errors or failed reconciliation attempts.
38+
39+
### Additional Resources:
40+
- [AtlasProject resource](https://www.mongodb.com/docs/atlas/operator/upcoming/atlasproject-custom-resource/)

0 commit comments

Comments
 (0)