2019-05-21 Kubernetes Failure Stories - KubeCon Europe
2019-05-21 Kubernetes Failure Stories - KubeCon Europe
2019-05-21 Kubernetes Failure Stories - KubeCon Europe
Failure Stories
HENNING JACOBS
@try_except_
ZALANDO AT A GLANCE
4
SCALE
380 Accounts
118 Clusters
5
DEVELOPERS USING KUBERNETES
6
47+ cluster
components
7
INCIDENTS ARE FINE
INCIDENT
#1
INCIDENT #1: CUSTOMER IMPACT
10
INCIDENT #1: CUSTOMER IMPACT
11
INCIDENT #1: INGRESS ERRORS
12
INCIDENT #1: AWS ALB 502
13 github.com/zalando/riptide
INCIDENT #1: AWS ALB 502
Server: awselb/2.0
...
14 github.com/zalando/riptide
INCIDENT #1: ALB HEALTHY HOST COUNT
3 healthy hosts
2xx requests
HTTP
Memory Usage
Memory Limit
17
INCIDENT #1: SKIPPER OOM
TLS
ALB
HTTP
#2
INCIDENT #2: CUSTOMER IMPACT
21
INCIDENT #1: IAM RETURNING 404
22
INCIDENT #1: NUMBER OF PODS
23
LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB K8s network
HTTP
OOMKill
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
27 containers:
...
INCIDENT #2: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
28 restartPolicy: Never
containers:
INCIDENT #2: LESSONS LEARNED
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources NOTE: we dropped quotas recently
spec: github.com/zalando-incubator/kubernetes-
hard: on-aws/pull/2059
pods: "1500"
29
INCIDENT
#3
INCIDENT #3: INGRESS ERRORS
31
INCIDENT #3: COREDNS OOMKILL
32
STOP THE BLEEDING: INCREASE MEMORY LIMIT
4Gi
2Gi
200Mi
33
SPIKE IN HTTP REQUESTS
34
SPIKE IN DNS QUERIES
35
INCREASE IN MEMORY USAGE
36
INCIDENT #3: CONTRIBUTING FACTORS
• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
37 github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
INCIDENT
#4
INCIDENT #4: CLUSTER DOWN
39
INCIDENT #4: MANUAL OPERATION
40
INCIDENT #4: RTFM
41
42
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
INCIDENT #4: LESSONS LEARNED
44
INCIDENT
#5
INCIDENT #5: API LATENCY SPIKES
46
INCIDENT #5: CONNECTION ISSUES
Master Node
API Server
etcd
etcd-member
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
47
INCIDENT #5: STOP THE BLEEDING
#!/bin/bash
while true; do
echo "sleep for 60 seconds"
sleep 60
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
48
INCIDENT #5: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
49
INCIDENT #5: LESSONS LEARNED
50
INCIDENT
#6
INCIDENT #6: IMPACT
Ingress
5XXs
52
INCIDENT #6: CLUSTER DOWN?
53
INCIDENT #6: THE TRIGGER
54
https://www.outcome-eng.com/human-error-never-root-cause/
CLUSTER UPGRADE
FLOW
56
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
57
CLUSTER CHANNELS
58 github.com/zalando-incubator/kubernetes-on-aws
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
59
RUNNING E2E TESTS (BEFORE)
Testing dev to alpha upgrade
Control plane
node node
60
RUNNING E2E TESTS (NOW)
Testing dev to alpha upgrade
branch: alpha (base) branch: dev (head) Control plane Control plane
61
INCIDENT #6: LESSONS LEARNED
62 github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
INCIDENT
#7
#7: KERNEL OOM KILLER
⇒ all containers
on this node down
64
INCIDENT #7: KUBELET MEMORY
65
UPSTREAM ISSUE REPORTED
https://github.com/kubernetes/kubernetes/issues/73587
66
INCIDENT #7: THE PATCH
67 https://github.com/kubernetes/kubernetes/issues/73587
INCIDENT
#8
INCIDENT #8: IMPACT
69
INCIDENT #8: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
70
INCIDENT #8: CPU THROTTLING
71
INCIDENT #8: WHAT HAPPENED
72
SLACK
Node
CPU
Memory
"Slack"
73
DISABLING CPU THROTTLING
kubelet … --cpu-cfs-quota=false
74
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
75
MORE TOPICS
• Docker
76
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
77
github.com/zalando-incubator/kubernetes-on-aws
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
78
MANAGED
KUBERNETES?
79
WILL MANAGED K8S SAVE US?
80
WILL MANAGED K8S SAVE US?
NO(not really)
e.g. AWS EKS uptime SLA is only for API server
81
PRODUCTION PROOFING AWS EKS
https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c
82
AWS EKS IN PRODUCTION
https://kubedex.com/90-days-of-aws-eks-in-production/
83
DOCKER.. (ON GKE)
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
84
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
WELCOME TO
CLOUD NATIVE!
86
KUBERNETES FAILURE STORIES
k8s.af
We need more failure talks!
87
Istio? Anyone?
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
88
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k