2019-05-21 Kubernetes Failure Stories - KubeCon Europe

Kubernetes
Failure Stories
HENNING JACOBS
@try_except_
ZALANDO AT A GLANCE
> 250 visits
~ 5.4 billion EUR million

per
month
revenue 2018 > 300.000

> 26
product choices
> 15.000 > 79% million ~ 2.000 17

employees in of visits via
active customers brands countries
Europe mobile devices
4
SCALE
380 Accounts
118 Clusters
5
DEVELOPERS USING KUBERNETES
6
47+ cluster
components
7
INCIDENTS ARE FINE
INCIDENT
#1
INCIDENT #1: CUSTOMER IMPACT
10
11
INCIDENT #1: INGRESS ERRORS
12
INCIDENT #1: AWS ALB 502
13 github.com/zalando/riptide
INCIDENT #1: AWS ALB 502
502 Bad Gateway
Server: awselb/2.0
...
14 github.com/zalando/riptide
INCIDENT #1: ALB HEALTHY HOST COUNT
3 healthy hosts
2xx requests
zero healthy hosts

15
LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB K8s network
HTTP
Node Skipper Node Skipper
MyApp MyApp MyApp

16
INCIDENT #1: SKIPPER MEMORY USAGE
Memory Usage
Memory Limit
17
INCIDENT #1: SKIPPER OOM
TLS
ALB
HTTP
Node Skipper Node Skipper OOMKill
MyApp MyApp MyApp

18
INCIDENT #1: CONTRIBUTING FACTORS
• Shared Ingress (per cluster)

• High latency of unrelated app (Solr)
caused high number of in-flight requests
• Skipper creates goroutine per HTTP request.
Goroutine costs 2kB memory + http.Request
• Memory limit was fixed at 500Mi (4x regular usage)
Fix for the memory issue in Skipper:

https://opensource.zalando.com/skipper/operation/operation/#scheduler
19
INCIDENT
#2
21
INCIDENT #1: IAM RETURNING 404
22
INCIDENT #1: NUMBER OF PODS
23
LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB K8s network
HTTP
MyApp MyApp MyApp

24
ROUTES FROM API SERVER
API Server ALB
MyApp MyApp MyApp

25
API SERVER DOWN
API Server ALB
OOMKill
MyApp MyApp MyApp

26
INCIDENT #2: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
27 containers:
...
INCIDENT #2: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
28 restartPolicy: Never
containers:
INCIDENT #2: LESSONS LEARNED
• Fix Ingress to stay “healthy” during API server problems

• Fix Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources NOTE: we dropped quotas recently
spec: github.com/zalando-incubator/kubernetes-
hard: on-aws/pull/2059
pods: "1500"
29
INCIDENT
#3
INCIDENT #3: INGRESS ERRORS
31
INCIDENT #3: COREDNS OOMKILL
coredns invoked oom-killer: restarts

gfp_mask=0x14000c0(GFP_KERNEL),
nodemask=(null), order=0, oom_score_adj=994
Memory cgroup out of memory: Kill process 6428

(coredns) score 2050 or sacrifice child
oom_reaper: reaped process 6428 (coredns),

now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
32
STOP THE BLEEDING: INCREASE MEMORY LIMIT
4Gi
2Gi
200Mi
33
SPIKE IN HTTP REQUESTS
34
SPIKE IN DNS QUERIES
35
INCREASE IN MEMORY USAGE
36
INCIDENT #3: CONTRIBUTING FACTORS
• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
37 github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
INCIDENT
#4
INCIDENT #4: CLUSTER DOWN
39
INCIDENT #4: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
40
INCIDENT #4: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix

help: etcdctl del [options] <key> [range_end]
41
42
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
• Disaster Recovery Plan?

• Backup etcd to S3
• Monitor the snapshots
44
INCIDENT
#5
INCIDENT #5: API LATENCY SPIKES
46
INCIDENT #5: CONNECTION ISSUES
Master Node
API Server
etcd
etcd-member
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
47
INCIDENT #5: STOP THE BLEEDING
#!/bin/bash
while true; do
echo "sleep for 60 seconds"
sleep 60
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
48
INCIDENT #5: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
49
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
50
INCIDENT
#6
INCIDENT #6: IMPACT
Ingress
5XXs
52
INCIDENT #6: CLUSTER DOWN?
53
INCIDENT #6: THE TRIGGER
54
https://www.outcome-eng.com/human-error-never-root-cause/
CLUSTER UPGRADE
FLOW
56
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
57
CLUSTER CHANNELS
Channel Description Clusters
dev Development and playground clusters. 3

alpha Main infrastructure clusters (important to us). 2
Product clusters for the rest of the
beta
organization (non-prod). 57+
Product clusters for the rest of the
stable
organization (prod). 57+
58 github.com/zalando-incubator/kubernetes-on-aws
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
59
RUNNING E2E TESTS (BEFORE)
Testing dev to alpha upgrade
branch: dev Control plane Control plane
Control plane
node node
Create Cluster Run e2e tests Delete Cluster
60
RUNNING E2E TESTS (NOW)
Testing dev to alpha upgrade
branch: alpha (base) branch: dev (head) Control plane Control plane
Control plane Control plane
node node node node
Create Cluster Update Cluster Run e2e tests Delete Cluster
61
• Automated e2e tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
62 github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
INCIDENT
#7
#7: KERNEL OOM KILLER
⇒ all containers
on this node down
64
INCIDENT #7: KUBELET MEMORY
65
UPSTREAM ISSUE REPORTED
https://github.com/kubernetes/kubernetes/issues/73587
66
INCIDENT #7: THE PATCH
67 https://github.com/kubernetes/kubernetes/issues/73587
INCIDENT
#8
INCIDENT #8: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume

"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
69
INCIDENT #8: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
..
..
..
70
INCIDENT #8: CPU THROTTLING
71
INCIDENT #8: WHAT HAPPENED
Scaled down IAM provider

to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
72
SLACK
CPU/memory requests "block" resources on nodes.

Difference between actual usage and requests → Slack
Node
CPU
Memory
"Slack"
73
DISABLING CPU THROTTLING
kubelet … --cpu-cfs-quota=false
[Announcement] CPU limits will be disabled
⇒ Ingress Latency Improvements
74
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),

switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
75
MORE TOPICS
• Graceful Pod shutdown and

race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
76
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
77
TIMEOUTS TO API SERVER..
78
MANAGED
KUBERNETES?
79
WILL MANAGED K8S SAVE US?
GKE: monthly uptime percentage at 99.95% for regional clusters
80
WILL MANAGED K8S SAVE US?
NO(not really)
e.g. AWS EKS uptime SLA is only for API server
81
PRODUCTION PROOFING AWS EKS
List of things you might

want to look at for EKS
in production
https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c
82
AWS EKS IN PRODUCTION
https://kubedex.com/90-days-of-aws-eks-in-production/
83
DOCKER.. (ON GKE)
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
84
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
WELCOME TO
CLOUD NATIVE!
86
KUBERNETES FAILURE STORIES
A compiled list of links to public failure stories related to Kubernetes.
k8s.af
We need more failure talks!
87
Istio? Anyone?
OPEN SOURCE
Kubernetes on AWS
AWS ALB Ingress controller

github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller

github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report

github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
88
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

2019-05-21 Kubernetes Failure Stories - KubeCon Europe

Uploaded by

Copyright:

Available Formats

2019-05-21 Kubernetes Failure Stories - KubeCon Europe

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2019-05-21 Kubernetes Failure Stories - KubeCon Europe

Uploaded by

Copyright:

Available Formats

Kubernetes

> 250 visits

~ 5.4 billion EUR million

revenue 2018 > 300.000

> 15.000 > 79% million ~ 2.000 17

502 Bad Gateway

zero healthy hosts

Node Skipper Node Skipper

MyApp MyApp MyApp

Node Skipper Node Skipper OOMKill

MyApp MyApp MyApp

• Shared Ingress (per cluster)

Fix for the memory issue in Skipper:

Node Skipper Node Skipper

MyApp MyApp MyApp

API Server ALB

Node Skipper Node Skipper

MyApp MyApp MyApp

API Server ALB

Node Skipper Node Skipper

MyApp MyApp MyApp

• Fix Ingress to stay “healthy” during API server problems

• Use quota for number of pods

coredns invoked oom-killer: restarts

Memory cgroup out of memory: Kill process 6428

oom_reaper: reaped process 6428 (coredns),

% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix

% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix

• Disaster Recovery Plan?

• It's never the AWS infrastructure until it is

• Treat t2 instances with care

• Kubernetes components are not necessarily "cloud native"

Cloud Native? Declarative, dynamic, resilient, and scalable

Channel Description Clusters

dev Development and playground clusters. 3

branch: dev Control plane Control plane

Create Cluster Run e2e tests Delete Cluster

Control plane Control plane

node node node node

Create Cluster Update Cluster Run e2e tests Delete Cluster

• Automated e2e tests are pretty good, but not enough

• Test the diff/migration automatically

• Bootstrap new cluster with previous configuration

• Apply new configuration

• Run end-to-end & conformance tests

Error during Pod creation:

MountVolume.SetUp failed for volume

⇒ All new Kubernetes deployments fail

Scaled down IAM provider

⇒ Process could not process credentials fast enough

CPU/memory requests "block" resources on nodes.

[Announcement] CPU limits will be disabled

⇒ Ingress Latency Improvements

• Switch to latest Docker to fix issues with Docker daemon freezing

• Redesign of DNS setup due to high DNS latencies (5s),

• Disabling CPU throttling (CFS quota) to avoid latency issues

• 502's during cluster updates: race condition during network setup

• Graceful Pod shutdown and

• Incompatible Kubernetes changes

• CoreOS ContainerLinux "stable" won't boot