2019-05-21 Kubernetes Failure Stories - KubeCon Europe

Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

Kubernetes

Failure Stories

HENNING JACOBS
@try_except_
ZALANDO AT A GLANCE

> 250 visits

~ 5.4 billion EUR million


per
month

revenue 2018 > 300.000


> 26
product choices

> 15.000 > 79% million ~ 2.000 17


employees in of visits via
active customers brands countries
Europe mobile devices

4
SCALE

380 Accounts

118 Clusters

5
DEVELOPERS USING KUBERNETES

6
47+ cluster
components

7
INCIDENTS ARE FINE
INCIDENT

#1
INCIDENT #1: CUSTOMER IMPACT

10
INCIDENT #1: CUSTOMER IMPACT

11
INCIDENT #1: INGRESS ERRORS

12
INCIDENT #1: AWS ALB 502

13 github.com/zalando/riptide
INCIDENT #1: AWS ALB 502

502 Bad Gateway

Server: awselb/2.0
...

14 github.com/zalando/riptide
INCIDENT #1: ALB HEALTHY HOST COUNT

3 healthy hosts

2xx requests

zero healthy hosts


15
LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB K8s network

HTTP

Node Skipper Node Skipper

MyApp MyApp MyApp


16
INCIDENT #1: SKIPPER MEMORY USAGE

Memory Usage

Memory Limit

17
INCIDENT #1: SKIPPER OOM
TLS

ALB

HTTP

Node Skipper Node Skipper OOMKill

MyApp MyApp MyApp


18
INCIDENT #1: CONTRIBUTING FACTORS

• Shared Ingress (per cluster)


• High latency of unrelated app (Solr)
caused high number of in-flight requests
• Skipper creates goroutine per HTTP request.
Goroutine costs 2kB memory + http.Request
• Memory limit was fixed at 500Mi (4x regular usage)

Fix for the memory issue in Skipper:


https://opensource.zalando.com/skipper/operation/operation/#scheduler
19
INCIDENT

#2
INCIDENT #2: CUSTOMER IMPACT

21
INCIDENT #1: IAM RETURNING 404

22
INCIDENT #1: NUMBER OF PODS

23
LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB K8s network

HTTP

Node Skipper Node Skipper

MyApp MyApp MyApp


24
ROUTES FROM API SERVER

API Server ALB

Node Skipper Node Skipper

MyApp MyApp MyApp


25
API SERVER DOWN

API Server ALB

OOMKill

Node Skipper Node Skipper

MyApp MyApp MyApp


26
INCIDENT #2: INNOCENT MANIFEST

apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
27 containers:
...
INCIDENT #2: FIXED CRON JOB

apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
28 restartPolicy: Never
containers:
INCIDENT #2: LESSONS LEARNED

• Fix Ingress to stay “healthy” during API server problems


• Fix Ingress to retain last known set of routes

• Use quota for number of pods

apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources NOTE: we dropped quotas recently
spec: github.com/zalando-incubator/kubernetes-
hard: on-aws/pull/2059
pods: "1500"
29
INCIDENT

#3
INCIDENT #3: INGRESS ERRORS

31
INCIDENT #3: COREDNS OOMKILL

coredns invoked oom-killer: restarts


gfp_mask=0x14000c0(GFP_KERNEL),
nodemask=(null), order=0, oom_score_adj=994

Memory cgroup out of memory: Kill process 6428


(coredns) score 2050 or sacrifice child

oom_reaper: reaped process 6428 (coredns),


now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

32
STOP THE BLEEDING: INCREASE MEMORY LIMIT

4Gi

2Gi

200Mi

33
SPIKE IN HTTP REQUESTS

34
SPIKE IN DNS QUERIES

35
INCREASE IN MEMORY USAGE

36
INCIDENT #3: CONTRIBUTING FACTORS

• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
37 github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
INCIDENT

#4
INCIDENT #4: CLUSTER DOWN

39
INCIDENT #4: MANUAL OPERATION

% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix

40
INCIDENT #4: RTFM

% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix


help: etcdctl del [options] <key> [range_end]

41
42
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
INCIDENT #4: LESSONS LEARNED

• Disaster Recovery Plan?


• Backup etcd to S3
• Monitor the snapshots

44
INCIDENT

#5
INCIDENT #5: API LATENCY SPIKES

46
INCIDENT #5: CONNECTION ISSUES

Master Node

API Server
etcd
etcd-member

...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
47
INCIDENT #5: STOP THE BLEEDING

#!/bin/bash

while true; do
echo "sleep for 60 seconds"
sleep 60
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
48
INCIDENT #5: CONFIRMATION FROM AWS

[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]

49
INCIDENT #5: LESSONS LEARNED

• It's never the AWS infrastructure until it is

• Treat t2 instances with care

• Kubernetes components are not necessarily "cloud native"

Cloud Native? Declarative, dynamic, resilient, and scalable

50
INCIDENT

#6
INCIDENT #6: IMPACT

Ingress
5XXs

52
INCIDENT #6: CLUSTER DOWN?

53
INCIDENT #6: THE TRIGGER

54
https://www.outcome-eng.com/human-error-never-root-cause/
CLUSTER UPGRADE
FLOW

56
CLUSTER LIFECYCLE MANAGER (CLM)

github.com/zalando-incubator/cluster-lifecycle-manager
57
CLUSTER CHANNELS

Channel Description Clusters

dev Development and playground clusters. 3


alpha Main infrastructure clusters (important to us). 2
Product clusters for the rest of the
beta
organization (non-prod). 57+
Product clusters for the rest of the
stable
organization (prod). 57+

58 github.com/zalando-incubator/kubernetes-on-aws
E2E TESTS ON EVERY PR

github.com/zalando-incubator/kubernetes-on-aws
59
RUNNING E2E TESTS (BEFORE)
Testing dev to alpha upgrade

branch: dev Control plane Control plane

Control plane

node node

Create Cluster Run e2e tests Delete Cluster

60
RUNNING E2E TESTS (NOW)
Testing dev to alpha upgrade

branch: alpha (base) branch: dev (head) Control plane Control plane

Control plane Control plane

node node node node

Create Cluster Update Cluster Run e2e tests Delete Cluster

61
INCIDENT #6: LESSONS LEARNED

• Automated e2e tests are pretty good, but not enough

• Test the diff/migration automatically

• Bootstrap new cluster with previous configuration

• Apply new configuration

• Run end-to-end & conformance tests

62 github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
INCIDENT

#7
#7: KERNEL OOM KILLER

⇒ all containers
on this node down

64
INCIDENT #7: KUBELET MEMORY

65
UPSTREAM ISSUE REPORTED

https://github.com/kubernetes/kubernetes/issues/73587

66
INCIDENT #7: THE PATCH

67 https://github.com/kubernetes/kubernetes/issues/73587
INCIDENT

#8
INCIDENT #8: IMPACT

Error during Pod creation:

MountVolume.SetUp failed for volume


"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found

⇒ All new Kubernetes deployments fail

69
INCIDENT #8: CREDENTIALS QUEUE

17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20

70
INCIDENT #8: CPU THROTTLING

71
INCIDENT #8: WHAT HAPPENED

Scaled down IAM provider


to reduce Slack
+ Number of deployments increased

⇒ Process could not process credentials fast enough

72
SLACK

CPU/memory requests "block" resources on nodes.


Difference between actual usage and requests → Slack

Node

CPU

Memory

"Slack"
73
DISABLING CPU THROTTLING
kubelet … --cpu-cfs-quota=false

[Announcement] CPU limits will be disabled

⇒ Ingress Latency Improvements

74
A MILLION WAYS TO CRASH YOUR CLUSTER?

• Switch to latest Docker to fix issues with Docker daemon freezing

• Redesign of DNS setup due to high DNS latencies (5s),


switch from kube-dns to node-local dnsmasq+CoreDNS

• Disabling CPU throttling (CFS quota) to avoid latency issues

• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts

• 502's during cluster updates: race condition during network setup

75
MORE TOPICS

• Graceful Pod shutdown and


race conditions (endpoints, Ingress)

• Incompatible Kubernetes changes

• CoreOS ContainerLinux "stable" won't boot

• Kubernetes EBS volume handling

• Docker

76
RACE CONDITIONS..

• Switch to the latest Docker version available to fix the issues with Docker daemon freezing

• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS

• Disabling CPU throttling (CFS quota) to avoid latency issues

• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts

• 502's during cluster updates: race condition

77
github.com/zalando-incubator/kubernetes-on-aws
TIMEOUTS TO API SERVER..

github.com/zalando-incubator/kubernetes-on-aws

78
MANAGED
KUBERNETES?

79
WILL MANAGED K8S SAVE US?

GKE: monthly uptime percentage at 99.95% for regional clusters

80
WILL MANAGED K8S SAVE US?

NO(not really)
e.g. AWS EKS uptime SLA is only for API server

81
PRODUCTION PROOFING AWS EKS

List of things you might


want to look at for EKS
in production

https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c

82
AWS EKS IN PRODUCTION

https://kubedex.com/90-days-of-aws-eks-in-production/

83
DOCKER.. (ON GKE)

https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
84
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
WELCOME TO
CLOUD NATIVE!
86
KUBERNETES FAILURE STORIES

A compiled list of links to public failure stories related to Kubernetes.

k8s.af
We need more failure talks!

87
Istio? Anyone?
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws

AWS ALB Ingress controller


github.com/zalando-incubator/kube-ingress-aws-controller

Skipper HTTP Router & Ingress controller


github.com/zalando/skipper

External DNS
github.com/kubernetes-incubator/external-dns

Postgres Operator
github.com/zalando-incubator/postgres-operator

Kubernetes Resource Report


github.com/hjacobs/kube-resource-report

Kubernetes Downscaler
github.com/hjacobs/kube-downscaler

88
QUESTIONS?

HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY

henning@zalando.de
@try_except_

Illustrations by @01k

You might also like