SRE and Incident Management

SRE and Incident
Management
MODULE 7
Workshop Series Overview
PAGE | 2
Helen Beal
Herder of Humans
@bealhelen
Helen Beal is a DevOps and Ways of Working

coach, Chief Ambassador at DevOps Institute
and an ambassador for the Continuous Delivery
Foundation. She is the Chair of the Value
Stream Management Consortium and provides
strategic advisory services to DevOps industry
leaders such as Plutora and Moogsoft. She is
also an analyst at Accelerated Strategies
Group. She hosts the Day-to-Day DevOps
webinar series for BrightTalk, speaks regularly
on DevOps topics, is a DevOps editor for InfoQ
and also writes for a number of other online
platforms. She regularly appears in
TechBeacon’s DevOps Top100 lists and was
recognized as the Top DevOps Evangelist 2020
in the DevOps Dozen awards.
3
Flow: Talk map
Error
budgets
SLAs,
Toil
SLOs, SLIs
What is SRE and Chaos

Implementation SRE Practices
SRE? observability engineering
drivers
AIOps and
Its origins IM
4
What is SRE?
● The goal is to create ultra-scalable and

highly reliable distributed software systems
● SRE's spend 50% of their time doing
"ops" related work such as issue
resolution, on-call, and manual
interventions
● SRE's spend 50% of their time on
development tasks such as new features,
scaling or automation
● Observability, monitoring, alerting and
automation are a large part of SRE
PAGE | 5
PAGE | 6
Where has the concept come from?
● Site Reliability
Engineering (SRE) is a “What happens when
discipline that a software engineer is
incorporates aspects of tasked with what
software engineering and used to be called
operations.”
applies them to
infrastructure and Ben Treynor, Google
operations problems
● Created at Google around
2003 and publicized via
SRE books
PAGE | 7
PAGE | 8
Let’s ask Mark
How do you connect

SRE and DevOps?
9
Results in higher performing organizations
and sustainable, scalable business
Higher customer engagement leads to

increased revenues and profitability
Sublime customer experience leads to more

usage, more positive reviews and referrals
PAGE | 10
Toil and the wisdom of production
● Any manual, mandated

operational task is bad
● If a task can be automated,
then it should be automated
● Tasks can provide the
"wisdom of production" that
will inform better system
design and behavior
SREs must have time to make tomorrow better than today
PAGE | 11
PAGE | 12
Let’s ask Mark
What do you see as

the most common
causes of toil?
13
Moving forward to SRE at Slack
● Slack moved from 100 AWS instances to 15,000 instances

over 4 years
● Excessive toil caused by low-quality, noisy alerting
● Ops teams were so consumed by interrupt-driven toil that
they were unable to make progress on improving reliability
● Slack explicitly committed to the importance of
reliability over feature velocity
● Operational ownership of services pushed back into the
dev teams resulting in the teams making the code fixes
necessary to stop the incident alerts
PAGE | 14
PAGE | 15
Reducing unplanned work and technical debt
Value Creation
How investing in
Value Creation SRE increases
What the innovation
team spends
their time Unplanned work
doing Unplanned work
Learning
Learning
Technical debt
Technical debt
Without SRE With SRE

PAGE | 16
SRE principles and practices
Source: The SRE Blueprint @ DevOps Institute PAGE | 17

SLAs, SLOs and SLIs
Service Level...
In SRE services are
managed to the SLO
SLA SLO SLI

Agreement Objective Indicator
A business contract that Specify a target level for An indicator of the level of
comes into effect when the reliability of your service that you are
your users are so service e.g., what the providing e.g., http
unhappy you have to success rate should be request success rate 99%
compensate them in 98% (it’s never 100%)
some fashion
SLOs need consequences if they are violated

PAGE | 18
The VALET dimensions of SLO
Dimension SLO Budget Policy
Volume/traffic Does the service handle the Budget: 99.99% of HTTP Address scalability issues
V right volumes of data or requests per month succeed
traffic? with 200 OK
Availability Is the service available to Budget: 99.9% Address downtime
A users when they need it? availability/uptime issues/outages, zero
downtime deployments
Latency Does the service deliver in a Payload of 90% of HTTP Address performance issues
L user-acceptable period of responses returned in under
time? 300ms
Errors Is the service delivering the 0.01% of HTTP requests Analyze and respond to main
E capabilities being requested? return 4xx or 5xx status status codes, new
codes functionality or infrastructure
may be required
Tickets Are our support services 75% of service tickets are Automate more manual
T efficient? automatically resolved processes
PAGE | 19
Error budgets
SLA SLO
70% The SLO is a proxy

for customer
happiness
75%
90%
50%
Error
Budget
PAGE | 20
Error budgets by SLI and SLO
SLI SLO ERROR BUDGETS
[Metric identifier] [Operator] [Metric] [Objective] [SLI] [Period] [Error Budget] [SLI]
Home page request served in < 100 ms 95% of home page requests served in < 100ms Allow 5% failure of home page requests
over past 24 hours served in < 100ms over past 24 hours
95th percentile of Home page latency 99% of 95th percentile of home page latency over Allow 1% failure of 95% percentile home
over 5 mins < 200ms 5 mins < 200ms for the past month latency over 5 minutes < 200ms for the
past month
Requests should be completed within 95% of requests should be completed within 250 Allow 5% failure of requests should be
250 ms ms over 24 hours completed within 250 ms over 24 hours
Services should be available for 99.99% 95% of Services should be available for 99.99% of Allow 5% failure of services availability
of time (based on heartbeat events time over 30 days over 30 days
from bounded system)
Book page request response code != 5xx 99% of book page request response code !=5xx Allow for 1% failure of book page request
over the past 7 days response code != 5xx over the last 7 days
PAGE | 21
Should you
automate
everything?
PAGE | 22
Let’s ask Mark
How do you recommend

prioritization of
automation
opportunities?
23
SRE and Observability/Monitoring
Observability is a characteristic of
systems; that they can be observed. It’s
closely related to a DevOps tenet:
‘telemetry everywhere’, meaning that
anything we implement is emitting data
about its activities. It requires intentional
behavior during digital product and
platform design and a conducive
architecture. It’s not monitoring.
Monitoring is what we do when we
observe our observable systems and the
tools category that largely makes this
possible.
PAGE | 24
SRE persona
How Observability Supports SRE’s Goals
● Reducing the toil associated with incident management – particularly around cause
analysis – improving uptime and MTTR
● Providing a platform for inspecting and adapting according to SLOs and ultimately
improving teams’ ability to meet them
● Offering a potential solution to improve when SLOs are not met, and error budgets are
over-spent
● Relieving team cognitive load when dealing with vast amounts of data – reducing burnout
● Releasing humans and teams from toil, improving productivity, innovation and the flow
and delivery of value
● Supporting multifunctional, autonomous teams and the “we build it, we own it” DevOps
mantra
● Completing the value stream cycle by providing insights around value outcomes that can
be fed back into the innovation phase
Icons by Freepik and Phatplus from FlatIcon PAGE | 25

PAGE | 26
PAGE | 27
Good practices for Incident Management
One of the key responsibilities of SRE is to manage

incidents of the production system(s) that they are
responsible for. Within an incident, SREs contribute
to debugging the system, choosing the right ● Defect prevention
immediate mitigation, and organizing the incident ● Strategies for
response if it requires broader coordination deploy/roll back/ roll
forward (Feature
Flags, Blue-Green,
Canary)
● Auto remediation
● Reverting system
PAGE | 28
Source: Damon Edwards Source: Jon Stevens-Hall PAGE | 29
Let’s ask Mark
What’s your view on

intelligent swarming?
30
IM, Observability, AIOps and the SRE
Self-healing
Step 5
systems
Use chaos
Step 4
engineering
Pay down technical debt
Step 3
for increased stability
Step 2 Automate toil using AI insights
Reduce MTTR through noise

Step 1
reduction (deduplication, correlation)
PAGE | 31
Platform SRE
PAGE | 32
Platform SRE ushers in self-service
● The Platform provides “self-service” provision of

infrastructure, functionalities, configurations and
environments that can be consumed by
development teams , third parties e.g. distributed
teams and partners
● Embedded governance, controls and standards
are built-in
● End-to-end deployment automation,
infrastructure playbooks of a service or
application
● Abstraction of infrastructure specific
implementations for multi/hybrid cloud through
runbooks and playbooks
● In-Source code, products built by platform teams
can be extended or enhanced by SRE/Dev/Ops
or any other
PAGE | 33
Chaos Engineering
The discipline of Properties of a

Chaos Experiment
experimenting on a
● Define steady state
distributed system in ● Formulate hypothesis
order to build confidence ● Outline methodology
● Identify blast radius
in the system’s ability to ● Observability is key
withstand turbulent ● Readily abortable
conditions.
PAGE | 34
Getting started with Chaos Engineering
From a technical point of view, they are easy to set up and do not have to be sophisticated in terms of
implementation
• Get relevant people in a room who are responsible for a system or set of
systems
• Shut off a component that the system should be robust enough to tolerate
without it
• Record the learning outcomes and bring the component back online
Apply It: Build a Game Day event

What / Who / When / Where / How
Note: You don’t need to run your GameDay in production! Insights can come from conducting
experiments first in a staging or test environment
PAGE | 35
Implementation challenges
And how to overcome them
Your difficult and Postmortems

You don’t have You stop at
dense process are underutilized You wait for
enough cross- incident
is slowing down and don’t incidents to
team usage or management
incident encompass in- happen
buy-in without SLOs
response depth learnings
PAGE | 36
Where to start
Start
where
you
are
Keep it simple Think and work

and practical Optimize holistically
Progress iteratively
and
with feedback
automate
Collaborate
and Focus on
promote value
visibility
PAGE | 37
THANK YOU!
And now over to Mark
Mark Kriaf
Partner Solutions Architect at AWS
markkriaf
Agenda
• The uptime flywheel - prepare
• Prepare for downtime
• AWS Services for incident management
• The uptime flywheel – learn
• Chaos engineering
• AWS Fault Injection Simulator
• Summary and Marketplace next steps
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The uptime flywheel @ AWS
Decouple
Learn &
& Prepare
improve
empower
Recover
Shorter Detect
incidents
Weekly
OpsMetrics
“2-pizza” teams meeting
Respond
Prepare for downtime
1. Detect
• What could go wrong?
• How would I know?
2. Respond and recover
• Who needs to be engaged?
• What do they need to do to
diagnose?
• Procedures and scripts
• Where do we collaborate?
3. Learn and improve

• How did we respond?
• What actions will we take?
AWS Systems Manager
Centralize operational data from
multiple AWS services and automate
tasks across your AWS resources
AWS Systems Manager
Benefits
• Simplify resources and application
management
• Easy to operate and securely manage multi-
cloud infrastructures at scale
• Resolve critical application availability and
performance
• Prepare for and manage incidents efficiently
with automated response
AWS Systems Manager
Incident Manager
Incident Manager
Resolve application issues faster with automated response plans
Specify a response plan to critical application

alarms, including who to engage, what runbook to
follow, and where to collaborate
Notify the right people immediately with SMS,
voice, and escalations (additional partner
integrations coming soon)
Single console to track incidents from detection to
mitigation and post-incident analysis, including
timeline, runbooks, metrics, etc.
Collaborate in Slack via AWS Chatbot to resolve
incidents
Identify post-incident action items, such as
improving alarms or automating runbooks steps,
using Amazon’s post-incident analysis template and
track them in OpsCenter
The uptime flywheel @ AWS
Decouple
Learn & Prepare
&
improve
empower
Recover
Shorter Detect
incidents
Weekly
“2-pizza” teams
OpsMetrics
meeting
Respond
Chaos engineering
Experiment to ensure that the impact of failures is mitigated
Chaos experiment IMPROVE STEADY

STATE
• Inject events that simulate
• Hardware failures, like servers dying
• Software failures, like malformed responses
VERIFY HYPOTHESIS
• Nonfailure events, like spikes in traffic or
scaling events
• Any event capable of disrupting steady state
RUN
EXPERIMENT
AWS Fault Injection Simulator
Improve resiliency and performance with controlled experiments
• A fast and easy way to get started with

fault injection experiments
• Validate how your application performs
on AWS
• Safeguard fault injection experiments
• Improve application performance,
resiliency, and observability
• Get comprehensive insights by
generating real-world failure conditions
AWS Fault Injection Simulator
AWS Fault Injection Simulator use case: periodic game days
Why conduct a game day?
• Simulate a failure or event to test systems, processes, and team responses

• Should cover the areas of operations, security, reliability, performance, and cost
• Can be carried out with replicas of your production environment using AWS CloudFormation
• Should involve all personnel who normally operate a workload
Game day process
Use a Correction
Define the
Execute your Analyze the game of Error (COE)
scenario you want
simulation day process to analyze
to practice
issues
AWS Marketplace: Destination for third-party solutions to use with AWS
DevOps Core practices
Collaboration & communication Continuous integration Continuous delivery Monitoring & observability
Microservices and everything-as-code Testing & quality management Security & compliance Incident management
Ideas
Ideas Ideas Plan Build Test Secure Release Operate

Ideas
Sample AWS Marketplace solution providers
1,600+ vendors | 8,000+ products
8,000+ 1,600+ 24 290,000+ 1.5M+
listings ISVs regions customers subscriptions
AWS Marketplace DevOps Workshop Series participating partner hands-on labs
And more coming soon!
How can you get started?
Find Buy Deploy
Through flexible With multiple

A breadth of DevOps solutions:
pricing options: deployment options:
Free trial SaaS

Pay-as-you-go Amazon Machine Image (AMI)
Budget alignment CloudFormation Template
Bring Your Own License (BYOL) Containers
Private Offers Amazon EKS/ Amazon ECS
Billing consolidation AI / ML models
Enterprise Discount Program AWS Data Exchange
Private Marketplace AWS Control Tower
Next steps
Bookmark the Workshop Series landing page, check back for new content or subscribe to
email updates
Start your lab of choice
Move on to Module 8: DevSecOps
Visit the AWS Marketplace website to experiment with DevOps tooling
Module 7 Hands-on Labs
Move on to Module 8: DevSecOps
https://pages.awscloud.com/awsmp-h2-dev-aws-marketplace-devops-workshop-series.html
THANK YOU!

SRE and Incident Management

Uploaded by

Copyright:

Available Formats

SRE and Incident Management

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SRE and Incident Management

Uploaded by

Copyright:

Available Formats

SRE and Incident

Helen Beal is a DevOps and Ways of Working

What is SRE and Chaos

● The goal is to create ultra-scalable and

How do you connect

Higher customer engagement leads to

Sublime customer experience leads to more

● Any manual, mandated

SREs must have time to make tomorrow better than today

What do you see as

● Slack moved from 100 AWS instances to 15,000 instances

Without SRE With SRE

Source: The SRE Blueprint @ DevOps Institute PAGE | 17

SLA SLO SLI

SLOs need consequences if they are violated

70% The SLO is a proxy

How do you recommend

Icons by Freepik and Phatplus from FlatIcon PAGE | 25

One of the key responsibilities of SRE is to manage

What’s your view on

Step 2 Automate toil using AI insights

Reduce MTTR through noise

● The Platform provides “self-service” provision of

The discipline of Properties of a

Apply It: Build a Game Day event

Your difficult and Postmortems

Keep it simple Think and work

• Prepare for downtime

• AWS Services for incident management

• The uptime flywheel – learn

• AWS Fault Injection Simulator

• Summary and Marketplace next steps

3. Learn and improve

Specify a response plan to critical application

Chaos experiment IMPROVE STEADY

• A fast and easy way to get started with

Why conduct a game day?

• Simulate a failure or event to test systems, processes, and team responses

Game day process

Ideas Ideas Plan Build Test Secure Release Operate

Sample AWS Marketplace solution providers

1,600+ vendors | 8,000+ products

AWS Marketplace DevOps Workshop Series participating partner hands-on labs

And more coming soon!

Through flexible With multiple

Free trial SaaS

Start your lab of choice

Move on to Module 8: DevSecOps

Visit the AWS Marketplace website to experiment with DevOps tooling

You might also like