SRE and Incident Management
SRE and Incident Management
SRE and Incident Management
Management
MODULE 7
Workshop Series Overview
PAGE | 2
Helen Beal
Herder of Humans
@bealhelen
3
Flow: Talk map
Error
budgets
SLAs,
Toil
SLOs, SLIs
AIOps and
Its origins IM
4
What is SRE?
PAGE | 5
PAGE | 6
Where has the concept come from?
● Site Reliability
Engineering (SRE) is a “What happens when
discipline that a software engineer is
incorporates aspects of tasked with what
software engineering and used to be called
operations.”
applies them to
infrastructure and Ben Treynor, Google
operations problems
● Created at Google around
2003 and publicized via
SRE books
PAGE | 7
PAGE | 8
Let’s ask Mark
9
Results in higher performing organizations
and sustainable, scalable business
PAGE | 10
Toil and the wisdom of production
PAGE | 11
PAGE | 12
Let’s ask Mark
13
Moving forward to SRE at Slack
PAGE | 14
PAGE | 15
Reducing unplanned work and technical debt
Value Creation
How investing in
Value Creation SRE increases
What the innovation
team spends
their time Unplanned work
doing Unplanned work
Learning
Learning
Technical debt
Technical debt
Tickets Are our support services 75% of service tickets are Automate more manual
T efficient? automatically resolved processes
PAGE | 19
Error budgets
SLA SLO
90%
50%
Error
Budget
PAGE | 20
Error budgets by SLI and SLO
SLI SLO ERROR BUDGETS
[Metric identifier] [Operator] [Metric] [Objective] [SLI] [Period] [Error Budget] [SLI]
Home page request served in < 100 ms 95% of home page requests served in < 100ms Allow 5% failure of home page requests
over past 24 hours served in < 100ms over past 24 hours
95th percentile of Home page latency 99% of 95th percentile of home page latency over Allow 1% failure of 95% percentile home
over 5 mins < 200ms 5 mins < 200ms for the past month latency over 5 minutes < 200ms for the
past month
Requests should be completed within 95% of requests should be completed within 250 Allow 5% failure of requests should be
250 ms ms over 24 hours completed within 250 ms over 24 hours
Services should be available for 99.99% 95% of Services should be available for 99.99% of Allow 5% failure of services availability
of time (based on heartbeat events time over 30 days over 30 days
from bounded system)
Book page request response code != 5xx 99% of book page request response code !=5xx Allow for 1% failure of book page request
over the past 7 days response code != 5xx over the last 7 days
PAGE | 21
Should you
automate
everything?
PAGE | 22
Let’s ask Mark
23
SRE and Observability/Monitoring
Observability is a characteristic of
systems; that they can be observed. It’s
closely related to a DevOps tenet:
‘telemetry everywhere’, meaning that
anything we implement is emitting data
about its activities. It requires intentional
behavior during digital product and
platform design and a conducive
architecture. It’s not monitoring.
Monitoring is what we do when we
observe our observable systems and the
tools category that largely makes this
possible.
PAGE | 24
SRE persona
How Observability Supports SRE’s Goals
● Reducing the toil associated with incident management – particularly around cause
analysis – improving uptime and MTTR
● Providing a platform for inspecting and adapting according to SLOs and ultimately
improving teams’ ability to meet them
● Offering a potential solution to improve when SLOs are not met, and error budgets are
over-spent
● Relieving team cognitive load when dealing with vast amounts of data – reducing burnout
● Releasing humans and teams from toil, improving productivity, innovation and the flow
and delivery of value
● Supporting multifunctional, autonomous teams and the “we build it, we own it” DevOps
mantra
● Completing the value stream cycle by providing insights around value outcomes that can
be fed back into the innovation phase
PAGE | 28
Source: Damon Edwards Source: Jon Stevens-Hall PAGE | 29
Let’s ask Mark
30
IM, Observability, AIOps and the SRE
Self-healing
Step 5
systems
Use chaos
Step 4
engineering
Pay down technical debt
Step 3
for increased stability
PAGE | 31
Platform SRE
PAGE | 32
Platform SRE ushers in self-service
PAGE | 33
Chaos Engineering
conditions.
PAGE | 34
Getting started with Chaos Engineering
From a technical point of view, they are easy to set up and do not have to be sophisticated in terms of
implementation
• Get relevant people in a room who are responsible for a system or set of
systems
• Shut off a component that the system should be robust enough to tolerate
without it
• Record the learning outcomes and bring the component back online
Note: You don’t need to run your GameDay in production! Insights can come from conducting
experiments first in a staging or test environment
PAGE | 35
Implementation challenges
And how to overcome them
PAGE | 36
Where to start
Start
where
you
are
PAGE | 37
THANK YOU!
And now over to Mark
Mark Kriaf
Partner Solutions Architect at AWS
markkriaf
Agenda
• The uptime flywheel - prepare
• Chaos engineering
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The uptime flywheel @ AWS
Decouple
Learn &
& Prepare
improve
empower
Recover
Shorter Detect
incidents
Weekly
OpsMetrics
“2-pizza” teams meeting
Respond
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Prepare for downtime
1. Detect
• What could go wrong?
• How would I know?
2. Respond and recover
• Who needs to be engaged?
• What do they need to do to
diagnose?
• Procedures and scripts
• Where do we collaborate?
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Systems Manager
Centralize operational data from
multiple AWS services and automate
tasks across your AWS resources
AWS Systems Manager
Benefits
• Simplify resources and application
management
• Easy to operate and securely manage multi-
cloud infrastructures at scale
• Resolve critical application availability and
performance
• Prepare for and manage incidents efficiently
with automated response
AWS Systems Manager
Incident Manager
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Incident Manager
Resolve application issues faster with automated response plans
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The uptime flywheel @ AWS
Decouple
Learn & Prepare
&
improve
empower
Recover
Shorter Detect
incidents
Weekly
“2-pizza” teams
OpsMetrics
meeting
Respond
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chaos engineering
Experiment to ensure that the impact of failures is mitigated
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Fault Injection Simulator
Improve resiliency and performance with controlled experiments
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Fault Injection Simulator
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Fault Injection Simulator use case: periodic game days
Use a Correction
Define the
Execute your Analyze the game of Error (COE)
scenario you want
simulation day process to analyze
to practice
issues
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Marketplace: Destination for third-party solutions to use with AWS
DevOps Core practices
Collaboration & communication Continuous integration Continuous delivery Monitoring & observability
Microservices and everything-as-code Testing & quality management Security & compliance Incident management
Ideas
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
8,000+ 1,600+ 24 290,000+ 1.5M+
listings ISVs regions customers subscriptions
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How can you get started?
Find Buy Deploy
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next steps
Bookmark the Workshop Series landing page, check back for new content or subscribe to
email updates
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 7 Hands-on Labs
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Move on to Module 8: DevSecOps
https://pages.awscloud.com/awsmp-h2-dev-aws-marketplace-devops-workshop-series.html
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.