SRE and Incident Management

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

SRE and Incident

Management
MODULE 7
Workshop Series Overview

PAGE | 2
Helen Beal
Herder of Humans
@bealhelen

Helen Beal is a DevOps and Ways of Working


coach, Chief Ambassador at DevOps Institute
and an ambassador for the Continuous Delivery
Foundation. She is the Chair of the Value
Stream Management Consortium and provides
strategic advisory services to DevOps industry
leaders such as Plutora and Moogsoft. She is
also an analyst at Accelerated Strategies
Group. She hosts the Day-to-Day DevOps
webinar series for BrightTalk, speaks regularly
on DevOps topics, is a DevOps editor for InfoQ
and also writes for a number of other online
platforms. She regularly appears in
TechBeacon’s DevOps Top100 lists and was
recognized as the Top DevOps Evangelist 2020
in the DevOps Dozen awards.

3
Flow: Talk map
Error
budgets
SLAs,
Toil
SLOs, SLIs

What is SRE and Chaos


Implementation SRE Practices
SRE? observability engineering
drivers

AIOps and
Its origins IM

4
What is SRE?

● The goal is to create ultra-scalable and


highly reliable distributed software systems
● SRE's spend 50% of their time doing
"ops" related work such as issue
resolution, on-call, and manual
interventions
● SRE's spend 50% of their time on
development tasks such as new features,
scaling or automation
● Observability, monitoring, alerting and
automation are a large part of SRE

PAGE | 5
PAGE | 6
Where has the concept come from?

● Site Reliability
Engineering (SRE) is a “What happens when
discipline that a software engineer is
incorporates aspects of tasked with what
software engineering and used to be called
operations.”
applies them to
infrastructure and Ben Treynor, Google
operations problems
● Created at Google around
2003 and publicized via
SRE books
PAGE | 7
PAGE | 8
Let’s ask Mark

How do you connect


SRE and DevOps?

9
Results in higher performing organizations
and sustainable, scalable business

Higher customer engagement leads to


increased revenues and profitability

Sublime customer experience leads to more


usage, more positive reviews and referrals

PAGE | 10
Toil and the wisdom of production

● Any manual, mandated


operational task is bad
● If a task can be automated,
then it should be automated
● Tasks can provide the
"wisdom of production" that
will inform better system
design and behavior

SREs must have time to make tomorrow better than today

PAGE | 11
PAGE | 12
Let’s ask Mark

What do you see as


the most common
causes of toil?

13
Moving forward to SRE at Slack

● Slack moved from 100 AWS instances to 15,000 instances


over 4 years
● Excessive toil caused by low-quality, noisy alerting
● Ops teams were so consumed by interrupt-driven toil that
they were unable to make progress on improving reliability
● Slack explicitly committed to the importance of
reliability over feature velocity
● Operational ownership of services pushed back into the
dev teams resulting in the teams making the code fixes
necessary to stop the incident alerts

PAGE | 14
PAGE | 15
Reducing unplanned work and technical debt

Value Creation
How investing in
Value Creation SRE increases
What the innovation
team spends
their time Unplanned work
doing Unplanned work

Learning

Learning
Technical debt
Technical debt

Without SRE With SRE


PAGE | 16
SRE principles and practices

Source: The SRE Blueprint @ DevOps Institute PAGE | 17


SLAs, SLOs and SLIs
Service Level...
In SRE services are
managed to the SLO

SLA SLO SLI


Agreement Objective Indicator
A business contract that Specify a target level for An indicator of the level of
comes into effect when the reliability of your service that you are
your users are so service e.g., what the providing e.g., http
unhappy you have to success rate should be request success rate 99%
compensate them in 98% (it’s never 100%)
some fashion

SLOs need consequences if they are violated


PAGE | 18
The VALET dimensions of SLO
Dimension SLO Budget Policy
Volume/traffic Does the service handle the Budget: 99.99% of HTTP Address scalability issues
V right volumes of data or requests per month succeed
traffic? with 200 OK
Availability Is the service available to Budget: 99.9% Address downtime
A users when they need it? availability/uptime issues/outages, zero
downtime deployments
Latency Does the service deliver in a Payload of 90% of HTTP Address performance issues
L user-acceptable period of responses returned in under
time? 300ms
Errors Is the service delivering the 0.01% of HTTP requests Analyze and respond to main
E capabilities being requested? return 4xx or 5xx status status codes, new
codes functionality or infrastructure
may be required

Tickets Are our support services 75% of service tickets are Automate more manual
T efficient? automatically resolved processes

PAGE | 19
Error budgets
SLA SLO

70% The SLO is a proxy


for customer
happiness
75%

90%

50%

Error
Budget

PAGE | 20
Error budgets by SLI and SLO
SLI SLO ERROR BUDGETS
[Metric identifier] [Operator] [Metric] [Objective] [SLI] [Period] [Error Budget] [SLI]
Home page request served in < 100 ms 95% of home page requests served in < 100ms Allow 5% failure of home page requests
over past 24 hours served in < 100ms over past 24 hours

95th percentile of Home page latency 99% of 95th percentile of home page latency over Allow 1% failure of 95% percentile home
over 5 mins < 200ms 5 mins < 200ms for the past month latency over 5 minutes < 200ms for the
past month
Requests should be completed within 95% of requests should be completed within 250 Allow 5% failure of requests should be
250 ms ms over 24 hours completed within 250 ms over 24 hours
Services should be available for 99.99% 95% of Services should be available for 99.99% of Allow 5% failure of services availability
of time (based on heartbeat events time over 30 days over 30 days
from bounded system)
Book page request response code != 5xx 99% of book page request response code !=5xx Allow for 1% failure of book page request
over the past 7 days response code != 5xx over the last 7 days

PAGE | 21
Should you
automate
everything?

PAGE | 22
Let’s ask Mark

How do you recommend


prioritization of
automation
opportunities?

23
SRE and Observability/Monitoring
Observability is a characteristic of
systems; that they can be observed. It’s
closely related to a DevOps tenet:
‘telemetry everywhere’, meaning that
anything we implement is emitting data
about its activities. It requires intentional
behavior during digital product and
platform design and a conducive
architecture. It’s not monitoring.
Monitoring is what we do when we
observe our observable systems and the
tools category that largely makes this
possible.

PAGE | 24
SRE persona
How Observability Supports SRE’s Goals
● Reducing the toil associated with incident management – particularly around cause
analysis – improving uptime and MTTR
● Providing a platform for inspecting and adapting according to SLOs and ultimately
improving teams’ ability to meet them
● Offering a potential solution to improve when SLOs are not met, and error budgets are
over-spent
● Relieving team cognitive load when dealing with vast amounts of data – reducing burnout
● Releasing humans and teams from toil, improving productivity, innovation and the flow
and delivery of value
● Supporting multifunctional, autonomous teams and the “we build it, we own it” DevOps
mantra
● Completing the value stream cycle by providing insights around value outcomes that can
be fed back into the innovation phase

Icons by Freepik and Phatplus from FlatIcon PAGE | 25


PAGE | 26
PAGE | 27
Good practices for Incident Management

One of the key responsibilities of SRE is to manage


incidents of the production system(s) that they are
responsible for. Within an incident, SREs contribute
to debugging the system, choosing the right ● Defect prevention
immediate mitigation, and organizing the incident ● Strategies for
response if it requires broader coordination deploy/roll back/ roll
forward (Feature
Flags, Blue-Green,
Canary)
● Auto remediation
● Reverting system

PAGE | 28
Source: Damon Edwards Source: Jon Stevens-Hall PAGE | 29
Let’s ask Mark

What’s your view on


intelligent swarming?

30
IM, Observability, AIOps and the SRE

Self-healing
Step 5
systems
Use chaos
Step 4
engineering
Pay down technical debt
Step 3
for increased stability

Step 2 Automate toil using AI insights

Reduce MTTR through noise


Step 1
reduction (deduplication, correlation)

PAGE | 31
Platform SRE

PAGE | 32
Platform SRE ushers in self-service

● The Platform provides “self-service” provision of


infrastructure, functionalities, configurations and
environments that can be consumed by
development teams , third parties e.g. distributed
teams and partners
● Embedded governance, controls and standards
are built-in
● End-to-end deployment automation,
infrastructure playbooks of a service or
application
● Abstraction of infrastructure specific
implementations for multi/hybrid cloud through
runbooks and playbooks
● In-Source code, products built by platform teams
can be extended or enhanced by SRE/Dev/Ops
or any other

PAGE | 33
Chaos Engineering

The discipline of Properties of a


Chaos Experiment
experimenting on a
● Define steady state
distributed system in ● Formulate hypothesis
order to build confidence ● Outline methodology
● Identify blast radius
in the system’s ability to ● Observability is key
withstand turbulent ● Readily abortable

conditions.
PAGE | 34
Getting started with Chaos Engineering

From a technical point of view, they are easy to set up and do not have to be sophisticated in terms of
implementation

• Get relevant people in a room who are responsible for a system or set of
systems
• Shut off a component that the system should be robust enough to tolerate
without it
• Record the learning outcomes and bring the component back online

Apply It: Build a Game Day event


What / Who / When / Where / How

Note: You don’t need to run your GameDay in production! Insights can come from conducting
experiments first in a staging or test environment
PAGE | 35
Implementation challenges
And how to overcome them

Your difficult and Postmortems


You don’t have You stop at
dense process are underutilized You wait for
enough cross- incident
is slowing down and don’t incidents to
team usage or management
incident encompass in- happen
buy-in without SLOs
response depth learnings

PAGE | 36
Where to start
Start
where
you
are

Keep it simple Think and work


and practical Optimize holistically
Progress iteratively
and
with feedback
automate
Collaborate
and Focus on
promote value
visibility

PAGE | 37
THANK YOU!
And now over to Mark
Mark Kriaf
Partner Solutions Architect at AWS
markkriaf
Agenda
• The uptime flywheel - prepare

• Prepare for downtime

• AWS Services for incident management

• The uptime flywheel – learn

• Chaos engineering

• AWS Fault Injection Simulator

• Summary and Marketplace next steps

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The uptime flywheel @ AWS

Decouple
Learn &
& Prepare
improve
empower

Recover
Shorter Detect
incidents
Weekly
OpsMetrics
“2-pizza” teams meeting

Respond
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Prepare for downtime
1. Detect
• What could go wrong?
• How would I know?
2. Respond and recover
• Who needs to be engaged?
• What do they need to do to
diagnose?
• Procedures and scripts
• Where do we collaborate?

3. Learn and improve


• How did we respond?
• What actions will we take?

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Systems Manager
Centralize operational data from
multiple AWS services and automate
tasks across your AWS resources
AWS Systems Manager

Benefits
• Simplify resources and application
management
• Easy to operate and securely manage multi-
cloud infrastructures at scale
• Resolve critical application availability and
performance
• Prepare for and manage incidents efficiently
with automated response
AWS Systems Manager
Incident Manager

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Incident Manager
Resolve application issues faster with automated response plans

Specify a response plan to critical application


alarms, including who to engage, what runbook to
follow, and where to collaborate
Notify the right people immediately with SMS,
voice, and escalations (additional partner
integrations coming soon)
Single console to track incidents from detection to
mitigation and post-incident analysis, including
timeline, runbooks, metrics, etc.
Collaborate in Slack via AWS Chatbot to resolve
incidents
Identify post-incident action items, such as
improving alarms or automating runbooks steps,
using Amazon’s post-incident analysis template and
track them in OpsCenter

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The uptime flywheel @ AWS

Decouple
Learn & Prepare
&
improve
empower

Recover
Shorter Detect
incidents
Weekly
“2-pizza” teams
OpsMetrics
meeting

Respond
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chaos engineering
Experiment to ensure that the impact of failures is mitigated

Chaos experiment IMPROVE STEADY


STATE
• Inject events that simulate
• Hardware failures, like servers dying
• Software failures, like malformed responses
VERIFY HYPOTHESIS
• Nonfailure events, like spikes in traffic or
scaling events
• Any event capable of disrupting steady state
RUN
EXPERIMENT

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Fault Injection Simulator
Improve resiliency and performance with controlled experiments

• A fast and easy way to get started with


fault injection experiments
• Validate how your application performs
on AWS
• Safeguard fault injection experiments
• Improve application performance,
resiliency, and observability
• Get comprehensive insights by
generating real-world failure conditions

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Fault Injection Simulator

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Fault Injection Simulator use case: periodic game days

Why conduct a game day?

• Simulate a failure or event to test systems, processes, and team responses


• Should cover the areas of operations, security, reliability, performance, and cost
• Can be carried out with replicas of your production environment using AWS CloudFormation
• Should involve all personnel who normally operate a workload

Game day process

Use a Correction
Define the
Execute your Analyze the game of Error (COE)
scenario you want
simulation day process to analyze
to practice
issues

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Marketplace: Destination for third-party solutions to use with AWS
DevOps Core practices
Collaboration & communication Continuous integration Continuous delivery Monitoring & observability
Microservices and everything-as-code Testing & quality management Security & compliance Incident management

Ideas

Ideas Ideas Plan Build Test Secure Release Operate


Ideas

Sample AWS Marketplace solution providers

1,600+ vendors | 8,000+ products

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
8,000+ 1,600+ 24 290,000+ 1.5M+
listings ISVs regions customers subscriptions

AWS Marketplace DevOps Workshop Series participating partner hands-on labs

And more coming soon!

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How can you get started?
Find Buy Deploy

Through flexible With multiple


A breadth of DevOps solutions:
pricing options: deployment options:

Free trial SaaS


Pay-as-you-go Amazon Machine Image (AMI)
Budget alignment CloudFormation Template
Bring Your Own License (BYOL) Containers
Private Offers Amazon EKS/ Amazon ECS
Billing consolidation AI / ML models
Enterprise Discount Program AWS Data Exchange
Private Marketplace AWS Control Tower

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next steps

Bookmark the Workshop Series landing page, check back for new content or subscribe to
email updates

Start your lab of choice

Move on to Module 8: DevSecOps

Visit the AWS Marketplace website to experiment with DevOps tooling

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module 7 Hands-on Labs

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Move on to Module 8: DevSecOps

https://pages.awscloud.com/awsmp-h2-dev-aws-marketplace-devops-workshop-series.html
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

You might also like