Resilience in Azure

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59
At a glance
Powered by AI
The key takeaways are that resilience is important when designing applications on Azure. The document discusses the process of designing and deploying highly resilient applications and provides example resilience design patterns.

The five pillars of software quality discussed are scalability, availability, resilience, management, and security.

The metrics used to measure resilience discussed are recovery time objective (RTO) and recovery point objective (RPO).

Resilience in

Microsoft Azure
Designing resilient infrastructure as a
service applications on Azure

August 2019
2 August 2019

Abstract
This document provides guidance on
designing resilient infrastructure as a
service (IaaS) applications on Azure and
provides sample application design
patterns for varied levels of resilience.
The audience for this document is those
who move applications from on-premises
environments to Azure or who build
applications on Azure, such as cloud
solution architects, business continuity and
disaster recovery administrators, application
developers and operations team members,
as well as CTOs, CIOs, and those involved in
policy, planning, and other strategic roles.
Resilience in Microsoft Azure 3

Table of contents
02 Abstract 34 Test failures and response strategies

04 Introduction 37 Deploy using a reliable process


Resilient deployment concepts
06 Concepts and terminology Application updates
Shared responsibility
40 Monitor to detect failures
Metrics to measure resilience
Resilient design
43 Respond to failures
09 What’s resilience?
Foundation
46 Example resilience design patterns
Proactive mitigation Tier 4 application
Network considerations Tier 3 application
Built-in Azure resilience services Tier 2 application
Shared responsibility Tier 1 application
Tier 0 application
15 Design your applications
59 Conclusion
17 Define resilience requirements
Availability requirements
SLAs
Composite SLAs
SLA for multi-region deployments
Decompose by workload

24 Design for resilience


Failure mode analysis

27 Implement resilience strategies for


Azure applications
Key Azure services
4 August 2019

Introduction
Azure is a rapidly growing cloud computing
platform that provides an ever-expanding
suite of cloud services. These include
analytics, computing, database, mobile,
networking, storage, and web services.
Azure integrates tools, templates, and
managed services that work together to
help make it easier to build and manage
enterprise, mobile, web, and Internet
of Things (IoT) apps faster, using the
tools, applications, and frameworks that
customers choose.
Resilience in Microsoft Azure 5

Azure is built on trust. The Azure approach This document describes a process for
to trust is based on five foundational achieving resilience using a structured
principles: security, compliance, privacy, approach throughout the lifetime
resilience, and intellectual property of an application—from design and
(IP) protection. A well-designed Azure implementation to deployment
application should focus on these five and operations.
pillars of software quality.

Pillar Description

Scalability A system’s ability to handle increased load.

Availability The proportion of time that a system is functional and working.

Resilience A system’s ability to recover from failures and continue to function.

Management Operations processes which keep a system running in production.

Security To protect applications and data from threats.


6 August 2019

Concepts and
Achieving resilience requires understanding
how it’s measured, designed, and
implemented. The concepts and terms

terminology in this section lay the foundation for the


example resilience design patterns
that follow.
Resilience in Microsoft Azure 7

Shared responsibility Metrics to


Being resilient in the event of any failure is measure resilience
a shared responsibility. The responsibility The following metrics are used to
of the customer versus those of the cloud measure resilience:
provider, in terms of resilience, depends
on the cloud service model being used: Recovery time objective (RTO) is
infrastructure as a service (IaaS), platform the maximum acceptable time that an
as a service (PaaS), or software as a application can be unavailable after
service (SaaS). an incident.

Recovery point objective (RPO) is the


maximum duration for which data loss can
occur during a disaster.

Mean time to recover (MTTR) is the


average time that it takes to restore a
component after a failure.

Mean time between failures (MTBF)


is the runtime that a component can
reasonably expect to last between outages.

The service-level agreement (SLA)


in Azure describes our commitments
regarding uptime and connectivity. If the
SLA for a service is 99.9 percent, customers
can expect the service to be available 99.9
percent of the time. Dealing with multiple
services that have different agreements
adds complexity. In this case, it’s necessary
to calculate a composite SLA.
8 August 2019

Resilient design

Failures can vary in the scope of their mode analysis (FMA) during the design
impact. There are many different failure phase helps identify possible points of
types, including hardware, datacenter, failure and defines how the application will
regional or transient failure, as well as respond to those failures.
dependency service, heavy load, accidental
data deletion/corruption, and application
deployment failures. Performing a failure

Define resilience requirements based on business needs.

Design the application for resilience.

Implement strategies to detect and recover from failures.

Test the implementation by simulating faults and triggering forced failovers.

Deploy the application into production using a reliable and repeatable process.

Monitor the application to detect failures.

Respond if there are failures that require manual interventions.


Resilience in Microsoft Azure 9

What’s
Resilience is a system’s ability to recover
from failures and continue to function.
It’s not only about avoiding failures

resilience? but also involves responding to failures in


a way that minimizes downtime or data
loss. Because failures can occur at various
levels, it’s important to have protection
for all types based on your application
availability requirements.
10 August 2019

Foundation Proactive mitigation


The first step toward achieving resilience Azure proactively mitigates potential
is to avoid failures in the first place. failures—reducing the failure impact on
Microsoft Azure invests heavily in cloud availability by 50 percent. Using deep
services to ensure that customers can run fleet telemetry, Microsoft enables failure
their workloads reliably. Our approach predictions with machine learning (ML)
to improving Azure reliability involves and ties them to automatic live migration
improving the platform’s capability for several types of hardware failure cases,
to minimize impact during planned including disk failures, input/output (I/O)
maintenance events and giving customers latency, and CPU frequency anomalies. As a
control over their experience during result, VMs are live migrated off of “at-risk”
these events. machines before they ever show signs of
failing. This means VMs running on Azure are
Cloud technology at Microsoft has come a
more reliable than the underlying hardware.
long way over the years, adding innovative
features such as memory-preserving host For more details on Azure’s innovative use
updates and live migration. With Azure, of ML, please see the linked papers on disk
overall availability has been trending up failure prediction and node failure prediction.
constantly and is now at approximately
99.999 percent reliability across the fleet. Network considerations
As a result, more customers are running
mission-critical workloads on Azure. The reliability and performance of cloud
services is determined in part by the network
For such customers, aggregate reliability over which the transactions take place.
isn’t enough—every reboot and every Microsoft has more datacenter regions than
30 second pause matters. Thus, Azure any of its competitors, and its network is one
has moved to a more rigorous definition of the largest in the world. Unlike many other
of what reliability should mean. We public cloud providers, data that traverses
now focus on driving down the annual between Azure datacenters and regions
interruption rate (AIR)—the likelihood that doesn’t go through the public internet.
a given virtual machine (VM) will see an Rather, it stays safe within the
interruption during the year—keeping our Microsoft network.
efforts focused across the usage lifecycle.
Resilience in Microsoft Azure 11

This includes all traffic between Microsoft


services, anywhere in the world. For
example, within Azure, traffic between VMs,
storage, and SQL communication traverses
only the Microsoft network, regardless of
the source and destination region. Cross-
region traffic stays on the Azure Virtual
Network as well. This keeps customers’
applications and data both secure and
highly available.
12 August 2019

Built-in Azure
resilience services
Despite all the investments Microsoft puts
into making the platform reliable, apps
can suffer from downtime because of
unplanned events such as power failures,
data corruption, ransomware attacks,
as well as fires or natural disasters. The
graphic below illustrates key Azure service
offerings for various types of failures that
can occur (Figure 1).

Improved availability Build and run highly available applications with Implement disaster recovery plan with data
Customer

near-zero RPO/RTO residency and minimal RPO/RTO


need

Premium Storage Availability Sets Availability Zones Azure site recovery & region pairs

Region 2
Region 1
Data-center 1
Single VM

SLA 99.9% SLA 99.95% SLA 99.99% RTO 30 minutes

Hardware failure Rack-level failure Data center outage Natural disaster

Accidental data loss Data corruption Ransomware

Go back to point-in-time to restore the healthier version of data

Backup

Figure 1. Azure resilience services


Resilience in Microsoft Azure 13

To help protect against the consequences Azure Backup serves as a general


of such failures, Azure provides a backup solution for cloud and on-premises
comprehensive set of built-in resilience workflows that run on VMs or
services that customers can easily enable physical servers.
and control based on individual business
Geo-replication for Azure SQL Database
needs. Whether it’s a single hardware
allows an application to perform quick
node failure, a rack level failure, a
disaster recovery of individual databases
datacenter outage, or a large-scale
in case of a regional disaster or
regional outage, Azure offers solutions,
large-scale outage.
including the following:
Locally redundant storage (LRS) provides
Availability Sets ensure that the VMs
object durability by replicating customer
deployed on Azure are distributed across
data to a storage scale unit.
multiple isolated hardware nodes in
a cluster. Zone redundant storage (ZRS) replicates
customer data synchronously across three
Availability Zones protect customers’
storage clusters in a single region.
applications and data from datacenter
failures across multiple physical locations Geo-redundant storage (GRS) provides
within a region. object durability over a given year by
replicating customer data to a secondary
Azure Load Balancer distributes inbound
region that’s hundreds of miles away from
traffic according to rules and health probes.
the primary region.
Azure Traffic Manager enables optimal
traffic distribution to services across
global Azure regions, while providing high
availability (HA) and responsiveness.

Azure Site Recovery replicates workloads


from a primary VM to a secondary failover
location—allowing for business continuity
and disaster recovery needs.
14 August 2019

Shared responsibility
Azure invests in platform resilience and for various types of failures and knowing
complements those investments with how to deal with them on-premises. With
the abovementioned offerings to ensure IaaS, Azure is responsible for the core
uptime for customers. Being resilient in the infrastructure resilience, which includes
event of any failure, however, is a shared storage, networking, and compute. As the
responsibility. Who’s responsible for what, model moves from IaaS to PaaS and then to
in terms of resilience, depends on the cloud SaaS, the customer is responsible for less and
service model that’s being used—IaaS, the cloud service provider is responsible
PaaS, or SaaS (Figure 2). for more.

In the traditional on-premises model, the For more information about the shared
entire responsibility of managing—from responsibility model as it pertains to Azure
the hardware for compute, storage, and cloud services, download the paper Shared
networking to the application—falls on Responsibilities for Cloud Computing Today.
the customer. That requires planning

Figure 2: Shared responsbility for cloud services


Resilience in Microsoft Azure 15

Design your
applications
16 August 2019

The following model provides a framework for


designing resilient applications on Azure:

Define resilience requirements based on business needs.

Design the application for resilience with protection from various failures at the
application, infrastructure, and regional levels. This document provides proven
best practices learnt from customer engagements strategies to detect and
recover from failures.

Implement strategies to detect and recover from failures.

Test the implementation by simulating faults and triggering forced failovers.

Deploy the application into production using a reliable and repeatable process.

the application to detect failures, enable health checks, and allow for incidence
Monitor response when necessary.

if there are failures that require manual interventions.there are failures that require
Respond manual interventions.
Resilience in Microsoft Azure 17

Define
The required level of resilience depends
on several considerations. The first step
is to define resilience requirements

resilience based on business needs, which include


availability requirements and an

requirements
understanding of the SLA.
18 August 2019

Understand SLA Understand application and composite SLAs.


requirements

Design Classify applications into various tiers based on the business need.

Identify resilience
Identify resilience requirements such as SLA, RPO, and RTO for each tier.
requirements

Availability requirements It’s also important to define what it means


for the application to be available. For
Understanding the applicable availability example, is the application considered
requirements is an important part of down if someone can submit an order,
designing for resilience. but the system can’t process it within the
To better understand these requirements, normal timeframe?
ask the following questions: Also, consider the probability of an outage
How much downtime is acceptable? occurring and whether a mitigation strategy
is cost-effective. Resilience planning begins
How much will potential downtime with business requirements. The following
cost the business? are some approaches to help in thinking
How much money and time can the about resilience in those terms.
business realistically invest in making
the application highly available?
Resilience in Microsoft Azure 19

SLAs

In Azure, the SLA describes our achieve that level of availability grows. An
commitment regarding uptime and uptime of 99.99 percent translates to about
connectivity. If the SLA for a particular five minutes of total downtime per month. Is
service is 99.9 percent, that means it worth the additional complexity and cost to
customers can expect the service to be reach that percentage? The answer depends
available 99.9 percent of on the individual business requirements.
the time.
To achieve four nines (99.99 percent), manual
Azure customers should define their own intervention can’t be relied on to recover
target SLAs for each workload in their from failures. The application must be self-
solutions. An SLA makes it possible to diagnosing and self-healing. Beyond four
evaluate whether the architecture meets nines, it’s challenging to detect outages quickly
the business requirements. For example, if enough to meet the SLA. Think about the time
a workload requires 99.99 percent uptime, window against which the SLA is measured: the
but depends on a service with a 99.9 smaller the window, the tighter the tolerances.
percent SLA, that service can’t be a single
It probably doesn’t make sense to define
point of failure in the system.
the SLA in terms of hourly or daily uptime.
Of course, higher availability is always Consider the MTBF and MTTR measurements.
better when everything else is equal. The lower the SLA, the less frequently the
But as customers strive for higher service can go down, and the more quickly the
percentages, the cost and complexity to service must recover.

The following table shows the potential cumulative downtime for various SLA levels:

SLA Downtime per week Downtime per month Downtime per year

99% 1.68 hours 7.2 hours 3.65 days

99.9% 10.1 minutes 43.2 minutes 8.76 hours

99.95% 5 minutes 21.6 minutes 4.38 hours

99.99% 1.01 minutes 4.32 minutes 52.56 minutes

99.999% 6 seconds 25.9 seconds 5.26 minutes


20 August 2019

Composite SLAs
Dealing with multiple services that have different SLAs adds complexity. In this case, it’s necessary
to calculate a composite SLA. Consider the Web Apps feature of Azure App Service which writes to
Azure SQL Database. At the time of this writing, these Azure services have the following SLAs:

App Service Web Apps = 99.95 percent

SQL Database = 99.99 percent

What’s the maximum downtime that should be expected for this application? If either service fails,
the whole application fails. In general, the probability of each service failing is independent of the
other. So, the composite SLA for this application is the following:

99.95% × 99.99% = 99.94%

That’s lower than the individual SLAs. This isn’t surprising, because an application that relies on
multiple services has more potential failure points.

On the other hand, the composite SLA can be improved by creating independent fallback paths. For
example, if SQL Database is unavailable, transactions are put into a queue to be processed later.

With this design, the application is still available even if it can’t connect to the database. However, it
fails if the database and the queue both fail at the same time. The expected percentage of time for
a simultaneous failure is 0.0001 × 0.001, so the composite SLA for this combined path is:

Database OR queue = 1.0 − (0.0001 × 0.001) = 99.99999%

The total composite SLA is:

Web app AND (database OR queue) = 99.95% × 99.99999% = ~99.95%

There are tradeoffs to this approach. The application logic is more complex, the customer is paying
for the queue, and there may be data consistency issues to consider.
Resilience in Microsoft Azure 21

SLA for multi-region deployments


Another HA technique is to deploy the application in more than one region
and use Traffic Manager to fail over if the application fails in one region. For a
multi-region deployment, the composite SLA is calculated as follows:

Let N be the composite SLA for the application deployed in one region,
and R be the number of regions where the application is deployed. The
expected chance that the application will fail in all regions at the same
time is ((1 − N) ^ R).

For example, if the single-region SLA is 99.95 percent:

The combined SLA for two regions = (1 − (0.9995 ^ 2)) = 99.999975%

The combined SLA for four regions = (1 − (0.9995 ^ 4)) = 99.999999%

The SLA for Traffic Manager must also be factored in. At the time of this
writing, the SLA for Traffic Manager is 99.99 percent.

Also, failing over isn’t instantaneous in active/passive configurations,


which can result in some downtime during a failover.

See Traffic Manager endpoint monitoring.


22 August 2019

Decompose by workload
Many cloud solutions consist of multiple Also consider usage patterns. Are there
application workloads. The term certain critical periods when the system
workload in this context means a discrete must be available? For example, a tax-filing
capability or computing task, which service can’t go down right before the
can be logically separated from other filing deadline, a video streaming service
tasks in terms of business logic and data must stay up during a big sports event, and
storage requirements. For example, an so on. During these critical periods, using
e-commerce app might include the redundant deployments across several
following workloads: regions will help provide resilience, so
the application can fail over if one region
Browse and search a product catalog.
fails. Because multi-region deployment is
Create and track orders. potentially more expensive, it‘s more cost
effective to run the application in a single
View recommendations.
region during less critical times.
These workloads might have different
Categorizing applications into different
requirements for availability, scalability,
tiers is a common strategy. Tier zero and
data consistency, and disaster recovery. This
tier one applications are made up of those
means business decisions are required to
applications that should experience very
balance cost versus risk.
minimal data loss and downtime.
Resilience in Microsoft Azure 23

The RTO/RPO for this tier needs to be where the only sales channel is through the
zero or as close to nil as possible. Tier website. For a retail company where brick
two applications consist of applications and mortar stores are the primary sales
for which it’s acceptable to lose minimal channel, the point of sales (POS) application
amounts of data with RTO and RPO of the is tier zero or tier one. Thus, it’s important
order of a few minutes. With tier three and to categorize the application inventory into
tier four applications, downtime can affect multiple tiers based on the organization’s
internal operations for a few hours. While business needs and resilience requirements.
this is inconvenient, it doesn’t pose a huge
risk to the business.

The categorization of applications and


the exact availability SLA, RPO, and RTO
requirements vary from industry to
industry, and from customer to customer.
For example, an e-commerce website is
tier zero or tier one for a retail company

Sample tiers are as follows:

Availability SLA RPO RTO

Tier 0 99.995 0 0

Tier 1 99.99 5 minutes 1 hour

Tier 2 99.95 30 minutes 4 hours

Tier 3 99.95 4 hours 8 hours

Tier 4 99 24 hours 72 hours


24 August 2019

Design for
resilience
Resilience in Microsoft Azure 25

Failures can vary in the scope of their impact. The table below describes various
types of application failures. While this list isn’t exhaustive, it provides a starting
point to help customers think about various failure types.

Failure type Description

A failure of any hardware component including computer,


Hardware failure
network, or storage hardware.

An entire datacenter is impacted by issues such as a power


Datacenter failure
grid outage.

This includes any natural disaster-like event that impacts multiple


Regional failure
datacenters in a region, causing the entire region to go down.

Requests between various components fail intermittently. End


Transient failure
user requests will fail when this isn’t handled properly.

This occurs when any service on which the application is


Dependency service failure
dependent is not functioning correctly.

A sudden spike in incoming requests prevents the application


Heavy load
from servicing requests.

Accidental data deletion or Customers mistakenly delete critical data or data has been
corruption corrupted due to unforeseen reasons.

Application deployment A failure that takes place when updating production


failure application deployments.
26 August 2019

Failure mode analysis


During the design phase, perform an FMA example, consider read failures and
to identify possible points of failure and write failures separately, because the
define how the application will respond impact and possible mitigations of
to those failures. Customers can design each will be different.
response strategies for various types of
Rate each failure mode according to its
failure depending on the application’s
overall risk. Consider these factors:
resilience and availability requirements.
Answering the following questions will help What’s the likelihood of the failure?
define an application’s design for resilience. Exact numbers aren’t necessary
because the purpose is to help rank
How will the application detect this
the priority.
type of failure?
What’s the impact on the application,
How will the application respond to
in terms of availability, data loss,
this type of failure?
monetary cost, and business
How will you log and monitor this type disruption?
of failure?
For each failure mode, determine how
Making the FMA part of the architecture the application will detect, respond, and
and design phases ensures that failure recover. Consider trade-offs in terms of cost
recovery is built into the system from and application complexity.
the beginning.

Here are the steps for conducting


an FMA:

Identify all system components and


include external dependencies, such as
identity providers, third-party services,
and so on.

For each component, identify potential


failures that could occur, keeping in
mind that a single component may
have more than one failure mode. For
Resilience in Microsoft Azure 27

Implement
This section provides guidance on
implementing common resilience strategies
for applications to prevent or respond to

resilience various types of failures. Most of these


aren’t limited to a particular technology but

strategies
provide a general idea of how to plan and
implement resilience strategies.

for Azure
applications
28 August 2019

Failure type Resilience strategy

Build redundancy into the application by deploying components


Hardware failure across different fault domains. For example, ensure that Azure
VMs are placed in different racks by using Availability Sets.

Build redundancy into the application with fault isolation zones


across datacenters. For example, ensure that Azure VMs are
Datacenter failure
placed in different fault-isolated datacenters by using Azure
Availability Zones.

Replicate the data and components into another region so that


Regional failure applications can be quickly recovered. For example, use Azure Site
Recovery to replicate Azure VMs to another Azure region.

Retry transient failures. For many Azure services, the client


Transient failure software development kit (SDK) implements automatic retries in a
way that’s transparent to the caller.

Degrade gracefully if a service fails without a failover path,


Dependency service failure
providing an acceptable user experience.

Load balance across instances to handle spikes in usage. For


Heavy load example, put two or more Azure VMs behind a load balancer to
distribute traffic to all VMs.

Back up data so it can be restored if there’s any deletion or


Accidental data deletion or
corruption. For example, use Azure Backup to periodically back up
corruption
your Azure VMs.

Application deployment
Automate deployments with a rollback plan.
failure
Resilience in Microsoft Azure 29

Each resilience strategy is discussed in more


detail in the subsections below.

Build redundancy. Build redundancy


into applications to avoid a single point
of failure. Ensure that VMs are deployed Figure 3. Total latency
into different fault domains by creating an
Availability Set and keeping a load balancer
Retry transient failures. Transient failures
in front of it. You can also deploy VMs
can be caused by momentary loss of
across two or more Availability Zones with a
network connectivity, a dropped database
zone redundant load balancer in front of it.
connection, or a timeout when a service
Replicate data and components. is busy. Often, a transient failure can be
Replicating data is a general strategy for resolved simply by retrying the request.
handling non-transient failures in a data For many Azure services, the client SDK
store. Many storage technologies provide implements automatic retries in a way that’s
built-in replication, including Azure Storage, transparent to the caller. For more on this,
Azure SQL Database, Azure Cosmos DB, see Retry guidance for specific services.
and Apache Cassandra. It’s important to
Each retry attempt adds to the total
consider both the read and write paths.
latency (Figure 3). In addition, too many
Depending on the storage technology, you
failed requests can cause a bottleneck as
might have multiple writable replicas or a
pending requests accumulate in the queue.
single writable replica and multiple read-
These blocked requests might hold critical
only replicas.
system resources such as memory, threads,
To maximize availability, replicas can be database connections, and more, which
placed in multiple regions. However, this can cause cascading failures. To avoid this,
increases the latency when replicating the increase the delay between each retry
data. Typically, replicating across regions attempt, and limit the total number of
is done asynchronously, which implies an failed requests.
eventual consistency model and potential
data loss if a replica fails.
30 August 2019

Load balance across instances. Scalability Back up data. Always configure backups
means a cloud application should be able for all critical production data sources.
to scale out by adding more instances. This This includes VMs, databases, storage,
approach also improves resilience, allowing among others. Accidental deletions or
for unhealthy instances to be removed from data corruptions can happen at any time.
rotation. For example: Personnel might not become aware
of some of these until a few days or
When you put two or more VMs
even weeks later. Thus, it’s important to
behind a load balancer, it distributes
configure longer retention times for backup
traffic to all of the VMs. See Run load-
copies, depending on the nature and
balanced VMs for scalability
criticality of the application.
and availability.

Scaling out an Azure App Service app Key Azure services


to multiple instances automatically
balances the load across instances. See Azure has features to make an application
how to Run a basic web application redundant at every level of failure, from an
in Azure. individual VM to an entire region.
These include:
Use Traffic Manager to distribute
traffic across a set of endpoints. Single VM. Azure provides an uptime SLA
for single VMs. (Note that the VM must use
Degrade gracefully. If a service fails and premium storage for all operating system
there’s no failover path, the application may disks and data disks.) Although running
be able to degrade gracefully while still two or more VMs can result in a higher SLA,
providing an acceptable user experience. a single VM may be reliable enough for
For example: some workloads. However, for production
Put a work item on a queue to be workloads, Microsoft recommends using
handled later. two or more VMs for redundancy.

Return an estimated value. Availability Sets. To protect against


localized hardware failures, such as a disk or
Use locally cached data. network switch failure, deploy two or more
Display an error message. (This option VMs in an Availability Set. An Availability
is better than having the application
stop responding to requests.)
Resilience in Microsoft Azure 31

Set consists of two or more fault domains running in an elastic and distributed
that share a common power source and infrastructure with no hard-coded
network switch. VMs in an Availability Set infrastructure components specified in the
are distributed across the fault domains, code base.
so if a hardware failure affects one fault
Azure Site Recovery. Azure Site Recovery
domain, network traffic can still be routed
helps to replicate Azure VMs to another
to the VMs in the other fault domains. For
Azure region for business continuity and
more information about Availability Sets,
disaster recovery. Conduct periodic disaster
see Manage the availability of Windows
recovery drills to ensure compliance
virtual machines in Azure.
requirements are met. The VM will be
Availability Zones. An Availability Zone is a replicated with the specified settings to
physically separate location within an Azure the selected region—ensuring customers
region. They provide a combination of low can recover their applications in the event
latency and high availability through the of outages in the source region. For more
strategic physical location separation within information, see Set up disaster recovery to
an Azure region. Each Availability Zone has a secondary Azure region for an Azure VM.
independent physical infrastructure with a
Customers should factor in the RTO and
distinct power source, network, and cooling
RPO numbers for their solutions here and
system. Deploying VMs across Availability
ensure that when testing, the recovery
Zones helps protect an application against
time and recovery point is appropriate for
datacenter-wide failures. See What are
their needs.
Availability Zones in Azure? for a list of
supported regions and services. Paired regions. To protect an application
in case of a regional outage, deploy the
When planning to use Availability Zones
application across multiple regions using
in a deployment, first validate that the
Traffic Manager. This distributes internet
application architecture and code base
traffic to the different regions and pairs
can support this configuration. When
each Azure region with another region—
deploying commercial off-the-shelf
forming a regional pair. With the exception
software, consult with the software vendor
of Brazil South, regional pairs are located
and test adequately before deploying into
within the same geography in order to
production. The application must support
32 August 2019

meet data residency requirements strategy requires more than simply


for taxation and law enforcement making copies of data. It needs to take
jurisdictional purposes. the application’s data architecture and
infrastructure into consideration. The
When designing a multi-region application,
app may manage many kinds of data of
network latency across regions is higher
varying importance, spread widely across
than within a region. For this reason, when
filesystems, databases, and other storage
replicating a database to enable failover,
services both in the cloud and on premises.
use synchronous data replication within a
Using the right services and products for
region, but asynchronous data replication
the job will simplify the backup process and
across regions.
increase recovery time if a backup needs
When selecting paired regions, ensure both to be restored. Azure Backup serves as a
regions have the required Azure services. general-purpose backup solution for cloud
For a list of services by region, see Products and on-premises workflows that run on
available by region. VMs or physical servers. It’s designed to
be a drop-in replacement for traditional
It’s also critical to select the best
backup solutions, which stores data in
deployment topology for disaster recovery,
Azure instead of archive tapes or other local
especially if RPO/RTO requirements are
physical media.
low. To ensure the failover region has
enough capacity to support the workload, Azure Monitor. Azure Monitor maximizes
select either an active/passive (full replica) the availability and performance of
topology or an active/active topology. Note customer applications by delivering a
that these deployment topologies might comprehensive solution for collecting,
increase complexity and cost as resources in analyzing, and acting on telemetry from
the secondary region are pre-provisioned both cloud and on-premises environments.
and may sit idle. For more information, see It helps customers understand how their
Failure and disaster recovery for applications are performing and proactively
Azure applications. identifies issues that affect them and the
resources on which they depend.
Azure Backup. Azure Backup is the final
and most powerful line of defense against
permanent data loss. An effective backup
Resilience in Microsoft Azure 33

The following table compares Availability Sets,


Availability Zones, and Azure Site Recovery/paired regions:

Availability Set Availability Zone Azure Site Recovery/Paired


region

Scope of Rack Datacenter Region


failure

Request Load Balancer Cross-zone Load Traffic Manager


routing Balancer

Network Very low Low Mid to high


latency

Virtual Azure Virtual Azure Virtual Cross-region virtual network


network Network Network peering
34 August 2019

Test failures
Generally, resilience can’t be tested in
the same way as application functionality
(that is, by running unit tests and so on).

and response Instead, test how the end-to-end workload


performs under failure conditions that only

strategies
occur intermittently.

Testing is an iterative process. Test the


application, measure the outcome, analyze
and address any failures that result, and
repeat the process.
Resilience in Microsoft Azure 35

Fault injection testing. Test the resilience


of the system during failures, either by
triggering actual failures or by simulating
them. The following are some common
failure scenarios to test:

Crash processes.

Expire certificates.

Change access keys.

Shut down the DNS service on


domain controllers.

Limit available system resources,


such as RAM or number
of threads.

Unmount disks.

Redeploy a VM.

Measure the recovery times and


verify that business requirements
are met. Test combinations of failure
modes as well. Make sure that failures
don’t cascade and are handled in an
isolated way.

This is another reason it’s important


to analyze possible failure points
during the design phase. The results
of that analysis should be input into
the test plan.
36 August 2019

Load testing. Load testing is crucial for Disaster recovery drills. It’s not enough to
identifying failures that only happen under have a good disaster recovery plan in place.
load, such as the backend database being It must also be tested periodically to ensure
overwhelmed or service throttling. Test that the recovery plan works properly when
for peak load, using production data or it matters most. For Azure virtual machines,
synthetic data that’s as close to production use Azure Site Recovery to replicate and
data as possible. The goal is to observe perform disaster recovery drills—all without
how the application behaves under real- impacting production applications or
world conditions. ongoing replication.
Resilience in Microsoft Azure 37

Deploy using
After an application is deployed to
production, updates present a potential
source of errors. In the worst case, a bad

a reliable update can cause downtime. To avoid


this, the deployment process must be

process
predictable and repeatable. Deployment
includes provisioning Azure resources,
deploying application code, and applying
configuration settings. An update may
involve all three or a subset of these.
38 August 2019

The crucial point is that manual deployments are Resilient deployment


prone to errors. Therefore, the best practice is to
have an automated, idempotent process that can
concepts
be run on demand and re-run if something fails. Infrastructure as Code (IaC) and
immutable infrastructure are two
Use Terraform, Ansible, Chef, Puppet, Azure
important concepts in resilient
PowerShell, Azure Command-Line Interface
deployments. They’re defined
(CLI), or Azure Resource Manager templates
as follows:
to automate the provisioning of Azure
resources. IaC is the practice of using code
to provision and configure
Use Azure Automation Desired State
infrastructure. IaC may use a
Configuration (DSC) to configure Windows
declarative approach, an imperative
VMs. Use Cloud-init for Linux VMs.
approach, or a combination of both.
Use Azure DevOps Services or Jenkins to
Immutable infrastructure is the
automate application deployment.
principle that infrastructure shouldn’t
be modified after it’s deployed to
production. Undocumented ad hoc
changes are difficult to track and to
troubleshoot, ultimately making it
harder to address issues with
the system.

A declarative approach focuses on


what needs to be accomplished,
whereas an imperative approach
focuses on how to accomplish it.
Resource Manager templates are an
example of a declarative approach
and PowerShell scripts are an
example of an imperative approach.
Resilience in Microsoft Azure 39

Application updates
When rolling out an application update,
Microsoft recommends blue-green deployment
or canary releases to push updates in a highly
controlled way and minimize possible impacts
from a bad deployment. These techniques are
explained below:

Blue-green deployment is a technique whereby


an update is deployed into a production
environment separately from the live
application. After validating the deployment,
switch the traffic routing to the updated version.
For example, Azure App Service enables this
with staging slots.

Canary releases are similar to blue-green


deployments, but instead of switching all traffic
to the updated version, the update is rolled
out to a small percentage of users by routing a
portion of the traffic to the new deployment.
If there’s a problem, back off and revert to the
old deployment. Otherwise, incrementally route
more traffic to the new version until it’s handling
100 percent of the traffic.

Regardless of which approach is used, ensure


the ability to roll back to the last known good
deployment in case the new version isn’t
functioning. In addition, have a strategy in place
to roll back database changes and any other
changes to dependent services. If errors occur,
the application logs must indicate which version
caused the error.
40 August 2019

Monitor to
Monitoring is crucial for resilience. If
something fails, it’s essential to know that it
failed, and to get insights into the cause(s)

detect failures of the failure.


Resilience in Microsoft Azure 41

Monitoring a large-scale distributed system These disparate sources are collected,


poses a significant challenge. Think about consolidated, and put into reliable
an application that runs on a few dozen data stores such as Azure Application
VMs. It’s not practical to log into each VM Insights, Azure Monitor Metrics, Azure
one at a time and look through each log Service Health, storage accounts, and
file to troubleshoot a problem. Moreover, Azure Log Analytics.
the number of VM instances is usually not
Analysis and diagnosis. After the
static because VMs are added and removed
data is consolidated in these different
as the application scales in and out, and
data stores, it can be analyzed to
occasionally an instance may fail and need
troubleshoot issues and provide an
to be reprovisioned. To further complicate
overall view of application health.
matters, a typical cloud application might
Generally, customers can search for the
use multiple data stores (Azure Storage,
data in Application Insights and Log
SQL Database, Cosmos DB, and Azure
Analytics using Kusto queries. Azure
Cache for Redis) and a single user action
Advisor provides recommendations
may span multiple subsystems.
with a focus on resilience and
The monitoring process operates like a optimization.
pipeline with several distinct stages:
Visualization and alerts. In this stage,
Instrumentation. The raw data for telemetry data is presented in such
monitoring comes from a variety a way that an operator can quickly
of sources, including application notice problems or trends. Examples
logs, operating system performance include dashboards or email alerts.
metrics, Azure resources, Azure With Azure dashboards, customers
subscriptions, and Azure tenants. can build a single pane of glass view to
Most Azure services expose metrics monitor Application Insights graphs,
that can be configured to analyze and Log Analytics, Azure Monitor metrics,
determine the cause of problems. and service health. Alerts from Azure
Monitor enable notifications for
Collection and storage. Raw
service health and resource
instrumentation data can be held
health issues.
in various locations and in various
formats (application trace logs,
Internet Information Services logs,
performance counters, and others).
42 August 2019

Monitoring isn’t the same as failure Application logging isn’t the same as
detection. For example, an application auditing. Auditing may be done for
might detect a transient error and retry, compliance or regulatory reasons. This
resulting in no downtime. But it should also means audit records must be complete, and
log the retry operation, so customers can it’s not acceptable to drop or exclude any
monitor the error rate in order to get an records while processing transactions. If an
overall picture of application health. application requires auditing, this should be
kept separate from diagnostics logging.
Application logs are an important source
of diagnostics data. Best practices for For more information, see Monitoring and
application logging include: diagnostics guidance.

Log in production so as not to lose


insight where it’s needed most.

Log events at service boundaries.


Include a correlation ID that flows
across service boundaries. If a
transaction flows through multiple
services and one of them fails, the
correlation ID will help pinpoint why
the transaction failed.

Use semantic logging, also known


as structured logging. Unstructured
logs make it hard to automate the
consumption and analysis of the log
data, which is needed at cloud scale.

Use asynchronous logging to avoid


application failure. Otherwise, the
logging system itself can cause
requests to back up when waiting to
write a logging event.
Resilience in Microsoft Azure 43

Respond to
Previous sections have focused on
automated recovery strategies, which
are critical for high availability. However,

failures sometimes manual intervention is needed.


44 August 2019

Alerts. Monitor the application for Operational readiness testing. Perform an


warning signs that may require proactive operational readiness test for both failover
intervention. For example, if SQL Database to the secondary region and failback to
or Cosmos DB consistently throttle the the primary region. Many Azure services
application, increase your database capacity support manual failover or test failover
or optimize queries. In this example, even for disaster recovery drills. Alternatively,
though the application might handle the simulate an outage by shutting down or
throttling errors transparently, telemetry removing services.
should still raise an alert so that you
Data consistency check. If a failure
can follow up. Microsoft recommends
occurs in a data store, there may be data
configuring alerts on Azure resource
inconsistencies when the store becomes
metrics and diagnostics logs against the
available again, especially if the data was
service limits and quota thresholds and
replicated. For Azure services that provide
further recommends setting up alerts on
cross-regional replication, look at the RTO
metrics, as they are lower latency than
and RPO to understand the expected data
diagnostics logs. In addition, Azure is able
loss in a failure. Review the SLAs for Azure
to provide some out-of-the-box health
services to understand whether cross-
statuses through resource health, which can
regional failover can be initiated manually
help diagnose throttling of Azure services.
or if it will be initiated by Microsoft. For
Failover. Configure a disaster recovery some services, Microsoft decides when
strategy for the application. The to perform the failover. Microsoft may
appropriate strategy will depend on the prioritize the recovery of data in the
SLAs. For most scenarios, an active-passive primary region, only failing over to a
implementation is sufficient. For more secondary region if data in the primary
information, see Deployment topologies for region is deemed unrecoverable. For
disaster recovery. Most Azure services allow example, GRS and Azure Key Vault follow
for either manual or automated failover. For this model.
example, in an IaaS application, use Azure
Site Recovery for the web and logic tiers
and SQL AlwaysOn availability groups for
the database tier. Traffic Manager provides
automated failover across regions.
Resilience in Microsoft Azure 45

Restoring from backup. In some scenarios,


restoring from backup is only possible
within the same region. This is the case
for Azure Backup. Other Azure services,
such as Azure Cache for Redis, provide
geo-replicated backups. The purpose of
backups is to protect against accidental
deletion or corruption of data by restoring
the application to an earlier functional
version. Therefore, while backups can
serve as a disaster recovery solution in
some cases, the inverse isn’t always true.
Disaster recovery won’t protect you against
accidental deletion or corruption of data.
46 August 2019

Example
This section will focus on resilience design
best practices for various application
deployments with varied resilience

resilience requirements. Typically, customers classify


the applications into various categories or

design
tiers based on their resilience
requirements, as discussed in the resilience
requirements section. The following

patterns sections take an example application from


each category and discuss how to design
that application to be resilient against
various types of failures.
Resilience in Microsoft Azure 47

Tier 4 application (99


application SLA, 24-hour
RPO, 72-hour RTO)
The first category of applications will be redeployed if there’s an issue with the single
the one with the least stringent availability instance VM in any tier in that region. For
requirements. The disaster recovery database VMs, use database backup copy
requirements are also on the low side with to recreate databases.
the acceptable data loss (RPO) of 24 hours
Azure Backup can be used on single
and acceptable downtime (RTO) of three
instance VMs to protect data and test
days. These can be internal applications such
backups using the restore feature. If there’s
as tooling applications, build servers, project
any data or VM level corruption, recover
document share websites, and more.
the file, folder, disk, or VM using Azure
A multi-tier web application in this category Backup restore capabilities.
can be deployed within an Azure region as a
For disaster recovery during a regional
single instance VM for each tier. If an explicit
failure scenario, consider replicating
SLA guarantee at the VM level is desired,
only the database VM with Azure Site
use premium storage for the VMs. Microsoft
Recovery. The web and app tier VMs can
recommends using premium storage for
be redeployed in another Azure region
the database VM after doing the trade-off
using Azure Resource Manager templates if
with premium storage cost if the application
they’re stateless or can be recovered from
is relatively important within this category.
the backup copy. Note that the recovery
Azure is the first and only public cloud
time could be high for this approach. If
to provide an explicit SLA on single
a higher trade-off on cost is acceptable,
instance VMs.
replicate the VMs across all tiers to another
The databases can be backed up using any region. Azure Site Recovery doesn’t require
backup software. For example, use Azure running additional VM instances in the
Backup and configure SQL database backups disaster recovery region. The VMs
for SQL servers running on Azure VMs. are created only when the user performs
Have Azure Resource Manager templates failover operations.
pre-created for the VMs so that VMs can be
48 August 2019

Monitor the health of the web application by


using an automation script that periodically
checks to determine whether the website
endpoint is reachable. Create a custom
endpoint that reports on the overall health of
the application. The endpoint should return
an HTTP error code if any critical dependency
is unhealthy or unreachable. Don’t report
errors for non-critical services.

Use a pre-created script to monitor the


simple health metrics of the VMs to detect
whether there’s an issue with the VM.
Troubleshoot issues by checking the VM
health metrics in the portal. Check for
component health (such as CPU, memory,
and disk) to determine whether there are
potential load issues. If there are consistent
issues with components such as CPU or RAM,
consider increasing the VM to a larger size or
consider scaling the application by deploying
more VMs at each tier.

Application software updates can be


deployed on the VMs using automation
scripts during a weekend maintenance
window. Ensure the automation script
is in place to roll back if any issues are
encountered during the deployment process.
Resilience in Microsoft Azure 49

The following table shows appropriate resilience strategies


for each failure type for tier four applications:

Failure type Resilience strategy

Hardware failure Operate ready-to-use templates to deploy another instance using


backup copies (if required). Test templates by deploying VMs into
a test subnet or a test virtual network.

Datacenter failure Operate ready-to-use templates to deploy another instance


using backup copies (if required) in another zone. Test templates
by deploying VMs into a test subnet or a test virtual network in
another availability zone.

Regional failure Use Azure Site Recovery to replicate the database VM. Test the
disaster recovery using test failover and Azure Site Recovery plans
Perform a disaster recovery failover in the event of an extended
outage in the source region.

Heavy load Load balance across instances to handle spikes in usage. For
example, put two or more Azure VMs behind a load balancer to
distribute traffic to all VMs.

Accidental data deletion or Use Azure Backup to back up the VMs. Test data recovery by
corruption restoring files, disks, and VMs. Restore data if there’s an accidental
deletion

Application deployment Use automation scripts to deploy updates. If there’s an issue


failure observed during the update process or after the update, roll back
to the previous version with an automated script.
50 August 2019

Tier 3 application (99.95


application SLA, 4-hour
RPO, 8-hour RTO)
The next category of applications is critical example, use SQL AlwaysOn availability
to business and requires high application groups with asynchronous replication for the
SLA (Figure 4). However, it’s acceptable to SQL databases.
have some downtime for these applications.
Use load balancers between each tier so that
The disaster recovery RPO and RTO
traffic can be load balanced and routed to the
requirements can be a few hours. These can
healthy VM instances. If there’s an issue with
be internal applications such as expense
one of the VMs in a tier, the application will
management or travel management
continue to work without any impact.
applications, which can have some impact if
the applications are down for a few hours, Use Azure Backup on all VMs to protect
but there won’t be significant revenue loss. the data and test backups using the restore
Less revenue-generating, customer-facing feature. The same feature can be used to
applications can also be part of recover a file, folder, disk, or VM if there’s
this category. any data or VM level corruption. Use the
SQL server database backup capability that’s
Build redundancy for the applications in
offered by Azure Backup to get more granular
this category by deploying them as two
(as low as 15 minutes) database copies.
or more VMs at each tier as part of an
Availability Set. Availability Sets ensure For disaster recovery in a regional failure
that the VMs are placed in different fault scenario, consider replicating only the
domains and this guarantees that hardware database VM with Azure Site Recovery. The
failures such as a cluster or rack failure don’t web and app tier VMs can be redeployed in
impact the end application. another Azure region using Azure Resource
Manager templates if they’re stateless or can
Having two or more VMs in an Availability
be recovered from the backup copy. Note
Set provides 99.95 percent availability
that the recovery time could be high for this
for each tier. This will help get the overall
approach. If a higher trade-off on cost is
composite SLA of the application to ebb
acceptable, replicate the VMs across all tiers to
within 99.9 percent. For the database VMs,
another region. Azure Site Recovery doesn’t
use built-in synchronous replication to get
require running additional VM instances in
high availability and avoid data loss. For
Resilience in Microsoft Azure 51

the disaster recovery region. The VMs health (such as CPU, memory, and disk) to
are created only when the user performs determine whether there are potential load
failover operations. issues. If there are consistent issues with
components such as CPU or RAM, consider
Monitor the health of the web application
increasing the VM to a larger size or consider
by using an automation script that
scaling the application by deploying more
periodically checks to determine whether
VMs at each tier. Monitor advanced metrics
the website endpoint is reachable. Create a
for a VMs health as well as activities such
custom endpoint that reports on the overall
as database failover when using
health of the application. The endpoint
asynchronous replication.
should return an HTTP error code if any
critical dependency is unhealthy Application software updates can be
or unreachable. deployed on the VMs using automation
scripts during a weekend maintenance
Use a pre-created script to monitor the
window. Ensure that the automation script
simple health metrics of the VMs to detect
is in place to roll back if any issues are
if there’s an issue with the VM. Troubleshoot
encountered during the deployment process.
any issues by checking the VM health
metrics in the portal. Check for component

Figure 4. Typical resilience pattern for a tier 3 application


52 August 2019

The following table shows appropriate resilience strategies


for each failure type for tier three applications:

Failure type Resilience strategy

Hardware failure Build redundancy by deploying two or more instances in an


Availability Set within a datacenter.

Datacenter failure Operate ready-to-use templates to deploy another instance


using backup copies (if required) in another availability zone. Test
templates by deploying VMs into a test subnet or a test virtual
network in another zone.

Regional failure Use Azure Site Recovery to replicate the database VM. Test the
disaster recovery using test failover and Azure Site Recovery plans.
Perform a disaster recovery failover in the event of an extended
outage in the source region.

Heavy load Use monitoring tools to identify load surges on the VM. Increase
the size of the VM or scale up by adding more instances.

Accidental data deletion or Use Azure Backup to back up VMs. Test data recovery by restoring
corruption files, disks, VMs, or SQL databases. Restore data if an accidental
deletion occurs.

Application deployment Use automation scripts to deploy updates. If an issue is observed


failure during the update process or after the update, roll back to the
previous version with an automated script.
Resilience in Microsoft Azure 53

Tier 2 application (99.99


application SLA, 30-minute
RPO, 4-hour RTO)

This category of applications contains


business critical apps whereby it can have
significant impact on revenues if downtime
occurs (Figure 5). These applications can
be external customer-facing e-commerce
websites, content streaming platforms,
financial transaction handling applications,
and the like. The applications should be
highly available with resilience for all
component failures.

Figure 5. Typical resilience pattern for a tier 2 application


54 August 2019

The following table shows appropriate resilience strategies


for each failure type for tier two applications:

Failure type Resilience strategy

Hardware failure Build redundancy by deploying two or more instances across


availability zones within a region.

Datacenter failure Build redundancy by deploying two or more instances across


availability zones within a region.

Regional failure Use Azure Site Recovery to replicate all the VMs. Test the disaster
recovery using test failover and Azure Site Recovery plans.
Perform a disaster recovery failover in the event of an extended
outage in the source region.

Heavy load Provision enough capacity into the application. Use tools to
monitor the load and add more instances that automatically use
scripts if the threshold (for example, 70 percent) is reached.

Accidental data deletion or Use Azure Backup to back up all VMs and SQL databases. Test
corruption data recovery by restoring files, disks, VMs, or SQL databases.
Restore data if an accidental deletion occurs.

Application deployment Use safe deployment practices to roll out the updates to a minimal
failure set of customers before deploying them widely. Use automation
scripts to deploy updates with the automatic roll back capability
built in if there’s an issue with the update deployment. Configure
alerts to send alarms/notifications if there is an issue occurs after
an update deployment. If so, have the automated roll back script
ready to execute.
Resilience in Microsoft Azure 55

Tier 1 application (99.99


application SLA, 5-minute
RPO, 1-hour RTO)

This category of applications includes


business and mission-critical apps that have
strict requirements regarding data loss and
recovery time. More than a few minutes
of data loss can significantly impact the
business and revenue. Customer-facing
applications such as order processing
systems and banking applications fall into
this category.
56 August 2019

The following table shows appropriate resilience strategies


for each failure type for tier one applications:

Failure type Resilience strategy

Hardware failure Build redundancy by deploying two or more instances across


availability zones within a region.

Datacenter failure Build redundancy by deploying two or more instances across


availability zones within a region.

Regional failure Use Azure Site Recovery to replicate all VMs in the web tier and
middle tier. Use native replication technologies such as SQL
AlwaysOn. Test the disaster recovery of the complete application
(including SQL AlwaysOn failover using Azure Site Recovery plans)
and test failover capabilities. Perform a disaster recovery failover
in the event of an extended outage in the source region.

Heavy load Provision enough capacity into the application. Use tools to
monitor the load and add more instances that automatically use
scripts if the threshold (for example, 70 percent) is reached.

Accidental data deletion or Use Azure Backup to back up all VMs and SQL databases. Test
corruption data recovery by restoring files, disks, VMs, or SQL databases.
Restore data if an accidental deletion occurs.

Application deployment Use safe deployment practices to roll out the updates to a
failure minimal set of customers before deploying them widely. Use
automation scripts to deploy updates with the automatic roll back
capability built in if there’s an issue with the update deployment.
Configure alerts to send alarms/notifications if an issue occurs
after an update deployment. If so, have the automated roll back
script ready to execute.
Resilience in Microsoft Azure 57

Tier 0 application (99.995 application


SLA, 0 RPO, 0 RTO)

A few business applications will be mission feasible. With the new technologies and
critical and will require close to 100 percent services now available in Microsoft Azure, a
availability, no data loss, and no downtime. mission-critical business application can be
An example of this is an e-commerce website made highly available across regions. Note
where the only sales channel is through a that such availability guarantees come at a
website. This site can’t accommodate any high cost.
downtime or data loss, especially during
Customers can architect their applications on
the holiday season. Similarly, a stock trading
Azure using various modern service offerings
website for a financial services company can’t
such as App Service plan, Cosmos DB, Azure
have any downtime or data loss.
Active Directory, Azure Cache for Redis, and
Traditionally, implementing highly available Azure Search to ensure the application is
applications across multiple datacenters highly available across multiple regions and
that are hundreds of kilometers apart wasn’t will run with low latencies (Figure 6).

Figure 6. Typical resilience pattern for a tier 0 application


58 August 2019

The following table shows appropriate resilience strategies for


each failure type for tier zero applications:

Failure type Resilience strategy

Hardware failure Build redundancy by deploying two or more VM instances across


availability zones within a region.

Datacenter failure Build redundancy by deploying two or more VM instances across


availability zones within a region.

Regional failure Use Azure Site Recovery to replicate all VMs in the web tier and
middle tier. Use global data distribution with Cosmos DB. Test the
disaster recovery of the complete application (including Cosmos
DB failover). Perform disaster recovery failover in the event of an
extended outage in a source region.

Heavy load Provision enough capacity into the application. Use tools to
monitor the load and add more instances that automatically use
scripts if the threshold (for example, 70 percent) is reached.

Accidental data deletion or Use Azure Backup to back up all VMs and SQL databases. Test
corruption data recovery by restoring files, disks, VMs or SQL databases.
Restore the data if an accidental deletion occurs.

Application deployment Use safe deployment practices to roll out the updates to a
failure minimal set of customers before deploying it widely. Use
automation scripts to deploy updates with the automatic roll back
capability built in if there’s an issue with the update deployment.
Have alerts configured to send alarms if an issue occurs after an
update deployment. If any occur, have the automated roll back
script ready to execute.
Resilience in Microsoft Azure 59

Conclusion This document discusses the importance


of resilience when designing applications
on Azure and the process of designing and
deploying highly resilient applications. It’s
designed to help customers understand
how Microsoft continuously improves
Azure’s platform reliability by investing in
the foundation aspects. It also provides an
overview of the built-in services offered by
Azure that can be leveraged to design and
deploy resilient applications.

It’s important for customers to understand


the shared responsibility model and the
expectations that go along with the SLA
when deploying applications in the cloud.
Microsoft can help organizations design
their application components to be resilient
to failures.

The example resilience design patterns


discussed in this document are meant to
help customers as they start running their
production applications on Azure.

To provide feedback or to share the best


practices you followed to deploy highly
resilient applications on Azure, please
contact us.

You might also like