Resilience in Azure
Resilience in Azure
Resilience in Azure
Microsoft Azure
Designing resilient infrastructure as a
service applications on Azure
August 2019
2 August 2019
Abstract
This document provides guidance on
designing resilient infrastructure as a
service (IaaS) applications on Azure and
provides sample application design
patterns for varied levels of resilience.
The audience for this document is those
who move applications from on-premises
environments to Azure or who build
applications on Azure, such as cloud
solution architects, business continuity and
disaster recovery administrators, application
developers and operations team members,
as well as CTOs, CIOs, and those involved in
policy, planning, and other strategic roles.
Resilience in Microsoft Azure 3
Table of contents
02 Abstract 34 Test failures and response strategies
Introduction
Azure is a rapidly growing cloud computing
platform that provides an ever-expanding
suite of cloud services. These include
analytics, computing, database, mobile,
networking, storage, and web services.
Azure integrates tools, templates, and
managed services that work together to
help make it easier to build and manage
enterprise, mobile, web, and Internet
of Things (IoT) apps faster, using the
tools, applications, and frameworks that
customers choose.
Resilience in Microsoft Azure 5
Azure is built on trust. The Azure approach This document describes a process for
to trust is based on five foundational achieving resilience using a structured
principles: security, compliance, privacy, approach throughout the lifetime
resilience, and intellectual property of an application—from design and
(IP) protection. A well-designed Azure implementation to deployment
application should focus on these five and operations.
pillars of software quality.
Pillar Description
Concepts and
Achieving resilience requires understanding
how it’s measured, designed, and
implemented. The concepts and terms
Resilient design
Failures can vary in the scope of their mode analysis (FMA) during the design
impact. There are many different failure phase helps identify possible points of
types, including hardware, datacenter, failure and defines how the application will
regional or transient failure, as well as respond to those failures.
dependency service, heavy load, accidental
data deletion/corruption, and application
deployment failures. Performing a failure
Deploy the application into production using a reliable and repeatable process.
What’s
Resilience is a system’s ability to recover
from failures and continue to function.
It’s not only about avoiding failures
Built-in Azure
resilience services
Despite all the investments Microsoft puts
into making the platform reliable, apps
can suffer from downtime because of
unplanned events such as power failures,
data corruption, ransomware attacks,
as well as fires or natural disasters. The
graphic below illustrates key Azure service
offerings for various types of failures that
can occur (Figure 1).
Improved availability Build and run highly available applications with Implement disaster recovery plan with data
Customer
Premium Storage Availability Sets Availability Zones Azure site recovery & region pairs
Region 2
Region 1
Data-center 1
Single VM
Backup
Shared responsibility
Azure invests in platform resilience and for various types of failures and knowing
complements those investments with how to deal with them on-premises. With
the abovementioned offerings to ensure IaaS, Azure is responsible for the core
uptime for customers. Being resilient in the infrastructure resilience, which includes
event of any failure, however, is a shared storage, networking, and compute. As the
responsibility. Who’s responsible for what, model moves from IaaS to PaaS and then to
in terms of resilience, depends on the cloud SaaS, the customer is responsible for less and
service model that’s being used—IaaS, the cloud service provider is responsible
PaaS, or SaaS (Figure 2). for more.
In the traditional on-premises model, the For more information about the shared
entire responsibility of managing—from responsibility model as it pertains to Azure
the hardware for compute, storage, and cloud services, download the paper Shared
networking to the application—falls on Responsibilities for Cloud Computing Today.
the customer. That requires planning
Design your
applications
16 August 2019
Design the application for resilience with protection from various failures at the
application, infrastructure, and regional levels. This document provides proven
best practices learnt from customer engagements strategies to detect and
recover from failures.
Deploy the application into production using a reliable and repeatable process.
the application to detect failures, enable health checks, and allow for incidence
Monitor response when necessary.
if there are failures that require manual interventions.there are failures that require
Respond manual interventions.
Resilience in Microsoft Azure 17
Define
The required level of resilience depends
on several considerations. The first step
is to define resilience requirements
requirements
understanding of the SLA.
18 August 2019
Design Classify applications into various tiers based on the business need.
Identify resilience
Identify resilience requirements such as SLA, RPO, and RTO for each tier.
requirements
SLAs
In Azure, the SLA describes our achieve that level of availability grows. An
commitment regarding uptime and uptime of 99.99 percent translates to about
connectivity. If the SLA for a particular five minutes of total downtime per month. Is
service is 99.9 percent, that means it worth the additional complexity and cost to
customers can expect the service to be reach that percentage? The answer depends
available 99.9 percent of on the individual business requirements.
the time.
To achieve four nines (99.99 percent), manual
Azure customers should define their own intervention can’t be relied on to recover
target SLAs for each workload in their from failures. The application must be self-
solutions. An SLA makes it possible to diagnosing and self-healing. Beyond four
evaluate whether the architecture meets nines, it’s challenging to detect outages quickly
the business requirements. For example, if enough to meet the SLA. Think about the time
a workload requires 99.99 percent uptime, window against which the SLA is measured: the
but depends on a service with a 99.9 smaller the window, the tighter the tolerances.
percent SLA, that service can’t be a single
It probably doesn’t make sense to define
point of failure in the system.
the SLA in terms of hourly or daily uptime.
Of course, higher availability is always Consider the MTBF and MTTR measurements.
better when everything else is equal. The lower the SLA, the less frequently the
But as customers strive for higher service can go down, and the more quickly the
percentages, the cost and complexity to service must recover.
The following table shows the potential cumulative downtime for various SLA levels:
SLA Downtime per week Downtime per month Downtime per year
Composite SLAs
Dealing with multiple services that have different SLAs adds complexity. In this case, it’s necessary
to calculate a composite SLA. Consider the Web Apps feature of Azure App Service which writes to
Azure SQL Database. At the time of this writing, these Azure services have the following SLAs:
What’s the maximum downtime that should be expected for this application? If either service fails,
the whole application fails. In general, the probability of each service failing is independent of the
other. So, the composite SLA for this application is the following:
That’s lower than the individual SLAs. This isn’t surprising, because an application that relies on
multiple services has more potential failure points.
On the other hand, the composite SLA can be improved by creating independent fallback paths. For
example, if SQL Database is unavailable, transactions are put into a queue to be processed later.
With this design, the application is still available even if it can’t connect to the database. However, it
fails if the database and the queue both fail at the same time. The expected percentage of time for
a simultaneous failure is 0.0001 × 0.001, so the composite SLA for this combined path is:
There are tradeoffs to this approach. The application logic is more complex, the customer is paying
for the queue, and there may be data consistency issues to consider.
Resilience in Microsoft Azure 21
Let N be the composite SLA for the application deployed in one region,
and R be the number of regions where the application is deployed. The
expected chance that the application will fail in all regions at the same
time is ((1 − N) ^ R).
The SLA for Traffic Manager must also be factored in. At the time of this
writing, the SLA for Traffic Manager is 99.99 percent.
Decompose by workload
Many cloud solutions consist of multiple Also consider usage patterns. Are there
application workloads. The term certain critical periods when the system
workload in this context means a discrete must be available? For example, a tax-filing
capability or computing task, which service can’t go down right before the
can be logically separated from other filing deadline, a video streaming service
tasks in terms of business logic and data must stay up during a big sports event, and
storage requirements. For example, an so on. During these critical periods, using
e-commerce app might include the redundant deployments across several
following workloads: regions will help provide resilience, so
the application can fail over if one region
Browse and search a product catalog.
fails. Because multi-region deployment is
Create and track orders. potentially more expensive, it‘s more cost
effective to run the application in a single
View recommendations.
region during less critical times.
These workloads might have different
Categorizing applications into different
requirements for availability, scalability,
tiers is a common strategy. Tier zero and
data consistency, and disaster recovery. This
tier one applications are made up of those
means business decisions are required to
applications that should experience very
balance cost versus risk.
minimal data loss and downtime.
Resilience in Microsoft Azure 23
The RTO/RPO for this tier needs to be where the only sales channel is through the
zero or as close to nil as possible. Tier website. For a retail company where brick
two applications consist of applications and mortar stores are the primary sales
for which it’s acceptable to lose minimal channel, the point of sales (POS) application
amounts of data with RTO and RPO of the is tier zero or tier one. Thus, it’s important
order of a few minutes. With tier three and to categorize the application inventory into
tier four applications, downtime can affect multiple tiers based on the organization’s
internal operations for a few hours. While business needs and resilience requirements.
this is inconvenient, it doesn’t pose a huge
risk to the business.
Tier 0 99.995 0 0
Design for
resilience
Resilience in Microsoft Azure 25
Failures can vary in the scope of their impact. The table below describes various
types of application failures. While this list isn’t exhaustive, it provides a starting
point to help customers think about various failure types.
Accidental data deletion or Customers mistakenly delete critical data or data has been
corruption corrupted due to unforeseen reasons.
Implement
This section provides guidance on
implementing common resilience strategies
for applications to prevent or respond to
strategies
provide a general idea of how to plan and
implement resilience strategies.
for Azure
applications
28 August 2019
Application deployment
Automate deployments with a rollback plan.
failure
Resilience in Microsoft Azure 29
Load balance across instances. Scalability Back up data. Always configure backups
means a cloud application should be able for all critical production data sources.
to scale out by adding more instances. This This includes VMs, databases, storage,
approach also improves resilience, allowing among others. Accidental deletions or
for unhealthy instances to be removed from data corruptions can happen at any time.
rotation. For example: Personnel might not become aware
of some of these until a few days or
When you put two or more VMs
even weeks later. Thus, it’s important to
behind a load balancer, it distributes
configure longer retention times for backup
traffic to all of the VMs. See Run load-
copies, depending on the nature and
balanced VMs for scalability
criticality of the application.
and availability.
Set consists of two or more fault domains running in an elastic and distributed
that share a common power source and infrastructure with no hard-coded
network switch. VMs in an Availability Set infrastructure components specified in the
are distributed across the fault domains, code base.
so if a hardware failure affects one fault
Azure Site Recovery. Azure Site Recovery
domain, network traffic can still be routed
helps to replicate Azure VMs to another
to the VMs in the other fault domains. For
Azure region for business continuity and
more information about Availability Sets,
disaster recovery. Conduct periodic disaster
see Manage the availability of Windows
recovery drills to ensure compliance
virtual machines in Azure.
requirements are met. The VM will be
Availability Zones. An Availability Zone is a replicated with the specified settings to
physically separate location within an Azure the selected region—ensuring customers
region. They provide a combination of low can recover their applications in the event
latency and high availability through the of outages in the source region. For more
strategic physical location separation within information, see Set up disaster recovery to
an Azure region. Each Availability Zone has a secondary Azure region for an Azure VM.
independent physical infrastructure with a
Customers should factor in the RTO and
distinct power source, network, and cooling
RPO numbers for their solutions here and
system. Deploying VMs across Availability
ensure that when testing, the recovery
Zones helps protect an application against
time and recovery point is appropriate for
datacenter-wide failures. See What are
their needs.
Availability Zones in Azure? for a list of
supported regions and services. Paired regions. To protect an application
in case of a regional outage, deploy the
When planning to use Availability Zones
application across multiple regions using
in a deployment, first validate that the
Traffic Manager. This distributes internet
application architecture and code base
traffic to the different regions and pairs
can support this configuration. When
each Azure region with another region—
deploying commercial off-the-shelf
forming a regional pair. With the exception
software, consult with the software vendor
of Brazil South, regional pairs are located
and test adequately before deploying into
within the same geography in order to
production. The application must support
32 August 2019
Test failures
Generally, resilience can’t be tested in
the same way as application functionality
(that is, by running unit tests and so on).
strategies
occur intermittently.
Crash processes.
Expire certificates.
Unmount disks.
Redeploy a VM.
Load testing. Load testing is crucial for Disaster recovery drills. It’s not enough to
identifying failures that only happen under have a good disaster recovery plan in place.
load, such as the backend database being It must also be tested periodically to ensure
overwhelmed or service throttling. Test that the recovery plan works properly when
for peak load, using production data or it matters most. For Azure virtual machines,
synthetic data that’s as close to production use Azure Site Recovery to replicate and
data as possible. The goal is to observe perform disaster recovery drills—all without
how the application behaves under real- impacting production applications or
world conditions. ongoing replication.
Resilience in Microsoft Azure 37
Deploy using
After an application is deployed to
production, updates present a potential
source of errors. In the worst case, a bad
process
predictable and repeatable. Deployment
includes provisioning Azure resources,
deploying application code, and applying
configuration settings. An update may
involve all three or a subset of these.
38 August 2019
Application updates
When rolling out an application update,
Microsoft recommends blue-green deployment
or canary releases to push updates in a highly
controlled way and minimize possible impacts
from a bad deployment. These techniques are
explained below:
Monitor to
Monitoring is crucial for resilience. If
something fails, it’s essential to know that it
failed, and to get insights into the cause(s)
Monitoring isn’t the same as failure Application logging isn’t the same as
detection. For example, an application auditing. Auditing may be done for
might detect a transient error and retry, compliance or regulatory reasons. This
resulting in no downtime. But it should also means audit records must be complete, and
log the retry operation, so customers can it’s not acceptable to drop or exclude any
monitor the error rate in order to get an records while processing transactions. If an
overall picture of application health. application requires auditing, this should be
kept separate from diagnostics logging.
Application logs are an important source
of diagnostics data. Best practices for For more information, see Monitoring and
application logging include: diagnostics guidance.
Respond to
Previous sections have focused on
automated recovery strategies, which
are critical for high availability. However,
Example
This section will focus on resilience design
best practices for various application
deployments with varied resilience
design
tiers based on their resilience
requirements, as discussed in the resilience
requirements section. The following
Regional failure Use Azure Site Recovery to replicate the database VM. Test the
disaster recovery using test failover and Azure Site Recovery plans
Perform a disaster recovery failover in the event of an extended
outage in the source region.
Heavy load Load balance across instances to handle spikes in usage. For
example, put two or more Azure VMs behind a load balancer to
distribute traffic to all VMs.
Accidental data deletion or Use Azure Backup to back up the VMs. Test data recovery by
corruption restoring files, disks, and VMs. Restore data if there’s an accidental
deletion
the disaster recovery region. The VMs health (such as CPU, memory, and disk) to
are created only when the user performs determine whether there are potential load
failover operations. issues. If there are consistent issues with
components such as CPU or RAM, consider
Monitor the health of the web application
increasing the VM to a larger size or consider
by using an automation script that
scaling the application by deploying more
periodically checks to determine whether
VMs at each tier. Monitor advanced metrics
the website endpoint is reachable. Create a
for a VMs health as well as activities such
custom endpoint that reports on the overall
as database failover when using
health of the application. The endpoint
asynchronous replication.
should return an HTTP error code if any
critical dependency is unhealthy Application software updates can be
or unreachable. deployed on the VMs using automation
scripts during a weekend maintenance
Use a pre-created script to monitor the
window. Ensure that the automation script
simple health metrics of the VMs to detect
is in place to roll back if any issues are
if there’s an issue with the VM. Troubleshoot
encountered during the deployment process.
any issues by checking the VM health
metrics in the portal. Check for component
Regional failure Use Azure Site Recovery to replicate the database VM. Test the
disaster recovery using test failover and Azure Site Recovery plans.
Perform a disaster recovery failover in the event of an extended
outage in the source region.
Heavy load Use monitoring tools to identify load surges on the VM. Increase
the size of the VM or scale up by adding more instances.
Accidental data deletion or Use Azure Backup to back up VMs. Test data recovery by restoring
corruption files, disks, VMs, or SQL databases. Restore data if an accidental
deletion occurs.
Regional failure Use Azure Site Recovery to replicate all the VMs. Test the disaster
recovery using test failover and Azure Site Recovery plans.
Perform a disaster recovery failover in the event of an extended
outage in the source region.
Heavy load Provision enough capacity into the application. Use tools to
monitor the load and add more instances that automatically use
scripts if the threshold (for example, 70 percent) is reached.
Accidental data deletion or Use Azure Backup to back up all VMs and SQL databases. Test
corruption data recovery by restoring files, disks, VMs, or SQL databases.
Restore data if an accidental deletion occurs.
Application deployment Use safe deployment practices to roll out the updates to a minimal
failure set of customers before deploying them widely. Use automation
scripts to deploy updates with the automatic roll back capability
built in if there’s an issue with the update deployment. Configure
alerts to send alarms/notifications if there is an issue occurs after
an update deployment. If so, have the automated roll back script
ready to execute.
Resilience in Microsoft Azure 55
Regional failure Use Azure Site Recovery to replicate all VMs in the web tier and
middle tier. Use native replication technologies such as SQL
AlwaysOn. Test the disaster recovery of the complete application
(including SQL AlwaysOn failover using Azure Site Recovery plans)
and test failover capabilities. Perform a disaster recovery failover
in the event of an extended outage in the source region.
Heavy load Provision enough capacity into the application. Use tools to
monitor the load and add more instances that automatically use
scripts if the threshold (for example, 70 percent) is reached.
Accidental data deletion or Use Azure Backup to back up all VMs and SQL databases. Test
corruption data recovery by restoring files, disks, VMs, or SQL databases.
Restore data if an accidental deletion occurs.
Application deployment Use safe deployment practices to roll out the updates to a
failure minimal set of customers before deploying them widely. Use
automation scripts to deploy updates with the automatic roll back
capability built in if there’s an issue with the update deployment.
Configure alerts to send alarms/notifications if an issue occurs
after an update deployment. If so, have the automated roll back
script ready to execute.
Resilience in Microsoft Azure 57
A few business applications will be mission feasible. With the new technologies and
critical and will require close to 100 percent services now available in Microsoft Azure, a
availability, no data loss, and no downtime. mission-critical business application can be
An example of this is an e-commerce website made highly available across regions. Note
where the only sales channel is through a that such availability guarantees come at a
website. This site can’t accommodate any high cost.
downtime or data loss, especially during
Customers can architect their applications on
the holiday season. Similarly, a stock trading
Azure using various modern service offerings
website for a financial services company can’t
such as App Service plan, Cosmos DB, Azure
have any downtime or data loss.
Active Directory, Azure Cache for Redis, and
Traditionally, implementing highly available Azure Search to ensure the application is
applications across multiple datacenters highly available across multiple regions and
that are hundreds of kilometers apart wasn’t will run with low latencies (Figure 6).
Regional failure Use Azure Site Recovery to replicate all VMs in the web tier and
middle tier. Use global data distribution with Cosmos DB. Test the
disaster recovery of the complete application (including Cosmos
DB failover). Perform disaster recovery failover in the event of an
extended outage in a source region.
Heavy load Provision enough capacity into the application. Use tools to
monitor the load and add more instances that automatically use
scripts if the threshold (for example, 70 percent) is reached.
Accidental data deletion or Use Azure Backup to back up all VMs and SQL databases. Test
corruption data recovery by restoring files, disks, VMs or SQL databases.
Restore the data if an accidental deletion occurs.
Application deployment Use safe deployment practices to roll out the updates to a
failure minimal set of customers before deploying it widely. Use
automation scripts to deploy updates with the automatic roll back
capability built in if there’s an issue with the update deployment.
Have alerts configured to send alarms if an issue occurs after an
update deployment. If any occur, have the automated roll back
script ready to execute.
Resilience in Microsoft Azure 59