Azure Application Architecture Guide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

Table of Contents

Azure Architecture Center


Application Architecture Guide
Architecture styles
N-tier application
Web-queue-worker
Microservices
CQRS
Event-driven architecture
Big data
Big compute
Technology choices
Compute options overview
Compute comparison
Data store overview
Data store comparison
Design principles
Design for self-healing
Make all things redundant
Minimize coordination
Design to scale out
Partition around limits
Design for operations
Use managed services
Use the best data store for the job
Design for evolution
Build for the needs of business
Pillars of software quality
Design patterns
This guide presents a structured approach for designing applications on Azure that are scalable, resilient, and highly available. It is
based on proven practices that we have learned from customer engagements.
Reference
Architectures

Architecture
Styles

Cloud Design Design Technology Compute and


Patterns Patterns Choices Storage Options

Quality Design
Pillars Principles

Design Review
Checklists Best Practices

Introduction
The cloud is changing the way applications are designed. Instead of monoliths, applications are decomposed into smaller,
decentralized services. These services communicate through APIs or by using asynchronous messaging or eventing. Applications
scale horizontally, adding new instances as demand requires.
These trends bring new challenges. Application state is distributed. Operations are done in parallel and asynchronously. The system
as a whole must be resilient when failures occur. Deployments must be automated and predictable. Monitoring and telemetry are
critical for gaining insight into the system. The Azure Application Architecture Guide is designed to help you navigate these
changes.

TR AD ITIO NAL O N- PR EMIS ES MO D ER N CLO U D

Monolithic, centralized Decomposed, de-centralized


Design for predictable scalability Design for elastic scale
Relational database Polyglot persistence (mix of storage technologies)
Strong consistency Eventual consistency
Serial and synchronized processing Parallel and asynchronous processing
Design to avoid failures (MTBF) Design for failure (MTTR)
Occasional big updates Frequent small updates
Manual management Automated self-management
Snowflake servers Immutable infrastructure

This guide is intended for application architects, developers, and operations teams. It's not a how-to guide for using individual
Azure services. After reading this guide, you will understand the architectural patterns and best practices to apply when building on
the Azure cloud platform.

How this guide is structured


The Azure Application Architecture Guide is organized as a series of steps, from the architecture and design to implementation. For
each step, there is supporting guidance that will help you with the design of your application architecture.
Architecture styles. The first decision point is the most fundamental. What kind of architecture are you building? It might be a
microservices architecture, a more traditional N-tier application, or a big data solution. We have identified seven distinct
architecture styles. There are benefits and challenges to each.

Azure Reference Architectures show recommended deployments in Azure, along with considerations for scalability,
availability, manageability, and security. Most also include deployable Resource Manager templates.

Technology Choices. Two technology choices should be decided early on, because they affect the entire architecture. These are
the choice of compute and storage technologies. The term compute refers to the hosting model for the computing resources that
your applications runs on. Storage includes databases but also storage for message queues, caches, IoT data, unstructured log data,
and anything else that an application might persist to storage.

Compute options and Storage options provide detailed comparison criteria for selecting compute and storage services.

Design Principles. Throughout the design process, keep these ten high-level design principles in mind.

Best practices articles give specific guidance on areas such as auto-scaling, caching, data partitioning, API design, and others.

Pillars. A successful cloud application will focus on these five pillars of software quality: Scalability, availability, resiliency,
management, and security.

Use our Design review checklists to review your design according to these quality pillars.

Cloud Design Patterns. These design patterns are useful for building reliable, scalable, and secure applications on Azure. Each
pattern describes a problem, a pattern that addresses the problem, and an example based on Azure.

View the complete Catalog of cloud design patterns.


An architecture style is a family of architectures that share certain characteristics. For example, N-tier is a common architecture
style. More recently, microservice architectures have started to gain favor. Architecture styles don't require the use of particular
technologies, but some technologies are well-suited for certain architectures. For example, containers are a natural fit for
microservices.
We have identified a set of architecture styles that are commonly found in cloud applications. The article for each style includes:
A description and logical diagram of the style.
Recommendations for when to choose this style.
Benefits, challenges, and best practices.
A recommended deployment using relevant Azure services.

A quick tour of the styles


This section gives a quick tour of the architecture styles that we've identified, along with some high-level considerations for their
use. Read more details in the linked topics.

N -tier
N-tier is a traditional architecture for enterprise applications. Dependencies are managed by dividing
Layers the application into layers that perform logical functions, such as presentation, business logic, and
data access. A layer can only call into layers that sit below it. However, this horizontal layering can be
a liability. It can be hard to introduce changes in one part of the application without touching the rest
N-tier
of the application. That makes frequent updates a challenge, limiting how quickly new features can be
added.
N-tier is a natural fit for migrating existing applications that already use a layered architecture. For that reason, N-tier is most often
seen in infrastructure as a service (IaaS) solutions, or application that use a mix of IaaS and managed services.

Web- Queue- Worker


For a purely PaaS solution, consider a Web-Queue-Worker architecture. In this style, the application
Front Back has a web front end that handles HTTP requests and a back-end worker that performs CPU-intensive
end Queue end tasks or long-running operations. The front end communicates to the worker through an
Web-queue-worker asynchronous message queue.
Web-queue-worker is suitable for relatively simple domains with some resource-intensive tasks. Like
N-tier, the architecture is easy to understand. The use of managed services simplifies deployment and operations. But with a
complex domains, it can be hard to manage dependencies. The front end and the worker can easily become large, monolithic
components that are hard to maintain and update. As with N-tier, this can reduce the frequency of updates and limit innovation.

Microservices
If your application has a more complex domain, consider moving to a Microservices architecture. A
microservices application is composed of many small, independent services. Each service implements
a single business capability. Services are loosely coupled, communicating through API contracts.
Microservices Each service can be built by a small, focused development team. Individual services can be deployed
without a lot of coordination between teams, which encourages frequent updates. A microservice
architecture is more complex to build and manage than either N-tier or web-queue-worker. It requires a mature development and
DevOps culture. But done right, this style can lead to higher release velocity, faster innovation, and a more resilient architecture.

CQRS
Write The CQRS (Command and Query Responsibility Segregation) style separates read and write
operations into separate models. This isolates the parts of the system that update data from the parts
that read the data. Moreover, reads can be executed against a materialized view that is physically
Read separate from the write database. That lets you scale the read and write workloads independently,
and optimize the materialized view for queries.
CQRS CQRS makes the most sense when it's applied to a subsystem of a larger architecture. Generally, you
shouldn't impose it across the entire application, as that will just create unneeded complexity.
Consider it for collaborative domains where many users access the same data.

Event-driven architecture
Event-Driven Architectures use a publish-subscribe (pub-sub) model, where producers publish
events, and consumers subscribe to them. The producers are independent from the consumers, and
Producers consumers are independent from each other.
Consumers Consider an event-driven architecture for applications that ingest and process a large volume of data
Event driven with very low latency, such as IoT solutions. The style is also useful when different subsystems must
perform different types of processing on the same event data.

Big Data, Big Compute


Big Data and Big Compute are specialized architecture styles for workloads that fit certain specific profiles. Big data divides a
very large dataset into chunks, performing paralleling processing across the entire set, for analysis and reporting. Big compute,
also called high-performance computing (HPC), makes parallel computations across a large number (thousands) of cores.
Domains include simulations, modeling, and 3-D rendering.

Architecture styles as constraints


An architecture style places constraints on the design, including the set of elements that can appear and the allowed relationships
between those elements. Constraints guide the "shape" of an architecture by restricting the universe of choices. When an
architecture conforms to the constraints of a particular style, certain desirable properties emerge.
For example, the constraints in microservices include:
A service represents a single responsibility.
Every service is independent of the others.
Data is private to the service that owns it. Services do not share data.
By adhering to these constraints, what emerges is a system where services can be deployed independently, faults are isolated,
frequent updates are possible, and it's easy to introduce new technologies into the application.
Before choosing an architecture style, make sure that you understand the underlying principles and constraints of that style.
Otherwise, you can end up with a design that conforms to the style at a superficial level, but does not achieve the full potential of
that style. It's also important to be pragmatic. Sometimes it's better to relax a constraint, rather than insist on architectural purity.
The following table summarizes how each style manages dependencies, and the types of domain that are best suited for each.

AR CHITECTU R E S T YLE D EPEND ENCY MANAG EMENT D O MAIN T YPE

N-tier Horizontal tiers divided by subnet Traditional business domain. Frequency of


updates is low.

Web-Queue-Worker Front and backend jobs, decoupled by async Relatively simple domain with some resource
messaging. intensive tasks.

Microservices Vertically (functionally) decomposed services Complicated domain. Frequent updates.


that call each other through APIs.
AR CHITECTU R E S T YLE D EPEND ENCY MANAG EMENT D O MAIN T YPE
CQRS Read/write segregation. Schema and scale are Collaborative domain where lots of users
optimized separately. access the same data.

Event-driven architecture. Producer/consumer. Independent view per IoT and real-time systems
sub-system.

Big data Divide a huge dataset into small chunks. Batch and real-time data analysis. Predictive
Parallel processing on local datasets. analysis using ML.

Big compute Data allocation to thousands of cores. Compute intensive domains such as
simulation.

Consider challenges and benefits


Constraints also create challenges, so it's important to understand the trade-offs when adopting any of these styles. Do the
benefits of the architecture style outweigh the challenges, for this subdomain and bounded context.
Here are some of the types of challenges to consider when selecting an architecture style:
Complexity. Is the complexity of the architecture justified for your domain? Conversely, is the style too simplistic for your
domain? In that case, you risk ending up with a "ball of mud", because the architecture does not help you to manage
dependencies cleanly.
Asynchronous messaging and eventual consistency. Asynchronous messaging can be used to decouple services, and
increase reliability (because messages can be retried) and scalability. However, this also creates challenges such as always-
once semantics and eventual consistency.
Inter-service communication. As you decompose an application into separate services, there is a risk that communication
between services will cause unacceptable latency or create network congestion (for example, in a microservices
architecture).
Manageability. How hard is it to manage the application, monitor, deploy updates, and so on?
N-tier architecture style
6/23/2017 • 5 min to read • Edit Online

An N-tier architecture divides an application into logical layers and physical tiers.

Remote
Service
Middle
Tier 1

Client WAF Web Cache


Tier
Data
Tier
Messaging Middle
Tier 2

Layers are a way to separate responsibilities and manage dependencies. Each layer has a specific responsibility. A
higher layer can use services in a lower layer, but not the other way around.
Tiers are physically separated, running on separate machines. A tier can call to another tier directly, or use
asynchronous messaging (message queue). Although each layer might be hosted in its own tier, that's not required.
Several layers might be hosted on the same tier. Physically separating the tiers improves scalability and resiliency,
but also adds latency from the additional network communication.
A traditional three-tier application has a presentation tier, a middle tier, and a database tier. The middle tier is
optional. More complex applications can have more than three tiers. The diagram above shows an application with
two middle tiers, encapsulating different areas of functionality.
An N-tier application can have a closed layer architecture or an open layer architecture:
In a closed layer architecture, a layer can only call the next layer immediately down.
In an open layer architecture, a layer can call any of the layers below it.
A closed layer architecture limits the dependencies between layers. However, it might create unnecessary network
traffic, if one layer simply passes requests along to the next layer.

When to use this architecture


N-tier architectures are typically implemented as infrastructure-as-service (IaaS) applications, with each tier
running on a separate set of VMs. However, an N-tier application doesn't need to be pure IaaS. Often, it's
advantageous to use managed services for some parts of the architecture, particularly caching, messaging, and data
storage.
Consider an N-tier architecture for:
Simple web applications.
Migrating an on-premises application to Azure with minimal refactoring.
Unified development of on-premises and cloud applications.
N-tier architectures are very common in traditional on-premises applications, so it's a natural fit for migrating
existing workloads to Azure.

Benefits
Portability between cloud and on-premises, and between cloud platforms.
Less learning curve for most developers.
Natural evolution from the traditional application model.
Open to heterogeneous environment (Windows/Linux)

Challenges
It's easy to end up with a middle tier that just does CRUD operations on the database, adding extra latency
without doing any useful work.
Monolithic design prevents independent deployment of features.
Managing an IaaS application is more work than an application that uses only managed services.
It can be difficult to manage network security in a large system.

Best practices
Use autoscaling to handle changes in load. See Autoscaling best practices.
Use asynchronous messaging to decouple tiers.
Cache semi-static data. See Caching best practices.
Configure database tier for high availability, using a solution such as SQL Server Always On Availability Groups.
Place a web application firewall (WAF) between the front end and the Internet.
Place each tier in its own subnet, and use subnets as a security boundary.
Restrict access to the data tier, by allowing requests only from the middle tier(s).

N-tier architecture on virtual machines


This section describes a recommended N-tier architecture running on VMs.

Each tier consists of two or more VMs, placed in an availability set or VM scale set. Multiple VMs provide resiliency
in case one VM fails. Load balancers are used to distribute requests across the VMs in a tier. A tier can be scaled
horizontally by adding more VMs to the pool.
Each tier is also placed inside its own subnet, meaning their internal IP addresses fall within the same address
range. That makes it easy to apply network security group (NSG) rules and route tables to individual tiers.
The web and business tiers are stateless. Any VM can handle any request for that tier. The data tier should consist of
a replicated database. For Windows, we recommend SQL Server, using Always On Availability Groups for high
availability. For Linux, choose a database that supports replication, such as Apache Cassandra.
Network Security Groups (NSGs) restrict access to each tier. For example, the database tier only allows access from
the business tier.
For more details and a deployable Resource Manager template, see the following reference architectures:
Run Windows VMs for an N-tier application
Run Linux VMs for an N-tier application
Additional considerations
N-tier architectures are not restricted to three tiers. For more complex applications, it is common to have
more tiers. In that case, consider using layer-7 routing to route requests to a particular tier.
Tiers are the boundary of scalability, reliability, and security. Consider having separate tiers for services with
different requirements in those areas.
Use VM Scale Sets for autoscaling.
Look for places in the architecture where you can use a managed service without significant refactoring. In
particular, look at caching, messaging, storage, and databases.
For higher security, place a network DMZ in front of the application. The DMZ includes network virtual
appliances (NVAs) that implement security functionality such as firewalls and packet inspection. For more
information, see Network DMZ reference architecture.
For high availability, place two or more NVAs in an availability set, with an external load balancer to
distribute Internet requests across the instances. For more information, see Deploy highly available network
virtual appliances.
Do not allow direct RDP or SSH access to VMs that are running application code. Instead, operators should
log into a jumpbox, also called a bastion host. This is a VM on the network that administrators use to connect
to the other VMs. The jumpbox has an NSG that allows RDP or SSH only from approved public IP addresses.
You can extend the Azure virtual network to your on-premises network using a site-to-site virtual private
network (VPN) or Azure ExpressRoute. For more information, see Hybrid network reference architecture.
If your organization uses Active Directory to manage identity, you may want to extend your Active Directory
environment to the Azure VNet. For more information, see Identity management reference architecture.
If you need higher availability than the Azure SLA for VMs provides, replicate the application across two
regions and use Azure Traffic Manager for failover. For more information, see Run Windows VMs in multiple
regions or Run Linux VMs in multiple regions.
Web-Queue-Worker architecture style
6/23/2017 • 3 min to read • Edit Online

The core components of this architecture are a web front end that serves client requests, and a worker that
performs resource-intensive tasks, long-running workflows, or batch jobs. The web front end communicates with
the worker through a message queue.
Remote
Service
Identity
Provider

Web Front Cache


Client End
Worker Database

Queue

CDN Static
Content

Other components that are commonly incorporated into this architecture include:
One or more databases.
A cache to store values from the database for quick reads.
CDN to serve static content
Remote services, such as email or SMS service. Often these are provided by third parties.
Identity provider for authentication.
The web and worker are both stateless. Session state can be stored in a distributed cache. Any long-running work is
done asynchronously by the worker. The worker can be triggered by messages on the queue, or run on a schedule
for batch processing. The worker is an optional component. If there are no long-running operations, the worker can
be omitted.
The front end might consist of a web API. On the client side, the web API can be consumed by a single-page
application that makes AJAX calls, or by a native client application.

When to use this architecture


The Web-Queue-Worker architecture is typically implemented using managed compute services, either Azure App
Service or Azure Cloud Services.
Consider this architecture style for:
Applications with a relatively simple domain.
Applications with some long-running workflows or batch operations.
When you want to use managed services, rather than infrastructure as a service (IaaS).

Benefits
Relatively simple architecture that is easy to understand.
Easy to deploy and manage.
Clear separation of concerns.
The front end is decoupled from the worker using asynchronous messaging.
The front end and the worker can be scaled independently.

Challenges
Without careful design, the front end and the worker can become large, monolithic components that are difficult
to maintain and update.
There may be hidden dependencies, if the front end and worker share data schemas or code modules.

Best practices
Expose a well-designed API to the client. See API design best practices.
Autoscale to handle changes in load. See Autoscaling best practices.
Cache semi-static data. See Caching best practices.
Use a CDN to host static content. See CDN best practices.
Use polyglot persistence when appropriate. See Use the best data store for the job.
Partition data to improve scalability, reduce contention, and optimize performance. See Data partitioning best
practices.

Web-Queue-Worker on Azure App Service


This section describes a recommended Web-Queue-Worker architecture that uses Azure App Service.

The front end is implemented as an Azure App Service web app, and the worker is implemented as a WebJob. The
web app and the WebJob are both associated with an App Service plan that provides the VM instances.
You can use either Azure Service Bus or Azure Storage queues for the message queue. (The diagram shows an
Azure Storage queue.)
Azure Redis Cache stores session state and other data that needs low latency access.
Azure CDN is used to cache static content such as images, CSS, or HTML.
For storage, choose the storage technologies that best fit the needs of the application. You might use multiple
storage technologies (polyglot persistence). To illustrate this idea, the diagram shows Azure SQL Database and
Azure Cosmos DB.
For more details, see Managed web application reference architecture.
Additional considerations
Not every transaction has to go through the queue and worker to storage. The web front end can perform
simple read/write operations directly. Workers are designed for resource-intensive tasks or long-running
workflows. In some cases, you might not need a worker at all.
Use the built-in autoscale feature of App Service to scale out the number of VM instances. If the load on the
application follows predictable patterns, use schedule-based autoscale. If the load is unpredictable, use
metrics-based autoscaling rules.
Consider putting the web app and the WebJob into separate App Service plans. That way, they are hosted on
separate VM instances and can be scaled independently.
Use separate App Service plans for production and testing. Otherwise, if you use the same plan for
production and testing, it means your tests are running on your production VMs.
Use deployment slots to manage deployments. This lets you to deploy an updated version to a staging slot,
then swap over to the new version. It also lets you swap back to the previous version, if there was a problem
with the update.
Microservices architecture style
6/26/2017 • 7 min to read • Edit Online

A microservices architecture consists of a collection of small, autonomous services. Each service is self-contained
and should implement a single business capability.

Microservices
Identity Service
Provider

Service Remote
API Service
Gateway
Client Service

Service

CDN Static Service


Content Management Discovery

In some ways, microservices are the natural evolution of service oriented architectures (SOA), but there are
differences between microservices and SOA. Here are some defining characteristics of a microservice:
In a microservices architecture, services are small, independent, and loosely coupled.
Each service is a separate codebase, which can be managed by a small development team.
Services can be deployed independently. A team can update an existing service without rebuilding and
redeploying the entire application.
Services are responsible for persisting their own data or external state. This differs from the traditional
model, where a separate data layer handles data persistence.
Services communicate with each other by using well-defined APIs. Internal implementation details of each
service are hidden from other services.
Services don't need to share the same technology stack, libraries, or frameworks.
Besides for the services themselves, some other components appear in a typical microservices architecture:
Management. The management component is responsible for placing services on nodes, identifying failures,
rebalancing services across nodes, and so forth.
Service Discovery. Maintains a list of services and which nodes they are located on. Enables service lookup to find
the endpoint for a service.
API Gateway. The API gateway is the entry point for clients. Clients don't call services directly. Instead, they call the
API gateway, which forwards the call to the appropriate services on the back end. The API gateway might aggregate
the responses from several services and return the aggregated response.
The advantages of using an API gateway include:
It decouples clients from services. Services can be versioned or refactored without needing to update all of
the clients.
Services can use messaging protocols that are not web friendly, such as AMQP.
The API Gateway can perform other cross-cutting functions such as authentication, logging, SSL termination,
and load balancing.

When to use this architecture


Consider this architecture style for:
Large applications that require a high release velocity.
Complex applications that need to be highly scalable.
Applications with rich domains or many subdomains.
An organization that consists of small development teams.

Benefits
Independent deployments. You can update a service without redeploying the entire application, and roll
back or roll forward an update if something goes wrong. Bug fixes and feature releases are more
manageable and less risky.
Independent development. A single development team can build, test, and deploy a service. The result is
continuous innovation and a faster release cadence.
Small, focused teams. Teams can focus on one service. The smaller scope of each service makes the code
base easier to understand, and it's easier for new team members to ramp up.
Fault isolation. If a service goes down, it won't take out the entire application. However, that doesn't mean
you get resiliency for free. You still need to follow resiliency best practices and design patterns. See
Designing resilient applications for Azure.
Mixed technology stacks. Teams can pick the technology that best fits their service.
Granular scaling. Services can be scaled independently. At the same time, the higher density of services per
VM means that VM resources are fully utilized. Using placement constraints, a services can be matched to a
VM profile (high CPU, high memory, and so on).

Challenges
Complexity. A microservices application has more moving parts than the equivalent monolithic application.
Each service is simpler, but the entire system as a whole is more complex.
Development and test. Developing against service dependencies requires a different approach. Existing
tools are not necessarily designed to work with service dependencies. Refactoring across service boundaries
can be difficult. It is also challenging to test service dependencies, especially when the application is evolving
quickly.
Lack of governance. The decentralized approach to building microservices has advantages, but it can also
lead to problems. You may end up with so many different languages and frameworks that the application
becomes hard to maintain. It may be useful to put some project-wide standards in place, without overly
restricting teams' flexibility. This especially applies to cross-cutting functionality such as logging.
Network congestion and latency. The use of many small, granular services can result in more interservice
communication. Also, if the chain of service dependencies gets too long (service A calls B, which calls C...), the
additional latency can become a problem. You will need to design APIs carefully. Avoid overly chatty APIs,
think about serialization formats, and look for places to use asynchronous communication patterns.
Data integrity. With each microservice responsible for its own data persistence. As a result, data
consistency can be a challenge. Embrace eventual consistency where possible.
Management. To be successful with microservices requires a mature DevOps culture. Correlated logging
across services can be challenging. Typically, logging must correlate multiple service calls for a single user
operation.
Versioning. Updates to a service must not break services that depend on it. Multiple services could be
updated at any given time, so without careful design, you might have problems with backward or forward
compatibility.
Skillset. Microservices are highly distributed systems. Carefully evaluate whether the team has the skills and
experience to be successful.

Best practices
Model services around the business domain.
Decentralize everything. Individual teams are responsible for designing and building services. Avoid sharing
code or data schemas.
Data storage should be private to the service that owns the data. Use the best storage for each service and
data type.
Services communicate through well-designed APIs. Avoid leaking implementation details. APIs should
model the domain, not the internal implementation of the service.
Avoid coupling between services. Causes of coupling include shared database schemas and rigid
communication protocols.
Offload cross-cutting concerns, such as authentication and SSL termination, to the gateway.
Keep domain knowledge out of the gateway. The gateway should handle and route client requests without
any knowledge of the business rules or domain logic. Otherwise, the gateway becomes a dependency and
can cause coupling between services.
Services should have loose coupling and high functional cohesion. Functions that are likely to change
together should be packaged and deployed together. If they reside in separate services, those services end
up being tightly coupled, because a change in one service will require updating the other service. Overly
chatty communication between two services may be a symptom of tight coupling and low cohesion.
Isolate failures. Use resiliency strategies to prevent failures within a service from cascading. See Resiliency
patterns and Designing resilient applications.

Microservices using Azure Container Service


You can use Azure Container Service to configure and provision a Docker cluster. Azure Container Services
supports several popular container orchestrators, including Kubernetes, DC/OS, and Docker Swarm.
Public nodes. These nodes are reachable through a public-facing load balancer. The API gateway is hosted on
these nodes.
Backend nodes. These nodes run services that clients reach via the API gateway. These nodes don't receive
Internet traffic directly. The backend nodes might include more than one pool of VMs, each with a different
hardware profile. For example, you could create separate pools for general compute workloads, high CPU
workloads, and high memory workloads.
Management VMs. These VMs run the master nodes for the container orchestrator.
Networking. The public nodes, backend nodes, and management VMs are placed in separate subnets within the
same virtual network (VNet).
Load balancers. An externally facing load balancer sits in front of the public nodes. It distributes internet requests
to the public nodes. Another load balancer is placed in front of the management VMs, to allow secure shell (ssh)
traffic to the management VMs, using NAT rules.
For reliability and scalability, each service is replicated across multiple VMs. However, because services are also
relatively lightweight (compared with a monolithic application), multiple services are usually packed into a single
VM. Higher density allows better resource utilization. If a particular service doesn't use a lot of resources, you don't
need to dedicate an entire VM to running that service.
The following diagram shows three nodes running four different services (indicated by different shapes). Notice
that each service has at least two instances.

Microservices using Azure Service Fabric


The following diagram shows a microservices architecture using Azure Service Fabric.
The Service Fabric Cluster is deployed to one or more VM scale sets. You might have more than one VM scale set in
the cluster, in order to have a mix of VM types. An API Gateway is placed in front of the Service Fabric cluster, with
an external load balancer to receive client requests.
The Service Fabric runtime performs cluster management, including service placement, node failover, and health
monitoring. The runtime is deployed on the cluster nodes themselves. There isn't a separate set of cluster
management VMs.
Services communicate with each other using the reverse proxy that is built into Service Fabric. Service Fabric
provides a discovery service that can resolve the endpoint for a named service.
CQRS architecture style
8/25/2017 • 3 min to read • Edit Online

Command and Query Responsibility Segregation (CQRS) is an architecture style that separates read operations
from write operations.

Command Write
Model

Client Event

Update
Query Read
Model
Materialized
View
In traditional architectures, the same data model is used to query and update a database. That's simple and works
well for basic CRUD operations. In more complex applications, however, this approach can become unwieldy. For
example, on the read side, the application may perform many different queries, returning data transfer objects
(DTOs) with different shapes. Object mapping can become complicated. On the write side, the model may
implement complex validation and business logic. As a result, you can end up with an overly complex model that
does too much.
Another potential problem is that read and write workloads are often asymmetrical, with very different
performance and scale requirements.
CQRS addresses these problems by separating reads and writes into separate models, using commands to update
data, and queries to read data.
Commands should be task based, rather than data centric. ("Book hotel room," not "set ReservationStatus to
Reserved.") Commands may be placed on a queue for asynchronous processing, rather than being
processed synchronously.
Queries never modify the database. A query returns a DTO that does not encapsulate any domain
knowledge.
For greater isolation, you can physically separate the read data from the write data. In that case, the read database
can use its own data schema that is optimized for queries. For example, it can store a materialized view of the data,
in order to avoid complex joins or complex O/RM mappings. It might even use a different type of data store. For
example, the write database might be relational, while the read database is a document database.
If separate read and write databases are used, they must be kept in sync. Typically this is accomplished by having
the write model publish an event whenever it updates the database. Updating the database and publishing the
event must occur in a single transaction.
Some implementations of CQRS use the Event Sourcing pattern. With this pattern, application state is stored as a
sequence of events. Each event represents a set of changes to the data. The current state is constructed by replaying
the events. In a CQRS context, one benefit of Event Sourcing is that the same events can be used to notify other
components — in particular, to notify the read model. The read model uses the events to create a snapshot of the
current state, which is more efficient for queries. However, Event Sourcing adds complexity to the design.

Command Events Event


Model Store

Publish

Query Read
Model

Materialized
View

When to use this architecture


Consider CQRS for collaborative domains where many users access the same data, especially when the read and
write workloads are asymmetrical.
CQRS is not a top-level architecture that applies to an entire system. Apply CQRS only to those subsystems where
there is clear value in separating reads and writes. Otherwise, you are creating additional complexity for no benefit.

Benefits
Independently scaling. CQRS allows the read and write workloads to scale independently, and may result in
fewer lock contentions.
Optimized data schemas. The read side can use a schema that is optimized for queries, while the write side
uses a schema that is optimized for updates.
Security. It's easier to ensure that only the right domain entities are performing writes on the data.
Separation of concerns. Segregating the read and write sides can result in models that are more maintainable
and flexible. Most of the complex business logic goes into the write model. The read model can be relatively
simple.
Simpler queries. By storing a materialized view in the read database, the application can avoid complex joins
when querying.

Challenges
Complexity. The basic idea of CQRS is simple. But it can lead to a more complex application design,
especially if they include the Event Sourcing pattern.
Messaging. Although CQRS does not require messaging, it's common to use messaging to process
commands and publish update events. In that case, the application must handle message failures or
duplicate messages.
Eventual consistency. If you separate the read and write databases, the read data may be stale.

Best practices
For more information about implementing CQRS, see CQRS Pattern.
Consider using the Event Sourcing pattern to avoid update conflicts.
Consider using the Materialized View pattern for the read model, to optimize the schema for queries.

CQRS in microservices
CQRS can be especially useful in a microservices architecture. One of the principles of microservices is that a
service cannot directly access another service's data store.

In the following diagram, Service A writes to a data store, and Service B keeps a materialized view of the data.
Service A publishes an event whenever it writes to the data store. Service B subscribes to the event.
Event-driven architecture style
6/23/2017 • 4 min to read • Edit Online

An event-driven architecture consists of event producers that generate a stream of events, and event consumers
that listen for the events.

Event Consumers

Event Producers Event Ingestion Event Consumers

Event Consumers

Events are delivered in near real time, so consumers can respond immediately to events as they occur. Producers
are decoupled from consumers — a producer doesn't know which consumers are listening. Consumers are also
decoupled from each other, and every consumer sees all of the events. This differs from a Competing Consumers
pattern, where consumers pull messages from a queue and a message is processed just once (assuming no errors).
In some systems, such as IoT, events must be ingested at very high volumes.
An event driven architecture can use a pub/sub model or an event stream model.
Pub/sub: The messaging infrastructure keeps track of subscriptions. When an event is published, it sends
the event to each subscriber. After an event is received, it cannot be replayed, and new subscribers do not
see the event.
Event streaming: Events are written to a log. Events are strictly ordered (within a partition) and durable.
Clients don't subscribe to the stream, instead a client can read from any part of the stream. The client is
responsible for advancing its position in the stream. That means a client can join at any time, and can replay
events.
On the consumer side, there are some common variations:
Simple event processing. An event immediately triggers an action in the consumer. For example, you
could use Azure Functions with a Service Bus trigger, so that a function executes whenever a message is
published to a Service Bus topic.
Complex event processing. A consumer processes a series of events, looking for patterns in the event
data, using a technology such as Azure Stream Analytics or Apache Storm. For example, you could aggregate
readings from an embedded device over a time window, and generate a notification if the moving average
crosses a certain threshold.
Event stream processing. Use a data streaming platform, such as Azure IoT Hub or Apache Kafka, as a
pipeline to ingest events and feed them to stream processors. The stream processors act to process or
transform the stream. There may be multiple stream processors for different subsystems of the application.
This approach is a good fit for IoT workloads.
The source of the events may be external to the system, such as physical devices in an IoT solution. In that case, the
system must be able to ingest the data at the volume and throughput that is required by the data source.
In the logical diagram above, each type of consumer is shown as a single box. In practice, it's common to have
multiple instances of a consumer, to avoid having the consumer become a single point of failure in system. Multiple
instances might also be necessary to handle the volume and frequency of events. Also, a single consumer might
process events on multiple threads. This can create challenges if events must be processed in order, or require
exactly-once semantics. See Minimize Coordination.

When to use this architecture


Multiple subsystems must process the same events.
Real-time processing with minimum time lag.
Complex event processing, such as pattern matching or aggregation over time windows.
High volume and high velocity of data, such as IoT.

Benefits
Producers and consumers are decoupled.
No point-to point-integrations. It's easy to add new consumers to the system.
Consumers can respond to events immediately as they arrive.
Highly scalable and distributed.
Subsystems have independent views of the event stream.

Challenges
Guaranteed delivery. In some systems, especially in IoT scenarios, it's crucial to guarantee that events are
delivered.
Processing events in order or exactly once. Each consumer type typically runs in multiple instances, for resiliency
and scalability. This can create a challenge if the events must be processed in order (within a consumer type), or
if the processing logic is not idempotent.

IoT architecture
Event-driven architectures are central to IoT solutions. The following diagram shows a possible logical architecture
for IoT. The diagram emphasizes the event-streaming components of the architecture.

The cloud gateway ingests device events at the cloud boundary, using a reliable, low latency messaging system.
Devices might send events directly to the cloud gateway, or through a field gateway. A field gateway is a
specialized device or software, usually colocated with the devices, that receives events and forwards them to the
cloud gateway. The field gateway might also preprocess the raw device events, performing functions such as
filtering, aggregation, or protocol transformation.
After ingestion, events go through one or more stream processors that can route the data (for example, to
storage) or perform analytics and other processing.
The following are some common types of processing. (This list is certainly not exhaustive.)
Writing event data to cold storage, for archiving or batch analytics.
Hot path analytics, analyzing the event stream in (near) real time, to detect anomalies, recognize patterns
over rolling time windows, or trigger alerts when a specific condition occurs in the stream.
Handling special types of non-telemetry messages from devices, such as notifications and alarms.
Machine learning.
The boxes that are shaded gray show components of an IoT system that are not directly related to event streaming,
but are included here for completeness.
The device registry is a database of the provisioned devices, including the device IDs and usually device
metadata, such as location.
The provisioning API is a common external interface for provisioning and registering new devices.
Some IoT solutions allow command and control messages to be sent to devices.

This section has presented a very high-level view of IoT, and there are many subtleties and challenges to
consider. For a more detailed reference architecture and discussion, see the Microsoft Azure IoT Reference
Architecture (PDF download).
Big data architecture style
6/23/2017 • 8 min to read • Edit Online

A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or
complex for traditional database systems.

Data Storage Batch


Processing
Data Analytical Analytics
Sources Data Store and
Reporting
Real-Time Message Stream
Ingestion Processing

Orchestration

Big data solutions typically involve one or more of the following types of workload:
Batch processing of big data sources at rest.
Real-time processing of big data in motion.
Interactive exploration of big data.
Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
Data sources: All big data solutions start with one or more data sources. Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is typically stored in a distributed file store that can hold
high volumes of large files in various formats. This kind of store is often called a data lake. Options for
implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing: Because the data sets are so large, often a big data solution must process data files using
long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs
involve reading source files, processing them, and writing the output to new files. Options include running
U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight
Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion: If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and
Kafka.
Stream processing: After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Storm and Spark Streaming in an HDInsight cluster.
Analytical data store: Many big data solutions prepare data for analysis and then serve the processed data
in a structured format that can be queried using analytical tools. The analytical data store used to serve these
queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence
(BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL technology such as
HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed
data store. Azure SQL Data Warehouse provides a managed service for large-scale, cloud-based data
warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve
data for analysis.
Analysis and reporting: The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also
support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling
these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use
Microsoft R Server, either standalone or with Spark.
Orchestration: Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
Azure includes many services that can be used in a big data architecture. They fall roughly into two categories:
Managed services, including Azure Data Lake Store, Azure Data Lake Analytics, Azure Data Warehouse, Azure
Stream Analytics, Azure Event Hub, Azure IoT Hub, and Azure Data Factory.
Open source technologies based on the Apache Hadoop platform, including HDFS, HBase, Hive, Pig, Spark,
Storm, Oozie, Sqoop, and Kafka. These technologies are available on Azure in the Azure HDInsight service.
These options are not mutually exclusive, and many solutions combine open source technologies with Azure
services.

When to use this architecture


Consider this architecture style when you need to:
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.
Use Azure Machine Learning or Microsoft Cognitive Services.

Benefits
Technology choices. You can mix and match Azure managed services and Apache technologies in HDInsight
clusters, to capitalize on existing skills or technology investments.
Performance through parallelism. Big data solutions take advantage of parallelism, enabling high-
performance solutions that scale to large volumes of data.
Elastic scale. All of the components in the big data architecture support scale-out provisioning, so that you can
adjust your solution to small or large workloads, and pay only for the resources that you use.
Interoperability with existing solutions. The components of the big data architecture are also used for IoT
processing and enterprise BI solutions, enabling you to create an integrated solution across data workloads.

Challenges
Complexity. Big data solutions can be extremely complex, with numerous components to handle data ingestion
from multiple data sources. It can be challenging to build, test, and troubleshoot big data processes. Moreover,
there may be a large number of configuration settings across multiple systems that must be used in order to
optimize performance.
Skillset. Many big data technologies are highly specialized, and use frameworks and languages that are not
typical of more general application architectures. On the other hand, big data technologies are evolving new
APIs that build on more established languages. For example, the U-SQL language in Azure Data Lake Analytics is
based on a combination of Transact-SQL and C#. Similarly, SQL-based APIs are available for Hive, HBase, and
Spark.
Technology maturity. Many of the technologies used in big data are evolving. While core Hadoop
technologies such as Hive and Pig have stabilized, emerging technologies such as Spark introduce extensive
changes and enhancements with each new release. Managed services such as Azure Data Lake Analytics and
Azure Data Factory are relatively young, compared with other Azure services, and will likely evolve over time.
Security. Big data solutions usually rely on storing all static data in a centralized data lake. Securing access to
this data can be challenging, especially when the data must be ingested and consumed by multiple applications
and platforms.

Best practices
Leverage parallelism. Most big data processing technologies distribute the workload across multiple
processing units. This requires that static data files are created and stored in a splittable format. Distributed
file systems such as HDFS can optimize read and write performance, and the actual processing is performed
by multiple cluster nodes in parallel, which reduces overall job times.
Partition data. Batch processing usually happens on a recurring schedule — for example, weekly or
monthly. Partition data files, and data structures such as tables, based on temporal periods that match the
processing schedule. That simplifies data ingestion and job scheduling, and makes it easier to troubleshoot
failures. Also, partitioning tables that are used in Hive, U-SQL, or SQL queries can significantly improve
query performance.
Apply schema-on-read semantics. Using a data lake lets you to combine storage for files in multiple
formats, whether structured, semi-structured, or unstructured. Use schema-on-read semantics, which project
a schema onto the data when the data is processing, not when the data is stored. This builds flexibility into
the solution, and prevents bottlenecks during data ingestion caused by data validation and type checking.
Process data in-place. Traditional BI solutions often use an extract, transform, and load (ETL) process to
move data into a data warehouse. With larger volumes data, and a greater variety of formats, big data
solutions generally use variations of ETL, such as transform, extract, and load (TEL). With this approach, the
data is processed within the distributed data store, transforming it to the required structure, before moving
the transformed data into an analytical data store.
Balance utilization and time costs. For batch processing jobs, it's important to consider two factors: The
per-unit cost of the compute nodes, and the per-minute cost of using those nodes to complete the job. For
example, a batch job may take eight hours with four cluster nodes. However, it might turn out that the job
uses all four nodes only during the first two hours, and after that, only two nodes are required. In that case,
running the entire job on two nodes would increase the total job time, but would not double it, so the total
cost would be less. In some business scenarios, a longer processing time may be preferable to the higher
cost of using under-utilized cluster resources.
Separate cluster resources. When deploying HDInsight clusters, you will normally achieve better
performance by provisioning separate cluster resources for each type of workload. For example, although
Spark clusters include Hive, if you need to perform extensive processing with both Hive and Spark, you
should consider deploying separate dedicated Spark and Hadoop clusters. Similarly, if you are using HBase
and Storm for low latency stream processing and Hive for batch processing, consider separate clusters for
Storm, HBase, and Hadoop.
Orchestrate data ingestion. In some cases, existing business applications may write data files for batch
processing directly into Azure storage blob containers, where they can be consumed by HDInsight or Azure
Data Lake Analytics. However, you will often need to orchestrate the ingestion of data from on-premises or
external data sources into the data lake. Use an orchestration workflow or pipeline, such as those supported
by Azure Data Factory or Oozie, to achieve this in a predictable and centrally manageable fashion.
Scrub sensitive data early. The data ingestion workflow should scrub sensitive data early in the process, to
avoid storing it in the data lake.
Big compute architecture style
6/23/2017 • 3 min to read • Edit Online

The term big compute describes large-scale workloads that require a large number of cores, often numbering in
the hundreds or thousands. Scenarios include image rendering, fluid dynamics, financial risk modeling, oil
exploration, drug design, and engineering stress analysis, among others.

Here are some typical characteristics of big compute applications:


The work can be split into discrete tasks, which can be run across many cores simultaneously.
Each task is finite. It takes some input, does some processing, and produces output. The entire application runs
for a finite amount of time (minutes to days). A common pattern is to provision a large number of cores in a
burst, and then spin down to zero once the application completes.
The application does not need to stay up 24/7. However, the system must handle node failures or application
crashes.
For some applications, tasks are independent and can run in parallel. In other cases, tasks are tightly coupled,
meaning they must interact or exchange intermediate results. In that case, consider using high-speed
networking technologies such as InfiniBand and remote direct memory access (RDMA).
Depending on your workload, you might use compute-intensive VM sizes (H16r, H16mr, and A9).

When to use this architecture


Computationally intensive operations such as simulation and number crunching.
Simulations that are computationally intensive and must be split across CPUs in multiple computers (10-1000s).
Simulations that require too much memory for one computer, and must be split across multiple computers.
Long-running computations that would take too long to complete on a single computer.
Smaller computations that must be run 100s or 1000s of times, such as Monte Carlo simulations.

Benefits
High performance with "embarrassingly parallel" processing.
Can harness hundreds or thousands of computer cores to solve large problems faster.
Access to specialized high-performance hardware, with dedicated high-speed InfiniBand networks.
You can provision VMs as needed to do work, and then tear them down.

Challenges
Managing the VM infrastructure.
Managing the volume of number crunching.
Provisioning thousands of cores in a timely manner.
For tightly coupled tasks, adding more cores can have diminishing returns. You may need to experiment to find
the optimum number of cores.

Big compute using Azure Batch


Azure Batch is a managed service for running large-scale high-performance computing (HPC) applications.
Using Azure Batch, you configure a VM pool, and upload the applications and data files. Then the Batch service
provisions the VMs, assign tasks to the VMs, runs the tasks, and monitors the progress. Batch can automatically
scale out the VMs in response to the workload. Batch also provides job scheduling.

Big compute running on Virtual Machines


You can use Microsoft HPC Pack to administer a cluster of VMs, and schedule and monitor HPC jobs. With this
approach, you must provision and manage the VMs and network infrastructure. Consider this approach if you have
existing HPC workloads and want to move some or all it to Azure. You can move the entire HPC cluster to Azure, or
keep your HPC cluster on-premises but use Azure for burst capacity. For more information, see Batch and HPC
solutions for large-scale computing workloads.
HPC Pack deployed to Azure
In this scenario, the HPC cluster is created entirely within Azure.
The head node provides management and job scheduling services to the cluster. For tightly coupled tasks, use an
RDMA network that provides very high bandwidth, low latency communication between VMs. For more
information see Deploy an HPC Pack 2016 cluster in Azure.
Burst an HPC cluster to Azure
In this scenario, an organization is running HPC Pack on-premises, and uses Azure VMs for burst capacity. The
cluster head node is on-premises. ExpressRoute or VPN Gateway connects the on-premises network to the Azure
VNet.
When designing a solution for Azure, there are two technology choices that you should make early in the design process, because
they affect the entire architecture. These are the choice of compute and data store technologies.
Compute refers to the hosting model for the computing resources that your application runs on.
Data stores include databases that hold application or business data, but also includes data stores used for caches, IoT data,
telemetry, unstructured log data, and anything else that an application might persist to storage.
This section of the Application Architecture Guide contains the following topics:
Compute options overview introduces some general considerations for choosing a compute service in Azure.
Criteria for choosing a compute option compares specific Azure compute services across several axes, including hosting model,
DevOps, availability, and scalability.
Choose the right data store describes the major categories of data store technologies, including RDBMS, key-value store,
document database, graph database, and others.
Comparison criteria for choosing a data store describes some of the factors to consider when choosing a data store.
Overview of Azure compute options
6/23/2017 • 2 min to read • Edit Online

The term compute refers to the hosting model for the computing resources that your application runs on.
At one end of the spectrum is Intrastructure-as-a-Service (IaaS). With IaaS, you provision the VMs that you need,
along with associated network and storage components. Then you deploy whatever software and applications you
want onto those VMs. This model is the closest to a traditional on-premises environment, except that Microsoft
manages the infrastructure. You still manage the individual VMs.
Platform-as-a-Service (PaaS) provides a managed hosting environment, where you can deploy your application
without needing to manage VMs or networking resources. For example, instead of creating individual VMs, you
specify an instance count, and the service will provision, configure, and manage the necessary resources. Azure App
Service is an example of a PaaS service.
There is a spectrum from IaaS to pure PaaS. For example, Azure VMs can auto-scale by using VM Scale Sets. This
automatic scaling capability isn't strictly PaaS, but it's the type of management feature that might be found in a
PaaS service.
Functions-as-a-Service (FaaS) goes even further in removing the need to worry about the hosting environment.
Instead of creating compute instances and deploying code to those instances, you simply deploy your code, and the
service automatically runs it. You don’t need to administer the compute resources. These services make use of
serverless architecture, and seamlessly scale up or down to whatever level necessary to handle the traffic. Azure
Functions are a FaaS service.
IaaS gives the most control, flexibility, and portability. FaaS provides simplicity, elastic scale, and potential cost
savings, because you pay only for the time your code is running. PaaS falls somewhere between the two. In general,
the more flexibility a service provides, the more you are responsible for configuring and managing the resources.
FaaS services automatically manage nearly all aspects of running an application, while IaaS solutions require you to
provision, configure and manage the VMs and network components you create.
Here are the main compute options currently available in Azure:
Virtual Machines are an IaaS service, allowing you to deploy and manage VMs inside a virtual network (VNet).
App Service is a managed service for hosting web apps, mobile app back ends, RESTful APIs, or automated
business processes.
Service Fabric is a distributed systems platform that can run in many environments, including Azure or on
premises. Service Fabric is an orchestrator of microservices across a cluster of machines.
Azure Container Service lets you create, configure, and manage a cluster of VMs that are preconfigured to run
containerized applications.
Azure Functions is a managed FaaS service.
Azure Batch is a managed service for running large-scale parallel and high-performance computing (HPC)
applications.
Cloud Services is a managed service for running cloud applications. It uses a PaaS hosting model.
When selecting a compute option, here are some factors to consider:
Hosting model. How is the service hosted? What requirements and limitations are imposed by this hosting
environment?
DevOps. Is there built-in support for application upgrades? What is the deployment model?
Scalability. How does the service handle adding or removing instances? Can it auto-scale based on load and
other metrics?
Availability. What is the service SLA?
Cost. In addition to the cost of the service itself, consider the operations cost for managing a solution built on
that service. For example, IaaS solutions might have a higher operations cost.
What are the overall limitations of each service?
What kind of application architectures are appropriate for this service?
For a more detailed comparison of compute options in Azure, see Criteria for choosing an Azure compute option.
The term compute refers to the hosting model for the computing resources that your applications runs on. The following tables
compare Azure compute services across several axes. Refer to these tables when selecting a compute option for your application.

Hosting model
A ZU R E
V IR TU AL S ER V ICE A ZU R E CO NTAINER CLO U D A ZU R E
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH

Application Agnostic Applications Services, Functions Containers Roles Scheduled


composition guest jobs
executables

Density Agnostic Multiple apps Multiple No dedicated Multiple One role Multiple apps
per instance services per instances 1 containers per instance per per VM
via app plans VM VM VM

Minimum 12 1 53 No dedicated 3 2 14
number of nodes 1
nodes

State Stateless or Stateless Stateless or Stateless Stateless or Stateless Stateless


management Stateful stateful Stateful

Web hosting Agnostic Built in Self-host, IIS Not applicable Agnostic Built-in (IIS) No
in containers

OS Windows, Windows, Windows, Not applicable Windows, Windows Windows,


Linux Linux Linux Linux Linux
(preview) (preview)

Can be Supported Supported 5 Supported Not Supported Supported 6 Supported


deployed to supported
dedicated
VNet?

Hybrid Supported Supported 7 Supported Not Supported Supported 8 Supported


connectivity supported

Notes
1. If using Consumption plan. If using App Service plan, functions run on the VMs allocated for your App Service plan. See Choose
the correct service plan for Azure Functions.
2. Higher SLA with two or more instances.
3. For production environments.
4. Can scale down to zero after job completes.
5. Requires App Service Environment (ASE).
6. Classic VNet only.
7. Requires ASE or BizTalk Hybrid Connections
8. Classic VNet, or Resource Manager VNet via VNet peering

DevOps
A ZU R E
V IR TU AL S ER V ICE A ZU R E CO NTAINER CLO U D A ZU R E
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH
Local Agnostic IIS Express, Local node Azure Local
A ZU R E Local Not
debugging V IR TU AL others 1 S ER V ICE
cluster A ZU R E CLI
Functions CO NTAINER
container CLO U D
emulator A ZU R E
supported
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE
runtime S ER V ICES B ATCH

Programming Agnostic Web Guest Functions Agnostic Web role, Command


model application, executable, with triggers worker role line
WebJobs for Service model, application
background Actor model,
tasks Containers

Resource Supported Supported Supported Supported Supported Limited 2 Supported


Manager

Application No built-in Deployment Rolling No built-in Depends on VIP swap or Not applicable
update support slots upgrade (per support orchestrator. rolling update
service) Most support
rolling
updates

Notes
1. Options include IIS Express for ASP.NET or node.js (iisnode); PHP web server; Azure Toolkit for IntelliJ, Azure Toolkit for Eclipse.
App Service also supports remote debugging of deployed web app.
2. See Resource Manager providers, regions, API versions and schemas.

Scalability
A ZU R E
V IR TU AL S ER V ICE A ZU R E CO NTAINER CLO U D A ZU R E
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH

Auto-scaling VM scale sets Built-in VM Scale Sets Built-in Not Built-in N/A
service service supported service

Load balancer Azure Load Integrated Azure Load Integrated Azure Load Integrated Azure Load
Balancer Balancer Balancer Balancer

Scale limit Platform 20 instances, 100 nodes Infinite 1 100 No defined 20 core limit
image: 1000 50 with App per VMSS limit, 200 by default.
nodes per Service maximum Contact
VMSS, Environment recommended customer
Custom service for
image: 100 increase.
nodes per
VMSS

Notes
1. If using Consumption plan. If using App Service plan, the App Service scale limits apply. See Choose the correct service plan for
Azure Functions.

Availability
A ZU R E
V IR TU AL S ER V ICE A ZU R E CO NTAINER CLO U D A ZU R E
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH

SLA SLA for Virtual SLA for App SLA for SLA for SLA for Azure SLA for Cloud SLA for Azure
Machines Service Service Fabric Functions Container Services Batch
Service
Multi region Traffic Traffic Traffic Not Traffic Traffic Not
A ZU R E
failover manager
V IR TU AL manager manager,
S ER V ICE supported
A ZU R E manager
CO NTAINER manager
CLO U D Supported
A ZU R E
CR ITER IA MACHINES APP S ER V ICE Multi-Region
FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH
Cluster

Security
A ZU R E
V IR TU AL S ER V ICE A ZU R E CO NTAINER CLO U D A ZU R E
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH

SSL Configured in Supported Supported Supported Configured in Supported Supported


VM VM

RBAC Supported Supported Supported Supported Supported Not Supported


supported

Other
A ZU R E
V IR TU AL S ER V ICE A ZU R E CO NTAINER CLO U D A ZU R E
CR ITER IA MACHINES APP S ER V ICE FAB R IC FU NCTIO NS S ER V ICE S ER V ICES B ATCH

Cost Windows, App Service Service Fabric Azure Azure Cloud Services Azure Batch
Linux pricing pricing Functions Container pricing pricing
pricing Service pricing

Suitable N-Tier, Big Web-Queue- Microservices, Microservices, Microservices, Web-Queue- Big Compute
architecture compute Worker Event driven EDA EDA Worker
styles (HPC) architecture
(EDA)
Choose the right data store
6/23/2017 • 10 min to read • Edit Online

Modern business systems manage increasingly large volumes of data. Data may be ingested from external services,
generated by the system itself, or created by users. These data sets may have extremely varied characteristics and
processing requirements. Businesses use data to assess trends, trigger business processes, audit their operations,
analyze customer behavior, and many other things.
This heterogeneity means that a single data store is usually not the best approach. Instead, it's often better to store
different types of data in different data stores, each focused towards a specific workload or usage pattern. The term
polyglot persistence is used to describe solutions that use a mix of data store technologies.
Selecting the right data store for your requirements is a key design decision. There are literally hundreds of
implementations to choose from among SQL and NoSQL databases. Data stores are often categorized by how they
structure data and the types of operations they support. This article describes several of the most common storage
models. Note that a particular data store technology may support multiple storage models. For example, a
relational database management systems (RDBMS) may also support key/value or graph storage. In fact, there is a
general trend for so-called multimodel support, where a single database system supports several models. But it's
still useful to understand the different models at a high level.
Not all data stores in a given category provide the same feature-set. Most data stores provide server-side
functionality to query and process data. Sometimes this functionality is built into the data storage engine. In other
cases, the data storage and processing capabilities are separated, and there may be several options for processing
and analysis. Data stores also support different programmatic and management interfaces.
Generally, you should start by considering which storage model is best suited for your requirements. Then consider
a particular data store within that category, based on factors such as feature set, cost, and ease of management.

Relational database management systems


Relational databases organize data as a series of two-dimensional tables with rows and columns. Each table has its
own columns, and every row in a table has the same set of columns. This model is mathematically based, and most
vendors provide a dialect of the Structured Query Language (SQL) for retrieving and managing data. An RDBMS
typically implements a transactionally consistent mechanism that conforms to the ACID (Atomic, Consistent,
Isolated, Durable) model for updating information.
An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead of time, and all
read or write operations must use the schema. This is in contrast to most NoSQL data stores, particularly key/value
types, where the schema-on-read model assumes that the client will be imposing its own interpretive schema on
data coming out of the database, and is agnostic to the data format being written.
An RDBMS is very useful when strong consistency guarantees are important — where all changes are atomic, and
transactions always leave the data in a consistent state. However, the underlying structures do not lend themselves
to scaling out by distributing storage and processing across machines. Also, information stored in an RDBMS, must
be put into a relational structure by following the normalization process. While this process is well understood, it
can lead to inefficiencies, because of the need to disassemble logical entities into rows in separate tables, and then
reassemble the data when running queries.
Relevant Azure service:
Azure SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL

Key/value stores
A key/value store is essentially a large hash table. You associate each data value with a unique key, and the
key/value store uses this key to store the data by using an appropriate hashing function. The hashing function is
selected to provide an even distribution of hashed keys across the data storage.
Most key/value stores only support simple query, insert, and delete operations. To modify a value (either partially
or completely), an application must overwrite the existing data for the entire value. In most implementations,
reading or writing a single value is an atomic operation. If the value is large, writing may take some time.
An application can store arbitrary data as a set of values, although some key/value stores impose limits on the
maximum size of values. The stored values are opaque to the storage system software. Any schema information
must be provided and interpreted by the application. Essentially, values are blobs and the key/value store simply
retrieves or stores the value by key.

Key/value stores are highly optimized for applications performing simple lookups, but are less suitable for systems
that need to query data across different key/value stores. Key/value stores are also not optimized for scenarios
where querying by value is important, rather than performing lookups based only on keys. For example, with a
relational database, you can find a record by using a WHERE clause, but key/values stores usually do not have this
type of lookup capability for values.
A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes
on separate machines.
Relevant Azure services:
Cosmos DB
Azure Redis Cache

Document databases
A document database is conceptually similar to a key/value store, except that it stores a collection of named fields
and data (known as documents), each of which could be simple scalar items or compound elements such as lists
and child collections. The data in the fields of a document can be encoded in a variety of ways, including XML,
YAML, JSON, BSON,or even stored as plain text. Unlike key/value stores, the fields in documents are exposed to the
storage management system, enabling an application to query and filter data by using the values in these fields.
Typically, a document contains the entire data for an entity. What items constitute an entity are application specific.
For example, an entity could contain the details of a customer, an order, or a combination of both. A single
document may contain information that would be spread across several relational tables in an RDBMS.
A document store does not require that all documents have the same structure. This free-form approach provides a
great deal of flexibility. Applications can store different data in documents as business requirements change.
The application can retrieve documents by using the document key. This is a unique identifier for the document,
which is often hashed, to help distribute data evenly. Some document databases create the document key
automatically. Others enable you to specify an attribute of the document to use as the key. The application can also
query documents based on the value of one or more fields. Some document databases support indexing to
facilitate fast lookup of documents based on one or more indexed fields.
Many document databases support in-place updates, enabling an application to modify the values of specific fields
in a document without rewriting the entire document. Read and write operations over multiple fields in a single
document are usually atomic.
Relevant Azure service: Cosmos DB

Graph databases
A graph database stores two types of information, nodes and edges. You can think of nodes as entities. Edges
which specify the relationships between nodes. Both nodes and edges can have properties that provide information
about that node or edge, similar to columns in a table. Edges can also have a direction indicating the nature of the
relationship.
The purpose of a graph database is to allow an application to efficiently perform queries that traverse the network
of nodes and edges, and to analyze the relationships between entities. The follow diagram shows an organization's
personnel database structured as a graph. The entities are employees and departments, and the edges indicate
reporting relationships and the department in which employees work. In this graph, the arrows on the edges show
the direction of the relationships.
This structure makes it straightforward to perform queries such as "Find all employees who report directly or
indirectly to Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and
relationships, you can perform very complex analyses very quickly. Many graph databases provide a query
language that you can use to traverse a network of relationships efficiently.
Relevant Azure service: Cosmos DB

Column-family databases
A column-family database organizes data into rows and columns. In its simplest form, a column-family database
can appear very similar to a relational database, at least conceptually. The real power of a column-family database
lies in its denormalized approach to structuring sparse data.
You can think of a column-family database as holding tabular data with rows and columns, but the columns are
divided into groups known as column families. Each column family holds a set of columns that are logically related
together and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be stored in
separate column families. Within a column family, new columns can be added dynamically, and rows can be sparse
(that is, a row doesn't need to have a value for every column).
The following diagram shows an example with two column families, Identity and Contact Info . The data for a single
entity has the same row key in each column-family. This structure, where the rows for any given object in a column
family can vary dynamically, is an important benefit of the column-family approach, making this form of data store
highly suited for storing structured, volatile data.

Unlike a key/value store or a document database, most column-family databases store data in key order, rather
than by computing a hash. Many implementations allow you to create indexes over specific columns in a column-
family. Indexes let you retrieve data by columns value, rather than row key.
Read and write operations for a row are usually atomic with a single column-family, although some
implementations provide atomicity across the entire row, spanning multiple column-families.
Relevant Azure service: HBase in HDInsight

Data analytics
Data analytics stores provide massively parallel solutions for ingesting, storing, and analyzing data. This data is
distributed across multiple servers using a share-nothing architecture to maximize scalability and minimize
dependencies. The data is unlikely to be static, so these stores must be able to handle large quantities of
information, arriving in a variety of formats from multiple streams, while continuing to process new queries.
Relevant Azure services:
SQL Data Warehouse
Azure Data Lake

Search Engine Databases


A search engine database supports the ability to search for information held in external data stores and services. A
search engine database can be used to index massive volumes of data and provide near real-time access to these
indexes. Although search engine databases are commonly thought of as being synonymous with the web, many
large-scale systems use them to provide structured and ad-hoc search capabilities on top of their own databases.
The key characteristics of a search engine database are the ability to store and index information very quickly, and
provide fast response times for search requests. Indexes can be multi-dimensional and may support free-text
searches across large volumes of text data. Indexing can be performed using a pull model, triggered by the search
engine database, or using a push model, initiated by external application code.
Searching can be exact or fuzzy. A fuzzy search finds documents that match a set of terms and calculates how
closely they match. Some search engines also support linguistic analysis that can return matches based on
synonyms, genre expansions (for example, matching dogs to pets ), and stemming (matching words with the same
root).
Relevant Azure service: Azure Search

Time Series Databases


Time series data is a set of values organized by time, and a time series database is a database that is optimized for
this type of data. Time series databases must support a very high number of writes, as they typically collect large
amounts of data in real time from a large number of sources. Updates are rare, and deletes are often done as bulk
operations. Although the records written to a time-series database are generally small, there are often a large
number of records, and total data size can grow rapidly.
Time series databases are good for storing telemetry data. Scenarios include IoT sensors or application/system
counters.
Relevant Azure service: Time Series Insights

Object storage
Object storage is optimized for storing and retrieving large binary objects (images, files, video and audio streams,
large application data objects and documents, virtual machine disk images). Objects in these store types are
composed of the stored data, some metadata, and a unique ID for accessing the object. Object stores enables the
management of extremely large amounts of unstructured data.
Relevant Azure service: Blob Storage

Shared files
Sometimes, using simple flat files can be the most effective means of storing and retrieving information. Using file
shares enables files to be accessed across a network. Given appropriate security and concurrent access control
mechanisms, sharing data in this way can enable distributed services to provide highly scalable data access for
performing basic, low-level operations such as simple read and write requests.
Relevant Azure service: File Storage
Criteria for choosing a data store
6/23/2017 • 8 min to read • Edit Online

Azure supports many types of data storage solutions, each providing different features and capabilities. This article
describes the comparison criteria you should use when evaluating a data store. The goal is to help you determine
which data storage types can meet your solution's requirements.

General Considerations
To start your comparison, gather as much of the following information as you can about your data needs. This
information will help you to determine which data storage types will meet your needs.
Functional requirements
Data format. What type of data are you intending to store? Common types include transactional data, JSON
objects, telemetry, search indexes, or flat files.
Data size. How large are the entities you need to store? Will these entities need to be maintained as a single
document, or can they be split across multiple documents, tables, collections, and so forth?
Scale and structure. What is the overall amount of storage capacity you need? Do you anticipate partitioning
your data?
Data relationships. Will your data need to support one-to-many or many-to-many relationships? Are
relationships themselves an important part of the data? Will you need to join or otherwise combine data from
within the same dataset, or from external datasets?
Consistency model. How important is it for updates made in one node to appear in other nodes, before further
changes can be made? Can you accept eventual consistency? Do you need ACID guarantees for transactions?
Schema flexibility. What kind of schemas will you apply to your data? Will you use a fixed schema, a schema-
on-write approach, or a schema-on-read approach?
Concurrency. What kind of concurrency mechanism do you want to use when updating and synchronizing
data? Will the application perform many updates that could potentially conflict. If so, you may requiring record
locking and pessimistic concurrency control. Alternatively, can you support optimistic concurrency controls? If
so, is simple timestamp-based concurrency control enough, or do you need the added functionality of multi-
version concurrency control?
Data movement. Will your solution need to perform ETL tasks to move data to other stores or data
warehouses?
Data lifecycle. Is the data write-once, read-many? Can it be moved into cool or cold storage?
Other supported features. Do you need any other specific features, such as schema validation, aggregation,
indexing, full-text search, MapReduce, or other query capabilities?
Non-functional requirements
Performance and scalability. What are your data performance requirements? Do you have specific
requirements for data ingestion rates and data processing rates? What are the acceptable response times for
querying and aggregation of data once ingested? How large will you need the data store to scale up? Is your
workload more read-heavy or write-heavy?
Reliability. What overall SLA do you need to support? What level of fault-tolerance do you need to provide for
data consumers? What kind of backup and restore capabilities do you need?
Replication. Will your data need to be distributed among multiple replicas or regions? What kind of data
replication capabilities do you require?
Limits. Will the limits of a particular data store support your requirements for scale, number of connections, and
throughput?
Management and cost
Managed service. When possible, use a managed data service, unless you require specific capabilities that can
only be found in an IaaS-hosted data store.
Region availability. For managed services, is the service available in all Azure regions? Does your solution
need to be hosted in certain Azure regions?
Portability. Will your data need to migrated to on-premises, external datacenters, or other cloud hosting
environments?
Licensing. Do you have a preference of a proprietary versus OSS license type? Are there any other external
restrictions on what type of license you can use?
Overall cost. What is the overall cost of using the service within your solution? How many instances will need
to run, to support your uptime and throughput requirements? Consider operations costs in this calculation. One
reason to prefer managed services is the reduced operational cost.
Cost effectiveness. Can you partition your data, to store it more cost effectively? For example, can you move
large objects out of an expensive relational database into an object store?
Security
Security. What type of encryption do you require? Do you need encryption at rest? What authentication
mechanism do you want to use to connect to your data?
Auditing. What kind of audit log do you need to generate?
Networking requirements. Do you need to restrict or otherwise manage access to your data from other
network resources? Does data need to be accessible only from inside the Azure environment? Does the data
need to be accessible from specific IP addresses or subnets? Does it need to be accessible from applications or
services hosted on-premises or in other external datacenters?
DevOps
Skill set. Are there particular programming languages, operating systems, or other technology that your team is
particularly adept at using? Are there others that would be difficult for your team to work with?
Clients Is there good client support for your development languages?
The following sections compare various data store models in terms of workload profile, data types, and example
use cases.

Relational database management systems (RDBMS)


Workload Both the creation of new records and updates to
existing data happen regularly.
Multiple operations have to be completed in a single
transaction.
Requires aggregation functions to perform cross-
tabulation.
Strong integration with reporting tools is required.
Relationships are enforced using database constraints.
Indexes are used to optimize query performance.
Allows access to specific subsets of data.
Data type Data is highly normalized.
Database schemas are required and enforced.
Many-to-many relationships between data entities in
the database.
Constraints are defined in the schema and imposed on
any data in the database.
Data requires high integrity. Indexes and relationships
need to be maintained accurately.
Data requires strong consistency. Transactions operate
in a way that ensures all data are 100% consistent for
all users and processes.
Size of individual data entries is intended to be small to
medium-sized.

Examples Line of business (human capital management,


customer relationship management, enterprise
resource planning)
Inventory management
Reporting database
Accounting
Asset management
Fund management
Order management

Document databases
Workload General purpose.
Insert and update operations are common. Both the
creation of new records and updates to existing data
happen regularly.
No object-relational impedance mismatch. Documents
can better match the object structures used in
application code.
Optimistic concurrency is more commonly used.
Data must be modified and processed by consuming
application.
Data requires index on multiple fields.
Individual documents are retrieved and written as a
single block.

Data type Data can be managed in de-normalized way.


Size of individual document data is relatively small.
Each document type can use its own schema.
Documents can include optional fields.
Document data is semi-structured, meaning that data
types of each field are not strictly defined.
Data aggregation is supported.
Examples Product catalog
User accounts
Bill of materials
Personalization
Content management
Operations data
Inventory management
Transaction history data
Materialized view of other NoSQL stores. Replaces
file/BLOB indexing.

Key/value stores
Workload Data is identified and accessed using a single ID key,
like a dictionary.
Massively scalable.
No joins, lock, or unions are required.
No aggregation mechanisms are used.
Secondary indexes are generally not used.

Data type Data size tends to be large.


Each key is associated with a single value, which is an
unmanaged data BLOB.
There is no schema enforcement.
No relationships between entities.

Examples Data caching


Session management
User preference and profile management
Product recommendation and ad serving
Dictionaries

Graph databases
Workload The relationships between data items are very complex,
involving many hops between related data items.
The relationship between data items are dynamic and
change over time.
Relationships between objects are first-class citizens,
without requiring foreign-keys and joins to traverse.

Data type Data is comprised of nodes and relationships.


Nodes are similar to table rows or JSON documents.
Relationships are just as important as nodes, and are
exposed directly in the query language.
Composite objects, such as a person with multiple
phone numbers, tend to be broken into separate,
smaller nodes, combined with traversable relationships
Examples Organization charts
Social graphs
Fraud detection
Analytics
Recommendation engines

Column-family databases
Workload Most column-family databases perform write
operations extremely quickly.
Update and delete operations are rare.
Designed to provide high throughput and low-latency
access.
Supports easy query access to a particular set of fields
within a much larger record.
Massively scalable.

Data type Data is stored in tables consisting of a key column and


one or more column families.
Specific columns can vary by individual rows.
Individual cells are accessed via get and put commands
Multiple rows are returned using a scan command.

Examples Recommendations
Personalization
Sensor data
Telemetry
Messaging
Social media analytics
Web analytics
Activity monitoring
Weather and other time-series data

Search engine databases


Workload Indexing data from multiple sources and services.
Queries are ad-hoc and can be complex.
Requires aggregation.
Full text search is required.
Ad hoc self-service query is required.
Data analysis with index on all fields is required.

Data type Semi-structured or unstructured


Text
Text with reference to structured data
Examples Product catalogs
Site search
Logging
Analytics
Shopping sites

Data warehouse
Workload Data analytics
Enterprise BI

Data type Historical data from multiple sources.


Usually denormalized in a "star" or "snowflake" schema,
consisting of fact and dimension tables.
Usually loaded with new data on a scheduled basis.
Dimension tables often include multiple historic
versions of an entity, referred to as a slowly changing
dimension.

Examples An enterprise data warehouse that provides data for analytical


models, reports, and dashboards.

Time series databases


Workload An overwhelmingly proportion of operations (95-99%)
are writes.
Records are generally appended sequentially in time
order.
Updates are rare.
Deletes occur in bulk, and are made to contiguous
blocks or records.
Read requests can be larger than available memory.
It's common for multiple reads to occur
simultaneously.
Data is read sequentially in either ascending or
descending time order.

Data type A time stamp that is used as the primary key and
sorting mechanism.
Measurements from the entry or descriptions of what
the entry represents.
Tags that define additional information about the type,
origin, and other information about the entry.

Examples Monitoring and event telemetry.


Sensor or other IoT data.

Object storage
Workload Identified by key.
Objects may be publicly or privately accessible.
Content is typically an asset such as a spreadsheet,
image, or video file.
Content must be durable (persistent), and external to
any application tier or virtual machine.

Data type Data size is large.


Blob data.
Value is opaque.

Examples Images, videos, office documents, PDFs


CSS, Scripts, CSV
Static HTML, JSON
Log and audit files
Database backups

Shared files
Workload Migration from existing apps that interact with the file
system.
Requires SMB interface.

Data type Files in a hierarchical set of folders.


Accessible with standard I/O libraries.

Examples Legacy files


Shared content accessible among a number of VMs or
app instances
Follow these design principles to make your application more scalable, resilient, and manageable.
Design for self healing. In a distributed system, failures happen. Design your application to be self healing when failures occur.
Make all things redundant. Build redundancy into your application, to avoid having single points of failure.
Minimize coordination. Minimize coordination between application services to achieve scalability.
Design to scale out. Design your application so that it can scale horizontally, adding or removing new instances as demand
requires.
Partition around limits. Use partitioning to work around database, network, and compute limits.
Design for operations. Design your application so that the operations team has the tools they need.
Use managed services. When possible, use platform as a service (PaaS) rather than infrastructure as a service (IaaS).
Use the best data store for the job. Pick the storage technology that is the best fit for your data and how it will be used.
Design for evolution. All successful applications change over time. An evolutionary design is key for continuous innovation.
Build for the needs of business. Every design decision must be justified by a business requirement.
Design your application to be self healing when failures occur
In a distributed system, failures happen. Hardware can fail. The network can have transient failures. Rarely, an entire service or
region may experience a disruption, but even those must be planned for.
Therefore, design an application to be self healing when failures occur. This requires a three-pronged approach:
Detect failures.
Respond to failures gracefully.
Log and monitor failures, to give operational insight.
How you respond to a particular type of failure may depend on your application's availability requirements. For example, if you
require very high availability, you might automatically fail over to a secondary region during a regional outage. However, that will
incur a higher cost than a single-region deployment.
Also, don't just consider big events like regional outages, which are generally rare. You should focus as much, if not more, on
handling local, short-lived failures, such as network connectivity failures or failed database connections.

Recommendations
Retry failed operations. Transient failures may occur due to momentary loss of network connectivity, a dropped database
connection, or a timeout when a service is busy. Build retry logic into your application to handle transient failures. For many Azure
services, the client SDK implements automatic retries. For more information, see Transient fault handling and Retry Pattern.
Protect failing remote services (Circuit Breaker). It's good to retry after a transient failure, but if the failure persists, you can
end up with too many callers hammering a failing service. This can lead to cascading failures, as requests back up. Use the Circuit
Breaker Pattern to fail fast (without making the remote call) when an operation is likely to fail.
Isolate critical resources (Bulkhead). Failures in one subsystem can sometimes cascade. This can happen if a failure causes
some resources, such as threads or sockets, not to get freed in a timely manner, leading to resource exhaustion. To avoid this,
partition a system into isolated groups, so that a failure in one partition does not bring down the entire system.
Perform load leveling. Applications may experience sudden spikes in traffic that can overwhelm services on the backend. To
avoid this, use the Queue-Based Load Leveling Pattern to queue work items to run asynchronously. The queue acts as a buffer that
smooths out peaks in the load.
Fail over. If an instance can't be reached, fail over to another instance. For things that are stateless, like a web server, put several
instances behind a load balancer or traffic manager. For things that store state, like a database, use replicas and fail over.
Depending on the data store and how it replicates, this may require the application to deal with eventual consistency.
Compensate failed transactions. In general, avoid distributed transactions, as they require coordination across services and
resources. Instead, compose an operation from smaller individual transactions. If the operation fails midway through, use
Compensating Transactions to undo any step that already completed.
Checkpoint long-running transactions. Checkpoints can provide resiliency if a long-running operation fails. When the
operation restarts (for example, it is picked up by another VM), it can be resumed from the last checkpoint.
Degrade gracefully. Sometimes you can't work around a problem, but you can provide reduced functionality that is still useful.
Consider an application that shows a catalog of books. If the application can't retrieve the thumbnail image for the cover, it might
show a placeholder image. Entire subsystems might be noncritical for the application. For example, in an e-commerce site, showing
product recommendations is probably less critical than processing orders.
Throttle clients. Sometimes a small number of users create excessive load, which can reduce your application's availability for
other users. In this situation, throttle the client for a certain period of time. See Throttling Pattern.
Block bad actors. Just because you throttle a client, it doesn't mean client was acting maliciously. It just means the client exceeded
their service quota. But if a client consistently exceeds their quota or otherwise behaves badly, you might block them. Define an
out-of-band process for user to request getting unblocked.
Use leader election. When you need to coordinate a task, use Leader Election to select a coordinator. That way, the coordinator is
not a single point of failure. If the coordinator fails, a new one is selected. Rather than implement a leader election algorithm from
scratch, consider an off-the-shelf solution such as Zookeeper.
Test with fault injection. All too often, the success path is well tested but not the failure path. A system could run in production
for a long time before a failure path is exercised. Use fault injection to test the resiliency of the system to failures, either by
triggering actual failures or by simulating them.
Embrace chaos engineering. Chaos engineering extends the notion of fault injection, by randomly injecting failures or abnormal
conditions into production instances.
For a structured approach to making your applications self healing, see Design resilient applications for Azure.
Build redundancy into your application, to avoid having single points of failure
A resilient application routes around failure. Identify the critical paths in your application. Is there redundancy at each point in the
path? If a subsystem fails, will the application fail over to something else?

Recommendations
Consider business requirements. The amount of redundancy built into a system can affect both cost and complexity. Your
architecture should be informed by your business requirements, such as recovery time objective (RTO). For example, a multi-
region deployment is more expensive than a single-region deployment, and is more complicated to manage. You will need
operational procedures to handle failover and failback. The additional cost and complexity might be justified for some business
scenarios and not others.
Place VMs behind a load balancer. Don't use a single VM for mission-critical workloads. Instead, place multiple VMs behind a
load balancer. If any VM becomes unavailable, the load balancer distributes traffic to the remaining healthy VMs. To learn how to
deploy this configuration, see Multiple VMs for scalability and availability.

Load
Balancer

Replicate databases. Azure SQL Database and Cosmos DB automatically replicate the data within a region, and you can enable
geo-replication across regions. If you are using an IaaS database solution, choose one that supports replication and failover, such
as SQL Server Always On Availability Groups.
Enable geo-replication. Geo-replication for Azure SQL Database and Cosmos DB creates secondary readable replicas of your
data in one or more secondary regions. In the event of an outage, the database can fail over to the secondary region for writes.
Partition for availability. Database partitioning is often used to improve scalability, but it can also improve availability. If one
shard goes down, the other shards can still be reached. A failure in one shard will only disrupt a subset of the total transactions.
Deploy to more than one region. For the highest availability, deploy the application to more than one region. That way, in the
rare case when a problem affects an entire region, the application can fail over to another region. The following diagram shows a
multi-region application that uses Azure Traffic Manager to handle failover.

Region 1

Region 2
Azure Traffic
Manager

Synchronize front and backend failover. Use Azure Traffic Manager to fail over the front end. If the front end becomes
unreachable in one region, Traffic Manager will route new requests to the secondary region. Depending on your database solution,
you may need to coordinate failing over the database.
Use automatic failover but manual failback. Use Traffic Manager for automatic failover, but not for automatic failback.
Automatic failback carries a risk that you might switch to the primary region before the region is completely healthy. Instead, verify
that all application subsystems are healthy before manually failing back. Also, depending on the database, you might need to check
data consistency before failing back.
Include redundancy for Traffic Manager. Traffic Manager is a possible failure point. Review the Traffic Manager SLA, and
determine whether using Traffic Manager alone meets your business requirements for high availability. If not, consider adding
another traffic management solution as a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS to
point to the other traffic management service.
Minimize coordination between application services to achieve scalability
Most cloud applications consist of multiple application services — web front ends, databases, business processes, reporting and
analysis, and so on. To achieve scalability and reliability, each of those services should run on multiple instances.
What happens when two instances try to perform concurrent operations that affect some shared state? In some cases, there must
be coordination across nodes, for example to preserve ACID guarantees. In this diagram, Node2 is waiting for Node1 to release a
database lock:

Update
Orders
Node 1

Update OrderItems

(blocked)
Node 2

Coordination limits the benefits of horizontal scale and creates bottlenecks. In this example, as you scale out the application and
add more instances, you'll see increased lock contention. In the worst case, the front-end instances will spend most of their time
waiting on locks.
"Exactly once" semantics are another frequent source of coordination. For example, an order must be processed exactly once. Two
workers are listening for new orders. Worker1 picks up an order for processing. The application must ensure that Worker2 doesn't
duplicate the work, but also if Worker1 crashes, the order isn't dropped.

Orders Worker 1

order #123 ? Process orders


Create order

Worker 2

You can use a pattern such as Scheduler Agent Supervisor to coordinate between the workers, but in this case a better approach
might be to partition the work. Each worker is assigned a certain range of orders (say, by billing region). If a worker crashes, a new
instance picks up where the previous instance left off, but multiple instances aren't contending.

Recommendations
Embrace eventual consistency. When data is distributed, it takes coordination to enforce strong consistency guarantees. For
example, suppose an operation updates two databases. Instead of putting it into a single transaction scope, it's better if the system
can accommodate eventual consistency, perhaps by using the Compensating Transaction pattern to logically roll back after a
failure.
Use domain events to synchronize state. A domain event is an event that records when something happens that has
significance within the domain. Interested services can listen for the event, rather than using a global transaction to coordinate
across multiple services. If this approach is used, the system must tolerate eventual consistency (see previous item).
Consider patterns such as CQRS and event sourcing. These two patterns can help to reduce contention between read
workloads and write workloads.
The CQRS pattern separates read operations from write operations. In some implementations, the read data is physically
separated from the write data.
In the Event Sourcing pattern, state changes are recorded as a series of events to an append-only data store. Appending an
event to the stream is an atomic operation, requiring minimal locking.
These two patterns complement each other. If the write-only store in CQRS uses event sourcing, the read-only store can listen for
the same events to create a readable snapshot of the current state, optimized for queries. Before adopting CQRS or event sourcing,
however, be aware of the challenges of this approach. For more information, see CQRS architecture style.
Partition data. Avoid putting all of your data into one data schema that is shared across many application services. A
microservices architecture enforces this principle by making each service responsible for its own data store. Within a single
database, partitioning the data into shards can improve concurrency, because a service writing to one shard does not affect a
service writing to a different shard.
Design idempotent operations. When possible, design operations to be idempotent. That way, they can be handled using at-
least-once semantics. For example, you can put work items on a queue. If a worker crashes in the middle of an operation, another
worker simply picks up the work item.
Use asynchronous parallel processing. If an operation requires multiple steps that are performed asynchronously (such as
remote service calls), you might be able to call them in parallel, and then aggregate the results. This approach assumes that each
step does not depend on the results of the previous step.
Use optimistic concurrency when possible. Pessimistic concurrency control uses database locks to prevent conflicts. This can
cause poor performance and reduce availability. With optimistic concurrency control, each transaction modifies a copy or snapshot
of the data. When the transaction is committed, the database engine validates the transaction and rejects any transactions that
would affect database consistency.
Azure SQL Database and SQL Server support optimistic concurrency through snapshot isolation. Some Azure storage services
support optimistic concurrency through the use of Etags, including DocumentDB API and Azure Storage.
Consider MapReduce or other parallel, distributed algorithms. Depending on the data and type of work to be performed,
you may be able to split the work into independent tasks that can be performed by multiple nodes working in parallel. See Big
compute architecture style.
Use leader election for coordination. In cases where you need to coordinate operations, make sure the coordinator does not
become a single point of failure in the application. Using the Leader Election pattern, one instance is the leader at any time, and
acts as the coordinator. If the leader fails, a new instance is elected to be the leader.
Design your application so that it can scale horizontally
A primary advantage of the cloud is elastic scaling — the ability to use as much capacity as you need, scaling out as load increases,
and scaling in when the extra capacity is not needed. Design your application so that it can scale horizontally, adding or removing
new instances as demand requires.

Recommendations
Avoid instance stickiness. Stickiness, or session affinity, is when requests from the same client are always routed to the same
server. Stickiness limits the application's ability to scale out. For example, traffic from a high-volume user will not be distributed
across instances. Causes of stickiness include storing session state in memory, and using machine-specific keys for encryption.
Make sure that any instance can handle any request.
Identify bottlenecks. Scaling out isn't a magic fix for every performance issue. For example, if your backend database is the
bottleneck, it won't help to add more web servers. Identify and resolve the bottlenecks in the system first, before throwing more
instances at the problem. Stateful parts of the system are the most likely cause of bottlenecks.
Decompose workloads by scalability requirements. Applications often consist of multiple workloads, with different
requirements for scaling. For example, an application might have a public-facing site and a separate administration site. The public
site may experience sudden surges in traffic, while the administration site has a smaller, more predictable load.
Offload resource-intensive tasks. Tasks that require a lot of CPU or I/O resources should be moved to background jobs when
possible, to minimize the load on the front end that is handling user requests.
Use built-in autoscaling features. Many Azure compute services have built-in support for autoscaling. If the application has a
predictable, regular workload, scale out on a schedule. For example, scale out during business hours. Otherwise, if the workload is
not predictable, use performance metrics such as CPU or request queue length to trigger autoscaling. For autoscaling best
practices, see Autoscaling.
Consider aggressive autoscaling for critical workloads. For critical workloads, you want to keep ahead of demand. It's better
to add new instances quickly under heavy load to handle the additional traffic, and then gradually scale back.
Design for scale in. Remember that with elastic scale, the application will have periods of scale in, when instances get removed.
The application must gracefully handle instances being removed. Here are some ways to handle scalein:
Listen for shutdown events (when available) and shut down cleanly.
Clients/consumers of a service should support transient fault handling and retry.
For long-running tasks, consider breaking up the work, using checkpoints or the Pipes and Filters pattern.
Put work items on a queue so that another instance can pick up the work, if an instance is removed in the middle of processing.
Use partitioning to work around database, network, and compute limits
In the cloud, all services have limits in their ability to scale up. Azure service limits are documented in Azure subscription and
service limits, quotas, and constraints. Limits include number of cores, database size, query throughput, and network throughput. If
your system grows sufficiently large, you may hit one or more of these limits. Use partitioning to work around these limits.
There are many ways to partition a system, such as:
Partition a database to avoid limits on database size, data I/O, or number of concurrent sessions.
Partition a queue or message bus to avoid limits on the number of requests or the number of concurrent connections.
Partition an App Service web app to avoid limits on the number of instances per App Service plan.
A database can be partitioned horizontally, vertically, or functionally.
In horizontal partitioning, also called sharding, each partition holds data for a subset of the total data set. The partitions
share the same data schema. For example, customers whose names start with A–M go into one partition, N–Z into another
partition.
In vertical partitioning, each partition holds a subset of the fields for the items in the data store. For example, put frequently
accessed fields in one partition, and less frequently accessed fields in another.
In functional partitioning, data is partitioned according to how it is used by each bounded context in the system. For
example, store invoice data in one partition and product inventory data in another. The schemas are independent.
For more detailed guidance, see Data partitioning.

Recommendations
Partition different parts of the application. Databases are one obvious candidate for partitioning, but also consider storage,
cache, queues, and compute instances.
Design the partition key to avoid hot spots. If you partition a database, but one shard still gets the majority of the requests,
then you haven't solved your problem. Ideally, load gets distributed evenly across all the partitions. For example, hash by customer
ID and not the first letter of the customer name, because some letters are more frequent. The same principle applies when
partitioning a message queue. Pick a partition key that leads to an even distribution of messages across the set of queues. For
more information, see Sharding.
Partition around Azure subscription and service limits. Individual components and services have limits, but there are also
limits for subscriptions and resource groups. For very large applications, you might need to partition around those limits.
Partition at different levels. Consider a database server deployed on a VM. The VM has a VHD that is backed by Azure Storage.
The storage account belongs to an Azure subscription. Notice that each step in the hierarchy has limits. The database server may
have a connection pool limit. VMs have CPU and network limits. Storage has IOPS limits. The subscription has limits on the
number of VM cores. Generally, it's easier to partition lower in the hierarchy. Only large applications should need to partition at the
subscription level.
Design an application so that the operations team has the tools they need
The cloud has dramatically changed the role of the operations team. They are no longer responsible for managing the hardware
and infrastructure that hosts the application. That said, operations is still a critical part of running a successful cloud application.
Some of the important functions of the operations team include:
Deployment
Monitoring
Escalation
Incident response
Security auditing
Robust logging and tracing are particularly important in cloud applications. Involve the operations team in design and planning, to
ensure the application gives them the data and insight thay need to be successful.

Recommendations
Make all things observable. Once a solution is deployed and running, logs and traces are your primary insight into the system.
Tracing records a path through the system, and is useful to pinpoint bottlenecks, performance issues, and failure points. Logging
captures individual events such as application state changes, errors, and exceptions. Log in production, or else you lose insight at
the very times when you need it the most.
Instrument for monitoring. Monitoring gives insight into how well (or poorly) an application is performing, in terms of
availability, performance, and system health. For example, monitoring tells you whether you are meeting your SLA. Monitoring
happens during the normal operation of the system. It should be as close to real-time as possible, so that the operations staff can
react to issues quickly. Ideally, monitoring can help avert problems before they lead to a critical failure. For more information, see
Monitoring and diagnostics.
Instrument for root cause analysis. Root cause analysis is the process of finding the underlying cause of failures. It occurs after a
failure has already happened.
Use distributed tracing. Use a distributed tracing system that is designed for concurrency, asynchrony, and cloud scale. Traces
should include a correlation ID that flows across service boundaries. A single operation may involve calls to multiple application
services. If an operation fails, the correlation ID helps to pinpoint the cause of the failure.
Standardize logs and metrics. The operations team will need to aggregate logs from across the various services in your solution.
If every service uses its own logging format, it becomes difficult or impossible to get useful information from them. Define a
common schema that includes fields such as correlation ID, event name, IP address of the sender, and so forth. Individual services
can derive custom schemas that inherit the base schema, and contain additional fields.
Automate management tasks, including provisioning, deployment, and monitoring. Automating a task makes it repeatable and
less prone to human errors.
Treat configuration as code. Check configuration files into a version control system, so that you can track and version your
changes, and roll back if needed.
When possible, use platform as a service (PaaS) rather than infrastructure as a
service (IaaS)
IaaS is like having a box of parts. You can build anything, but you have to assemble it yourself. Managed services are easier to
configure and administer. You don't need to provision VMs, set up VNets, manage patches and updates, and all of the other
overhead associated with running software on a VM.
For example, suppose your application needs a message queue. You could set up your own messaging service on a VM, using
something like RabbitMQ. But Azure Service Bus already provides reliable messaging as service, and it's simpler to set up. Just
create a Service Bus namespace (which can be done as part of a deployment script) and then call Service Bus using the client SDK.
Of course, your application may have specific requirements that make an IaaS approach more suitable. However, even if your
application is based on IaaS, look for places where it may be natural to incorporate managed services. These include cache, queues,
and data storage.

INS TE AD O F R U NNING ... CO NS ID ER U S ING ...

Active Directory Azure Active Directory Domain Services

Elasticsearch Azure Search

Hadoop HDInsight

IIS App Service

MongoDB Cosmos DB

Redis Azure Redis Cache

SQL Server Azure SQL Database


Pick the storage technology that is the best fit for your data and how it will be
used
Gone are the days when you would just stick all of your data into a big relational SQL database. Relational databases are very good
at what they do — providing ACID guarantees for transactions over relational data. But they come with some costs:
Queries may require expensive joins.
Data must be normalized and conform to a predefined schema (schema on write).
Lock contention may impact performance.
In any large solution, it's likely that a single data store technology won't fill all your needs. Alternatives to relational databases
include key/value stores, document databases, search engine databases, time series databases, column family databases, and
graph databases. Each has pros and cons, and different types of data fit more naturally into one or another.
For example, you might store a product catalog in a document database, such as Cosmos DB, which allows for a flexible schema. In
that case, each product description is a self-contained document. For queries over the entire catalog, you might index the catalog
and store the index in Azure Search. Product inventory might go into a SQL database, because that data requires ACID guarantees.
Remember that data includes more than just the persisted application data. It also includes application logs, events, messages, and
caches.

Recommendations
Don't use a relational database for everything. Consider other data stores when appropriate. See Choose the right data store
Embrace polyglot persistence. In any large solution, it's likely that a single data store technology won't fill all your needs.
Consider the type of data. For example, put transactional data into SQL, put JSON documents into a document database, put
telemetry data into a time series data base, put application logs in Elasticsearch, and put blobs in Azure Blob Storage.
Prefer availability over (strong) consistency. The CAP theorem implies that a distributed system must make trade-offs
between availability and consistency. (Network partitions, the other leg of the CAP theorem, can never be completely avoided.)
Often, you can achieve higher availability by adopting an eventual consistency model.
Consider the skill set of the development team. There are advantages to using polyglot persistence, but it's possible to go
overboard. Adopting a new data storage technology requires a new set of skills. The development team must understand how to
get the most out of the technology. They must understand appropriate usage patterns, how to optimize queries, tune for
performance, and so on. Factor this in when considering storage technologies.
Use compensating transactions. A side effect of polyglot persistence is that single transaction might write data to multiple
stores. If something fails, use compensating transactions to undo any steps that already completed.
Look at bounded contexts. Bounded context is a term from domain driven design. A bounded context is an explicit boundary
around a domain model, and defines which parts of the domain the model applies to. Ideally, a bounded context maps to a
subdomain of the business domain. The bounded contexts in your system are a natural place to consider polyglot persistence. For
example, "products" may appear in both the Product Catalog subdomain and the Product Inventory subdomain, but it's very likely
that these two subdomains have different requirements for storing, updating, and querying products.
An evolutionary design is key for continuous innovation
All successful applications change over time, whether to fix bugs, add new features, bring in new technologies, or make existing
systems more scalable and resilient. If all the parts of an application are tightly coupled, it becomes very hard to introduce changes
into the system. A change in one part of the application may break another part, or cause changes to ripple through the entire
codebase.
This problem is not limited to monolithic applications. An application can be decomposed into services, but still exhibit the sort of
tight coupling that leaves the system rigid and brittle. But when services are designed to evolve, teams can innovate and
continuously deliver new features.
Microservices are becoming a popular way to achieve an evolutonary design, because they address many of the considerations
listed here.

Recommendations
Enforce high cohesion and loose coupling. A service is cohesive if it provides functionality that logically belongs together.
Services are loosely coupled if you can change one service without changing the other. High cohesion generally means that
changes in one function will require changes in other related functions. If you find that updating a service requires coordinated
updates to other services, it may be a sign that your services are not cohesive. One of the goals of domain-driven design (DDD) is
to identity those boundaries.
Encapsulate domain knowledge. When a client consumes a service, the responsibility for enforcing the business rules of the
domain should not fall on the client. Instead, the service should encapsulate all of the domain knowledge that falls under its
responsibility. Otherwise, every client has to enforce the business rules, and you end up with domain knowledge spread across
different parts of the application.
Use asynchronous messaging. Asynchronous messaging is a way to decouple the message producer from the consumer. The
producer does not depend on the consumer responding to the message or taking any particular action. With a pub/sub
architecture, the producer may not even know who is consuming the message. New services can easily consume the messages
without any modifications to the producer.
Don't build domain knowledge into a gateway. Gateways can be useful in a microservices architecture, for things like request
routing, protocol translation, load balancing, or authentication. However, the gateway should be restricted to this sort of
infrastructure functionality. It should not implement any domain knowledge, to avoid becoming a heavy dependency.
Expose open interfaces. Avoid creating custom translation layers that sit between services. Instead, a service should expose an
API with a well-defined API contract. The API should be versioned, so that you can evolve the API while maintaining backward
compatibility. That way, you can update a service without coordinating updates to all of the upstream services that depend on it.
Public facing services should expose a RESTful API over HTTP. Backend services might use an RPC-style messaging protocol for
performance reasons.
Design and test against service contracts. When services expose well-defined APIs, you can develop and test against those
APIs. That way, you can develop and test an individual service without spinning up all of its dependent services. (Of course, you
would still perform integration and load testing against the real services.)
Abstract infrastructure away from domain logic. Don't let domain logic get mixed up with infrastructure-related functionality,
such as messaging or persistence. Otherwise, changes in the domain logic will require updates to the infrastructure layers and vice
versa.
Offload cross-cutting concerns to a separate service. For example, if several services need to authenticate requests, you could
move this functionality into its own service. Then you could evolve the authentication service — for example, by adding a new
authentication flow — without touching any of the services that use it.
Deploy services independently. When the DevOps team can deploy a single service independently of other services in the
application, updates can happen more quickly and safely. Bug fixes and new features can be rolled out at a more regular cadence.
Design both the application and the release process to support independent updates.
Every design decision must be justified by a business requirement
This design principle may seem obvious, but it's crucial to keep in mind when designing a solution. Do you anticipate millions of
users, or a few thousand? Is a one hour application outage acceptable? Do you expect large bursts in traffic, or a very predictable
workload? Ultimately, every design decision must be justified by a business requirement.

Recommendations
Define business objectives, including the recovery time objective (RTO), recovery point objective (RPO), and maximum tolerable
outage (MTO). These numbers should inform decisions about the architecture. For example, to achieve a low RTO, you might
implement automated failover to a secondary region. But if your solution can tolerate a higher RTO, that degree of redundancy
might be unnecessary.
Document service level agreements (SLA) and service level objectives (SLO), including availability and performance metrics.
You might build a solution that delivers 99.95% availability. Is that enough? The answer is a business decision.
Model the application around the business domain. Start by analyzing the business requirements. Use these requirements to
model the application. Consider using a domain-driven design (DDD) approach to create domain models that reflect the business
processes and use cases.
Capture both functional and nonfunctional requirements. Functional requirements let you judge whether the application
does the right thing. Nonfunctional requirements let you judge whether the application does those things well. In particular, make
sure that you understand your requirements for scalability, availability, and latency. These requirements will influence design
decisions and choice of technology.
Decompose by workload. The term "workload" in this context means a discrete capability or computing task, which can be
logically separated from other tasks. Different workloads may have different requirements for availability, scalability, data
consistency, and disaster recovery.
Plan for growth. A solution might meet your current needs, in terms of number of users, volume of transactions, data storage,
and so forth. However, a robust application can handle growth without major architectural changes. See Design to scale out and
Partition around limits. Also consider that your business model and business requirements will likely change over time. If an
application's service model and data models are too rigid, it becomes hard to evolve the application for new use cases and
scenarios. See Design for evolution.
Manage costs. In a traditional on-premises application, you pay upfront for hardware (CAPEX). In a cloud application, you pay for
the resources that you consume. Make sure that you understand the pricing model for the services that you consume. The total
cost will include network bandwidth usage, storage, IP addresses, service consumption, and other factors. See Azure pricing for
more information. Also consider your operations costs. In the cloud, you don't have to manage the hardware or other
infrastructure, but you still need to manage your applications, including DevOps, incident response, disaster recovery, and so forth.
Pillars of software quality
6/23/2017 • 10 min to read • Edit Online

A successful cloud application will focus on these five pillars of software quality: Scalability, availability, resiliency,
management, and security.

PILLAR DESCRIPTION

Scalability The ability of a system to handle increased load.

Availability The proportion of time that a system is functional and


working.

Resiliency The ability of a system to recover from failures and continue


to function.

Management Operations processes that keep a system running in


production.

Security Protecting applications and data from threats.

Scalability
Scalability is the ability of a system to handle increased load. There are two main ways that an application can scale.
Vertical scaling (scaling up) means increasing the capacity of a resource, for example by using a larger VM size.
Horizontal scaling (scaling out) is adding new instances of a resource, such as VMs or database replicas.
Horizontal scaling has significant advantages over vertical scaling:
True cloud scale. Applications can be designed to run on hundreds or even thousands of nodes, reaching scales
that are not possible on a single node.
Horizontal scale is elastic. You can add more instances if load increases, or remove them during quieter periods.
Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
Scaling out may be cheaper than scaling up. Running several small VMs can cost less than a single large VM.
Horizontal scaling can also improve resiliency, by adding redundancy. If an instance goes down, the application
keeps running.
An advantage of vertical scaling is that you can do it without making any changes to the application. But at some
point you'll hit a limit, where you can't scale any up any more. At that point, any further scaling must be horizontal.
Horizontal scale must be designed into the system. For example, you can scale out VMs by placing them behind a
load balancer. But each VM in the pool must be able to handle any client request, so the application must be
stateless or store state externally (say, in a distributed cache). Managed PaaS services often have horizontal scaling
and auto-scaling built in. The ease of scaling these services is a major advantage of using PaaS services.
Just adding more instances doesn't mean an application will scale, however. It might simply push the bottleneck
somewhere else. For example, if you scale a web front-end to handle more client requests, that might trigger lock
contentions in the database. You would then need to consider additional measures, such as optimistic concurrency
or data partitioning, to enable more throughput to the database.
Always conduct performance and load testing to find these potential bottlenecks. The stateful parts of a system,
such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally.
Resolving one bottleneck may reveal other bottlenecks elsewhere.
Use the Scalability checklist to review your design from a scalability standpoint.
Scalability guidance
Design patterns for scalability and performance
Best practices: Autoscaling, Background jobs, Caching, CDN, Data partitioning

Availability
Availability is the proportion of time that the system is functional and working. It is usually measured as a
percentage of uptime. Application errors, infrastructure problems, and system load can all reduce availability.
A cloud application should have a service level objective (SLO) that clearly defines the expected availability, and
how the availability is measured. When defining availability, look at the critical path. The web front-end might be
able to service client requests, but if every transaction fails because it can't connect to the database, the application
is not available to users.
Availability is often described in terms of "9s" — for example, "four 9s" means 99.99% uptime. The following table
shows the potential cumulative downtime at different availability levels.

% UPTIME DOWNTIME PER WEEK DOWNTIME PER MONTH DOWNTIME PER YEAR

99% 1.68 hours 7.2 hours 3.65 days

99.9% 10 minutes 43.2 minutes 8.76 hours

99.95% 5 minutes 21.6 minutes 4.38 hours

99.99% 1 minute 4.32 minutes 52.56 minutes

99.999% 6 seconds 26 seconds 5.26 minutes

Notice that 99% uptime could translate to an almost 2-hour service outage per week. For many applications,
especially consumer-facing applications, that is not an acceptable SLO. On the other hand, five 9s (99.999%) means
no more than 5 minutes of downtime in a year. It's challenging enough just detecting an outage that quickly, let
alone resolving the issue. To get very high availability (99.99% or higher), you can't rely on manual intervention to
recover from failures. The application must be self-diagnosing and self-healing, which is where resiliency becomes
crucial.
In Azure, the Service Level Agreement (SLA) describes Microsoft's commitments for uptime and connectivity. If the
SLA for a particular service is 99.95%, it means you should expect the service to be available 99.95% of the time.
Applications often depend on multiple services. In general, the probability of either service having downtime is
independent. For example, suppose your application depends on two services, each with a 99.9% SLA. The
composite SLA for both services is 99.9% × 99.9% ≈ 99.8%, or slightly less than each service by itself.
Use the Availability checklist to review your design from an availability standpoint.
Availability guidance
Design patterns for availability
Best practices: Autoscaling, Background jobs

Resiliency
Resiliency is the ability of the system to recover from failures and continue to function. The goal of resiliency is to
return the application to a fully functioning state after a failure occurs. Resiliency is closely related to availability.
In traditional application development, there has been a focus on reducing mean time between failures (MTBF).
Effort was spent trying to prevent the system from failing. In cloud computing, a different mindset is required, due
to several factors:
Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
Costs for cloud environments are kept low through the use of commodity hardware, so occasional hardware
failures must be expected.
Applications often depend on external services, which may become temporarily unavailable or throttle high-
volume users.
Today's users expect an application to be available 24/7 without ever going offline.
All of these factors mean that cloud applications must be designed to expect occasional failures and recover from
them. Azure has many resiliency features already built into the platform. For example,
Azure Storage, SQL Database, and Cosmos DB all provide built-in data replication, both within a region and
across regions.
Azure Managed Disks are automatically placed in different storage scale units, to limit the effects of hardware
failures.
VMs in an availability set are spread across several fault domains. A fault domain is a group of VMs that share a
common power source and network switch. Spreading VMs across fault domains limits the impact of physical
hardware failures, network outages, or power interruptions.
That said, you still need to build resiliency your application. Resiliency strategies can be applied at all levels of the
architecture. Some mitigations are more tactical in nature — for example, retrying a remote call after a transient
network failure. Other mitigations are more strategic, such as failing over the entire application to a secondary
region. Tactical mitigations can make a big difference. While it's rare for an entire region to experience a disruption,
transient problems such as network congestion are more common — so target these first. Having the right
monitoring and diagnostics is also important, both to detect failures when they happen, and to find the root causes.
When designing an application to be resilient, you must understand your availability requirements. How much
downtime is acceptable? This is partly a function of cost. How much will potential downtime cost your business?
How much should you invest in making the application highly available?
Use the Resiliency checklist to review your design from a resiliency standpoint.
Resiliency guidance
Designing resilient applications for Azure
Design patterns for resiliency
Best practices: Transient fault handling, Retry guidance for specific services

Management and DevOps


This pillar covers the operations processes that keep an application running in production.
Deployments must be reliable and predictable. They should be automated to reduce the chance of human error.
They should be a fast and routine process, so they don't slow down the release of new features or bug fixes. Equally
important, you must be able to quickly roll back or roll forward if an update has problems.
Monitoring and diagnostics are crucial. Cloud applications run in a remote datacenter where you do not have full
control of the infrastructure or, in some cases, the operating system. In a large application, it's not practical to log
into VMs to troubleshoot an issue or sift through log files. With PaaS services, there may not even be a dedicated
VM to log into. Monitoring and diagnostics give insight into the system, so that you know when and where failures
occur. All systems must be observable. Use a common and consistent logging schema that lets you correlate events
across systems.
The monitoring and diagnostics process has several distinct phases:
Instrumentation. Generating the raw data, from application logs, web server logs, diagnostics built into the
Azure platform, and other sources.
Collection and storage. Consolidating the data into one place.
Analysis and diagnosis. To troubleshoot issues and see the overall health.
Visualization and alerts. Using telemetry data to spot trends or alert the operations team.
Use the DevOps checklist to review your design from a management and DevOps standpoint.
Management and DevOps guidance
Design patterns for management and monitoring
Best practices: Monitoring and diagnostics

Security
You must think about security throughout the entire lifecycle of an application, from design and implementation to
deployment and operations. The Azure platform provides protections against a variety of threats, such as network
intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps
processes.
Here are some broad security areas to consider.
Identity management
Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD is a fully managed
identity and access management service. You can use it to create domains that exist purely on Azure, or integrate
with your on-premises Active Directory identities. Azure AD also integrates with Office365, Dynamics CRM Online,
and many third-party SaaS applications. For consumer-facing applications, Azure Active Directory B2C lets users
authenticate with their existing social accounts (such as Facebook, Google, or LinkedIn), or create a new user
account that is managed by Azure AD.
If you want to integrate an on-premises Active Directory environment with an Azure network, several approaches
are possible, depending on your requirements. For more information, see our Identity Management reference
architectures.
Protecting your infrastructure
Control access to the Azure resources that you deploy. Every Azure subscription has a trust relationship with an
Azure AD tenant. Use Role-Based Access Control (RBAC) to grant users within your organization the correct
permissions to Azure resources. Grant access by assigning RBAC role to users or groups at a certain scope. The
scope can be a subscription, a resource group, or a single resource. Audit all changes to infrastructure.
Application security
In general, the security best practices for application development still apply in the cloud. These include things like
using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL injection attacks, and so on.
Cloud applications often use managed services that have access keys. Never check these into source control.
Consider storing application secrets in Azure Key Vault.
Data sovereignty and encryption
Make sure that your data remains in the correct geopolitical zone when using Azure's highly available. Azure's geo-
replicated storage uses the concept of a paired region in the same geopolitical region.
Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt keys and secrets by
using keys that are protected by hardware security modules (HSMs). Many Azure storage and DB services support
data encryption at rest, including Azure Storage, Azure SQL Database, Azure SQL Data Warehouse, and Cosmos DB.
Security resources
Azure Security Center provides integrated security monitoring and policy management across your Azure
subscriptions.
Azure Security Documentation
Microsoft Trust Center
Cloud Design Patterns
8/14/2017 • 6 min to read • Edit Online

These design patterns are useful for building reliable, scalable, secure applications in the cloud.
Each pattern describes the problem that the pattern addresses, considerations for applying the pattern, and an
example based on Microsoft Azure. Most of the patterns include code samples or snippets that show how to
implement the pattern on Azure. However, most of the patterns are relevant to any distributed system, whether
hosted on Azure or on other cloud platforms.

Challenges in cloud development


Availability
Availability is the proportion of time that the system is functional and working, usually measured as a percentage
of uptime. It can be affected by system errors, infrastructure problems, malicious attacks, and system load. Cloud
applications typically provide users with a service level agreement (SLA), so applications must be designed to
maximize availability.

Data Management
Data management is the key element of cloud applications, and influences most of the quality attributes. Data is
typically hosted in different locations and across multiple servers for reasons such as performance, scalability or
availability, and this can present a range of challenges. For example, data consistency must be maintained, and
data will typically need to be synchronized across different locations.

Design and Implementation


Good design encompasses factors such as consistency and coherence in component design and deployment,
maintainability to simplify administration and development, and reusability to allow components and subsystems
to be used in other applications and in other scenarios. Decisions made during the design and implementation
phase have a huge impact on the quality and the total cost of ownership of cloud hosted applications and services.

Messaging
The distributed nature of cloud applications requires a messaging infrastructure that connects the components
and services, ideally in a loosely coupled manner in order to maximize scalability. Asynchronous messaging is
widely used, and provides many benefits, but also brings challenges such as the ordering of messages, poison
message management, idempotency, and more

Management and Monitoring


Cloud applications run in in a remote datacenter where you do not have full control of the infrastructure or, in
some cases, the operating system. This can make management and monitoring more difficult than an on-premises
deployment. Applications must expose runtime information that administrators and operators can use to manage
and monitor the system, as well as supporting changing business requirements and customization without
requiring the application to be stopped or redeployed.
Performance and Scalability
Performance is an indication of the responsiveness of a system to execute any action within a given time interval,
while scalability is ability of a system either to handle increases in load without impact on performance or for the
available resources to be readily increased. Cloud applications typically encounter variable workloads and peaks in
activity. Predicting these, especially in a multi-tenant scenario, is almost impossible. Instead, applications should be
able to scale out within limits to meet peaks in demand, and scale in when demand decreases. Scalability concerns
not just compute instances, but other elements such as data storage, messaging infrastructure, and more.

Resiliency
Resiliency is the ability of a system to gracefully handle and recover from failures. The nature of cloud hosting,
where applications are often multi-tenant, use shared platform services, compete for resources and bandwidth,
communicate over the Internet, and run on commodity hardware means there is an increased likelihood that both
transient and more permanent faults will arise. Detecting failures, and recovering quickly and efficiently, is
necessary to maintain resiliency.

Security
Security is the capability of a system to prevent malicious or accidental actions outside of the designed usage, and
to prevent disclosure or loss of information. Cloud applications are exposed on the Internet outside trusted on-
premises boundaries, are often open to the public, and may serve untrusted users. Applications must be designed
and deployed in a way that protects them from malicious attacks, restricts access to only approved users, and
protects sensitive data.

Catalog of patterns
PATTERN SUMMARY

Ambassador Create helper services that send network requests on behalf


of a consumer service or application.

Anti-Corruption Layer Implement a façade or adapter layer between a modern


application and a legacy system.

Backends for Frontends Create separate backend services to be consumed by specific


frontend applications or interfaces.

Bulkhead Isolate elements of an application into pools so that if one


fails, the others will continue to function.

Cache-Aside Load data on demand into a cache from a data store

Circuit Breaker Handle faults that might take a variable amount of time to fix
when connecting to a remote service or resource.

CQRS Segregate operations that read data from operations that


update data by using separate interfaces.

Compensating Transaction Undo the work performed by a series of steps, which together
define an eventually consistent operation.

Competing Consumers Enable multiple concurrent consumers to process messages


received on the same messaging channel.
PATTERN SUMMARY

Compute Resource Consolidation Consolidate multiple tasks or operations into a single


computational unit

Event Sourcing Use an append-only store to record the full series of events
that describe actions taken on data in a domain.

External Configuration Store Move configuration information out of the application


deployment package to a centralized location.

Federated Identity Delegate authentication to an external identity provider.

Gatekeeper Protect applications and services by using a dedicated host


instance that acts as a broker between clients and the
application or service, validates and sanitizes requests, and
passes requests and data between them.

Gateway Aggregation Use a gateway to aggregate multiple individual requests into


a single request.

Gateway Offloading Offload shared or specialized service functionality to a


gateway proxy.

Gateway Routing Route requests to multiple services using a single endpoint.

Health Endpoint Monitoring Implement functional checks in an application that external


tools can access through exposed endpoints at regular
intervals.

Index Table Create indexes over the fields in data stores that are
frequently referenced by queries.

Leader Election Coordinate the actions performed by a collection of


collaborating task instances in a distributed application by
electing one instance as the leader that assumes responsibility
for managing the other instances.

Materialized View Generate prepopulated views over the data in one or more
data stores when the data isn't ideally formatted for required
query operations.

Pipes and Filters Break down a task that performs complex processing into a
series of separate elements that can be reused.

Priority Queue Prioritize requests sent to services so that requests with a


higher priority are received and processed more quickly than
those with a lower priority.

Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a service
that it invokes in order to smooth intermittent heavy loads.

Retry Enable an application to handle anticipated, temporary failures


when it tries to connect to a service or network resource by
transparently retrying an operation that's previously failed.
PATTERN SUMMARY

Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of services
and other remote resources.

Sharding Divide a data store into a set of horizontal partitions or


shards.

Sidecar Deploy components of an application into a separate process


or container to provide isolation and encapsulation.

Static Content Hosting Deploy static content to a cloud-based storage service that
can deliver them directly to the client.

Strangler Incrementally migrate a legacy system by gradually replacing


specific pieces of functionality with new applications and
services.

Throttling Control the consumption of resources used by an instance of


an application, an individual tenant, or an entire service.

Valet Key Use a token or key that provides clients with restricted direct
access to a specific resource or service.

You might also like