Building Real-Time Data Platforms MemSQL PDF
Building Real-Time Data Platforms MemSQL PDF
Building Real-Time Data Platforms MemSQL PDF
m
pl
Building
im
en
ts
of
Real-Time
Data Pipelines
Unifying Applications and Analytics
with In-Memory Architectures
Me mSQLd e
li
ve
rsont
h epromiseofpr
ovi
di
ngdat
afas
ter
.
Ourcl
ie
ntsseet
heben
e t
sn otonl
ythr
oughourr
epor
ti
ng
sy
stems,bu
tmoreimport
antlywi
thi
nourr
eal
-t
i
med ec
is
ioni
ng.
Mi
keZa
cha
rsk
i-Chi
efOpe
rat
ingOc
era
tCPXi
Building Real-Time
Data Pipelines
Unifying Applications and Analytics
with In-Memory Architectures
The OReilly logo is a registered trademark of OReilly Media, Inc. Building Real-
Time Data Pipelines, the cover image, and related trade dress are trademarks of
OReilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the authors disclaim all responsibility for errors or omissions, including without
limitation responsibility for damages resulting from the use of or reliance on this
work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is sub
ject to open source licenses or the intellectual property rights of others, it is your
responsibility to ensure that your use thereof complies with such licenses and/or
rights.
978-1-491-93547-7
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
5. Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Background 27
iii
Characteristics of Spark 27
Understanding Databases and Spark 28
Other Use Cases 29
Conclusion 29
10. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Recommended Next Steps 53
iv | Table of Contents
Introduction
Imagine you had a time machine that could go back one minute, or
an hour. Think about what you could do with it. From the perspec
tive of other people, it would seem like there was nothing you
couldnt do, no contest you couldnt win.
In the real world, there are three basic ways to win. One way is to
have something, or to know something, that your competition does
not. Nice work if you can get it. The second way to win is to simply
be more intelligent. However, the number of people who think they
are smarter is much larger than the number of people who actually
are smarter.
The third way is to process information faster so you can make and
act on decisions faster. Being able to make more decisions in less
time gives you an advantage in both information and intelligence. It
allows you to try many ideas, correct the bad ones, and react to
changes before your competition. If your opponent cannot react as
fast as you can, it does not matter what they have, what they know,
or how smart they are. Taken to extremes, its almost like having a
time machine.
An example of the third way can be found in high-frequency stock
trading. Every trading desk has access to a large pool of highly intel
ligent people, and pays them well. All of the players have access to
the same information at the same time, at least in theory. Being
more or less equally smart and informed, the most active area of
competition is the end-to-end speed of their decision loops. In
recent years, traders have gone to the trouble of building their own
wireless long-haul networks, to exploit the fact that microwaves
move through the air 50% faster than light can pulse through fiber
optics. This allows them to execute trades a crucial millisecond
faster.
Finding ways to shorten end-to-end information latency is also a
constant theme at leading tech companies. They are forever working
to reduce the delay between something happening out there in the
world or in their huge clusters of computers, and when it shows up
on a graph. At Facebook in the early 2010s, it was normal to wait
hours after pushing new code to discover whether everything was
working efficiently. The full report came in the next day. After build
ing their own distributed in-memory database and event pipeline,
their information loop is now on the order of 30 seconds, and they
push at least two full builds per day. Instead of slowing down as they
got bigger, Facebook doubled down on making more decisions
faster.
What is your systems end-to-end latency? How long is your decision
loop, compared to the competition? Imagine you had a system that
was twice as fast. What could you do with it? This might be the most
important question for your business.
In this book well explore new models of quickly processing infor
mation end to end that are enabled by long-term hardware trends,
learnings from some of the largest and most successful tech compa
nies, and surprisingly powerful ideas that have survived the test of
time.
Carlos Bueno
Principal Product Manager at MemSQL,
author of The Mature Optimization Handbook
and Lauren Ipsum
CHAPTER 1
When to Use In-Memory Database
Management Systems (IMDBMS)
1
Online Transaction Processing (OLTP)
OLTP workloads are characterized by a high volume of low-latency
operations that touch relatively few records. OLTP performance is
bottlenecked by random data accesshow quickly the system finds
a given record and performs the desired operation. Conventional
databases can capture moderate transaction levels, but trying to
query the data simultaneously is nearly impossible. That has led to a
range of separate systems focusing on analytics more than transac
tions. These online analytical processing (OLAP) solutions comple
ment OLTP solutions.
However, in-memory solutions can increase OLTP transactional
throughput; each transactionincluding the mechanisms to persist
the datais accepted and acknowledged faster than a disk-based
solution. This speed enables OLTP and OLAP systems to converge
in a hybrid, or HTAP, system.
When building real-time applications, being able to quickly store
more data in-memory sets a foundation for unique digital experien
ces such as a faster and more personalized mobile application, or a
richer set of data for business intelligence.
Data latency is the time it takes from when data enters a pipe
line to when it is queryable.
Query latency represents the rate at which you can get answers
to your questions to generate reports faster.
Traditionally, OLAP has not been associated with operational work
loads. The online in OLAP refers to interactive query speed,
meaning an analyst can send a query to the database and it returns
in some reasonable amount of time (as opposed to a long-running
job that may take hours or days to complete). However, many
modern applications rely on real-time analytics for things like per
sonalization and traditional OLAP systems have been unable to
meet this need. Addressing this kind of application requires rethink
If you want to load data very quickly, but only query for basic
results, you can use a stream processing framework.
And if you want fast queries but are able to take your time load
ing data, many columnar databases or data warehouses can fit
that bill.
However, rapidly emerging workloads are no longer served by any
of the traditional options, which is where new HTAP-optimized
architectures provide a highly desirable solution. HTAP represents a
combination of low data latency and low query latency, and is deliv
ered via an in-memory database. Reducing both latency variables
with a single solution enables new applications and real-time data
pipelines across industries.
Modern Workloads
Near ubiquitous Internet connectivity now drives modern work
loads and a corresponding set of unique requirements. Database
systems must have the following characteristics:
Ingest and process data in real-time
In many companies, it has traditionally taken one day to under
stand and analyze data from when the data is born to when it is
usable to analysts. Now companies want to do this in real time.
Generate reports over changing datasets
The generally accepted standard today is that after collecting
data during the day and not necessarily being able to use it, a
four- to six-hour process begins to produce an OLAP cube or
materialized reports that facilitate faster access for analysts.
Today, companies expect queries to run on changing datasets
with results accurate to the last transaction.
Modern Workloads | 3
Anomaly detection as events occur
The time to react to an event can directly correlate with the
financial health of a business. For example, quickly understand
ing unusual trades in financial markets, intruders to a corporate
network, or the metrics for a manufacturing process can help
companies avoid massive losses.
Subsecond response times
When corporations get access to fresh data, its popularity rises
across hundreds to thousand of analysts. Handling the serving
workload requires memory-optimized systems.
Real-Time Analytics
Agile businesses need to implement tight operational feedback loops
so decision makers can refine strategies quickly. In-memory databa
ses support rapid iteration by removing conventional database bot
Risk Management
Successful companies must be able to quantify and plan for risk.
Risk calculations require aggregating data from many sources, and
companies need the ability to calculate present risk while also run
ning ad hoc future planning scenarios.
In-memory solutions calculate volatile metrics frequently for more
granular risk assessment and can ingest millions of records per sec
ond without blocking analytical queries. These solutions also serve
the results of risk calculations to hundreds of thousands of concur
rent users.
Personalization
Todays users expect tailored experiences and publishers, advertisers,
and retailers can drive engagement by targeting recommendations
based on users history and demographic information. Personaliza
tion shapes the modern web experience. Building applications to
deliver these experiences requires a real-time database to perform
segmentation and attribution at scale.
In-memory architectures scale to support large audiences, converge
a system or record with a system of insight for tighter feedback
loops, and eliminate costly pre-computation with the ability to cap
ture and analyze data in real time.
Portfolio Tracking
Financial assets and their value change in real time, and the report
ing dashboards and tools must similarly keep up. HTAP and in-
memory systems converge transactional and analytical processing
so portfolio value computations are accurate to the last trade.
Now users can update reports more frequently to recognize and cap
italize on short-term trends, provide a real-time serving layer to
thousands of analysts, and view real-time and historical data
through a single interface (Figure 1-1).
Conclusion
In the early days of databases, systems were designed to focus on
each individual transaction and treat it as an atomic unit (for exam
ple, the debit and credit for accounting, the movement of physical
inventory, or the addition of a new employee to payroll). These criti
cal transactions move the business forward and remain a corner
stone of systems-of-record.
Yet, a new model is emerging where the aggregate of all the transac
tions becomes critical to understanding the shape of the business
(for example, the behavior of millions of users across a mobile
phone application, the input from sensor arrays in Internet of
Things (IoT) applications, or the clicks measured on a popular web
site). These modern workloads represent a new era of transactions
9
users, it might mean crunching through real-time and historical
data simultaneously to derive insight on critical business decisions.
In-Memory
Memory, specifically RAM, provides speed levels hundreds of times
faster than typical solid state drives with flash, and thousands of
times faster than rotating disk drives made with magnetic media. As
such, RAM is likely to retain a sweet spot for in-memory processing
as a primary media type. That does not preclude incorporating com
binations of RAM and flash and disk, as discussed later in this
section.
But there are multiple ways to deploy RAM for in-memory databa
ses, providing different levels of flexibility. In-memory approaches
generally fit into three categories: memory after, memory only, and
memory optimized (Figure 2-1). In these approaches we delineate
where the database stores active data in its primary format. Note
Memory after
Memory-after architectures typically retain the legacy path of com
mitting transactions directly to disk, then quickly staging them
after to memory. This approach provides speed after the fact, but
does not account for rapid ingest.
Memory only
A memory-only approach exclusively uses memory, and provides
no native capability to incorporate other media types such as flash
or disk. Memory-only databases provide performance for smaller
datasets, but fail to account for the large data volumes common in
todays workloads and therefore provide limited functionality.
Memory optimized
Memory-optimized architectures allow for the capture of massive
ingest streams by committing transactions to memory first, then
persisting to flash or disk following. Of course, options exist to com
mit every transaction to persistent media. Memory-optimized
approaches allow all data to remain in RAM for maximum perfor
mance, but also for data to be stored on disk or flash where it makes
sense for a combination of high volumes and cost-effectiveness.
SQL
While many distributed solutions discarded SQL in their early days
consider the entire NoSQL marketthey are now implementing
SQL as a layer for analytics. In essence, they are reimplementing fea
tures that have existed in relational databases for many years.
A native SQL implementation will also support full transactional
SQL including inserts, updates, and deletes, which makes it easy to
build applications. SQL is the universal language for interfacing with
common business intelligence tools.
Other models
As universal as SQL may be, there are times when it helps to have
other models (Figure 2-2). JavaScript Object Notation (JSON) sup
ports semi-structured data. Another relevant data type is geospatial,
an essential part of the mobile world as today every data point has a
location.
Completing the picture for additional data models is Spark, a popu
lar data processing framework that incorporates a set of rich pro
gramming libraries. In-memory databases that extend to and incor
porate Spark can provide immediate access to this functionality.
Mixed Media
Understandably, not every piece of data requires in-memory place
ment forever. As data ages, retention still matters, but there is typi
cally a higher tolerance to wait a bit longer for results. Therefore it
makes sense for any in-memory database architecture to natively
incorporate alternate media types like disk or flash.
One method to incorporate disk or flash with in-memory databases
is through columnar storage formats. Disk-based data warehousing
solutions typically deploy column-based formats and these can also
be integrated with in-memory database solutions.
Conclusion
As with choices in the overall database market, in-memory solutions
span a wide range of offerings with a common theme of memory as
a vehicle for speed and agility. However, an in-memory approach is
fundamentally different from a traditional disk-based approach and
requires a fresh look at longstanding challenges.
Powerful solutions will not only deliver maximum scale and perfor
mance, but will retain enterprise approaches such as SQL and rela
tional architectures, support application friendliness with flexible
schemas, and facilitate integration into the vibrant data ecosystem.
Conclusion | 13
CHAPTER 3
Moving from Data Silos to
Real-Time Data Pipelines
15
rate of loading, this is not an online operation and runs overnight or
at the end of the week.
The challenge with this approach is that fresh, real-time data does
not make it to the analytical database until a batch load runs. Sup
pose you wanted to build a system for optimizing display advertis
ing performance by selecting ads that have performed well recently.
This application has a transactional component, recording the
impression and charging the advertiser for the impression, and an
analytical component, running a query that selects possible ads to
show to a user and then ordering by some conversion metric over
the past x minutes or hours.
In a legacy system with data silos, users can only analyze ad impres
sions that have been loaded into the data warehouse. Moreover,
many data warehouses are not designed around the low latency
requirements of a real-time application. They are meant more for
business analysts to query interactively, rather than computing pro
grammatically generated queries in the time it takes a web page to
load.
On the other side, the OLTP database should be able to handle the
transactional component, but, depending on the load on the data
base, probably will not be able to execute the analytical queries
simultaneously. Legacy OLTP databases, especially those that use
disk as the primary storage medium, are not designed for and gen
erally cannot handle mixed OLTP/OLAP workloads.
This example of real-time display ad optimization demonstrates the
fundamental flaw in the legacy data processing model. Both the
transactional and analytical components of the application must
complete in the time it takes the page to load and, ideally, take into
account the most recent data. As long as data remains siloed, this
Conclusion
There is more to the notion of a real-time data pipeline than what
we had before but faster. Rather, the shift from data silos to pipe
lines represents a shift in thinking about business opportunities.
More than just being faster, a real-time data pipeline eliminates the
distinction between real-time and historical data, such that analytics
can inform business operations in real time.
Conclusion | 19
CHAPTER 4
Processing Transactions and
Analytics in a Single Database
In-Memory Storage
Storing data in memory allows reads and writes to occur orders of
magnitude faster than on disk. This is especially valuable for run
ning concurrent transactional and analytical workloads, as it allevi
ates bottlenecks caused by disk contention. In-memory operation is
necessary for converged processing as no purely disk-based system
will be able to deliver the input/output (I/O) required with any rea
sonable amount of hardware.
21
Access to Real-Time and Historical Data
In addition to speed, converged processing requires the ability to
compare real-time data to statistical models and aggregations of his
torical data. To do so, a database must be designed to facilitate two
kinds of workloads: (1) high-throughput operational and (2) fast
analytical queries. With two powerful storage engines, real-time and
historical data can be converged into one database platform and
made available through a single interface.
Simplifying Infrastructure
By serving as a database of record and analytical warehouse, a
hybrid database can significantly simplify an organizations data pro
cessing infrastructure by functioning as the source of day-to-day
operational workloads.
There are many advantages to maintaining a simple computing
infrastructure:
Increased uptime
A simple infrastructure has fewer potential points of failure,
resulting in fewer component failures and easier problem diag
nosis.
Reduced latency
There is no way to avoid latency when transferring data between
data stores. Data transfer necessitates ETL, which is time con
suming and introduces opportunities for error. The simplified
computing structure of a converged processing database fore
goes the entire ETL process.
Synchronization
With a hybrid database architecture, drill-down from analytic
aggregates always points to the most recent application data.
Contrast that to traditional database architectures where analyt
ical and transactional data is siloed. This requires a cumber
some synchronization process and an increased likelihood that
the analytics copy of data will be stale, providing a false repre
sentation of data.
Conclusion
Many innovative organizations are already proving that access to
real-time analytics, and the ability to power applications with real-
time data, brings a substantial competitive advantage to the table.
For businesses to support emerging trends like the Internet of
Things and the high expectations of users, they will have to operate
in real time. To do so, they will turn to converged data processing, as
it offers the ability to forego ETL and simplify database architecture.
Conclusion | 25
CHAPTER 5
Spark
Background
Apache Spark is an open source cluster computing framework origi
nally developed at UC Berkeley in the AMPLab. Spark is a fast and
flexible alternative to both stream and batch processing systems like
Storm and MapReduce, and can be integrated as a part of batch pro
cessing, stream processing, machine learning, and more. A recent
survey of 2,100 developers revealed that 82% would choose Spark to
replace MapReduce.
Characteristics of Spark
Spark is a versatile distributed data processing engine, providing a
rich language for data scientists to explore data. It comes with an
ever-growing suite of libraries for analytics and stream processing.
Spark Core consists of a programming interface and a distributed
execution environment. On top of this core platform, the Spark
developer community has built several libraries including Spark
Streaming, MLlib (for machine learning), Spark SQL, and GraphX
(for graph analytics) (Figure 5-1). As of version 1.3, Spark SQL was
repackaged as the DataFrame API. Beyond acting as a SQL server,
the DataFrame API is meant to provide a general purpose library for
manipulating structured data.
27
Figure 5-1. Spark data processing framework
The Spark execution engine keeps data in memory and has the abil
ity to schedule jobs distributed over many nodes. Integrating Spark
with other in-memory systems, like an in-memory database, facili
tates efficient and quick operations.
By design, Spark is statelessthere is no persistent data storage. As
such, Spark relies on other systems for serving, storing, and tracking
changes to data. Spark can be used with a variety of external storage
options including, most commonly, databases and filesystems. Dif
ferent external data stores suit different use cases.
28 | Chapter 5: Spark
Augmenting Spark with a real-time operational database opens a
wide array of new use cases. With this setup, Spark can access live
production data, and result sets from Spark can immediately be put
to use in the database to support mission-critical applications. Pair
ing Spark with a real-time database enables companies to go from a
static view to a dynamic view of operational metrics.
Sparks distributed, in-memory execution environment is one of its
core innovations. In-memory data processing eliminates the disk
I/O bottleneck, and the distributed architecture reduces CPU con
tention by enabling parallelized execution. Using Spark with a disk-
optimized or single server database offsets the benefits of the Spark
architecture (Figure 5-2).
Conclusion
Spark is an exciting technology that is changing the way businesses
process and analyze data. More broadly, it reflects the trend toward
scale-out, memory-optimized data processing systems. With use
30 | Chapter 5: Spark
CHAPTER 6
Architecting Multipurpose
Infrastructure
31
tional custom code for synchronizing data between the separate
stores.
While introducing additional specialized systems may solve prob
lems in the short run, over time the cost of complexity adds up. This
chapter will cover trends in modern data processing systems that
allow greater flexibility and more streamlined infrastructure. Topics
include:
Multimodal Systems
Multimodal refers to a system with multiple modes of operation.
Commonly this refers to databases that support OLTP and OLAP
workloads, but it could also include stream processing or complex
event processing. The OLTP/OLAP example is the best understood
and most represented in the market, and is discussed in greater
depth in Chapter 4.
One point to consider when evaluating multimodal systems is
whether the system can operate in both modes simultaneously. For
instance, many databases both support transaction processing and
offer analytic query functionality. However, their concurrency
model effectively prevents the database from doing both simultane
ously.
Multimodel Systems
Multimodel refers to a system that supports multiple data models.
A data model specifies how data is logically organized and generally
affects how data is serialized, stored, and queried. For example, most
developers are familiar with the relational model, which represents
data as keyed tuples (rows) of typed attributes (columns). Tuples are
Multimodel Systems | 33
Figure 6-1. In this example, click_stream is a JSON column
Tiered Storage
Increasingly, modern data stores support multiple storage media,
including DRAM (main memory), flash, and spinning disk. DRAM
has established itself as the medium of choice for fast read and write
access, especially for OLTP workloads. However, despite drops in
the price of memory in recent years, it is not feasible for most com
panies to store very large datasets totally in DRAM.
To address this common concern, some modern data stores offer
multiple storage options spread across different media, as in
Figure 6-2. For example, some databases allow the user to transpar
ently specify which data resides in memory or on disk on a per-table
basis. Other databases support multiple storage media, but do not
transparently expose those options to the user.
Note that storing some data in memory and some on flash or disk is
not necessarily the same as tiered storage. For instance, some ven
dors have added in-memory analytical caches on top of their exist
ing disk-based offering. An in-memory analytical cache can acceler
ate query execution, but does not provide true storage tiering since
the in-memory copy of data is redundant.
Conclusion
In data processing infrastructure, simplicity and efficiency go hand
in hand. Every system in a pipeline adds connection points, data
transfer, and different data formats and APIs. While there is no sin
gle system that can manage all data processing needs for a modern
enterprise, it is important to select versatile tools that allow a busi
ness to limit infrastructure complexity and to build efficient, resil
ient data pipelines.
37
solution for one usage scenario does not integrate well with the best
of breed solution for your other usage scenarios. Their APIs dont
play nicely together, their data models are very different, or they
have vastly different interfaces such that you have to train your
organization multiple times to use the system. The best of breed
approach is also not maintainable over time unless you have strictly
defined interfaces between your systems. Many companies end up
resorting to middleware solutions to integrate the sea of disparate
systems, effectively adding another piece of software on top of their
growing array of solutions.
The other way companies think about operational systems is con
solidation. With this approach, you choose the least amount of soft
ware solutions that maximize the use cases covered. The best of
breed school would argue that this causes vendor lock-in and over
reliance on one solution that may become more expensive over
time. That said, that argument really only works on software solu
tions that have proprietary interfaces that are not transferrable to
other systems. A counterexample for this is a SQL-based relational
database using freely available client drivers. Enterprises should
choose solutions that use interfaces where knowledge about their
usage is generally available and widely applicable, and that can han
dle a vast amount of use cases. Consolidating your enterprise
around systems such as these reduces vendor lock-in, allows you to
use fewer systems to do more things, and makes maintenance over
time much easier than the best of breed alternative. This is not to say
that the ideal enterprise architecture would be to use only one sys
tem; that is unrealistic. Enterprises should, however, seek to consoli
date software solutions when appropriate.
Conclusion
Modern technology makes it possible for enterprises to build the
ideal operational system. To develop an optimally architected opera
tional system, enterprises should look to use fewer systems doing
more, to use systems that allow programmatic decision making on
both real-time and historical data, and use systems that allow fast
ad-hoc reporting on live data.
43
Figure 8-1. In-memory database persistence and high availability
Data Durability
For data storage to be durable, it must survive in the event of a
server failure. After the server failure, the data should be recoverable
into a transactionally consistent state without any data loss or cor
ruption. In-memory databases guarantee this by periodically flush
ing snapshots of the in-memory store into a durable copy on disk,
maintaining transaction logs, and replaying the snapshot and trans
action logs upon server restart.
It is easier to understand data durability in an in-memory database
through a specific scenario. Suppose a database application inserts a
new record into a database. The following events will occur once a
commit is issued:
Data Availability
Almost all the time, the requirements around data loss in a database
are not focused on the data remaining fully durable in a single
machine. The requirements are simply about the data remaining
available and up-to-date at all times in the system as a whole. In
other words, in a multimachine system, it is perfectly fine for data to
be lost in one of the machines, as long as the data is still persisted
somewhere in the system, and upon querying the data, it still
returns a transactionally consistent result. This is where high availa
bility comes in. For data to be highly available, it must be queryable
from a system despite failures from some of the machines in the
system.
It is easier to understand high availability through a specific sce
nario. In a distributed system, any number of machines in the sys
tem can fail. If a failure occurs, the following should happen:
Data Availability | 45
A distributed database system that guarantees high availability also
has mechanisms for maintaining at least two copies of the data in
different machines at all times. These copies must be fully in sync
while the database is online through proper database replication.
Distributed databases have settings for controlling network timeouts
and data window sizes for replication.
A distributed database system is also very robust. Failures of its dif
ferent components are mostly recoverable, and machines are auto-
added into the distributed database efficiently and without loss of
service or much degradation of performance.
Finally, distributed databases should also allow replication of data
across wide distances, typically to a disaster recovery center offsite.
This process is called cross datacenter replication, and is provided
by most in-memory, distributed, SQL databases.
Data Backups
In addition to providing data durability and high availability, data
bases also provide ways to manually or programmatically create
backups for the databases. Creating a backup is typically done by
issuing a command, which immediately creates on-disk copies of
the current state of the database. These database backups can then
be restored into an existing or new database instance in the future
for historical analysis or kept for long-term storage.
Conclusion
Databases should always provide persistence and high availability
mechanisms for their data. Enterprises should only look at databases
that provide this functionality for their mission-critical systems. In-
memory SQL databases that are available today provide these guar
antees through mechanisms for data durability (snapshots, transac
tion logs), data availability (master/slave data copies, replication),
and data backups.
47
provide more cost-effective operation in the long run if the dataset
and size remain relatively predictable.
Bare metal environments are mostly complemented by on-premises
deployments, and in some cases cloud providers offer bare metal
deployments.
Orchestration Frameworks
With the recent proliferation of container-based solutions like
Docker, many companies are choosing orchestration frameworks
such as Mesos or Kubernetes to manage these deployments. Data
base architects seeking the most flexibility should evaluate these
options; they can help when deploying different systems simultane
ously that need to interact with each other, for example, a messaging
queue, a transformation tier, and an in-memory database.
Control
On-premises database systems provide the highest level of control
over data processing and performance. The physical systems are all
dedicated to their owner, as opposed to being shared on a cloud
infrastructure. This eliminates being relegated to a lowest common
denominator of performance and instead allows fine-tuned assign
ment of resources for performance-intensive applications.
Security
If your data is private or highly regulated, an on-premise database
infrastructure may be the most straightforward option. Financial
and government services and healthcare providers handle sensitive
customer data according to complex regulations that are often more
easily addressed in a dedicated on-site infrastructure.
RAM
When working with high-value, transactional data, RAM is the best
option. RAM is orders of magnitude faster than SSD, and enables
real-time processing and analytics on a changing dataset. For organ
izations with real-time data requirements, high-value data is kept in
memory for a specified period of time and later moved to disk for
historical analytics.
Deployment Conclusions
Perhaps the only certainty with computer systems is that things are
likely to change. As applications evolve and data requirements
expand, architects need to ensure that they can rapidly adopt.
Deployment Conclusions | 51
CHAPTER 10
Conclusion
53
Explore leveraging open source frameworks such as Apache
Kafka and Apache Spark to streamline data pipelines and enrich
data for analysis.
Select a vendor and run a proof of concept that puts your use
case(s) to the test.
Go to production at a manageable scale to validate the value of
real-time analytics or applications.
Theres no getting around the fact that the world is moving towards
operating in real time. For your business, possessing the ability to
analyze and react to incoming data will give you an upper hand that
could be the difference between growth or stagnation. With technol
ogy advances such as in-memory computing and distributed sys
tems, its entirely possible to implement a cost-effective, high-
performance data processing model that enables your business to
operate at the pace and scale of incoming data. The question is, are
you up for the challenge?