Attunity Streaming Change Data Capture Ebook
Attunity Streaming Change Data Capture Ebook
Attunity Streaming Change Data Capture Ebook
Data Capture
A Foundation for Modern
Data Architectures
TRY IT NOW!
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Streaming Change Data Capture,
the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsi‐
bility for errors or omissions, including without limitation responsibility for damages resulting from
the use of or reliance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or describes is subject
to open source licenses or the intellectual property rights of others, it is your responsibility to ensure
that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Attunity. See our statement of editorial
independence.
978-1-492-03249-6
[LSI]
Table of Contents
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Prologue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
4. Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Case Study 1: Streaming to a Cloud-Based Lambda Architecture 21
Case Study 2: Streaming to the Data Lake 23
iii
Case Study 3: Streaming, Data Lake, and Cloud Architecture 24
Case Study 4: Supporting Microservices on the AWS Cloud Architecture 25
Case Study 5: Real-Time Operational Data Store/Data Warehouse 26
7. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iv | Table of Contents
Acknowledgments
Experts more knowledgeable than we are helped to make this book happen. First,
of course, are numerous enterprise customers in North America and Europe,
with whom we have the privilege of collaborating, as well as Attunity’s talented
sales and presales organization. Ted Orme, VP of marketing and business devel‐
opment, proposed the idea for this book based on his conversations with many
customers. Other valued contributors include Jordan Martz, Ola Mayer, Clive
Bearman, and Melissa Kolodziej.
v
Prologue
There is no shortage of hyperbolic metaphors for the role of data in our modern
economy—a tsunami, the new oil, and so on. From an IT perspective, data flows
might best be viewed as the circulatory system of the modern enterprise. We
believe the beating heart is change data capture (CDC) software, which identifies,
copies, and sends live data to its various users.
Although many enterprises are modernizing their businesses by adopting CDC,
there remains a dearth of information about how this critical technology works,
why modern data integration needs it, and how leading enterprises are using it.
This book seeks to close that gap. We hope it serves as a practical guide for enter‐
prise architects, data managers, and CIOs as they build modern data architec‐
tures.
Generally, this book focuses on structured data, which, loosely speaking, refers to
data that is highly organized; for example, using the rows and columns of rela‐
tional databases for easy querying, searching, and retrieval. This includes data
from the Internet of Things (IoT) and social media sources that is collected into
structured repositories.
vii
Introduction: The Rise of Modern
Data Architectures
Data is creating massive waves of change and giving rise to a new data-driven
economy that is only beginning. Organizations in all industries are changing
their business models to monetize data, understanding that doing so is critical to
competition and even survival. There is tremendous opportunity as applications,
instrumented devices, and web traffic are throwing off reams of 1s and 0s, rich in
analytics potential.
These analytics initiatives can reshape sales, operations, and strategy on many
fronts. Real-time processing of customer data can create new revenue opportuni‐
ties. Tracking devices with Internet of Things (IoT) sensors can improve opera‐
tional efficiency, reduce risk, and yield new analytics insights. New artificial
intelligence (AI) approaches such as machine learning can accelerate and
improve the accuracy of business predictions. Such is the promise of modern
analytics.
However, these opportunities change how data needs to be moved, stored, pro‐
cessed, and analyzed, and it’s easy to underestimate the resulting organizational
and technical challenges. From a technology perspective, to achieve the promise
of analytics, underlying data architectures need to efficiently process high vol‐
umes of fast-moving data from many sources. They also need to accommodate
evolving business needs and multiplying data sources.
To adapt, IT organizations are embracing data lake, streaming, and cloud architec‐
tures. These platforms are complementing and even replacing the enterprise data
warehouse (EDW), the traditional structured system of record for analytics.
Figure I-1 summarizes these shifts.
ix
Figure I-1. Key technology shifts
Enterprise architects and other data managers know firsthand that we are in the
early phases of this transition, and it is tricky stuff. A primary challenge is data
integration—the second most likely barrier to Hadoop Data Lake implementa‐
tions, right behind data governance, according to a recent TDWI survey (source:
“Data Lakes: Purposes, Practices, Patterns and Platforms,” TDWI, 2017). IT
organizations must copy data to analytics platforms, often continuously, without
disrupting production applications (a trait known as zero-impact). Data integra‐
tion processes must be scalable, efficient, and able to absorb high data volumes
from many sources without a prohibitive increase in labor or complexity.
Table I-1 summarizes the key data integration requirements of modern analytics
initiatives.
All this entails careful planning and new technologies because traditional batch-
oriented data integration tools do not meet these requirements. Batch replication
jobs and manual extract, transform, and load (ETL) scripting procedures are
slow, inefficient, and disruptive. They disrupt production, tie up talented ETL
programmers, and create network and processing bottlenecks. They cannot scale
sufficiently to support strategic enterprise initiatives. Batch is unsustainable in
today’s enterprise.
The same database typically cannot support both of these requirements for
transaction-intensive enterprise applications, because the underlying server has
only so much CPU processing power available. It is not acceptable for an analyt‐
ics query to slow down production workloads such as the processing of online
sales transactions. Hence the need to analyze copies of production records on a
different platform. The business case for offloading queries is to both record
business data and analyze it, without one action interfering with the other.
The first method used for replicating production records (i.e., rows in a database
table) to an analytics platform is batch loading, also known as bulk or full loading.
This process creates files or tables at the target, defines their “metadata” struc‐
tures based on the source, and populates them with data copied from the source
as well as the necessary metadata definitions.
1
Batch loads and periodic reloads with the latest data take time and often consume
significant processing power on the source system. This means administrators
need to run replication loads during “batch windows” of time in which produc‐
tion is paused or will not be heavily affected. Batch windows are increasingly
unacceptable in today’s global, 24×7 business environment.
Here are real examples of enterprise struggles with batch loads (in Chapter 4, we
examine how organizations are using CDC to eliminate struggles like these and
realize new business value):
Advantages of CDC
CDC has three fundamental advantages over batch replication:
• It enables faster and more accurate decisions based on the most current data;
for example, by feeding database transactions to streaming analytics applica‐
tions.
• It minimizes disruptions to production workloads.
• It reduces the cost of transferring data over the wide area network (WAN) by
sending only incremental changes.
• Data used for tactical decisions, defined as decisions that prioritize daily
tasks and activities, on average lost more than half its value 30 minutes after
its creation. Value here is measured by the portion of decisions enabled,
meaning that data more than 30 minutes old contributed to 70% fewer
operational decisions than fresher data. Marketing, sales, and operations per‐
sonnel make these types of decisions using custom dashboards or embedded
Advantages of CDC | 3
analytics capabilities within customer relationship management (CRM)
and/or supply-chain management (SCM) applications.
• Operational data on average lost about half its value after eight hours. Exam‐
ples of operational decisions, usually made over a few weeks, include
improvements to customer service, inventory stocking, and overall organiza‐
tional efficiency, based on data visualization applications or Microsoft Excel.
• Data used for strategic decisions has the longest-range implications, but still
loses half its value roughly 56 hours after creation (a little less than two and a
half days). In the strategic category, data scientists and other specialized ana‐
lysts often are assessing new market opportunities and significant potential
changes to the business, using a variety of advanced statistical tools and
methods.
Figure 1-1 plots Nucleus Research’s findings. The Y axis shows the value of data
to decision making, and the X axis shows the hours after its creation.
Examples bring research findings like this to life. Consider the case of a leading
European payments processor, which we’ll call U Pay. It handles millions of
mobile, online and in-store transactions daily for hundreds of thousands of mer‐
chants in more than 100 countries. Part of U Pay’s value to merchants is that it
credit-checks each transaction as it happens. But loading data in batch to the
underlying data lake with Sqoop, an open source ingestion scripting tool for
Hadoop, created damaging bottlenecks. The company could not integrate both
the transactions from its production SQL Server and Oracle systems and credit
agency communications fast enough to meet merchant demands.
U Pay decided to replace Sqoop with CDC, and everything changed. The com‐
pany was able to transact its business much more rapidly and bring the credit
checks in house. U Pay created a new automated decision engine that assesses the
Change data capture (CDC) identifies and captures just the most recent produc‐
tion data and metadata changes that the source has registered during a given time
period, typically measured in seconds or minutes, and then enables replication
software to copy those changes to a separate data repository. A variety of techni‐
cal mechanisms enable CDC to minimize time and overhead in the manner most
suited to the type of analytics or application it supports. CDC can accompany
batch load replication to ensure that the target is and remains synchronized with
the source upon load completion. Like batch loads, CDC helps replication soft‐
ware copy data from one source to one target, or one source to multiple targets.
CDC also identifies and replicates changes to source schema (that is, data defini‐
tion language [DDL]) changes, enabling targets to dynamically adapt to struc‐
tural updates. This eliminates the risk that other data management and analytics
processes become brittle and require time-consuming manual updates.
7
Amazon Kinesis and Azure Event Hubs) are used both to enable streaming ana‐
lytics applications and to transmit to various big data targets.
CDC has evolved to become a critical building block of modern data architec‐
tures. As explained in Chapter 1, CDC identifies and captures the data and meta‐
data changes that were committed to a source during the latest time period,
typically seconds or minutes. This enables replication software to copy and com‐
mit these incremental source database updates to a target. Figure 2-1 offers a
simplified view of CDC’s role in modern data analytics architectures.
So, what are these incremental data changes? There are four primary categories
of changes to a source database: row changes such as inserts, updates, and deletes,
as well as metadata (DDL) changes:
Inserts
These add one or more rows to a database. For example, a new row, also
known as a record, might summarize the time, date, amount, and customer
name for a recent sales transaction.
Updates
These change fields in one or more existing rows; for example, to correct an
address in a customer transaction record.
Deletes
Deletes eliminate one or more rows; for example, when an incorrect sales
transaction is erased.
Figure 2-2. CDC example: row changes (one row = one record)
There are two primary architectural options for CDC: agent-based and agentless.
As the name suggests, agent-based CDC software resides on the source server
itself and therefore interacts directly with the production database to identify and
capture changes. CDC agents are not ideal because they direct CPU, memory,
and storage away from source production workloads, thereby degrading perfor‐
mance. Agents are also sometimes required on target end points, where they have
a similar impact on management burden and performance.
The more modern, agentless architecture has zero footprint on source or target.
Rather, the CDC software interacts with source and target from a separate inter‐
mediate server. This enables organizations to minimize source impact and
improve ease of use.
Query-based CDC
This approach regularly checks the production database for changes. This
method can also slow production performance by consuming source CPU
cycles. Certain source databases and data warehouses, such as Teradata, do
not have change logs (described in the next section) and therefore require
alternative CDC methods such as queries. You can identify changes by using
timestamps, version numbers, and/or status columns as follows:
• Timestamps in a dedicated source table column can record the time of
the most recent update, thereby flagging any row containing data more
recent than the last CDC replication task. To use this query method, all
of the tables must be altered to include timestamps, and administrators
must ensure that they accurately represent time zones.
• Version numbers increase by one increment with each change to a table.
They are similar to timestamps, except that they identify the version
number of each row rather than the time of the last change. This method
requires a means of identifying the latest version; for example, in a sup‐
porting reference table and comparing it to the version column.
• Status indicators take a similar approach, as well, stating in a dedicated
column whether a given row has been updated since the last replication.
These indicators also might indicate that, although a row has been upda‐
ted, it is not ready to be copied; for example, because the entry needs
human validation.
Figure 2-4. Log readers identify changes in backup and recovery logs
Table 2-1 summarizes the functionality and production impact of trigger, query,
and log-based CDC.
Table 2-1. Functionality and production impact of CDC methods delivering data
CDC capture Description Production
method impact
Log reader Identifies changes by scanning backup/recovery transaction logs Minimal
Preferred method when log access is available
Query Flags new transaction in production table column with timestamps, version Low
numbers, and so on
CDC engine periodically asks production database for updates; for example, for
Teradata
Trigger Source transactions “trigger” copies to change-capture table Medium
Preferred method if no access to transaction logs
Between each of these phases, we need to transform data into the right form.
Change data capture plays an integral role by accelerating ingestion in the raw
phase. This helps improve the timeliness and accuracy of data and metadata in
the subsequent Design/Refine and Optimize phases.
Now let’s examine the architectures in which data workflow and analysis take
place, their role in the modern enterprise, and their integration points with
change data capture (CDC). As shown in Figure 3-1, the methods and terminol‐
ogy for data transfer tend to vary by target. Even though the transfer of data and
metadata into a database involves simple replication, a more complicated extract,
transform, and load (ETL) process is required for data warehouse targets. Data
lakes, meanwhile, typically can ingest data in its native format. Finally, streaming
targets require source data to be published and consumed in a series of messages.
Any of these four target types can reside on-premises, in the cloud, or in a hybrid
combination of the two.
15
We will explore the role of CDC in five modern architectural scenarios: replica‐
tion to databases, ETL to data warehouses, ingestion to data lakes, publication to
streaming systems, and transfer across hybrid cloud infrastructures. In practice,
most enterprises have a patchwork of the architectures described here, as they
apply different engines to different workloads. A trial-and-error learning process,
changing business requirements, and the rise of new platforms all mean that data
managers will need to keep copying data from one place to another. Data mobi‐
lity will be critical to the success of modern enterprise IT departments for the
foreseeable future.
Replication to Databases
Organizations have long used databases such as Oracle and SQL Server for
operational reporting; for example, to track sales transactions and trends, inven‐
tory levels, and supply-chain status. They can employ batch and CDC replication
to copy the necessary records to reporting databases, thereby offloading the quer‐
ies and analytics workload from production. CDC has become common in these
scenarios as the pace of business quickens and business managers at all levels
increasingly demand real-time operational dashboards.
CDC also can transmit source schema/data definition language (DDL) updates
into message streams and integrate with messaging schema registries to ensure
that the analytics consumers understand the metadata. In addition, when CDC
Now let’s explore some case studies. Each of these illustrates the role of change
data capture (CDC) in enabling scalable and efficient analytics architectures that
do not affect production application performance. By moving and processing
incremental data and metadata updates in real time, these organizations have
reduced or eliminated the need for resource-draining and disruptive batch (aka
full) loads. They are siphoning data to multiple platforms for specialized analysis
on each, consuming CPU and other resources in a balanced and sustainable way.
21
physician’s notes and are testing other new AI approaches such as machine learn‐
ing to improve predictions of clinical treatment outcomes.
Figure 4-1. Data architecture for Kafka streaming to cloud-based Lambda architec‐
ture
After the data arrives in HDFS and HBase, Spark in-memory processing helps
match orders to production on a real-time basis and maintain referential integ‐
rity for purchase order tables. As a result, Suppertime has accelerated sales and
product delivery with accurate real-time operational reporting. It has replaced
batch loads with CDC to operate more efficiently and more profitably.
29
Figure 5-1. Replication Maturity Model
Level 1: Basic
At the Basic maturity level, organizations have not yet implemented CDC. A sig‐
nificant portion of organizations are still in this phase. During a course on data
integration at a TDWI event in Anaheim in Orlando in December 2017, this
author was surprised to see only half of the attendees raise their hands when
asked if they used CDC.
Instead, organizations use traditional, manual extract, transform, and load (ETL)
tools and scripts, or open source Sqoop software in the case of Hadoop, that rep‐
licate production data to analytics platforms via disruptive batch loads. These
processes often vary by end point and require skilled ETL programmers to learn
multiple processes and spend extra time configuring and reconfiguring replica‐
tion tasks. Data silos persist because most of these organizations lack the resour‐
ces needed to integrate all of their data manually.
Such practices often are symptoms of larger issues that leave much analytics
value unrealized, because the cost and effort of data integration limit both the
number and the scope of analytics projects. Siloed teams often run ad hoc analyt‐
ics initiatives that lack a single source of truth and strategic guidance from execu‐
tives. To move from the Basic to Opportunistic level, IT department leaders need
to recognize these limitations and commit the budget, training, and resources
needed to use CDC replication software.
Level 3: Systematic
Systematic organizations are getting their data house in order. IT departments in
this phase implement automated CDC solutions such as Attunity Replicate that
require no disruptive agents on source systems. These solutions enable uniform
data integration procedures across more platforms, breaking silos while minimiz‐
ing skill and labor requirements with a “self-service” approach. Data architects
rather than specialized ETL programmers can efficiently perform high-scale data
integration, ideally through a consolidated enterprise console and with no man‐
ual scripting. In many cases, they also can integrate full-load replication and
CDC processes into larger IT management frameworks using REST or other
APIs. For example, administrators can invoke and execute Attunity Replicate
tasks from workload automation solutions.
IT teams at this level often have clear executive guidance and sponsorship in the
form of a crisp corporate data strategy. Leadership is beginning to use data ana‐
lytics as a competitive differentiator. Examples from Chapter 4 include the case
studies for Suppertime and USave, which have taken systematic, data-driven
approaches to improving operational efficiency. StartupBackers (case study 3) is
similarly systematic in its data consolidation efforts to enable new analytics
insights. Another example is illustrated in case study 4, Nest Egg, whose ambi‐
tious campaign to run all transactional records through a coordinated Amazon
Web Services (AWS) cloud data flow is enabling an efficient, high-scale microser‐
vices environment.
Level 4: Transformational
Organizations reaching the Transformational level are automating additional
segments of data pipelines to accelerate data readiness for analytics. For example,
they might use data warehouse automation software to streamline the creation,
management, and updates of data warehouse and data mart environments. They
Level 2: Opportunistic | 31
also might be automating the creation, structuring, and continuous updates of
data stores within data lakes. Attunity Compose for Hive provides these capabili‐
ties for Hive data stores so that datasets compliant with ACID (atomicity, consis‐
tency, isolation, durability) can be structured rapidly in what are effectively SQL-
like data warehouses on top of Hadoop.
We find that leaders within Transformational organizations are often devising
creative strategies to reinvent their businesses with analytics. They seek to
become truly data-driven. GetWell (case study 1 in Chapter 4) is an example of a
transformational organization. By applying the very latest technologies—
machine learning, and so on—to large data volumes, it is reinventing its offerings
to greatly improve the quality of care for millions of patients.
So why not deploy Level 3 or Level 4 solutions and call it a day? Applying a con‐
sistent, nondisruptive and fully automated CDC process to various end points
certainly improves efficiency, enables real-time analytics, and yields other bene‐
fits. However, the technology will take you only so far. We find that the most
effective IT teams achieve the greatest efficiency, scalability, and analytics value
when they are aligned with a C-level strategy to eliminate data silos, and guide
and even transform their business with data-driven decisions.
Attunity Replicate, a modern data integration platform built on change data cap‐
ture (CDC), is designed for the Systematic (Level 3) and Transformational (Level
4) maturity levels described in Chapter 5. It provides a highly automated and
consistent platform for replicating incremental data updates while minimizing
production workload impact. Attunity Replicate integrates with all major data‐
base, data warehouse, data lake, streaming, cloud, and mainframe end points.
With Attunity Replicate, you can address use cases that include the following:
33
Figure 6-1. Attunity Replicate architecture
Attunity Replicate CDC and the larger Attunity Replicate portfolio enable effi‐
cient, scalable, and low-impact integration of data to break silos. Organizations
can maintain consistent, flexible control of data flows throughout their environ‐
ments and automate key aspects of data transformation for analytics. These key
benefits are achieved while reducing dependence on expensive, high-skilled ETL
programmers.
As with any new technology, the greatest barrier to successful adoption can be
inertia. Perhaps your organization is managing to meet business requirements
with traditional extract, transform, and load scripting and/or batch loading
without change data capture. Perhaps Sqoop is enabling your new data lake to
ingest sufficient data volumes with tolerable latencies and a manageable impact
on production database workloads. Or perhaps your CIO grew up in the script‐
ing world and is skeptical of graphical interfaces and automated replication pro‐
cesses.
But we are on a trajectory in which the business is depending more and more on
analyzing growing volumes of data at a faster and faster clip. There is a tipping
point at which traditional manual bulk loading tools and manual scripting begin
to impede your ability to deliver the business-changing benefits of modern ana‐
lytics. Successful enterprises identify the tipping point before it arrives and adopt
the necessary enabling technologies. Change data capture is such a technology. It
provides the necessary heartbeat for efficient, high-scale, and nondisruptive data
flows in modern enterprise circulatory systems.
37
APPENDIX A
Gartner Maturity Model for Data and
Analytics
The Replication Maturity Model shown in Figure 5-1 is adapted from the Gart‐
ner Maturity Model for Data and Analytics (ITScore for Data and Analytics,
October 23, 2017), as shown in Figure A-1.
Figure A-1. Overview of the Maturity Model for Data and Analytics (D&A = data
and analytics; ROI = return on investment)
39
About the Authors
Kevin Petrie is senior director of product marketing at Attunity. He has 20 years
of experience in high tech, including marketing, big data services, strategy, and
journalism. Kevin has held leadership roles at EMC and Symantec, and is a fre‐
quent speaker and blogger. He holds a Bachelor of Arts degree from Bowdoin
College and MBA from the Haas School of Business at UC Berkeley. Kevin is a
bookworm, outdoor fitness nut, husband, and father of three boys.
Dan Potter is a 20-year marketing veteran and the vice president of product
management and marketing at Attunity. In this role, he is responsible for product
roadmap management, marketing, and go-to-market strategies. Prior to Attunity,
he held senior marketing roles at Datawatch, IBM, Oracle, and Progress Soft‐
ware. Dan earned a B.S. in Business Administration from University of New
Hampshire.
Itamar Ankorion is the chief marketing officer (CMO) at Attunity leading global
marketing, business development, and product management. Itamar has overall
responsibility for Attunity’s marketing, including the go-to-market strategy,
brand and marketing communications, demand generation, product manage‐
ment, and product marketing. In addition, he is responsible for business devel‐
opment, building and managing Attunity’s alliances including strategic, OEM,
reseller, technology, and system integration partnerships. Itamar has more than
15 years of experience in marketing, business development, and product man‐
agement in the enterprise software space. He holds a B.A. in Computer Science
and Business Administration and an MBA from the Tel Aviv University.