0% found this document useful (0 votes)
18 views35 pages

VERA White Paper

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

VERA:

The Path to Cloud-Native


Apache Flink®

Contributors: Yuan Mei, Dawn Leamon, Zakelly Lan,


Giannis Polyzos, Jinzhong Li, Ben Gamble, Igor Kersic,
Karin Landers
Table of Contents
The Motivation Behind VERA
Changing Demands Reshape the Data Streaming Landscape
Closing the Gaps
VERA’s Key Innovations
The Cutting-Edge Technology of VERA
How VERA Tackles Challenges for Flink Users
Challenge 1: Evolving to a Cloud-Native Architecture
Challenge 2: Joining Data at Scale
Challenge 3: Ensuring Stable and Predictable Stream Processing
Challenge 4: Minimizing RPO and RTO
Challenge 5: Achieving Business and Development Agility
Comparative Performance Evaluation
Key Findings At A Glance
Test #1: Processing Performance Improvements
Test #2: Rescaling Speed
Test #3: Key-Value Separation
Bringing VERA to Everyone
Conclusion
About Veverica
References

This white paper introduces the Veverica Runtime Assembly (VERA) architecture, describes
the reasons behind why Ververica built VERA, the journey to democratize stream processing,
and the results of our performance comparison between VERA’s state engine (Gemini) and
the open-source Apache Flink® state engine (RocksDB).
The Motivation Behind VERA
Let’s travel back to 2009, when the big data landscape was shaped by two key ideas:
“computing as a utility” as seen in Grid computing technologies, and “data as a resource”
leveraged in peer-to-peer networks. The framework at the center of big data during this time
was Apache Hadoop, pioneered by Google in 2004 for distributed data processing. Hadoop
combined Grid computing’s distributed model with the decentralized, large-scale data
sharing of Peer-to-Peer (P2P) systems to create a platform designed for big data storage
and processing.

Then came two academic projects that would transform the big data landscape: Apache Flink
which started as a “Stratosphere” student research project at the Technical University of
Berlin, and Apache Spark born at the AMPLab at the University of California, Berkeley. The
introduction of Spark and Flink marked the shift from batch-only processing in Hadoop to
faster, more flexible systems capable of handling both real-time and batch data.

While both technologies were significant in the big data ecosystem, Spark and Flink followed
two different paths as they evolved to meet varying user needs and technical challenges.

● Spark followed a batch-first model, built around its Resilient Distributed Dataset
(RDD) API. Although it later expanded its capabilities with add-ons for stream
processing, Spark’s roots remain in batch operations.

● Flink focused on real-time stream processing, treating batch jobs as a special case of
streams and handling finite data sets (bounded streams) much like real-time
streams.

By 2016, Flink was the go-to processing solution for event-driven and streaming workloads,
gaining significant adoption as more companies relied on it for critical tasks, marking a
tipping point for widespread use.

Changing Demands Reshape the Data Streaming Landscape


The landscape of data stream processing looks vastly different than it did back in 2016. The
ways we ingest, process, and store data have evolved significantly due to advancements in
technology, the growth of big data, and the rising demand for real-time analytics.
When Apache Flink was first introduced, state sizes were typically measured in hundreds of
megabytes. Today, it's common for stateful streaming applications to handle terabyte-scale
state sizes. The exponential growth of data to process demanded a transition from the
Hadoop era to modern real-time data processing in containerized, cloud-native
environments.

Real-time data movement emerged as the norm, driven by tools like Apache Kafka®, Apache
Pulsar™, and AWS Kinesis. These enable continuous data streams in real-time from sources
such as (internet of Things) IoT devices, logs, and social media.

Concurrently, storage models also evolved, shifting from HDFS to cloud-native object
storage like Amazon S3, Google Cloud Storage, and Azure Blob Storage, enhancing cost
effectiveness and scalability for modern data workloads.

In data management and governance, the convergence of Lakehouse architectures (batch)


with real-time streaming is gaining popularity. This approach combines the strengths of data
lakes and warehouses, enabling better governance and analytics on large-scale data, while
processing both historical and real-time data streams. As a result, organizations can adopt
more agile, scalable solutions to manage dynamic workloads effectively.

The demand for real-time insights led to the integration of streaming data and machine
learning, enabling organizations to train models on fresh data (data in real-time, not
historical or batch data). This unlocks instant insights in predictive analytics and automated
decision-making, making it faster and more efficient.

At the same time, network bandwidth increased significantly, from hundreds of megabits per
second a decade ago to hundreds of gigabits today. The increasing availability of high-speed
network connections in data center hardware is now becoming standard, particularly at the
chassis level (the enclosure housing multiple server blades), with anticipated plans to
integrate them directly into individual blades for even faster networking. Advances in
technologies like fiber optics, 5G, and higher capacity Ethernet standards have also enabled
faster data transfers, reduced latency, and more efficient computing across cloud regions
and data centers.

This increase in bandwidth, combined with decreasing costs, accelerated the transition from
local to distributed and cloud-based storage solutions, allowing real-time applications to
handle massive data volumes.

Lastly, the rise of microservices and event-driven architectures changed how data is
processed, enabling more granular, scalable, and loosely coupled systems. The data mesh
paradigm, which decentralizes data ownership and management to domain-specific teams,
has emerged to address scalability challenges in large, distributed organizations.

While Apache Flink’s core architecture has not changed significantly over the past decade,
these trends have driven organizations and the community toward real-time processing,
cloud-native solutions, and intelligent automation, with an increasing emphasis on
scalability, resilience, and data governance.

Closing the Gaps


In response to the rapidly changing landscape, Ververica reimagined how Apache Flink
should function in the cloud-native era. We identified several key limitations in Flink’s core
architecture and ecosystem that users face when trying to meet their streaming data needs
with open-source Flink. Three critical areas for improvement emerged:

1. Evolving Flink to run natively in modern cloud environments

2. Simplifying the complexities of fault tolerance

3. Supporting the demands of large stateful applications

To overcome these challenges, we developed Veverica’s Run-time Assembly (VERA), the


engine powering the Veverica Streaming Data Platform. VERA bridges these gaps, helping
organizations navigate the complexities of the current data landscape by simplifying
streaming application development and operations and making Flink more scalable, resilient,
and adaptable for cloud-native, fault-tolerant, and state heavy use cases.

VERA’s Key Innovations


This section highlights VERA’s innovations and the benefits that it offers to users.

The Cutting-Edge Technology of VERA


VERA is a performance, resilience, and cost-optimized Apache Flink® runtime that powers the
Ververica Streaming Data Platform. The underlying technology and features include:

● Open-core technology stack that is 100% compatible with open-source Apache


Flink®, enabling developers to seamlessly access, modify, and utilize Flink’s features.
● Enterprise-grade technology that Ververica built to address the challenges with
open-source Apache Flink® including Ververica’s Flink optimizations and advanced
CDC connectors.

● Advanced features that support its cloud-native architecture, streamline the


integration with the underlying technology stack, optimize stream processing, and
enhance the usability for enterprise customers, meeting even the most demanding
requirements.

Image 1 captures the primary components of VERA, and its three layers.

Image 1: Primary components of Veverica Runtime Assembly (VERA)

Open-Core Tech Stack


VERA follows an open-core model, combining open-source Flink with proprietary technology
and features developed by Veverica. These components complement the open-source core,
ensuring VERA remains 100% compatible with Flink. VERA preserves all the benefits of Flink’s
unified programming model, including low-latency processing, high throughput, and stateful
stream processing. Flink applications can run on VERA without requiring modifications or
compatibility issues.
Because VERA is fully compatible with Flink and guarantees no vendor lock-in, you can
seamlessly move between vendor and open-source Flink. You are not tied to any specific
vendor’s products, services, or pricing.

Ververica-Built Tech Stack


The cutting-edge technology behind VERA centers on Ververica’s custom-built,
enterprise-grade, and cloud-native state storage engine, Gemini Backend StateStore (or
simply “Gemini”). This innovative engine replaces the standard RocksDB State Backend in
open-source Flink and provides superior performance in high-demand state-intensive
scenarios. For an in-depth look at how VERA addresses the challenges of the open-source
Flink state engine, see the section "How VERA Tackles Challenges Flink Users Face."

Ververica also provides a set of advanced Change Data Capture (CDC) connectors with
enterprise optimizations for highly scalable ingestion. VERA seamlessly integrates with Flink
CDC and supports advanced CDC capabilities. By abstracting away the complexities related
to deployment, configuration, infrastructure management, scalability, fault tolerance, and
maintenance, VERA makes CDC more accessible for users who prioritize ease of use and
streamlined workflows.

Advanced Features
VERA’s advanced features enhance performance, optimize computing resource utilization,
and elevate the user experience. You can:

● Effortlessly scale large stateful applications because VERA separates compute and
storage resources.

● Enhance performance with streaming join optimization through key-value separation.

● Recover faster from job failures or updates with advanced scheduling functions for
quicker restarts.

● Update event processing rules on the fly with Dynamic Complex Event Processing
(CEP).

● Meet the stringent requirements of modern, cloud-native stream processing


workloads with enhanced catalogs.

● Unlock powerful machine learning and AI capabilities with built-in Flink ML and AI
applications that are fully integrated with the open-source Apache Flink® library that
provides pre-installed and pre-configured machine learning APIs.
Next, it’s important to take the time to understand the common challenges users face with
open-source Apache Flink®, and how VERA solves them.

How VERA Tackles Challenges for Flink Users


During the process of addressing the critical gaps in open-source Flink, Ververica identified
five common obstacles that users encounter with Flink technology.

Challenges Users Face with OS Flink

Image 2: Five Challenges VERA Solves

These hurdles make it difficult for users without extensive Flink knowledge to use Apache
Flink effectively out of the box, resulting in a steep learning curve. By overcoming these
limitations, Ververica aims to democratize stream processing, making it accessible to a
broader audience, not just those with Flink expertise and substantial budgets or resources.

To learn more about how we set out to achieve this goal, check out insights from Ben
Gamble, Veverica’s Field CTO in this video:

https://www.youtube.com/watch?v=gSFcXk8TOOM

Now, let’s explore how we solve each of these challenges…


Challenge 1: Evolving to a Cloud-Native Architecture
Local disk limitations in Apache Flink® can significantly affect both performance and
scalability, particularly in stateful stream processing tasks. Flink by default uses RocksDB
which is optimized for local flash drives or limited RAM, which does not scale.

In cloud-native environments with containerization, the fixed size of local disks create
challenges for stateful stream processing. As state sizes grow and local disks reach their
storage limits, streaming jobs can become unstable. Addressing these issues by expanding
disk capacity or increasing job parallelism can be costly and requires maintenance windows.

Impact
Local disk limitations in Flink can lead to slower processing, reduced scalability, longer
recovery times, and higher failure rates. These impacts are evident in several scenarios:

● Flink often relies on local disk for checkpointing, creating savepoints, and managing
state. Limited I/O performance (e.g., slow read/write speeds) on local disks can create
bottlenecks, slowing down checkpointing and causing back pressure in the system.

● For stateful streaming jobs in Flink that store large amounts of data, limited local disk
capacity can lead to job failures or the inability to save checkpoints.

● As state grows in large-scale applications, local disk storage saturation can hinder
scalability, preventing the system from efficiently handling larger workloads. Resource
management becomes more complex as disks fill up faster, requiring more frequent
monitoring and maintenance to avoid performance degradation.

● Although Flink provides fault tolerance through state snapshots and recovery,
inconsistent local disk capacity may result in incomplete or corrupted checkpoints,
compromising fault tolerance and increasing recovery time after failures.

To manage these issues, many organizations adopt hybrid architectures that combine both
local and remote storage storage solutions such as Amazon S3 or Google Cloud Storage to
balance speed, scalability, and resilience.
Solution
VERA solves this issue by decoupling compute and storage and leveraging a tiered storage
architecture.

Decoupling compute and storage


Separating compute and storage in VERA enables each to scale independently. This
decoupling means you can add more computing power without being constrained by storage
limitations, and storage can scale without impacting compute resources. This flexibility is
crucial for handling varying workloads, especially in stateful streaming applications where
state sizes can grow rapidly.

VERA removes the dependency on local disks for storing state and treats them primarily as a
cache. This strategy enhances fault tolerance and provides seamless storage scaling by:

● Offloading state data to a remote distributed file system, eliminating concerns about
local disk capacity and simplifying job rescaling. This ensures that the size of the
state does not limit computing resources.

● Managing checkpointing directly on the distributed file system, ensuring robust job
recovery and improving resiliency and stability.

● Reducing I/O bottlenecks, as relying on remote storage systems like cloud-based


object storage offers higher throughput and reliability than local disks.

In short, by decoupling compute and storage, VERA ensures that stateful applications can
scale efficiency, improve resiliency, and optimize cost while delivering the performance
needed for dynamic, cloud-native environments.

Tiered storage architecture


Tiered storage manages state data by organizing it across multiple layers or “tiers” of storage
based on access patterns and resource costs. Frequently accessed data is stored in fast,
high-cost storage (like memory or SSDs), while less frequently accessed data is offloaded to
slower, lower-cost storage (e.g., local disks or remote object stores). This layered approach
ensures efficient resource usage and optimizes both performance and cost.

VERA enhances this tiered storage model with rescaling optimizations that significantly
reduce job downtime and improve system scalability.
When scaling a job (either up or down), Flink typically needs to process and download all
state data before resuming operations, which can result in long recovery times. VERA
optimizes this process through two mechanisms:

● File Granular Merging and Trimming: VERA asynchronously trims redundant state data
during rescaling. This allows the job to resume faster, as the state backend does not
need to wait for pruning operations to complete before processing.

● Lazy Restore of State Files: VERA addresses slow state recovery and long job
interruption times by introducing lazy state loading, downloading state data
asynchronously. This allows the job to access remote storage directly, merge
metadata, and resume execution without waiting for the entire state to be restored
locally.

Image 3: Two State Recovery Approaches

VERA's lazy restore mechanism optimizes state file loading during recovery. Unlike eager
restore state recovery, where all state data must be loaded before the job can begin, lazy
restore reads the metadata files along with a small amount of in-memory data from the
remote storage. Jobs can start processing immediately while the rest of the state is
asynchronously downloaded in the background.
Although the job starts fast, reading from remote storage is not ideal as it involves I/O delays
and could introduce some extra latency. To ensure optimal performance, memory and local
disks act as accelerators and background threads perform on-demand, just-in-time
downloads, gradually transferring the remaining files from remote storage to the local disk.

Image 4: Comparison of job downtime between OS Flink and VERA state engines

As shown in Image 4, VERA’s approach significantly shortens job downtime compared to


open-source Flink. VERA can access the remote storage directly, using the data files on the
remote storage to locate and merge metadata, quickly restoring the database instances and
getting the job in a RUNNING state. VERA performs the file cleanup process to remove
(prune) redundant data asynchronously with almost no impact on the performance of the
state reading and writing threads.

VERA minimizes downtime by streamlining state backend rebuilding. Delayed pruning allows
jobs to begin processing as soon as the state is restored. Lazy state restore further reduces
downtime, allowing data processing to start after retrieving minimal metadata. The remaining
state is downloaded progressively, with job performance improving as more state data is
cached locally.

By allowing jobs to begin processing while the state is still being restored and
asynchronously pruning redundant data, VERA achieves up to 90% faster recovery times
during scaling events. (See the Comparative Performance Evaluation later in this document.)
Result

VERA overcomes local disk limitations in Flink by decoupling compute and storage, enabling
independent scaling for both. This approach removes constraints imposed by local storage
capacity, ensuring efficient handling of large-scale stateful applications.

VERA’s tiered storage architecture further optimizes performance, storing frequently


accessed data locally while offloading less critical data to scalable remote storage. Through
lazy state loading and asynchronous file download, VERA minimizes downtime, allowing jobs
to start processing with minimal metadata while the full state loads in the background.

This multi-tiered strategy enables VERA to balance state recovery speed with resource
efficiency, improving scalability, resilience, and fault tolerance in cloud-native environments.

Challenge 2: Joining Data at Scale


Stateful computations are at the core of many stream processing tasks in Apache Flink®, but
they often face performance issues as state sizes grow. Operations such as streaming joins,
for example, must store unprocessed records from one stream until matching records from
another stream arrive. While RocksDB, Flink’s default state backend, is optimized for
write-intensive workloads, it struggles with performance in two main areas as data volumes
grow:

● Flink’s single-threaded state access and update interface limits performance,


especially in state-heavy, low-latency applications.

● As the state grows in size, relying on local disk storage for state management in
RocksDB leads to read/write delays, particularly during dual-stream or multi-stream
joins, where key-value pairs must be stored and accessed efficiently.

Consider the following dual-stream join example:

SELECT ad_id, add_publisher_id, customer_feature


FROM click_mobile s
INNER JOIN show_mobile s
ON c.ad_id = s.ad_id
Image 5: Key-Value pair is stored on each side of the join.

As shown in Image 5, the key-value pair is stored in state for each side of the join, but not all
records will produce matching results. For example, in an anomaly detection system, only a
small subset of the data may lead to actual results, while the est must be held in memory.

Impact
As state grows, the latency increases, especially when accessing unprocessed records
waiting for matches across streams. This impacts real-time processing in use cases like
session tracking, fraud detection, and real-time analytics. Users may experience not only
slower processing but also higher operational complexity as Flink’s performance degrades.
These issues lead to longer recovery times, reduced scalability, and stability challenges
during periods of high data volume, which disrupts real-time analytics and complex event
processing pipelines.

Solution
VERA solves these challenges by separating keys from values. Large values are stored
separately, with references placed in the log-structured merge-tree (LSM tree). This reduces
unnecessary I/O overhead by enabling quick access to keys, which minimizes data movement
during updates and compactions.
Why Key-Value Separation Matters
In scenarios like anomaly detection or recommendation systems, only a small portion of the
state contributes to final results. Key-value separation allows VERA to quickly access keys
without immediately loading large values, reducing CPU and I/O overhead. This is especially
beneficial for large-scale streaming joins, where the full-table data must be processed
efficiently, even when managing terabytes of

For use cases with limited local storage space, VERA stores the separated value data (cold
data) in a distributed file system, ensuring key lookups remain unaffected. VERA
automatically detects local storage capacity thresholds, maintaining performance similar to
local storage but at a lower cost.

Additionally, VERA addresses common issues associated with key-value separation, such as
inefficiency in range queries and storage amplification. By focusing on the specific needs of
stream processing (where range queries are less common), and leveraging object storage,
VERA ensures both performance and resource efficiency.

Finally, to maximize the performance advantages of key-value separation, VERA supports


adaptive key-value separation.

Image 6: Adaptive Key-Value Separation Process.

In a stream computing scenario, different jobs have different data characteristics (value
length, key, and value access frequency), and fixed key-value separation parameters make it
difficult to optimize the performance of all jobs. With adaptive key-value separation, VERA
continuously monitors state access patterns and dynamically adjusting parameters,
ensuring optimal performance without requiring manual configurations or job restarts.
Result
VERA enhances Flink’s state management by implementing advanced techniques such as
adaptive key-value separation and in-memory optimizations to reduce latency and improve
throughput for large-scale, real-time streaming applications.

Challenge 3: Ensuring Stable and Predictable Stream


Processing
Predictable stream processing requires systems to handle high-throughput data and large
states with minimal latency, especially in real-time applications. However, Apache Flink®
struggles with several limitations when scaling, leading to performance degradation under
load. Key challenges include:

● State Management Issues: Flink's reliance on RocksDB for local disk storage
introduces significant performance bottlenecks. RocksDB is optimized for
write-heavy operations, but when dealing with large states, particularly in real-time
use cases, it struggles to handle the rapid state access required, resulting in latency
issues during both operation and recovery.

● Job Recovery Latency: Flink's job recovery relies on state snapshots (checkpoints and
savepoints) to restore a job’s state after failure. However, this process is
synchronous, requiring all state data to be fully reloaded before the job can continue.
This synchronous reloading increases downtime significantly, especially for large
stateful applications, as the entire state must be restored before job execution
resumes.

● Rescaling Bottlenecks: When scaling, Flink often needs to restart and reschedule
jobs, which prolongs recovery times and disrupts real-time performance. Rescaling
isn’t seamless and can lead to longer recovery times, particularly in large-scale
applications with heavy data loads.

Due to Flink’s recovery inefficiencies, users encounter unpredictable job recovery times,
making it difficult to meet performance guarantees in critical real-time applications such as
fraud detection or session window tracking.
More on State Snapshots Challenges
Flink’s dependency on state snapshots deserves a deeper look. In Flink, state snapshots are
created using checkpoints and savepoints to ensure fault tolerance and reliable recovery.
Checkpoints are automated snapshots that capture the job's state periodically, enabling the
system to restore state in case of failure with minimal data loss. Savepoints, on the other
hand, are manually triggered, consistent snapshots that allow users to restart or update jobs
at a specific point in time, often used for version upgrades or job rescaling. Both mechanisms
store state data externally, ensuring state consistency across failures or job restarts.

When Flink triggers a checkpoint, it flushes the data from RocksDB memory on disk and
generates new files, introducing undesired outcomes:

● Unnecessary cache misses increase disk reads, degrading performance.

● Frequent, forced compactions increase the overall CPU and I/O overhead.

Image 7: Periodic checkpoint of Flink jobs causes periodic CPU spikes.

Sudden resource contention due to CPU spikes in the cluster, complicates predictable
capacity planning, as it forces a choice between sacrificing stability by under estimating
capacity or losing efficiency by optimizing for peak CPU bursts. The performance of the Flink
jobs declines, compromising the stability and reliability of the cluster, particularly in
large-scale stateful streaming applications.

Impact
Given that stateful streaming applications are expected to run continuously (24/7), Flink’s
job recovery limitations can result in a range of negative consequences for users:

● Extended Downtime: Users face long delays during state recovery after failures,
disrupting time-sensitive applications like fraud detection or real-time analytics.
● Operational Complexity: Manual intervention is often required to manage checkpoints
and tune job recovery. This increases operational overhead and makes it more
challenging to maintain a smooth-running stream processing infrastructure.

● Scalability Challenges: Flink’s inefficiency in managing large state sizes makes it hard
to scale predictably. Users face bottlenecks when scaling jobs, reducing system
stability and making it harder to handle growing workloads.

● Inconsistent Real-Time Processing: Unpredictable job recovery times make it difficult


for users to guarantee stable, low-latency real-time services, affecting
mission-critical applications like financial trading and monitoring.

● Missed SLAs: Slow job recovery can lead to missed service-level agreements (SLAs),
resulting in penalties or lost opportunities in high-priority use cases.

Flink’s job recovery limitations lead to extended recovery times, operational complexity,
scalability issues, and unpredictable system behavior, which affect the performance and
reliability of critical stream processing tasks.

Solution
VERA addresses Flink's job recovery limitations through key innovations.

First, as discussed in previous sections, VERA decouples state storage from compute, using
a tiered storage architecture where larger states are stored remotely. This enables faster
scaling since only critical state data is restored, while non-essential data is retrieved as
needed. This lazy state restore mechanism improves the speed and scalability of the system
during recovery and normal operations, allowing for smooth scaling under load without
overwhelming the system with unnecessary state restores.

Additionally, VERA reduces checkpointing overhead by keeping remote state files intact,
eliminating the need to re-upload them during checkpointing. To further optimize, VERA
uploads memory table snapshots directly to remote storage during checkpointing,
minimizing CPU and I/O impact. This streamlined process also reduces savepoint times to
just five seconds, aligning with SLA requirements and ensuring efficient resource use (see
Image 5).
Image 8: Word Count Example showing reduced savepoint times are lower with VERA.

Next, VERA enhances Apache Flink's job performance and stability through dynamic
scheduling and resource management. During task failures, Flink's default behavior is to
restart all tasks in the same region, which can lead to downtime. VERA optimizes this by
implementing local task recovery, which isolates and recovers only the failed tasks. This
granular recovery reduces unnecessary job restarts, minimizing the time it takes to recover
and reducing pressure on the cluster during high-parallelism workloads. This feature ensures
faster recovery and better system performance under load.

Additionally, VERA supports approximate recovery semantics, which allows certain


applications (like online AI training or anomaly detection) to resume operations with partial
state restoration, further improving recovery speed without fully reloading all state data. This
trade-off between recovery precision and speed ensures that the system maintains
near-continuous operation, even in scenarios where some data loss is tolerable.

Unlike Flink's static resource allocation, VERA dynamically adjusts resource allocation based
on task load and resource needs. This approach ensures that tasks with higher demands
(such as those with more intensive compute or memory needs) get the resources required
for efficient execution, while less resource-intensive tasks don't consume unnecessary
system capacity. This flexibility reduces resource waste and improves overall job
performance, especially under high load.

VERA can allocate resources and schedule tasks based on real-time data access patterns
and compute load. This means that as data flow changes or increases, the system
automatically adjusts to optimize task distribution across available resources. By preventing
resource bottlenecks or underutilization, VERA’s load aware scheduling improves both
throughput and latency.

Lastly, VERA introduces live updates to resource allocations and task parameters without
requiring a full restart. These hot job updates allow for in-place scaling or configuration
changes to happen with minimal downtime, which is particularly useful in long-running jobs
or when system demands shift dynamically.

Result
By integrating dynamic scheduling and resource management, VERA enhances system
performance and resource utilization, even in large-scale and high-throughput stream
processing environments. VERA’s innovations enable faster, more scalable job recovery by
minimizing downtime through task-level recovery and seamless rescaling. These
optimizations reduce operational overhead, allowing for predictable and stable stream
processing under heavy loads, enabling real-time applications to meet SLAs and ensure
efficient system operation.

Challenge 4: Minimizing RPO and RTO


In stream processing, both job recovery and disaster recovery are essential for restoring
operations after a failure, though they address different levels of the system. Job recovery
focuses on resuming individual stream processing tasks after small-scale failures with
minimal disruption, typically involving strategies like checkpointing and asynchronous state
loading. On the other hand, disaster recovery focuses on restoring the entire system
infrastructure following catastrophic events such as hardware failures, network outages, or
data center crashes.

Efficient state management and scalability are key to maintaining stability and performance
in both types of recovery, particularly under load. Disaster recovery involves restoring not
just the stream processing tasks but also rebalancing resources– like memory, compute, and
networking capacity–across multiple data centers or cloud regions to handle the load after
recovery. This broader recovery is more resource-intensive and time-consuming.

A critical focus of both recovery processes is minimizing Recovery Point Objective (RPO)
which aims to limit data loss and Recovery Time Objective (RTO) which minimizes downtime.
However, Flink’s default recovery mechanism relies heavily on synchronous recovery, which
causes significant downtime, particularly for applications with large state sizes. For example,
in real-time analytics or fraud detection, where any downtime can result in missed insights or
undetected fraudulent transactions, prolonged recovery times can lead to severe
consequences.

Additionally, resource contention during the recovery process may spike CPU and I/O usage,
complicating the overall performance and leading to unpredictable behavior. Such spikes
make it difficult to maintain predictable system performance, further hindering the ability to
meet RTO targets in high-demand environments.

During normal operation and recovery, Flink’s use of RocksDB can cause bottlenecks, leading
to latency issues when handling large states or complex workloads. In the case of a
catastrophic failure, like a data center or cluster outage, Flink’s reliance on synchronous
state reloading during recovery prolongs the recovery process, in large-scale or
high-parallelism environments. This challenge makes it difficult to achieve the low RPO and
RTO required in mission-critical real-time applications, where even minutes of downtime can
mean missed service-level agreements (SLAs) and penalties.

Impact
Flink's recovery mechanism has a substantial impact on both RPO and RTO in stream
processing environments. These two objectives define the data loss tolerance and
downtime acceptable in recovery scenarios.

RPO depends on the time interval between checkpoints. In practice, if a failure occurs just
before the next checkpoint, all the data since the last checkpoint is lost, potentially
increasing the RPO to the length of the checkpoint interval. For applications with strict data
integrity requirements, such as financial trading or real-time machine learning, losing
minutes of data can result in substantial financial or operational impact.

Flink’s recovery mechanism increases RTO by requiring the restoration of large states from
checkpoints. The full synchronous state restoration that Flink uses causes long recovery
times under load, which increases downtime. For example, in large-scale applications like
online banking or sensor-based monitoring systems, extended downtime can disrupt
services and erode customer trust. Additionally, resource contention during recovery further
degrades performance, slowing the process and making it harder to meet stringent RTO
requirements.
Solution
VERA significantly enhances both job and disaster recovery by decoupling state storage from
compute resources, allowing for faster and more scalable recovery times.

VERA’s dynamic recovery features also provide critical improvements. By implementing


approximate state recovery, it allows certain applications (such as real-time AI training or
sensor data analysis) to resume processing without needing to fully reload all state data.
This dramatically reduces recovery time, providing a more flexible approach to disaster
recovery in scenarios where some tolerance for data loss exists.

In addition, VERA optimizes resource allocation during recovery by dynamically rebalancing


compute and memory resources across available clusters, ensuring that the system can
efficiently handle the resumed load post-recovery. By preventing CPU and I/O bottlenecks,
VERA ensures a smoother recovery process, reducing downtime and improving overall
system performance.

Result

With VERA, both job and disaster recovery processes are faster and more scalable. The use
of approximate state recovery dramatically reduces both RPO and RTO, ensuring that critical
data is retained and systems are brought back online with minimal downtime. Resource
contention during recovery is reduced through dynamic resource allocation, resulting in
more predictable performance and smoother recovery for large-scale, stateful applications.

As a result, real-time applications such as fraud detection, financial services, and industrial
monitoring benefit from increased reliability and stability, allowing them to meet strict SLAs,
maintain customer trust, and avoid penalties due to system downtime.

Challenge 5: Achieving Business and Development Agility


Flink’s architecture, while robust for stream processing, lacks the flexibility required to rapidly
respond to evolving business needs, which is vital in real-time, mission-critical applications.
Users often encounter significant downtime when updating event processing rules or
managing large-scale workloads, as Flink’s mechanisms struggle to accommodate dynamic
updates or seamlessly scale across distributed systems. This limits business agility, slows
down development cycles, and complicates the management of real-time applications,
making it challenging to meet the demands of fast-moving industries such as finance,
e-commerce, and IoT.

Impact
The challenges in maintaining operational continuity lead to costly downtimes, which can
disrupt mission-critical services, including real-time financial processing or IoT systems.
These disruptions not only increase the risk of data loss but also undermine business agility,
forcing development teams to delay updates to avoid interruptions.

What’s more, users face difficulties efficiently managing state and metadata, resulting in
bottlenecks that further slow down the scaling of distributed applications and make it
difficult for organizations to meet the demands of large-scale, cloud-native environments. As
a result, productivity declines, development cycles slow down, pipeline management
becomes complicated, and opportunities for real-time optimization or innovation are missed.

Solution
VERA enhances the user experience by introducing intuitive abstractions that simplify
complex business use cases through a number of innovations, including Dynamic Complex
Event Processing (Dynamic CEP), enhanced catalogs, and built-in Flink Machine Learning
(ML).

Update Event Processing Rules On-the-Fly


VERA’s Dynamic Complex Event Processing (Dynamic CEP) allows users to update event
processing rules without service interruption, effectively decoupling upstream changes from
downstream impacts. This capability addresses two critical challenges:

1. Updating rules in real-time. Users can change rules for jobs running in production
without stopping the job, ensuring continuous operation.

2. Efficient serialization and deserialization. Newly created rules are efficiently serialized
and deserialized, facilitating smooth integration with external databases, such as
Amazon Relational Database Service (RDS), to seamlessly fetch new and updated
policies or rules.

This approach is key for managing mission-critical workloads. Some common use cases
include:
● Real-time risk control. Organizations can detect potentially malicious users by
monitoring customer behavior logs, and flagging those who initiate more than ten
transfers and exceed 10,000 within a five-minute window as abnormal users.

● Real-time marketing campaigns. Marketers can optimize strategies by identifying


users who add products to their carts more than three times in ten minutes without
completing a purchase, enabling adjustments to improve conversion rates during
promotions.

● Anomaly detection. In IoT systems or manufacturing, abnormal conditions are


detected by triggering alarms when sensor temperature readings exceed predefined
thresholds more than three times within a specific time interval.

With VERA, teams collaborate more effectively. Business Intelligence (BI) teams can rely on
stable data feeds while upstream teams make necessary changes without risking
operational continuity.

Seamlessly Manage Cloud-Native Stream Processing Workloads

VERA strengthens the reliability and efficiency of large-scale stream processing systems by
ensuring high availability, improving system performance, and enabling scalability. Enhanced
catalogs provide real-time metadata for better state and query management. By optimizing
data access and state storage management through enhanced catalogs, VERA reduces
bottlenecks in real-time application development. VERA supports seamless scaling across
regions or cloud-native environments, crucial for applications handling large datasets in
industries, including e-commerce, IoT, and financial services.

These innovations position VERA to meet the stringent requirements of modern,


cloud-native stream processing workloads.

Build Machine Learning & AI Pipelines Quickly


VERA’s built-in Flink Machine Learning (ML) library simplifies the development and
deployment of ML pipelines by providing standard ML APIs and operators for data
preprocessing, training, and real-time model inference. These features allow users to
seamlessly integrate ML workflows into stream processing environments without the need
for additional tools or complex integrations. Additionally, VERA boosts performance and
usability through pre-installed dependencies and integrated platform capabilities, such as
file storage, catalogs, and cluster management, which enhance operational efficiency and
improve the overall user experience.
This ability to seamlessly integrate machine learning pipelines into real-time data processing
is essential for applications like fraud detection, recommendation systems, and real-time
analytics, where timely and accurate predictions are critical.

By simplifying machine learning model deployment and management, VERA accelerates


development cycles and reduces operational overhead. This enables data scientists and
engineers to collaborate more effectively, moving models from the lab to production faster,
ensuring real-time applications meet their performance and scalability requirements.

Results

VERA’s innovations significantly enhance business and development agility in stream


processing environments. With Dynamic CEP, users can update rules without downtime,
ensuring operational continuity and reducing the time needed to adapt to changing business
requirements. Enhanced catalogs minimize downtime and bottlenecks, making large-scale
stream processing more efficient and scalable across distributed systems. VERA’s built-in
Flink ML library accelerates the deployment of machine learning pipelines and simplifies the
process for data scientists and developers to build, train, and deploy models in real-time
environments, further enhancing agility by speeding up the adoption of AI-driven solutions.

Additionally, VERA’s support for seamless scaling and faster recovery reduces operational
overhead, allowing businesses to respond more quickly to market changes while enabling
developers to implement updates with minimal disruption.

Together, these enhancements eliminate repetitive, time-consuming manual tasks, improve


system performance through automation, and ensure better resource management, all of
which contribute to reducing operational overhead. These enhancements ensure that
real-time applications can meet stringent SLAs, improve system reliability, and support rapid
innovation cycles, allowing teams to focus more on innovation and less on maintaining the
infrastructure.

The chart summarizes how VERA solves the challenges users face with open-source Flink:
Image 9: Summary of VERA solutions for the challenges users face with open-source Flink

Now that we’ve explained what VERA is and how it addresses the challenges that Flink users
face, it’s time to show you the results!

Comparative Performance Evaluation


This section reviews the evaluation and test results of the VERA State Engine (Gemini) and
Apache Flink® State Engine (RocksDB). Several tests were conducted to assess their
performance, with results detailed below.

Note: During its development, VERA’s state engine was rigorously validated with real-world
use cases. It has consistently shown superior performance over Flink’s State Engine
(RocksDB) in handling terabyte-scale stream processing workloads.

Key Findings At A Glance


Image 10 provides a preview of the key findings of this evaluation.
Image 10: Key Findings of the Evaluation

With these insights in mind, let’s review the specific results from the performance tests.

Test #1: Processing Performance Improvements


The configuration for this stateful stream processing use case is based on the Nexmark
benchmark suite for queries over continuous data streams. This use case simulates
auction-related events in a streaming environment, including bids, auction items, and user
behavior.

● User: a person submitting an item for auction and/or making a bid on an auction item.
● Auction Item: an item currently being auctioned.
● Bid: an offer made for an item being auctioned.

This test uses the standard Nexmark configuration which includes four key parameters:

● Parallelism (8): The system processes the stream using 8 parallel tasks.

● TaskManagers (8): The system uses 8 TaskManagers (nodes) responsible for


executing the individual tasks of the stream processing job.

● Memory (8GB): Each Task Manager is allocated 8 GB of memory to handle stateful


operations such as maintaining auction states, bid history, and user behavior)
without running out of memory.
● Dataset Size (100M): The system processes a total dataset size of 100 million events
(EventsNum=100M) which represents a mixture of bids, auctions, and user data
flowing through the system for processing.

Results
The test results (see Image 9) for point lookups and write operations, which account for a
large proportion of all state access in the Flink scenario, show that VERA significantly boosts
job performance. VERA achieves an average improvement of 50% in single-core throughput,
with some stateful workloads seeing even greater performance gains.

Image 11: Comparison of performance measured by throughput per core. (Note: Higher throughput per
core indicates better performance and more efficient utilization of CPU resources.)

*For information about the queries (q1-q22), see https://github.com/nexmark/nexmark.

Analysis
This test reveals that VERA effectively optimizes state handling and job execution,
outperforming Flink's standard state engine. The results demonstrate VERA’s capability to
handle high-throughput workloads more efficiently, highlighting its potential to deliver
notable performance improvements in real-world stream processing tasks.
Test #2: Rescaling Speed
This test uses a WordCount Benchmark to compare the rescaling performance between
VERA’s state engine (Gemini) and Flink’s state engine (RocksDB). The job's total state size
was maintained at approximately 128GB, and the source data was generated following a
uniform distribution. Each TaskManager was allocated 1 vCPU and 4GB of memory, and each
TaskManager was assigned one task slot.

Results
This evaluation compares VERA’s state engine (Gemini) and Flink’s state engine (RocksDB) in
three scenarios:

● Case A: VERA with Hot Job Updates


● Case B: VERA with Hot Job Updates + Lazy State Loading features
● Case C: VERA with adjusted job parallelism

In all three test cases, VERA exhibits substantially less downtime than Flink’s state engine
(RocksDB).

Case A: VERA with Hot Job Updates


The test results (Image 12) show that job restarts with 128GB of state resulted in a 94%
performance improvement, while restarts with 256GB of state saw a 96% boost.
Image 12: Comparison of job restarts with / without Hot Job Updates enabled

VERA’s Hot Job Updates feature significantly reduces downtime during job restarts, handling
large state sizes with much greater efficiency than Flink’s state engine (RocksDB). This
indicates that VERA’s update mechanism minimizes disruption, enhancing overall system
responsiveness and availability.

Case B: VERA with Hot Job Updates + Lazy State Loading


Test results (see Image 13) reveal that combining hot job updates with lazy state loading cut
job restart times by 90%.

Image 13: Comparison of job restarts with Hot Job Updates and Lazy State Loading enabled
The integration of Lazy State Loading with Hot Job Updates provides a substantial boost in
restart speed, showcasing VERA’s advanced capabilities in minimizing downtime. This
combined approach allows for more rapid job recovery and state management, offering a
more agile and responsive data processing solution.

Case C: VERA with adjusting job parallelism


In scenarios requiring adjustments to the number of parallel tasks, whether scaling up or
down to optimize performance, handle varying workloads, or improve resource utilization,
VERA (Image 14) shows significantly less downtime than OS Flink, with a 32% performance
improvement when scaling up, and a 47% performance improvement when scaling down.

Image 14: Comparison of downtime when scaling job parallelism up and down.

VERA’s ability to manage parallel task adjustments with significantly reduced downtime
highlights its efficiency in optimizing performance and resource utilization. The higher
performance improvements when scaling down suggest that VERA excels in reallocating
resources and managing state in dynamic scenarios, which is crucial for handling fluctuating
workloads and enhancing overall system flexibility.
Analysis
VERA’s state engine (Gemini) demonstrates notable advantages over Flink’s state engine
(RocksDB) in terms of downtime reduction and performance efficiency. By integrating Hot
Job Updates, Lazy State Loading, and optimized parallelism adjustments, VERA achieves
substantial improvements in job restart times and resource management, resulting in a more
reliable and agile Streaming Data Platform.

Test #3: Key-Value Separation


This test uses the Nexmark Q20 benchmark (a Join use case) to evaluate the performance of
key-value separation. To better simulate dual-stream Join scenarios with large state sizes,
we intentionally increased the data volume to 400 million and 800 million events
(EventsNum=400M/800M).

Note: Other setups and test environments remain consistent with those described above in
Test 1.

Results
As shown in Image 15, enabling VERA with key-value separation in the Q20 dual-stream Join
scenario, leads to a substantial boost in job throughput, ranging from 50% to 70%.
Image 15: Comparison of job throughput with and without key-value separation enabled

Analysis
This improvement indicates VERA’s ability to efficiently manage and process large state sizes,
resulting in more effective resource utilization and faster data processing. The performance
gains observed with key-value separation highlight its effectiveness in optimizing job
throughput compared to scenarios without this feature.

This demonstrates VERA's advanced capability to enhance streaming data processing


performance by leveraging key-value separation, which is particularly beneficial for handling
complex and high-volume data streams.

Overall, the test underscores VERA’s superior efficiency in managing stateful streaming
workloads, thereby offering significant performance advantages over traditional state
management approaches.
Bringing VERA to Everyone
VERA was developed to tackle critical challenges in data stream processing, such as latency,
scalability, operational complexity, and business agility. By addressing these pain points,
Ververica built a solution designed to meet the demands of modern data-driven applications
while maintaining full compatibility with Apache Flink.

With a deep understanding of the limitations in the existing open-source Flink technology,
VERA integrates advanced features like optimized state storage, smart resource
management, and dynamic processing capabilities. This approach not only enhances
performance, but it also bridges the gap between complex capabilities and broader
accessibility.

VERA’s innovative architecture ensures that stateful stream processing is much more
resilient, accessible, and easier to manage, making Apache Flink® accessible to users with
varying levels of expertise or resources. This marks a significant step toward making
sophisticated data stream processing available and practical for diverse applications and
industries.

Conclusion
The evaluation demonstrates that VERA enhances the functionalities of open-source
Apache Flink® while preserving full compatibility. This combination of open-core and
proprietary technology underscores the strength of the Ververica Streaming Data Platform.

VERA delivers a robust and scalable platform for high-throughput, low-latency workloads
with adaptive performance across various conditions. Its user-friendly design allows
developers to focus on building applications rather than managing infrastructure.

With its modern, powerful, and accessible cloud-native engine, VERA introduces features
aimed at improving real-time stream processing. The future holds great potential as
customers leverage VERA to drive innovation and achieve their goals.

Ready to try out VERA? Contact us: https://www.ververica.com/contact


About Veverica
Ververica empowers businesses with high performance data streaming and processing
solutions that streamline operations, increase developer efficiency, and enable customers to
solve real-time use cases reliably and securely.

Ververica makes it easy for organizations to harness data insights at scale, with our
advanced Streaming Data Platform powered by VERA, the engine that revolutionizes Apache
Flink.

References
● Stateful Stream Processing | Apache Flink:
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/concepts/stateful-stre
am-processing

● https://cwiki.apache.org/confluence/x/R4p3EQ

● Ververica Runtime Assembly (VERA)


​https://docs.ververica.com/vvc/about-ververica-cloud/vera/#geministatebackend

● https://github.com/apache/flink-benchmarks/tree/master/src/main/java/org/apache/
flink/state/benchmark

● GitHub - nexmark/nexmark: Benchmarks for queries over continuous data streams.


https://github.com/nexmark/nexmark

● https://flink-learning.org.cn/article/detail/baf859431244a0c24aa36c2250c90967?sp
m=a2csy.flink.0.0.49496ea8tkSye7&name=article&tab=suoyou&page=2

● https://flink-learning.org.cn/article/detail/1563c71112adccac151151e609184377?spm=
a2csy.flink.0.0.49496ea8tkSye7&name=article&tab=suoyou&page=2

You might also like