VERA White Paper
VERA White Paper
VERA White Paper
This white paper introduces the Veverica Runtime Assembly (VERA) architecture, describes
the reasons behind why Ververica built VERA, the journey to democratize stream processing,
and the results of our performance comparison between VERA’s state engine (Gemini) and
the open-source Apache Flink® state engine (RocksDB).
The Motivation Behind VERA
Let’s travel back to 2009, when the big data landscape was shaped by two key ideas:
“computing as a utility” as seen in Grid computing technologies, and “data as a resource”
leveraged in peer-to-peer networks. The framework at the center of big data during this time
was Apache Hadoop, pioneered by Google in 2004 for distributed data processing. Hadoop
combined Grid computing’s distributed model with the decentralized, large-scale data
sharing of Peer-to-Peer (P2P) systems to create a platform designed for big data storage
and processing.
Then came two academic projects that would transform the big data landscape: Apache Flink
which started as a “Stratosphere” student research project at the Technical University of
Berlin, and Apache Spark born at the AMPLab at the University of California, Berkeley. The
introduction of Spark and Flink marked the shift from batch-only processing in Hadoop to
faster, more flexible systems capable of handling both real-time and batch data.
While both technologies were significant in the big data ecosystem, Spark and Flink followed
two different paths as they evolved to meet varying user needs and technical challenges.
● Spark followed a batch-first model, built around its Resilient Distributed Dataset
(RDD) API. Although it later expanded its capabilities with add-ons for stream
processing, Spark’s roots remain in batch operations.
● Flink focused on real-time stream processing, treating batch jobs as a special case of
streams and handling finite data sets (bounded streams) much like real-time
streams.
By 2016, Flink was the go-to processing solution for event-driven and streaming workloads,
gaining significant adoption as more companies relied on it for critical tasks, marking a
tipping point for widespread use.
Real-time data movement emerged as the norm, driven by tools like Apache Kafka®, Apache
Pulsar™, and AWS Kinesis. These enable continuous data streams in real-time from sources
such as (internet of Things) IoT devices, logs, and social media.
Concurrently, storage models also evolved, shifting from HDFS to cloud-native object
storage like Amazon S3, Google Cloud Storage, and Azure Blob Storage, enhancing cost
effectiveness and scalability for modern data workloads.
The demand for real-time insights led to the integration of streaming data and machine
learning, enabling organizations to train models on fresh data (data in real-time, not
historical or batch data). This unlocks instant insights in predictive analytics and automated
decision-making, making it faster and more efficient.
At the same time, network bandwidth increased significantly, from hundreds of megabits per
second a decade ago to hundreds of gigabits today. The increasing availability of high-speed
network connections in data center hardware is now becoming standard, particularly at the
chassis level (the enclosure housing multiple server blades), with anticipated plans to
integrate them directly into individual blades for even faster networking. Advances in
technologies like fiber optics, 5G, and higher capacity Ethernet standards have also enabled
faster data transfers, reduced latency, and more efficient computing across cloud regions
and data centers.
This increase in bandwidth, combined with decreasing costs, accelerated the transition from
local to distributed and cloud-based storage solutions, allowing real-time applications to
handle massive data volumes.
Lastly, the rise of microservices and event-driven architectures changed how data is
processed, enabling more granular, scalable, and loosely coupled systems. The data mesh
paradigm, which decentralizes data ownership and management to domain-specific teams,
has emerged to address scalability challenges in large, distributed organizations.
While Apache Flink’s core architecture has not changed significantly over the past decade,
these trends have driven organizations and the community toward real-time processing,
cloud-native solutions, and intelligent automation, with an increasing emphasis on
scalability, resilience, and data governance.
Image 1 captures the primary components of VERA, and its three layers.
Ververica also provides a set of advanced Change Data Capture (CDC) connectors with
enterprise optimizations for highly scalable ingestion. VERA seamlessly integrates with Flink
CDC and supports advanced CDC capabilities. By abstracting away the complexities related
to deployment, configuration, infrastructure management, scalability, fault tolerance, and
maintenance, VERA makes CDC more accessible for users who prioritize ease of use and
streamlined workflows.
Advanced Features
VERA’s advanced features enhance performance, optimize computing resource utilization,
and elevate the user experience. You can:
● Effortlessly scale large stateful applications because VERA separates compute and
storage resources.
● Recover faster from job failures or updates with advanced scheduling functions for
quicker restarts.
● Update event processing rules on the fly with Dynamic Complex Event Processing
(CEP).
● Unlock powerful machine learning and AI capabilities with built-in Flink ML and AI
applications that are fully integrated with the open-source Apache Flink® library that
provides pre-installed and pre-configured machine learning APIs.
Next, it’s important to take the time to understand the common challenges users face with
open-source Apache Flink®, and how VERA solves them.
These hurdles make it difficult for users without extensive Flink knowledge to use Apache
Flink effectively out of the box, resulting in a steep learning curve. By overcoming these
limitations, Ververica aims to democratize stream processing, making it accessible to a
broader audience, not just those with Flink expertise and substantial budgets or resources.
To learn more about how we set out to achieve this goal, check out insights from Ben
Gamble, Veverica’s Field CTO in this video:
https://www.youtube.com/watch?v=gSFcXk8TOOM
In cloud-native environments with containerization, the fixed size of local disks create
challenges for stateful stream processing. As state sizes grow and local disks reach their
storage limits, streaming jobs can become unstable. Addressing these issues by expanding
disk capacity or increasing job parallelism can be costly and requires maintenance windows.
Impact
Local disk limitations in Flink can lead to slower processing, reduced scalability, longer
recovery times, and higher failure rates. These impacts are evident in several scenarios:
● Flink often relies on local disk for checkpointing, creating savepoints, and managing
state. Limited I/O performance (e.g., slow read/write speeds) on local disks can create
bottlenecks, slowing down checkpointing and causing back pressure in the system.
● For stateful streaming jobs in Flink that store large amounts of data, limited local disk
capacity can lead to job failures or the inability to save checkpoints.
● As state grows in large-scale applications, local disk storage saturation can hinder
scalability, preventing the system from efficiently handling larger workloads. Resource
management becomes more complex as disks fill up faster, requiring more frequent
monitoring and maintenance to avoid performance degradation.
● Although Flink provides fault tolerance through state snapshots and recovery,
inconsistent local disk capacity may result in incomplete or corrupted checkpoints,
compromising fault tolerance and increasing recovery time after failures.
To manage these issues, many organizations adopt hybrid architectures that combine both
local and remote storage storage solutions such as Amazon S3 or Google Cloud Storage to
balance speed, scalability, and resilience.
Solution
VERA solves this issue by decoupling compute and storage and leveraging a tiered storage
architecture.
VERA removes the dependency on local disks for storing state and treats them primarily as a
cache. This strategy enhances fault tolerance and provides seamless storage scaling by:
● Offloading state data to a remote distributed file system, eliminating concerns about
local disk capacity and simplifying job rescaling. This ensures that the size of the
state does not limit computing resources.
● Managing checkpointing directly on the distributed file system, ensuring robust job
recovery and improving resiliency and stability.
In short, by decoupling compute and storage, VERA ensures that stateful applications can
scale efficiency, improve resiliency, and optimize cost while delivering the performance
needed for dynamic, cloud-native environments.
VERA enhances this tiered storage model with rescaling optimizations that significantly
reduce job downtime and improve system scalability.
When scaling a job (either up or down), Flink typically needs to process and download all
state data before resuming operations, which can result in long recovery times. VERA
optimizes this process through two mechanisms:
● File Granular Merging and Trimming: VERA asynchronously trims redundant state data
during rescaling. This allows the job to resume faster, as the state backend does not
need to wait for pruning operations to complete before processing.
● Lazy Restore of State Files: VERA addresses slow state recovery and long job
interruption times by introducing lazy state loading, downloading state data
asynchronously. This allows the job to access remote storage directly, merge
metadata, and resume execution without waiting for the entire state to be restored
locally.
VERA's lazy restore mechanism optimizes state file loading during recovery. Unlike eager
restore state recovery, where all state data must be loaded before the job can begin, lazy
restore reads the metadata files along with a small amount of in-memory data from the
remote storage. Jobs can start processing immediately while the rest of the state is
asynchronously downloaded in the background.
Although the job starts fast, reading from remote storage is not ideal as it involves I/O delays
and could introduce some extra latency. To ensure optimal performance, memory and local
disks act as accelerators and background threads perform on-demand, just-in-time
downloads, gradually transferring the remaining files from remote storage to the local disk.
Image 4: Comparison of job downtime between OS Flink and VERA state engines
VERA minimizes downtime by streamlining state backend rebuilding. Delayed pruning allows
jobs to begin processing as soon as the state is restored. Lazy state restore further reduces
downtime, allowing data processing to start after retrieving minimal metadata. The remaining
state is downloaded progressively, with job performance improving as more state data is
cached locally.
By allowing jobs to begin processing while the state is still being restored and
asynchronously pruning redundant data, VERA achieves up to 90% faster recovery times
during scaling events. (See the Comparative Performance Evaluation later in this document.)
Result
VERA overcomes local disk limitations in Flink by decoupling compute and storage, enabling
independent scaling for both. This approach removes constraints imposed by local storage
capacity, ensuring efficient handling of large-scale stateful applications.
This multi-tiered strategy enables VERA to balance state recovery speed with resource
efficiency, improving scalability, resilience, and fault tolerance in cloud-native environments.
● As the state grows in size, relying on local disk storage for state management in
RocksDB leads to read/write delays, particularly during dual-stream or multi-stream
joins, where key-value pairs must be stored and accessed efficiently.
As shown in Image 5, the key-value pair is stored in state for each side of the join, but not all
records will produce matching results. For example, in an anomaly detection system, only a
small subset of the data may lead to actual results, while the est must be held in memory.
Impact
As state grows, the latency increases, especially when accessing unprocessed records
waiting for matches across streams. This impacts real-time processing in use cases like
session tracking, fraud detection, and real-time analytics. Users may experience not only
slower processing but also higher operational complexity as Flink’s performance degrades.
These issues lead to longer recovery times, reduced scalability, and stability challenges
during periods of high data volume, which disrupts real-time analytics and complex event
processing pipelines.
Solution
VERA solves these challenges by separating keys from values. Large values are stored
separately, with references placed in the log-structured merge-tree (LSM tree). This reduces
unnecessary I/O overhead by enabling quick access to keys, which minimizes data movement
during updates and compactions.
Why Key-Value Separation Matters
In scenarios like anomaly detection or recommendation systems, only a small portion of the
state contributes to final results. Key-value separation allows VERA to quickly access keys
without immediately loading large values, reducing CPU and I/O overhead. This is especially
beneficial for large-scale streaming joins, where the full-table data must be processed
efficiently, even when managing terabytes of
For use cases with limited local storage space, VERA stores the separated value data (cold
data) in a distributed file system, ensuring key lookups remain unaffected. VERA
automatically detects local storage capacity thresholds, maintaining performance similar to
local storage but at a lower cost.
Additionally, VERA addresses common issues associated with key-value separation, such as
inefficiency in range queries and storage amplification. By focusing on the specific needs of
stream processing (where range queries are less common), and leveraging object storage,
VERA ensures both performance and resource efficiency.
In a stream computing scenario, different jobs have different data characteristics (value
length, key, and value access frequency), and fixed key-value separation parameters make it
difficult to optimize the performance of all jobs. With adaptive key-value separation, VERA
continuously monitors state access patterns and dynamically adjusting parameters,
ensuring optimal performance without requiring manual configurations or job restarts.
Result
VERA enhances Flink’s state management by implementing advanced techniques such as
adaptive key-value separation and in-memory optimizations to reduce latency and improve
throughput for large-scale, real-time streaming applications.
● State Management Issues: Flink's reliance on RocksDB for local disk storage
introduces significant performance bottlenecks. RocksDB is optimized for
write-heavy operations, but when dealing with large states, particularly in real-time
use cases, it struggles to handle the rapid state access required, resulting in latency
issues during both operation and recovery.
● Job Recovery Latency: Flink's job recovery relies on state snapshots (checkpoints and
savepoints) to restore a job’s state after failure. However, this process is
synchronous, requiring all state data to be fully reloaded before the job can continue.
This synchronous reloading increases downtime significantly, especially for large
stateful applications, as the entire state must be restored before job execution
resumes.
● Rescaling Bottlenecks: When scaling, Flink often needs to restart and reschedule
jobs, which prolongs recovery times and disrupts real-time performance. Rescaling
isn’t seamless and can lead to longer recovery times, particularly in large-scale
applications with heavy data loads.
Due to Flink’s recovery inefficiencies, users encounter unpredictable job recovery times,
making it difficult to meet performance guarantees in critical real-time applications such as
fraud detection or session window tracking.
More on State Snapshots Challenges
Flink’s dependency on state snapshots deserves a deeper look. In Flink, state snapshots are
created using checkpoints and savepoints to ensure fault tolerance and reliable recovery.
Checkpoints are automated snapshots that capture the job's state periodically, enabling the
system to restore state in case of failure with minimal data loss. Savepoints, on the other
hand, are manually triggered, consistent snapshots that allow users to restart or update jobs
at a specific point in time, often used for version upgrades or job rescaling. Both mechanisms
store state data externally, ensuring state consistency across failures or job restarts.
When Flink triggers a checkpoint, it flushes the data from RocksDB memory on disk and
generates new files, introducing undesired outcomes:
● Frequent, forced compactions increase the overall CPU and I/O overhead.
Sudden resource contention due to CPU spikes in the cluster, complicates predictable
capacity planning, as it forces a choice between sacrificing stability by under estimating
capacity or losing efficiency by optimizing for peak CPU bursts. The performance of the Flink
jobs declines, compromising the stability and reliability of the cluster, particularly in
large-scale stateful streaming applications.
Impact
Given that stateful streaming applications are expected to run continuously (24/7), Flink’s
job recovery limitations can result in a range of negative consequences for users:
● Extended Downtime: Users face long delays during state recovery after failures,
disrupting time-sensitive applications like fraud detection or real-time analytics.
● Operational Complexity: Manual intervention is often required to manage checkpoints
and tune job recovery. This increases operational overhead and makes it more
challenging to maintain a smooth-running stream processing infrastructure.
● Scalability Challenges: Flink’s inefficiency in managing large state sizes makes it hard
to scale predictably. Users face bottlenecks when scaling jobs, reducing system
stability and making it harder to handle growing workloads.
● Missed SLAs: Slow job recovery can lead to missed service-level agreements (SLAs),
resulting in penalties or lost opportunities in high-priority use cases.
Flink’s job recovery limitations lead to extended recovery times, operational complexity,
scalability issues, and unpredictable system behavior, which affect the performance and
reliability of critical stream processing tasks.
Solution
VERA addresses Flink's job recovery limitations through key innovations.
First, as discussed in previous sections, VERA decouples state storage from compute, using
a tiered storage architecture where larger states are stored remotely. This enables faster
scaling since only critical state data is restored, while non-essential data is retrieved as
needed. This lazy state restore mechanism improves the speed and scalability of the system
during recovery and normal operations, allowing for smooth scaling under load without
overwhelming the system with unnecessary state restores.
Additionally, VERA reduces checkpointing overhead by keeping remote state files intact,
eliminating the need to re-upload them during checkpointing. To further optimize, VERA
uploads memory table snapshots directly to remote storage during checkpointing,
minimizing CPU and I/O impact. This streamlined process also reduces savepoint times to
just five seconds, aligning with SLA requirements and ensuring efficient resource use (see
Image 5).
Image 8: Word Count Example showing reduced savepoint times are lower with VERA.
Next, VERA enhances Apache Flink's job performance and stability through dynamic
scheduling and resource management. During task failures, Flink's default behavior is to
restart all tasks in the same region, which can lead to downtime. VERA optimizes this by
implementing local task recovery, which isolates and recovers only the failed tasks. This
granular recovery reduces unnecessary job restarts, minimizing the time it takes to recover
and reducing pressure on the cluster during high-parallelism workloads. This feature ensures
faster recovery and better system performance under load.
Unlike Flink's static resource allocation, VERA dynamically adjusts resource allocation based
on task load and resource needs. This approach ensures that tasks with higher demands
(such as those with more intensive compute or memory needs) get the resources required
for efficient execution, while less resource-intensive tasks don't consume unnecessary
system capacity. This flexibility reduces resource waste and improves overall job
performance, especially under high load.
VERA can allocate resources and schedule tasks based on real-time data access patterns
and compute load. This means that as data flow changes or increases, the system
automatically adjusts to optimize task distribution across available resources. By preventing
resource bottlenecks or underutilization, VERA’s load aware scheduling improves both
throughput and latency.
Lastly, VERA introduces live updates to resource allocations and task parameters without
requiring a full restart. These hot job updates allow for in-place scaling or configuration
changes to happen with minimal downtime, which is particularly useful in long-running jobs
or when system demands shift dynamically.
Result
By integrating dynamic scheduling and resource management, VERA enhances system
performance and resource utilization, even in large-scale and high-throughput stream
processing environments. VERA’s innovations enable faster, more scalable job recovery by
minimizing downtime through task-level recovery and seamless rescaling. These
optimizations reduce operational overhead, allowing for predictable and stable stream
processing under heavy loads, enabling real-time applications to meet SLAs and ensure
efficient system operation.
Efficient state management and scalability are key to maintaining stability and performance
in both types of recovery, particularly under load. Disaster recovery involves restoring not
just the stream processing tasks but also rebalancing resources– like memory, compute, and
networking capacity–across multiple data centers or cloud regions to handle the load after
recovery. This broader recovery is more resource-intensive and time-consuming.
A critical focus of both recovery processes is minimizing Recovery Point Objective (RPO)
which aims to limit data loss and Recovery Time Objective (RTO) which minimizes downtime.
However, Flink’s default recovery mechanism relies heavily on synchronous recovery, which
causes significant downtime, particularly for applications with large state sizes. For example,
in real-time analytics or fraud detection, where any downtime can result in missed insights or
undetected fraudulent transactions, prolonged recovery times can lead to severe
consequences.
Additionally, resource contention during the recovery process may spike CPU and I/O usage,
complicating the overall performance and leading to unpredictable behavior. Such spikes
make it difficult to maintain predictable system performance, further hindering the ability to
meet RTO targets in high-demand environments.
During normal operation and recovery, Flink’s use of RocksDB can cause bottlenecks, leading
to latency issues when handling large states or complex workloads. In the case of a
catastrophic failure, like a data center or cluster outage, Flink’s reliance on synchronous
state reloading during recovery prolongs the recovery process, in large-scale or
high-parallelism environments. This challenge makes it difficult to achieve the low RPO and
RTO required in mission-critical real-time applications, where even minutes of downtime can
mean missed service-level agreements (SLAs) and penalties.
Impact
Flink's recovery mechanism has a substantial impact on both RPO and RTO in stream
processing environments. These two objectives define the data loss tolerance and
downtime acceptable in recovery scenarios.
RPO depends on the time interval between checkpoints. In practice, if a failure occurs just
before the next checkpoint, all the data since the last checkpoint is lost, potentially
increasing the RPO to the length of the checkpoint interval. For applications with strict data
integrity requirements, such as financial trading or real-time machine learning, losing
minutes of data can result in substantial financial or operational impact.
Flink’s recovery mechanism increases RTO by requiring the restoration of large states from
checkpoints. The full synchronous state restoration that Flink uses causes long recovery
times under load, which increases downtime. For example, in large-scale applications like
online banking or sensor-based monitoring systems, extended downtime can disrupt
services and erode customer trust. Additionally, resource contention during recovery further
degrades performance, slowing the process and making it harder to meet stringent RTO
requirements.
Solution
VERA significantly enhances both job and disaster recovery by decoupling state storage from
compute resources, allowing for faster and more scalable recovery times.
Result
With VERA, both job and disaster recovery processes are faster and more scalable. The use
of approximate state recovery dramatically reduces both RPO and RTO, ensuring that critical
data is retained and systems are brought back online with minimal downtime. Resource
contention during recovery is reduced through dynamic resource allocation, resulting in
more predictable performance and smoother recovery for large-scale, stateful applications.
As a result, real-time applications such as fraud detection, financial services, and industrial
monitoring benefit from increased reliability and stability, allowing them to meet strict SLAs,
maintain customer trust, and avoid penalties due to system downtime.
Impact
The challenges in maintaining operational continuity lead to costly downtimes, which can
disrupt mission-critical services, including real-time financial processing or IoT systems.
These disruptions not only increase the risk of data loss but also undermine business agility,
forcing development teams to delay updates to avoid interruptions.
What’s more, users face difficulties efficiently managing state and metadata, resulting in
bottlenecks that further slow down the scaling of distributed applications and make it
difficult for organizations to meet the demands of large-scale, cloud-native environments. As
a result, productivity declines, development cycles slow down, pipeline management
becomes complicated, and opportunities for real-time optimization or innovation are missed.
Solution
VERA enhances the user experience by introducing intuitive abstractions that simplify
complex business use cases through a number of innovations, including Dynamic Complex
Event Processing (Dynamic CEP), enhanced catalogs, and built-in Flink Machine Learning
(ML).
1. Updating rules in real-time. Users can change rules for jobs running in production
without stopping the job, ensuring continuous operation.
2. Efficient serialization and deserialization. Newly created rules are efficiently serialized
and deserialized, facilitating smooth integration with external databases, such as
Amazon Relational Database Service (RDS), to seamlessly fetch new and updated
policies or rules.
This approach is key for managing mission-critical workloads. Some common use cases
include:
● Real-time risk control. Organizations can detect potentially malicious users by
monitoring customer behavior logs, and flagging those who initiate more than ten
transfers and exceed 10,000 within a five-minute window as abnormal users.
With VERA, teams collaborate more effectively. Business Intelligence (BI) teams can rely on
stable data feeds while upstream teams make necessary changes without risking
operational continuity.
VERA strengthens the reliability and efficiency of large-scale stream processing systems by
ensuring high availability, improving system performance, and enabling scalability. Enhanced
catalogs provide real-time metadata for better state and query management. By optimizing
data access and state storage management through enhanced catalogs, VERA reduces
bottlenecks in real-time application development. VERA supports seamless scaling across
regions or cloud-native environments, crucial for applications handling large datasets in
industries, including e-commerce, IoT, and financial services.
Results
Additionally, VERA’s support for seamless scaling and faster recovery reduces operational
overhead, allowing businesses to respond more quickly to market changes while enabling
developers to implement updates with minimal disruption.
The chart summarizes how VERA solves the challenges users face with open-source Flink:
Image 9: Summary of VERA solutions for the challenges users face with open-source Flink
Now that we’ve explained what VERA is and how it addresses the challenges that Flink users
face, it’s time to show you the results!
Note: During its development, VERA’s state engine was rigorously validated with real-world
use cases. It has consistently shown superior performance over Flink’s State Engine
(RocksDB) in handling terabyte-scale stream processing workloads.
With these insights in mind, let’s review the specific results from the performance tests.
● User: a person submitting an item for auction and/or making a bid on an auction item.
● Auction Item: an item currently being auctioned.
● Bid: an offer made for an item being auctioned.
This test uses the standard Nexmark configuration which includes four key parameters:
● Parallelism (8): The system processes the stream using 8 parallel tasks.
Results
The test results (see Image 9) for point lookups and write operations, which account for a
large proportion of all state access in the Flink scenario, show that VERA significantly boosts
job performance. VERA achieves an average improvement of 50% in single-core throughput,
with some stateful workloads seeing even greater performance gains.
Image 11: Comparison of performance measured by throughput per core. (Note: Higher throughput per
core indicates better performance and more efficient utilization of CPU resources.)
Analysis
This test reveals that VERA effectively optimizes state handling and job execution,
outperforming Flink's standard state engine. The results demonstrate VERA’s capability to
handle high-throughput workloads more efficiently, highlighting its potential to deliver
notable performance improvements in real-world stream processing tasks.
Test #2: Rescaling Speed
This test uses a WordCount Benchmark to compare the rescaling performance between
VERA’s state engine (Gemini) and Flink’s state engine (RocksDB). The job's total state size
was maintained at approximately 128GB, and the source data was generated following a
uniform distribution. Each TaskManager was allocated 1 vCPU and 4GB of memory, and each
TaskManager was assigned one task slot.
Results
This evaluation compares VERA’s state engine (Gemini) and Flink’s state engine (RocksDB) in
three scenarios:
In all three test cases, VERA exhibits substantially less downtime than Flink’s state engine
(RocksDB).
VERA’s Hot Job Updates feature significantly reduces downtime during job restarts, handling
large state sizes with much greater efficiency than Flink’s state engine (RocksDB). This
indicates that VERA’s update mechanism minimizes disruption, enhancing overall system
responsiveness and availability.
Image 13: Comparison of job restarts with Hot Job Updates and Lazy State Loading enabled
The integration of Lazy State Loading with Hot Job Updates provides a substantial boost in
restart speed, showcasing VERA’s advanced capabilities in minimizing downtime. This
combined approach allows for more rapid job recovery and state management, offering a
more agile and responsive data processing solution.
Image 14: Comparison of downtime when scaling job parallelism up and down.
VERA’s ability to manage parallel task adjustments with significantly reduced downtime
highlights its efficiency in optimizing performance and resource utilization. The higher
performance improvements when scaling down suggest that VERA excels in reallocating
resources and managing state in dynamic scenarios, which is crucial for handling fluctuating
workloads and enhancing overall system flexibility.
Analysis
VERA’s state engine (Gemini) demonstrates notable advantages over Flink’s state engine
(RocksDB) in terms of downtime reduction and performance efficiency. By integrating Hot
Job Updates, Lazy State Loading, and optimized parallelism adjustments, VERA achieves
substantial improvements in job restart times and resource management, resulting in a more
reliable and agile Streaming Data Platform.
Note: Other setups and test environments remain consistent with those described above in
Test 1.
Results
As shown in Image 15, enabling VERA with key-value separation in the Q20 dual-stream Join
scenario, leads to a substantial boost in job throughput, ranging from 50% to 70%.
Image 15: Comparison of job throughput with and without key-value separation enabled
Analysis
This improvement indicates VERA’s ability to efficiently manage and process large state sizes,
resulting in more effective resource utilization and faster data processing. The performance
gains observed with key-value separation highlight its effectiveness in optimizing job
throughput compared to scenarios without this feature.
Overall, the test underscores VERA’s superior efficiency in managing stateful streaming
workloads, thereby offering significant performance advantages over traditional state
management approaches.
Bringing VERA to Everyone
VERA was developed to tackle critical challenges in data stream processing, such as latency,
scalability, operational complexity, and business agility. By addressing these pain points,
Ververica built a solution designed to meet the demands of modern data-driven applications
while maintaining full compatibility with Apache Flink.
With a deep understanding of the limitations in the existing open-source Flink technology,
VERA integrates advanced features like optimized state storage, smart resource
management, and dynamic processing capabilities. This approach not only enhances
performance, but it also bridges the gap between complex capabilities and broader
accessibility.
VERA’s innovative architecture ensures that stateful stream processing is much more
resilient, accessible, and easier to manage, making Apache Flink® accessible to users with
varying levels of expertise or resources. This marks a significant step toward making
sophisticated data stream processing available and practical for diverse applications and
industries.
Conclusion
The evaluation demonstrates that VERA enhances the functionalities of open-source
Apache Flink® while preserving full compatibility. This combination of open-core and
proprietary technology underscores the strength of the Ververica Streaming Data Platform.
VERA delivers a robust and scalable platform for high-throughput, low-latency workloads
with adaptive performance across various conditions. Its user-friendly design allows
developers to focus on building applications rather than managing infrastructure.
With its modern, powerful, and accessible cloud-native engine, VERA introduces features
aimed at improving real-time stream processing. The future holds great potential as
customers leverage VERA to drive innovation and achieve their goals.
Ververica makes it easy for organizations to harness data insights at scale, with our
advanced Streaming Data Platform powered by VERA, the engine that revolutionizes Apache
Flink.
References
● Stateful Stream Processing | Apache Flink:
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/concepts/stateful-stre
am-processing
● https://cwiki.apache.org/confluence/x/R4p3EQ
● https://github.com/apache/flink-benchmarks/tree/master/src/main/java/org/apache/
flink/state/benchmark
● https://flink-learning.org.cn/article/detail/baf859431244a0c24aa36c2250c90967?sp
m=a2csy.flink.0.0.49496ea8tkSye7&name=article&tab=suoyou&page=2
● https://flink-learning.org.cn/article/detail/1563c71112adccac151151e609184377?spm=
a2csy.flink.0.0.49496ea8tkSye7&name=article&tab=suoyou&page=2