ECS765P - W11 - Stream Processing II
ECS765P - W11 - Stream Processing II
Stream Processing II
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/765P Big Data Processing
Stream Processing II
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Weeks 9-11: Processing
Data
Ingestion Storage Processing Output
Sources
● Windowing streams
● Micro-batch processing
● Putting things together
Apache Storm: Streams
Storm streams are sequences of tuples. Tuples are the smallest pieces of data that can be processed
by a bolt and consist of a list of fields.
The order of tuples represent the time at which they arrive at the streaming system.
Event time vs processing time
Ideally, tuples arrive to the system in the same order as they have been generated. However, this is not
always the case and it is useful to distinguish between:
• Event time: Time at which one event has been generated by a source
• Processing time: Time at which events are seen by the stream processing system
Event time and processing time allow us to introduce two new concepts:
• Unordered streams: Streams where the event time ordering is different from the processing time
ordering.
• Processing-time lag: Delay between the generation and observation of an event in the system. This lag
can be variable, i.e., will change from tuple to tuple.
Event time vs processing time
Time-agnostic stream processing
In time-agnostic stream processing the event time is irrelevant for processing purposes. Examples
include filtering, where tuples are discarded based on the values of their fields alone.
Aggregating without windowing
Windowing
There are scenarios where either the event time or processing time are relevant:
• To aggregate tuples and produce an aggregate computation (e.g., count, average, ..)
• To observe temporal patterns
Windowing allows streaming systems to deal with temporal dimensions.
During windowing, tuples that fall within temporal boundaries (windows) are grouped together.
Windows can be fixed, sliding (overlapping) or sessions (variable, separated by inactivity gaps longer
than a threshold).
https://cs.stanford.edu/~matei/courses/2015/6.S897/slides/dataflow.pdf
Windowing based on processing time
Using processing time for windowing is simple, but is restricted to applications where the processing time
is the dimension of interest.
For instance, the processing time can be relevant when monitoring incoming requests to a web server, i.e.,
when the client sent the request is irrelevant but rather the time at which it is processed by the system.
Windowing based on event time
When we are interested in the time when the data source produced an event, event time is the domain of
interest and ordering is needed based on the event timestamp. For instance, when dealing with the event
time of system logs to identify the source of failure or breach.
Processing time is always available, but the event time may not.
Aggregating by windowing
Aggregating by windowing
Event Time Window = 2 mins
Aggregating by windowing
Event Time Window = 2 mins, Processing Time Window = 2 mins
Windowing in Storm
originally, no window function
Bolt
Input stream Output stream every minute
● Windowing streams
● Micro-batch processing
● Putting things together
Quiz Time
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of data streams.
• Follows a micro-batch processing model
• Scales up to 100s of nodes
• Integrates with Spark’s batch and interactive processing
• Provides batch-like API for complex algorithms
• Can receive data from Kafka, Kinesis, TCP sockets, etc.
Spark Streaming
Spark Streaming uses micro-batches for stream processing. Storm has also developed abstractions
for dealing with micro-batches (Trident project).
The main differences between micro-batch and true stream processing are:
• Streaming has low latency while Micro-batch has higher latency, up to the time interval defining
the micro-batch
• Micro-batch in general achieves high throughput than pure streams
Example 1: Get hashtags from Twitter
tweets DStream
tweets DStream
flatMap
hashTags Dstream
[#cat, #dog, … ]
Example 1: Get hashtags from Twitter
tweets DStream
flatMap
hashTags Dstream
[#cat, #dog, … ]
save
Example 2: Count hashtags
tweets DStream
flatMap
hashTags Dstream
tagCounts countByValue
[(#cat, 10), (#dog, 25), ... ]
Sliding windows operations in DStreams
sliding window
sliding interval
operation
window length
Example 3: Count hashtags over last 10 minutes
hashTags
sliding window
countByValue
tagCounts
count over all
the data in the
window
Sample Twitter processing stream
Input Stream 1
Stream
map flatMap
tweets statuses words
t: t.status s: s.split(“ “)
Stream 1 Stream 2
filter
words hashtags
w: w.startsWith(“#“)
Output
Stream 2
Stream
More efficient reduceByKeyAndWindow((lambda a,b: a+b), (lambda a,b: a-b), 3600*5, 10)
https://www.youtube.com/watch?v=rIDo91_LBV4
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.streaming.DStream.reduceByKeyAndWindow.html
Recall – Topology for website click analysis in Storm
Spark streaming topology for website click analysis
Append to
HDFS
Compute Compute
count top
referrals referrals
Big Data Processing: Week 10
Topic List:
● Windowing streams
● Micro-batch processing
● Putting things together
Break
Big Data for streaming data
Streaming
Batch Batch
Data Ingestion Storage
Processing Output
Source
Stream Stream
Processing Output
Big Data for streaming data
An architecture including a batch sub-pipeline and a stream sub-pipeline (sometimes called Lambda
architecture) allows to store the true, raw data and produce quick results.
Streaming
Batch Batch
Data Ingestion Storage
Processing Output
Source
Stream Stream
Processing Output
Big Data for streaming data
Streaming
Batch
Data Kafka HDFS MapReduce
Output
Source
Storm Stream
Output
Reliability
In Big Data streaming, we need to consider different reliability aspects, some specific to stream processing
and other shared with other Big Data Frameworks:
• Delivery guarantees (stream): Will the messages that constitute a stream be delivered?
• State management (stream): Will the state of the processing operators be preserved in the event of
failures?
• Fault tolerance (common with big data): In case of failures (node, network, etc), will the framework be
able to recover and start from the point where it left?
Delivery guarantees for in-flight streams
Big Data streaming platforms needs to ingest streaming data, i.e. make streaming data available for
further storage or processing.
There are open-source solutions (Kafka, Flume) and proprietary solutions (AWS: Amazon Kinesis
Firehose, AWS IoT events; Azure: Event Hub, IoT Hub; GCP: Pub/Sub).
Streaming
Batch Batch
Data Ingestion Storage
Processing Output
Source
Stream Stream
Processing Output
Ingestion: Kafka
https://kafka.apache.org
Kafka (developed by LinkedIn and given to Apache Foundation as open source project) has been described
as “the HDFS of unbounded data sources”:
• Distributed message broker providing high throughput, high availability and scalability
• Follows the publish-subscribe paradigm
• Can interface with multiple streaming sources and sinks
• Has been extended to offer stream processing (joins, aggregations, filters): supporting event-time based
processing and exactly once delivery guarantee.
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
https://storm.apache.org/releases/2.3.0/storm-kafka-client.html
Kafka: Main concepts
Messages in a topic persist over a pre-set period of time. Hence, Kafka is a durable system. In addition,
acknowledgement messages are exchanged between Kafka and publishers and subscribers to monitor
their progress.
This guarantees exactly-once delivery and is one of the reasons why Kafka is the de-facto ingestion
solution for stream processing systems.
Big Data Processing: Week 11
Topic List:
● Windowing streams
● Micro-batch processing
● Putting things together
Quiz
Round-up on assessment
● Quiz (20%): deadline was 3-Apr 15:00
● Course work (30%): deadline 5-Apr 15:00 (1 day extension)
● PDF report detailing your approach to solving each task along with the visualisation of the results and discuss challenges faced and
insights gained from each task (Job ID if possible)
● Code, scripts, and plotting data for verification
● Course Exam (50%): Online (like quiz) – 9th of May 2024 (2 hours in 3 hours window)
● Exam questions: MCQ including questions on given code snippets + choose missing value + solve
and write numerical answer in box + Drag and drop matching questions
● Practice Questions: Available in QM+ under General Section from previous exams (in PDF)
END