0% found this document useful (0 votes)
57 views47 pages

ECS765P - W11 - Stream Processing II

The document discusses stream processing using Spark Streaming and Apache Storm. It covers topics like windowing streams, micro-batch processing, and how to put stream processing concepts together in Spark Streaming and Storm applications.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views47 pages

ECS765P - W11 - Stream Processing II

The document discusses stream processing using Spark Streaming and Apache Storm. It covers topics like windowing streams, micro-batch processing, and how to put stream processing concepts together in Spark Streaming and Storm applications.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

ECS640U/765P Big Data Processing

Stream Processing II
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/765P Big Data Processing
Stream Processing II
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Weeks 9-11: Processing

Data
Ingestion Storage Processing Output
Sources

This week we will continue learning about Stream Processing


Stream Processing 2
Topic List:

● Windowing streams
● Micro-batch processing
● Putting things together
Apache Storm: Streams

Storm streams are sequences of tuples. Tuples are the smallest pieces of data that can be processed
by a bolt and consist of a list of fields.

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

The order of tuples represent the time at which they arrive at the streaming system.
Event time vs processing time

Ideally, tuples arrive to the system in the same order as they have been generated. However, this is not
always the case and it is useful to distinguish between:

• Event time: Time at which one event has been generated by a source
• Processing time: Time at which events are seen by the stream processing system

Event time and processing time allow us to introduce two new concepts:
• Unordered streams: Streams where the event time ordering is different from the processing time
ordering.
• Processing-time lag: Delay between the generation and observation of an event in the system. This lag
can be variable, i.e., will change from tuple to tuple.
Event time vs processing time
Time-agnostic stream processing

In time-agnostic stream processing the event time is irrelevant for processing purposes. Examples
include filtering, where tuples are discarded based on the values of their fields alone.
Aggregating without windowing
Windowing

There are scenarios where either the event time or processing time are relevant:
• To aggregate tuples and produce an aggregate computation (e.g., count, average, ..)
• To observe temporal patterns
Windowing allows streaming systems to deal with temporal dimensions.

Tuple Tuple Tuple Tuple Tuple Tuple Tuple


Windowing

During windowing, tuples that fall within temporal boundaries (windows) are grouped together.
Windows can be fixed, sliding (overlapping) or sessions (variable, separated by inactivity gaps longer
than a threshold).

https://cs.stanford.edu/~matei/courses/2015/6.S897/slides/dataflow.pdf
Windowing based on processing time

Using processing time for windowing is simple, but is restricted to applications where the processing time
is the dimension of interest.
For instance, the processing time can be relevant when monitoring incoming requests to a web server, i.e.,
when the client sent the request is irrelevant but rather the time at which it is processed by the system.
Windowing based on event time

When we are interested in the time when the data source produced an event, event time is the domain of
interest and ordering is needed based on the event timestamp. For instance, when dealing with the event
time of system logs to identify the source of failure or breach.

Processing time is always available, but the event time may not.
Aggregating by windowing
Aggregating by windowing
Event Time Window = 2 mins
Aggregating by windowing
Event Time Window = 2 mins, Processing Time Window = 2 mins
Windowing in Storm
originally, no window function

Stream processing by windowing requires


• Buffers to store tuples
• A trigger stream that triggers the computation.
1) Store 2 minutes of messages (state)

Bolt
Input stream Output stream every minute

3) Update 2 minute window data and


2) Trigger stream yield output whenever trigger arrives
(1 trigger/minute)
Stream Processing 2
Topic List:

● Windowing streams
● Micro-batch processing
● Putting things together

Quiz Time
Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of data streams.
• Follows a micro-batch processing model
• Scales up to 100s of nodes
• Integrates with Spark’s batch and interactive processing
• Provides batch-like API for complex algorithms
• Can receive data from Kafka, Kinesis, TCP sockets, etc.
Spark Streaming

Spark Streaming operates as follows:


• Receives input data streams
• Divides the data into micro-batches (i.e. by temporal windowing)
• Batches are treated as RDDs and processed by the Spark engine
• Results are streamed as batches
Spark Streaming

Discretised streams or DStream is Spark Streaming’s abstraction to


represent continuous streams of data:
• They are continuous series of RDDs
• Each RDD contains data from a time interval
• Allows the reuse of the Spark programming model
• At each time slot, one RDD is processed
Micro-batch vs True stream processing

Spark Streaming uses micro-batches for stream processing. Storm has also developed abstractions
for dealing with micro-batches (Trident project).

The main differences between micro-batch and true stream processing are:

• Streaming has low latency while Micro-batch has higher latency, up to the time interval defining
the micro-batch
• Micro-batch in general achieves high throughput than pure streams
Example 1: Get hashtags from Twitter

tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of RDD representing a stream of data

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

stored in memory as an RDD


(immutable, distributed)
Example 1: Get hashtags from Twitter

tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


hashTags = tweets.flatMap (lambda tweet: getTags(tweet))

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap
hashTags Dstream
[#cat, #dog, … ]
Example 1: Get hashtags from Twitter

tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


hashTags = tweets.flatMap (lambda tweet: getTags(tweet))
hashTags.saveAsTextFiles("hdfs://...")

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap
hashTags Dstream
[#cat, #dog, … ]
save
Example 2: Count hashtags

tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


hashTags = tweets.flatMap (lambda tweet: getTags(tweet))
tagCounts = hashTags.countByValue()

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap
hashTags Dstream

tagCounts countByValue
[(#cat, 10), (#dog, 25), ... ]
Sliding windows operations in DStreams

DStreams allow us to define sliding windows. Two parameters need to be specified:


• Size of the window (in seconds)
• Frequency of computation (in seconds)

Operations such as the following can be implemented:


• “Obtain the maximum temperature over the last 60 seconds every 5 seconds”.
Example 3: Count hashtags over last 10 minutes

tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


hashTags = tweets.flatMap (lambda tweet: getTags(tweet))
tagCounts = hashTags.window(60 , 5).countByValue()

sliding window
sliding interval
operation
window length
Example 3: Count hashtags over last 10 minutes

tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)


hashTags = tweets.flatMap (lambda tweet: getTags(tweet))
tagCounts = hashTags.window(60, 5).countByValue()

t-1 t t+1 t+2 t+3

hashTags

sliding window

countByValue

tagCounts
count over all
the data in the
window
Sample Twitter processing stream

Input Stream 1
Stream

map flatMap
tweets statuses words
t: t.status s: s.split(“ “)

Stream 1 Stream 2

filter
words hashtags
w: w.startsWith(“#“)

Output
Stream 2
Stream

map reduceByKeyAndWindow hashtag


hashtags
h: ( h,1 ) (a,b : a + b) , 60 * 60 * 5 , 10 counts
Sample Twitter processing stream

ssc = new StreamingContext(sc,1)


tweets = TwitterUtils.createStream(ssc, None)

status = tweets.map(lambda tweet: tweet.getText())


words = status.flatMap(lambda status: status.split(" "))
hashtags = words.filter(lambda word: word.startsWith("#"))
duration of the window
(5 hour in this case),
hashtagCounts = hashtags.map(lambda t: (t, 1)). sliding interval
reduceByKeyAndWindow((lambda a,b: a+b), 3600*5, 10)
ssc.start()
ReduceFunc InvReduceFunc (invFunc can be None, then it will reduce all the
RDDs in window, could be slower than having invFunc)

More efficient reduceByKeyAndWindow((lambda a,b: a+b), (lambda a,b: a-b), 3600*5, 10)

https://www.youtube.com/watch?v=rIDo91_LBV4
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.streaming.DStream.reduceByKeyAndWindow.html
Recall – Topology for website click analysis in Storm
Spark streaming topology for website click analysis

Append to
HDFS

Stream of Accessed Filter Save to


website Website Bot Cassandra
clicks URLs Accesses

Compute Compute
count top
referrals referrals
Big Data Processing: Week 10
Topic List:

● Windowing streams
● Micro-batch processing
● Putting things together

Break
Big Data for streaming data

Streaming data sources generate data continuously.


Depending on the application, after ingestion of streaming data, the data can be stored for later analysis,
directly processed or both.
Hence, it is common to see different architectures for processing streaming data.

Streaming
Batch Batch
Data Ingestion Storage
Processing Output
Source

Stream Stream
Processing Output
Big Data for streaming data

An architecture including a batch sub-pipeline and a stream sub-pipeline (sometimes called Lambda
architecture) allows to store the true, raw data and produce quick results.

Streaming
Batch Batch
Data Ingestion Storage
Processing Output
Source

Stream Stream
Processing Output
Big Data for streaming data

Streaming
Batch
Data Kafka HDFS MapReduce
Output
Source

Storm Stream
Output
Reliability

In Big Data streaming, we need to consider different reliability aspects, some specific to stream processing
and other shared with other Big Data Frameworks:
• Delivery guarantees (stream): Will the messages that constitute a stream be delivered?
• State management (stream): Will the state of the processing operators be preserved in the event of
failures?
• Fault tolerance (common with big data): In case of failures (node, network, etc), will the framework be
able to recover and start from the point where it left?
Delivery guarantees for in-flight streams

Several delivery guarantees are defined in streaming systems:


• At-most-once (best effort) delivery: Messages will never be delivered more than once, but some
messages might be lost
• At-least-once (guaranteed but with duplication) delivery: A message might be delivered to a
consumer, probably more than once, but will never be lost
• Exactly-once (guaranteed via feedback) delivery: This is the ideal case, as each message will be
delivered one time. It is very hard in distributed systems and will always come together with
performance trade-offs.
Reliability in Storm and Spark Streaming

Storm (True Streaming):


• At-least-once delivery
• Fault tolerance
• State management not supported

Spark streaming (Micro-Batch Streaming):


• At-least-once delivery
• Fault tolerance
• State management Micro-batch can maintain the state management
Ingestion

Big Data streaming platforms needs to ingest streaming data, i.e. make streaming data available for
further storage or processing.
There are open-source solutions (Kafka, Flume) and proprietary solutions (AWS: Amazon Kinesis
Firehose, AWS IoT events; Azure: Event Hub, IoT Hub; GCP: Pub/Sub).

Streaming
Batch Batch
Data Ingestion Storage
Processing Output
Source

Stream Stream
Processing Output
Ingestion: Kafka
https://kafka.apache.org

Kafka (developed by LinkedIn and given to Apache Foundation as open source project) has been described
as “the HDFS of unbounded data sources”:
• Distributed message broker providing high throughput, high availability and scalability
• Follows the publish-subscribe paradigm
• Can interface with multiple streaming sources and sinks
• Has been extended to offer stream processing (joins, aggregations, filters): supporting event-time based
processing and exactly once delivery guarantee.

https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
https://storm.apache.org/releases/2.3.0/storm-kafka-client.html
Kafka: Main concepts

Kafka defines the following concepts:


• Producers are data sources that publish (write) events to Kafka
• Consumers are systems that subscribe to (read and process) these events
• Events are organised and stored into topics (can be seen as a folder in a filesystem) that are replicated
and partitioned for fault-tolerance, high availability and scalability.
Kafka: Publish-subscribe model

Kafka defines the following concepts:


• Producers are data sources that publish (write) events to Kafka
• Consumers are systems that subscribe to (read and process) these events

Kafka keeps track of how far each consumer has read


Kafka: Delivery guarantees

Messages in a topic persist over a pre-set period of time. Hence, Kafka is a durable system. In addition,
acknowledgement messages are exchanged between Kafka and publishers and subscribers to monitor
their progress.

This guarantees exactly-once delivery and is one of the reasons why Kafka is the de-facto ingestion
solution for stream processing systems.
Big Data Processing: Week 11
Topic List:

● Windowing streams
● Micro-batch processing
● Putting things together

Quiz
Round-up on assessment
● Quiz (20%): deadline was 3-Apr 15:00
● Course work (30%): deadline 5-Apr 15:00 (1 day extension)
● PDF report detailing your approach to solving each task along with the visualisation of the results and discuss challenges faced and
insights gained from each task (Job ID if possible)
● Code, scripts, and plotting data for verification

● Course Exam (50%): Online (like quiz) – 9th of May 2024 (2 hours in 3 hours window)
● Exam questions: MCQ including questions on given code snippets + choose missing value + solve
and write numerical answer in box + Drag and drop matching questions
● Practice Questions: Available in QM+ under General Section from previous exams (in PDF)

END

You might also like