0% found this document useful (0 votes)
109 views

Big Data Processing Types

The document discusses different types of big data processing including batch processing and streaming processing. Batch processing involves processing all available data at once which can have high latency but also high throughput. Streaming processing treats data as a continuous stream and processes items individually as they arrive, providing lower latency but also lower throughput than batch processing. The document examines when each type of processing would be appropriate depending on how the data is stored and accessed.

Uploaded by

khaoula fattah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Big Data Processing Types

The document discusses different types of big data processing including batch processing and streaming processing. Batch processing involves processing all available data at once which can have high latency but also high throughput. Streaming processing treats data as a continuous stream and processes items individually as they arrive, providing lower latency but also lower throughput than batch processing. The document examines when each type of processing would be appropriate depending on how the data is stored and accessed.

Uploaded by

khaoula fattah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Advanced Big

Data
Processing
Dr. Kalthoum Zaouali

kalthoum.zaouali@pristini-international.tn
Advanced Big Data Processing

Big Data
Processing
Types
Chapter I

©Dr. Kalthoum Zaouali 2022-2023


Chapter I
Processing ▪ Overview
Types
HDFS & Map-Reduce

❑ HDFS: Hadoop Distributed File System

❑ The paradigm Map-Reduce: Its principle is to divide the processing into two large
parts:

▪ Maper: The way it does to be carried out in an identical way between the parts of
the file of sights between the machines;

▪ Reducer: Bring together the parts which are arranged in a well-determined way
which will carry out the processing in a globalized way to make a calculation or a
general processing between the map results;

©Dr. Kalthoum Zaouali 2022-2023 3


Chapter I
Processing ▪ Overview
Types
HDFS & Map-Reduce

o Could this kind of MapReduce processing be done for


any type of calculation?

o Could the MapReduce paradigm be done on all the


processing that can be done on massive data?

o Is it sufficient?

o Are there any problems with MapReduce ?

©Dr. Kalthoum Zaouali 2022-2023 4


Chapter I
Processing ▪ Map-Reduce
Types
Map-Reduce processing

▪ A set of mappers will run processing in parallel then apply the Shuffle & Sort.
▪ The results will be sent to the Reducer which will gather data and send the final result.

Map Shuffle & Sort Reduce

b
a’
Final result b
Result ’c’
c
d’

d
high latency
high throughput
©Dr. Kalthoum Zaouali 2022-2023 5
Chapter I
Processing ▪ Map-Reduce
Types
Disadvantage/Advantage

❖ In particular, MapReduce is implemented on Hadoop so:

1. There are processing executions in parallel;

2. Then, there are communications between the machines where the operations
of Shuffle & Sort are executed;

3. Then, the Reduce operations on all the different reduce parts are defined;

4. The result must be sent to the Name Node to discuss again with the client;

➢ Therefore, perform the end-to-end operation is slow;

➢ On the other hand, the advantage is that the flow rate is high.

©Dr. Kalthoum Zaouali 2022-2023 6


Chapter I
Processing ▪ Map-Reduce
Types
Disadvantage/Advantage

▪ The problem with Map-Reduce processing is the time required to perform this
processing: High Latency.

▪ Latency is the time between sending the request and receiving the result.

▪ Throughout is the amount of information we see at the output: As long as we


send requests as long as the results are displayed at the end;

→ MapReduce is a batch processing type.

©Dr. Kalthoum Zaouali 2022-2023 7


Chapter I
Processing ▪ Batch Processing
Types
Definition

▪ By definition, the Batch Processing (BP) is the fact of performing a process


that needs to browse all the data that we have stored in the storage system.

▪ We used parallelism to be able to access the file browsing, but it should give
the same result if we browse the file sequentially in an ordinary centralized
system.

▪ The BP is not necessarily a parallel processing, it is any type of processing


that gives a global calculation of all the data belonging to a data source.

✓ In general, decision-making processing is a processing that needs to have


visibility over a very large amount of data and which scans the data sources
globally to extract a decision.

©Dr. Kalthoum Zaouali 2022-2023 8


Chapter I
Processing ▪ Streaming
Types
Definition

▪ Other types of processing that do not really require a global data path:
these are Streaming processing.

▪ Streaming: This is a word that is widely used in video playback.


The principle of streaming is that instead of browsing all the data at once,
the data will be sent one by one and then the processing will be done little
by little on each of the parts.

short treatment

Result

… tr3 tr2 tr1


res3 res2 res1

©Dr. Kalthoum Zaouali 2022-2023 9


Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

❑ This type of processing in terms of the Latency is rather low:

▪ You don't have to wait for all of the data to go through because the
processing carried out is not really heavy (which can create congestion).

▪ It's not really a single queue but it's a set of parallel queues from which we
can carry out the processing of each part in a parallel way.

▪ Many systems use this kind of streaming processing like Storm and Flink.

▪ On the other hand, in terms of Throughput, it is lower compared to


MapReduce, because there will be a processing time that will elapse
between res1 and res2 ...

©Dr. Kalthoum Zaouali 2022-2023 10


Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

▪ Streaming processing also does not require visibility into the other data that
is being processed.

❖ Example of streaming processing:


We want to process Twitter, we want to know people's impressions of a
product for sale:

▪ So, the first thing to do is to collect messages from Twitter;

▪ Then, filter the messages that speak about our product only: It's not really an operation
that requires visibility of all messages, so, we're going to go through message by
message and we're going to filter the one that concerns us;

▪ Then, we apply a little analysis to find out the general opinion: If the opinion is positive,
then we give +1 comment, otherwise -1.

▪ At the end, we can draw curves for example to visualize the results.

©Dr. Kalthoum Zaouali 2022-2023 11


Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

Use Batch Processing or Streaming?

➢ It depends on the availability and the way of storing the data:

▪ If all the data is already stored in a well-defined space:


→ the Batch Processing is used;

▪ If the data arrives in a streaming way (from a streaming source) as it is


processed : Systems still receiving data even in the middle of the
processing:
→ the streaming processing is used;

©Dr. Kalthoum Zaouali 2022-2023 12


Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

Suppose we have a stream of data that will come in


streaming, what is the problem that can happen?

• The data stream must be adequate to the data rate: If the data flow arrives
every 10 ms,

→ The processing should ideally take place in a duration less than 10 ms


to avoid the risk of loss of data.

©Dr. Kalthoum Zaouali 2022-2023 13


Chapter I
Processing ▪ Data ingestion system
Types

➢ The first solution is to add a data ingestion system:

❖ Data ingestion system: It's a kind of queue: When the data is arriving, these data
will be stored in the queue and then they will be processed one by one to get the
result.

❖ Message Oriented Middleware (MOM): This is a middleware that will ensure the
synchronization of different processes.

❑ For a Big Data system with very high throughput, this queue can get crowded itself
→ It cannot bear the large data flow.

©Dr. Kalthoum Zaouali 2022-2023 14


Chapter I
Processing ▪ Data ingestion system
Types

❖ For a Big Data system:

➢ Have a data ingestion system that consists of multiple distributed queues,

➢ These queues can be divided according to size, topic, and several other criteria...

▪ The system that can do all this is the most adequate: Kafka solution;

➢ Kafka will offer a distributed ingestion system (MOM)

→ In terms of performance, stored volume and in terms of efficiency,


it is extremely powerful.

©Dr. Kalthoum Zaouali 2022-2023 15


Chapter I
Processing ▪ Micro-batching processing
Types

➢ The second solution is the micro-batching processing (micro-BP)

➢ The micro-BP is the fact of using for example Spark.

▪ Instead of doing the data processing one by one, it will collect the data into spark
mini queue.

▪ After a certain size relative to a certain time, it triggers the processing on all the
data already grouped.

➢ It is batching since it groups the data then it does the processing.

➢ On the other hand, it is micro batching since it does not have visibility on all the
data, it only has visibility on the grouped batch.

©Dr. Kalthoum Zaouali 2022-2023 16


Chapter I
Processing ▪ Micro-batching processing
Types

What have we won then!!

▪ The execution of the operation on the data instead of being sequential (one by one) and
which will be repeated on each data:
➢ If the data is grouped → The system performs its operations much faster.
➢ If we have data structures which does not allow to do the processing faster → it can be
executed in one go on the data batch set.

▪ Micro batching provides better performance if you want to do aggregate processing.


➢ There is less risk of data loss and time can be saved.

▪ We can also have processing that may depend on previous and following data
processing (Having visibility on neighboring data).
➢ Micro-BP is a good alternative

➢ Micro-BP risk: Production problem!!

©Dr. Kalthoum Zaouali 2022-2023 17


Chapter I
Processing ▪ Micro-batching processing
Types

▪ Companies that decided to use micro-BP started with Spark, since it offers
advanced Machine Learning libraries, graph processing, Spark SQL, etc, and
many other libraries.

▪ Spark is advanced compared to other technologies.

▪ However, Spark poses a problem in case companies are looking to run true
streaming processing.

▪ If desired, smooth processing and constant high-flow micro-BP can generate the
Spark system.

©Dr. Kalthoum Zaouali 2022-2023 18


Chapter I
Processing ▪ Interactive processing
Types

❖ Another type of data processing, is the interactive processing:

➢ It is a synchronous processing;

➢ It is the act of interacting with the user: The client sends a request and it remains
waiting for the response from the system.

❖ What kind of processing that cannot be interactive? → It's batch processing, the
problem is that the latency is very high;

▪ For interactive processing: The latency is low.

©Dr. Kalthoum Zaouali 2022-2023 19


Chapter I
Processing ▪ Interactive processing
Types

❖ Another type of processing is the on-demand batch (to be programmed):

▪ You can click on a button and ask to launch a batch.

▪ It is rare to receive an instant result.

▪ We cannot do Big Data batch processing!

©Dr. Kalthoum Zaouali 2022-2023 20


Chapter I
Processing ▪ Real-time processing
Types

❖ For the real-time processing, the response time is as important as the response it-
self:

▪ For rigid real-time, the exact response time is indicated

▪ For flexible real-time, a time margin is indicated.

©Dr. Kalthoum Zaouali 2022-2023 21


Chapter I
Processing ▪ Big Data processing architectures
Types

To do

Do a specific (Description and design) search on Lamda and Kappa Big Data architectures!

©Dr. Kalthoum Zaouali 2022-2023 22

You might also like