Big Data Processing Types
Big Data Processing Types
Data
Processing
Dr. Kalthoum Zaouali
kalthoum.zaouali@pristini-international.tn
Advanced Big Data Processing
Big Data
Processing
Types
Chapter I
❑ The paradigm Map-Reduce: Its principle is to divide the processing into two large
parts:
▪ Maper: The way it does to be carried out in an identical way between the parts of
the file of sights between the machines;
▪ Reducer: Bring together the parts which are arranged in a well-determined way
which will carry out the processing in a globalized way to make a calculation or a
general processing between the map results;
o Is it sufficient?
▪ A set of mappers will run processing in parallel then apply the Shuffle & Sort.
▪ The results will be sent to the Reducer which will gather data and send the final result.
b
a’
Final result b
Result ’c’
c
d’
d
high latency
high throughput
©Dr. Kalthoum Zaouali 2022-2023 5
Chapter I
Processing ▪ Map-Reduce
Types
Disadvantage/Advantage
2. Then, there are communications between the machines where the operations
of Shuffle & Sort are executed;
3. Then, the Reduce operations on all the different reduce parts are defined;
4. The result must be sent to the Name Node to discuss again with the client;
➢ On the other hand, the advantage is that the flow rate is high.
▪ The problem with Map-Reduce processing is the time required to perform this
processing: High Latency.
▪ Latency is the time between sending the request and receiving the result.
▪ We used parallelism to be able to access the file browsing, but it should give
the same result if we browse the file sequentially in an ordinary centralized
system.
▪ Other types of processing that do not really require a global data path:
these are Streaming processing.
short treatment
Result
▪ You don't have to wait for all of the data to go through because the
processing carried out is not really heavy (which can create congestion).
▪ It's not really a single queue but it's a set of parallel queues from which we
can carry out the processing of each part in a parallel way.
▪ Many systems use this kind of streaming processing like Storm and Flink.
▪ Streaming processing also does not require visibility into the other data that
is being processed.
▪ Then, filter the messages that speak about our product only: It's not really an operation
that requires visibility of all messages, so, we're going to go through message by
message and we're going to filter the one that concerns us;
▪ Then, we apply a little analysis to find out the general opinion: If the opinion is positive,
then we give +1 comment, otherwise -1.
▪ At the end, we can draw curves for example to visualize the results.
• The data stream must be adequate to the data rate: If the data flow arrives
every 10 ms,
❖ Data ingestion system: It's a kind of queue: When the data is arriving, these data
will be stored in the queue and then they will be processed one by one to get the
result.
❖ Message Oriented Middleware (MOM): This is a middleware that will ensure the
synchronization of different processes.
❑ For a Big Data system with very high throughput, this queue can get crowded itself
→ It cannot bear the large data flow.
➢ These queues can be divided according to size, topic, and several other criteria...
▪ The system that can do all this is the most adequate: Kafka solution;
▪ Instead of doing the data processing one by one, it will collect the data into spark
mini queue.
▪ After a certain size relative to a certain time, it triggers the processing on all the
data already grouped.
➢ On the other hand, it is micro batching since it does not have visibility on all the
data, it only has visibility on the grouped batch.
▪ The execution of the operation on the data instead of being sequential (one by one) and
which will be repeated on each data:
➢ If the data is grouped → The system performs its operations much faster.
➢ If we have data structures which does not allow to do the processing faster → it can be
executed in one go on the data batch set.
▪ We can also have processing that may depend on previous and following data
processing (Having visibility on neighboring data).
➢ Micro-BP is a good alternative
▪ Companies that decided to use micro-BP started with Spark, since it offers
advanced Machine Learning libraries, graph processing, Spark SQL, etc, and
many other libraries.
▪ However, Spark poses a problem in case companies are looking to run true
streaming processing.
▪ If desired, smooth processing and constant high-flow micro-BP can generate the
Spark system.
➢ It is a synchronous processing;
➢ It is the act of interacting with the user: The client sends a request and it remains
waiting for the response from the system.
❖ What kind of processing that cannot be interactive? → It's batch processing, the
problem is that the latency is very high;
❖ For the real-time processing, the response time is as important as the response it-
self:
To do
Do a specific (Description and design) search on Lamda and Kappa Big Data architectures!