0% found this document useful (0 votes)

109 views

Big Data Processing Types

The document discusses different types of big data processing including batch processing and streaming processing. Batch processing involves processing all available data at once which can have high latency but also high throughput. Streaming processing treats data as a continuous stream and processes items individually as they arrive, providing lower latency but also lower throughput than batch processing. The document examines when each type of processing would be appropriate depending on how the data is stored and accessed.

Uploaded by

khaoula fattah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views

Big Data Processing Types

Uploaded by

khaoula fattah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Advanced Big

Data
Processing
Dr. Kalthoum Zaouali

kalthoum.zaouali@pristini-international.tn
Advanced Big Data Processing

Big Data
Processing
Types
Chapter I

©Dr. Kalthoum Zaouali 2022-2023

Chapter I
Processing ▪ Overview
Types
HDFS & Map-Reduce

❑ HDFS: Hadoop Distributed File System

❑ The paradigm Map-Reduce: Its principle is to divide the processing into two large
parts:

▪ Maper: The way it does to be carried out in an identical way between the parts of
the file of sights between the machines;

▪ Reducer: Bring together the parts which are arranged in a well-determined way
which will carry out the processing in a globalized way to make a calculation or a
general processing between the map results;

©Dr. Kalthoum Zaouali 2022-2023 3

Chapter I
Processing ▪ Overview
Types
HDFS & Map-Reduce

o Could this kind of MapReduce processing be done for

any type of calculation?

o Could the MapReduce paradigm be done on all the

processing that can be done on massive data?

o Is it sufficient?

o Are there any problems with MapReduce ?

©Dr. Kalthoum Zaouali 2022-2023 4

Chapter I
Processing ▪ Map-Reduce
Types
Map-Reduce processing

▪ A set of mappers will run processing in parallel then apply the Shuffle & Sort.
▪ The results will be sent to the Reducer which will gather data and send the final result.

Map Shuffle & Sort Reduce

b
a’
Final result b
Result ’c’
c
d’

d
high latency
high throughput
©Dr. Kalthoum Zaouali 2022-2023 5
Chapter I
Processing ▪ Map-Reduce
Types
Disadvantage/Advantage

❖ In particular, MapReduce is implemented on Hadoop so:

1. There are processing executions in parallel;

2. Then, there are communications between the machines where the operations
of Shuffle & Sort are executed;

3. Then, the Reduce operations on all the different reduce parts are defined;

4. The result must be sent to the Name Node to discuss again with the client;

➢ Therefore, perform the end-to-end operation is slow;

➢ On the other hand, the advantage is that the flow rate is high.

©Dr. Kalthoum Zaouali 2022-2023 6

Chapter I
Processing ▪ Map-Reduce
Types
Disadvantage/Advantage

▪ The problem with Map-Reduce processing is the time required to perform this
processing: High Latency.

▪ Latency is the time between sending the request and receiving the result.

▪ Throughout is the amount of information we see at the output: As long as we

send requests as long as the results are displayed at the end;

→ MapReduce is a batch processing type.

©Dr. Kalthoum Zaouali 2022-2023 7

Chapter I
Processing ▪ Batch Processing
Types
Definition

▪ By definition, the Batch Processing (BP) is the fact of performing a process

that needs to browse all the data that we have stored in the storage system.

▪ We used parallelism to be able to access the file browsing, but it should give
the same result if we browse the file sequentially in an ordinary centralized
system.

▪ The BP is not necessarily a parallel processing, it is any type of processing

that gives a global calculation of all the data belonging to a data source.

✓ In general, decision-making processing is a processing that needs to have

visibility over a very large amount of data and which scans the data sources
globally to extract a decision.

©Dr. Kalthoum Zaouali 2022-2023 8

Chapter I
Processing ▪ Streaming
Types
Definition

▪ Other types of processing that do not really require a global data path:
these are Streaming processing.

▪ Streaming: This is a word that is widely used in video playback.

The principle of streaming is that instead of browsing all the data at once,
the data will be sent one by one and then the processing will be done little
by little on each of the parts.

short treatment

Result

… tr3 tr2 tr1

res3 res2 res1

©Dr. Kalthoum Zaouali 2022-2023 9

Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

❑ This type of processing in terms of the Latency is rather low:

▪ You don't have to wait for all of the data to go through because the
processing carried out is not really heavy (which can create congestion).

▪ It's not really a single queue but it's a set of parallel queues from which we
can carry out the processing of each part in a parallel way.

▪ Many systems use this kind of streaming processing like Storm and Flink.

▪ On the other hand, in terms of Throughput, it is lower compared to

MapReduce, because there will be a processing time that will elapse
between res1 and res2 ...

©Dr. Kalthoum Zaouali 2022-2023 10

Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

▪ Streaming processing also does not require visibility into the other data that
is being processed.

❖ Example of streaming processing:

We want to process Twitter, we want to know people's impressions of a
product for sale:

▪ So, the first thing to do is to collect messages from Twitter;

▪ Then, filter the messages that speak about our product only: It's not really an operation
that requires visibility of all messages, so, we're going to go through message by
message and we're going to filter the one that concerns us;

▪ Then, we apply a little analysis to find out the general opinion: If the opinion is positive,
then we give +1 comment, otherwise -1.

▪ At the end, we can draw curves for example to visualize the results.

©Dr. Kalthoum Zaouali 2022-2023 11

Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

Use Batch Processing or Streaming?

➢ It depends on the availability and the way of storing the data:

▪ If all the data is already stored in a well-defined space:

→ the Batch Processing is used;

▪ If the data arrives in a streaming way (from a streaming source) as it is

processed : Systems still receiving data even in the middle of the
processing:
→ the streaming processing is used;

©Dr. Kalthoum Zaouali 2022-2023 12

Chapter I
Processing ▪ Streaming
Types
Advantage/Disadvantage

Suppose we have a stream of data that will come in

streaming, what is the problem that can happen?

• The data stream must be adequate to the data rate: If the data flow arrives
every 10 ms,

→ The processing should ideally take place in a duration less than 10 ms

to avoid the risk of loss of data.

©Dr. Kalthoum Zaouali 2022-2023 13

Chapter I
Processing ▪ Data ingestion system
Types

➢ The first solution is to add a data ingestion system:

❖ Data ingestion system: It's a kind of queue: When the data is arriving, these data
will be stored in the queue and then they will be processed one by one to get the
result.

❖ Message Oriented Middleware (MOM): This is a middleware that will ensure the
synchronization of different processes.

❑ For a Big Data system with very high throughput, this queue can get crowded itself
→ It cannot bear the large data flow.

©Dr. Kalthoum Zaouali 2022-2023 14

Chapter I
Processing ▪ Data ingestion system
Types

❖ For a Big Data system:

➢ Have a data ingestion system that consists of multiple distributed queues,

➢ These queues can be divided according to size, topic, and several other criteria...

▪ The system that can do all this is the most adequate: Kafka solution;

➢ Kafka will offer a distributed ingestion system (MOM)

→ In terms of performance, stored volume and in terms of efficiency,

it is extremely powerful.

Chapter I
Processing ▪ Micro-batching processing
Types

➢ The second solution is the micro-batching processing (micro-BP)

➢ The micro-BP is the fact of using for example Spark.

▪ Instead of doing the data processing one by one, it will collect the data into spark
mini queue.

▪ After a certain size relative to a certain time, it triggers the processing on all the
data already grouped.

➢ It is batching since it groups the data then it does the processing.

➢ On the other hand, it is micro batching since it does not have visibility on all the
data, it only has visibility on the grouped batch.

Chapter I
Processing ▪ Micro-batching processing
Types

What have we won then!!

▪ The execution of the operation on the data instead of being sequential (one by one) and
which will be repeated on each data:
➢ If the data is grouped → The system performs its operations much faster.
➢ If we have data structures which does not allow to do the processing faster → it can be
executed in one go on the data batch set.

▪ Micro batching provides better performance if you want to do aggregate processing.

➢ There is less risk of data loss and time can be saved.

▪ We can also have processing that may depend on previous and following data
processing (Having visibility on neighboring data).
➢ Micro-BP is a good alternative

➢ Micro-BP risk: Production problem!!

Chapter I
Processing ▪ Micro-batching processing
Types

▪ Companies that decided to use micro-BP started with Spark, since it offers
advanced Machine Learning libraries, graph processing, Spark SQL, etc, and
many other libraries.

▪ Spark is advanced compared to other technologies.

▪ However, Spark poses a problem in case companies are looking to run true
streaming processing.

▪ If desired, smooth processing and constant high-flow micro-BP can generate the
Spark system.

Chapter I
Processing ▪ Interactive processing
Types

❖ Another type of data processing, is the interactive processing:

➢ It is a synchronous processing;

➢ It is the act of interacting with the user: The client sends a request and it remains
waiting for the response from the system.

❖ What kind of processing that cannot be interactive? → It's batch processing, the
problem is that the latency is very high;

▪ For interactive processing: The latency is low.

Chapter I
Processing ▪ Interactive processing
Types

❖ Another type of processing is the on-demand batch (to be programmed):

▪ You can click on a button and ask to launch a batch.

▪ It is rare to receive an instant result.

▪ We cannot do Big Data batch processing!

Chapter I
Processing ▪ Real-time processing
Types

❖ For the real-time processing, the response time is as important as the response it-
self:

▪ For rigid real-time, the exact response time is indicated

▪ For flexible real-time, a time margin is indicated.

Chapter I
Processing ▪ Big Data processing architectures
Types

To do

Do a specific (Description and design) search on Lamda and Kappa Big Data architectures!

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Scd2 Flag Implementation
No ratings yet
Scd2 Flag Implementation
7 pages
Sharding in MongoDB
No ratings yet
Sharding in MongoDB
3 pages
How SemiAnalysis Uses The Supply Chain To Get Inside Nvidia - Business Insider
No ratings yet
How SemiAnalysis Uses The Supply Chain To Get Inside Nvidia - Business Insider
6 pages
iCEDQ Ebooks - DataOps Implementation Guide
No ratings yet
iCEDQ Ebooks - DataOps Implementation Guide
13 pages
Very Useful - AS400 Slides
100% (1)
Very Useful - AS400 Slides
86 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
27 pages
Example of SCD1 and Update Stratgey.
100% (1)
Example of SCD1 and Update Stratgey.
35 pages
SCD 2
No ratings yet
SCD 2
9 pages
Operational Data Stores
No ratings yet
Operational Data Stores
3 pages
Error Handling: Archive For The ETL Exception & Error Handling' Category
No ratings yet
Error Handling: Archive For The ETL Exception & Error Handling' Category
4 pages
How SQL Server Stores Data On Disk in The Data and Log Files
No ratings yet
How SQL Server Stores Data On Disk in The Data and Log Files
16 pages
MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph
No ratings yet
MyRocks LSM Tree Database Storage Engine Serving Facebooks Social Graph
14 pages
PL - SQL Model Questions
No ratings yet
PL - SQL Model Questions
4 pages
Db2 Oracle
No ratings yet
Db2 Oracle
39 pages
RESTful Day 1 PDF
No ratings yet
RESTful Day 1 PDF
47 pages
ETLO
No ratings yet
ETLO
13 pages
HOW To Analyzing and Interpreting AWR Report
0% (1)
HOW To Analyzing and Interpreting AWR Report
1 page
AWR Report
No ratings yet
AWR Report
3 pages
Ingestion Layer PDF
No ratings yet
Ingestion Layer PDF
11 pages
AS400 Database Performance and Query Optimization
No ratings yet
AS400 Database Performance and Query Optimization
250 pages
Mongodb Vs Couchbase Architecture WP PDF
No ratings yet
Mongodb Vs Couchbase Architecture WP PDF
45 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Redis Cheat Sheet: by Via
No ratings yet
Redis Cheat Sheet: by Via
2 pages
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
No ratings yet
Oracle 12c - CDB - PDB - Performing Basic Tasks PDF
18 pages
AIA Group FY 2022 Analyst Presentation Final
No ratings yet
AIA Group FY 2022 Analyst Presentation Final
78 pages
Four Distributed System Architectural Patterns
No ratings yet
Four Distributed System Architectural Patterns
10 pages
Next Pathway - Azure Synapse Analytics Migration Checklist
No ratings yet
Next Pathway - Azure Synapse Analytics Migration Checklist
3 pages
Chapter 1 1 PDF
No ratings yet
Chapter 1 1 PDF
60 pages
Db2 DBA Planning
No ratings yet
Db2 DBA Planning
415 pages
Informatica PDF
No ratings yet
Informatica PDF
55 pages
SW Project
No ratings yet
SW Project
19 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
SQL Replication Basic
No ratings yet
SQL Replication Basic
22 pages
Zookeeper
0% (1)
Zookeeper
87 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
MySQL Interview Questions
No ratings yet
MySQL Interview Questions
47 pages
HFS HZ 2023 May Data Modernization Services 3
No ratings yet
HFS HZ 2023 May Data Modernization Services 3
60 pages
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
100% (4)
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
62 pages
Oracle Database Internals FAQ
No ratings yet
Oracle Database Internals FAQ
9 pages
Expert Report On Innovations in Health Care in The Slovak Republic
No ratings yet
Expert Report On Innovations in Health Care in The Slovak Republic
62 pages
MIE1628 Big Data Analytics Lecture8
No ratings yet
MIE1628 Big Data Analytics Lecture8
82 pages
ADaM IG V1.1draft
No ratings yet
ADaM IG V1.1draft
92 pages
DBMigation Checkllist
No ratings yet
DBMigation Checkllist
5 pages
CA ERwin Tutorial
No ratings yet
CA ERwin Tutorial
12 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Oracle Netapp Best Practices
No ratings yet
Oracle Netapp Best Practices
47 pages
Vertica Column-vs-Row
No ratings yet
Vertica Column-vs-Row
64 pages
Flash Recovery Area - Space Management Warning and Alerts
No ratings yet
Flash Recovery Area - Space Management Warning and Alerts
4 pages
B Tree
No ratings yet
B Tree
63 pages
Azure AnalysisServiceOverview
No ratings yet
Azure AnalysisServiceOverview
173 pages
A Brief History in Time For Data Vault
100% (1)
A Brief History in Time For Data Vault
6 pages
Oracle Frequently Asked Questions
No ratings yet
Oracle Frequently Asked Questions
151 pages
Insurance DataWare House Design Vechiles
No ratings yet
Insurance DataWare House Design Vechiles
2 pages
Siebel Insurance 8 Guide
From Everand
Siebel Insurance 8 Guide
Mohammed Azizuddin Aamer
4/5 (2)
Microsoft Dynamics NAV 7 Programming Cookbook
From Everand
Microsoft Dynamics NAV 7 Programming Cookbook
Rakesh Raul
No ratings yet
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
C42-Batch Stream Micro Batch Realtime Processing
No ratings yet
C42-Batch Stream Micro Batch Realtime Processing
33 pages
Custom Print in Oracle Primavera Unifier
No ratings yet
Custom Print in Oracle Primavera Unifier
2 pages
SAP UI5 and Fiori Training
100% (1)
SAP UI5 and Fiori Training
116 pages
3DCityDB Suite QuickInstall
No ratings yet
3DCityDB Suite QuickInstall
43 pages
Cópia de How To Create A New Event For E-Social - v1.7
No ratings yet
Cópia de How To Create A New Event For E-Social - v1.7
28 pages
unit 5 IO
No ratings yet
unit 5 IO
67 pages
SSH Into A Container
No ratings yet
SSH Into A Container
2 pages
Computer Science Class 11 - Sultan Chand - ModelTestPaper2
100% (1)
Computer Science Class 11 - Sultan Chand - ModelTestPaper2
5 pages
Select Modifying Data: SQL Cheat Sheet - Mysql
No ratings yet
Select Modifying Data: SQL Cheat Sheet - Mysql
3 pages
Git 20
No ratings yet
Git 20
22 pages
Metasploit Guide Install
No ratings yet
Metasploit Guide Install
6 pages
Passportautomation PDF
No ratings yet
Passportautomation PDF
20 pages
5 Data Types - Lists
No ratings yet
5 Data Types - Lists
37 pages
Write A Program To Implement Quick Sort Algorithm.: (For Programming Based Labs)
No ratings yet
Write A Program To Implement Quick Sort Algorithm.: (For Programming Based Labs)
4 pages
Arrays in Mips Assembly Language Objective: Array Declaration
No ratings yet
Arrays in Mips Assembly Language Objective: Array Declaration
2 pages
Lab
No ratings yet
Lab
5 pages
Java Full Stack Developer
0% (1)
Java Full Stack Developer
11 pages
CS2105 Assignment 1
No ratings yet
CS2105 Assignment 1
7 pages
1stRoundMockExam_Final (2)
No ratings yet
1stRoundMockExam_Final (2)
20 pages
Bca 2 Python Unit 3 Eng
No ratings yet
Bca 2 Python Unit 3 Eng
8 pages
نموذج اختبار
No ratings yet
نموذج اختبار
2 pages
Rational For Data Structure Lab
No ratings yet
Rational For Data Structure Lab
51 pages
Sriragavi P Resume
No ratings yet
Sriragavi P Resume
1 page
Brain Fuck - Programming Language
No ratings yet
Brain Fuck - Programming Language
15 pages
(FREE PDF Sample) Definitive Guide To Django Web Development Done Right Second Edition Adrian Holovaty Ebooks
100% (4)
(FREE PDF Sample) Definitive Guide To Django Web Development Done Right Second Edition Adrian Holovaty Ebooks
84 pages
IT 118 - SIA - Module 1
No ratings yet
IT 118 - SIA - Module 1
8 pages
Adventure Backpack (Rainbow) - Feed The Beast Wiki
No ratings yet
Adventure Backpack (Rainbow) - Feed The Beast Wiki
1 page
Android Development Beginner Guide
No ratings yet
Android Development Beginner Guide
17 pages
Salesforce Training in Hyderabad
No ratings yet
Salesforce Training in Hyderabad
7 pages
How-To Extend Master Data Governance For Material by A New Entity Type (Custom Z-Table, Reuse Option) V9.0
No ratings yet
How-To Extend Master Data Governance For Material by A New Entity Type (Custom Z-Table, Reuse Option) V9.0
57 pages
Kubernetes Interview Questions
No ratings yet
Kubernetes Interview Questions
5 pages