BSC in Information Technology: Massive or Big Data Processing J.Alosius
BSC in Information Technology: Massive or Big Data Processing J.Alosius
BSC in Information Technology: Massive or Big Data Processing J.Alosius
(Data Science)
SLIIT – 2019 (Semester 2)
Massive or BIG Data Processing
J.Alosius
Introduction to Hadoop
Resolving Challenges in BIG Data Processing
• Suppose you need to process 1GB of data in the
relational database – No problem in handling data.
• Suppose data grows to 100TB – Problem occurs when
store and process data.
• What you can do?
Thus, we need:
• Semaphores (lock, unlock)
• Conditional variables (wait, notify, broadcast)
• Barriers
Hadoop runs code across a cluster of computers and performs the following tasks:
• Data are initially divided into files and directories.
• Files are divided into consistent sized blocks ranging from 128M and 64M.
• Then the files are distributed across various cluster nodes for further processing of
data.
• Job tracker starts its scheduling programs on individual nodes.
• Once all the nodes are done with scheduling then the output is return back.
Hadoop Server Roles
• A rack is a collection of 30 or 40 nodes that are physically stored close together and are
all connected to the same network switch.
• Network bandwidth between any two nodes in rack is greater than bandwidth between
two nodes on different racks. A Hadoop Cluster is a collection of racks.
• This concept that chooses closer datanodes based on the rack information is called Rack
Awareness in Hadoop.
• Rack awareness is having the knowledge of Cluster topology or more specifically how the
different data nodes are distributed across the racks of a Hadoop cluster
Rack Awarness – Replication Policy
Creating a block
• First Replica is loaded in Local Node
• Second Replica is loaded in different rack
• Third Replica is stored in same rack but different node
• Missing Heartbeats
signify lost Nodes
• Name Node
consults metadata,
finds affected data
• Name Node
consults Rack
• Awareness script
Name Node tells a
Data Node to re-
replicate