BSC in Information Technology: Massive or Big Data Processing J.Alosius

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

BSc In Information Technology

(Data Science)
SLIIT – 2019 (Semester 2)
Massive or BIG Data Processing
J.Alosius
Introduction to Hadoop
Resolving Challenges in BIG Data Processing
• Suppose you need to process 1GB of data in the
relational database – No problem in handling data.
• Suppose data grows to 100TB – Problem occurs when
store and process data.
• What you can do?

Big data storage is challenging

• Data Volumes are massive


• Reliability of Storing PBs of data is challenging
• All kinds of failures: Disk/Hardware/Network Failures
• Probability of failures simply increase with the number
of machines …
Distributed processing is nontrivial
• How to assign tasks to different workers in an efficient way?
• What happens if tasks fail or worker dies?
• How do workers exchange full/partial results? How do we aggregate partial results?
• How to synchronize distributed tasks allocated to different workers?
• What if we have more work units than workers?
• How do we know all the workers have finished?
Common Theme
• Parallelization problems arise from:
• Communication between workers (e.g., to exchange state)
• Access to shared resources (e.g., data)

• Thus, we need a synchronization mechanism

Managing Multiple Workers


Difficult because
• We don’t know the order in which workers run
• We don’t know when workers interrupt each other
• We don’t know the order in which workers access shared data

Thus, we need:
• Semaphores (lock, unlock)
• Conditional variables (wait, notify, broadcast)
• Barriers

Still, lots of problems:


• Deadlock, livelock, race conditions...
• Dining philosophers, sleeping barbers, cigarette smokers...
Concurrency Challenge
• Concurrency is difficult to reason about
• Concurrency is even more difficult to reason about
• At the scale of datacenters (even across datacenters)
• In the presence of failures
• In terms of multiple interacting services
• Not to mention debugging…
• The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything

What should be Done??


• It’s all about the right level of abstraction
• The von Neumann architecture has served us well, but is no longer
appropriate for the multi-core/cluster environment
• Hide system-level details from the developers
• No more race conditions, lock contention, etc.
• Separating the what from how
• Developer specifies the computation that needs to be performed
• Execution framework (“runtime”) handles actual execution
Hadoop offers

• Scale “out”, not “up”


• Limits of SMP and large shared-memory machines
• Move processing to the data
• Cluster have limited bandwidth
• Process data sequentially, avoid random access
• Seeks are expensive, disk throughput is reasonable
• Seamless scalability
Definition of Hadoop
• A software framework written in JAVA that allows distributed
processing of large datasets across collection of autonomous
computers, connected through a network and distribution
middleware, which enables computers to coordinate their
activities and to share the resources of the system.
Hadoop Eco System
Components of Hadoop Eco System
NameNode - Represent every files and directory which is used in the namespace
DataNode - Manages the state of an HDFS node and allows you to interacts with the blocks
Master Node - Conducts parallel processing of data using Hadoop MapReduce.
Slave Nodes - Additional machines in the Hadoop cluster.
• Allows to store data to conduct complex calculations.
• Contain DataNode and Task Tracker
• Allows to synchronize the processes with the NameNode and Job Tracker respectively.
Major Components of Hadoop
Hadoop MapReduce: 
• MapReduce is a computational model and software framework for writing
applications which are run on Hadoop.
• MapReduce programs are capable of processing enormous data in parallel on large
clusters of computation nodes.

HDFS (Hadoop Distributed File System):


• HDFS takes care of the storage part of Hadoop applications. MapReduce applications
consume data from HDFS.
• HDFS creates multiple replicas of data blocks and distributes them on compute nodes
in a cluster. This distribution enables reliable and extremely rapid computations.

Hadoop runs code across a cluster of computers and performs the following tasks:
• Data are initially divided into files and directories.
• Files are divided into consistent sized blocks ranging from 128M and 64M.
• Then the files are distributed across various cluster nodes for further processing of
data.
• Job tracker starts its scheduling programs on individual nodes.
• Once all the nodes are done with scheduling then the output is return back.
Hadoop Server Roles

Distributed Data Storage (Hadoop Cluster )


• computer nodes are stored on racks.
• The nodes on a single rack are connected by a network. (typically gigabit Ethernet)
• Racks are connected by another level of network or a switch.
• The bandwidth of intra-rack communication is usually much greater than that of
inter-rack communication.
Hadoop Rack Awarness

• A rack is a collection of 30 or 40 nodes that are physically stored close together and are
all connected to the same network switch.
• Network bandwidth between any two nodes in rack is greater than bandwidth between
two nodes on different racks. A Hadoop Cluster is a collection of racks.
• This concept that chooses closer datanodes based on the rack information is called Rack
Awareness in Hadoop. 
• Rack awareness is having the knowledge of Cluster topology or more specifically how the
different data nodes are distributed across the racks of a Hadoop cluster
Rack Awarness – Replication Policy

Creating a block
• First Replica is loaded in Local Node
• Second Replica is loaded in different rack
• Third Replica is stored in same rack but different node

Advantages of Rack Awareness


• Minimize the writing cost and Maximize read speed
• Provide maximize network bandwidth and low latency
• Data protection against rack failure
Name Node Responsibilities
• Managing the file system
• Holds file/ directory structure, meta data, file-to block mapping, access
permissions etc.
• Coordinating file operations
• Directs clients to Data Nodes for reads and writes
• No data is moved through the Name Node
• Maintaining overall health:
• Periodic communication with the Data Nodes
• Block re-replication and re-balancing
• Housekeeping, backup
of Name Node
metadata
• Connects to Name
Node every hour*
• Saved metadata can
rebuild a failed Name
Node

• Missing Heartbeats
signify lost Nodes
• Name Node
consults metadata,
finds affected data
• Name Node
consults Rack
• Awareness script
Name Node tells a
Data Node to re-
replicate

You might also like