HADOOP

Apache HADOOP
Open source software framework designed

for storage and processing of large scale
data on clusters of commodity hardware
Created by Doug Cutting and Mike
Carafella in 2005.
Cutting named the program after his son’s
toy elephant.
Uses for Hadoop
Data-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
Who Uses Hadoop?
How much data?
Facebook
◦ 500 TB per day
Yahoo
◦ Over 170 PB
eBay
◦ Over 6 PB
Getting the data to the processors becomes
the bottleneck
Requirements for Hadoop
Must support partial failure
 Must be scalable
Partial Failures
 Failure of a single component must not
cause the failure of the entire system only a
degradation of the application performance
 Failure should not result in the loss of
any data
Component Recovery
If a component fails, it should be able to
recover without restarting the entire system
Component failure or recovery during a job
must not affect the final output
Scalability
Increasing resources should increase load
capacity
Increasing the load on the system should
result in a graceful decline in performance
for all jobs
 Not system failure
The four key characteristics of Hadoop are:
 Economical: Its systems are highly
economical as ordinary computers can be
used for data processing.
 Reliable: It is reliable as it stores copies of
the data on different machines and is
resistant to hardware failure.
 Scalable: It is easily scalable both,
horizontally and vertically. A few extra
nodes help in scaling up the framework.
◦ Flexible: It is flexible and can store as much
structured and unstructured data
Difference between Traditional Database
System and Hadoop
The table given below will help you distinguish between Traditional Database System
and Hadoop.
Traditional Database
Hadoop
System
In Hadoop, the program
goes to the data. It initially
Data is stored in a distributes the data to
central location and sent multiple systems and later
to the processor at runs the computation
runtime. wherever the data is
located.
Hadoop works better when

Traditional Database the data size is big. It can
Systems cannot be used process and store a large
to process and store a amount of data efficiently
significant amount of and effectively.
data(big data).
Traditional RDBMS is Hadoop can process and

used to manage only store a variety of data,
structured and semi-
structured data. It cannot whether it is structured or
be used to control unstructured.
unstructured data.
The Hadoop Ecosystem

Hadoop
Contains Libraries and other modules
Common
HDFS Hadoop Distributed File System
Hadoop YARN Yet Another Resource Negotiator

Hadoop A programming model for large scale
MapReduce data processing
Hadoop ecosystem is continuously growing to meet

the needs of Big Data. It comprises the following
twelve components:
 HDFS(Hadoop Distributed file system)
 HBase
 Sqoop
 Flume
 Spark
 Hadoop MapReduce
 Pig
 Impala
 Hive
 Cloudera Search
 Oozie
 Hue.
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data
Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data
services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine
Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
HDFS (Hadoop Distributed File System):
HDFS is the primary or major component of

Hadoop ecosystem and is responsible for
storing large data sets of structured or
unstructured data across various nodes and
thereby maintaining the metadata in the form
of log files.
 Hadoop distributed file system (HDFS) is a

Java based file system that provides scalable,
fault tolerance, reliable and cost efficient data
storage for Big data.
HDFS is a distributed file system that runs on

Commodity hardware.
Hadoop interact directly with HDFS by shell-

like commands.
HDFS consists of two core components i.e.

1. Name node
2. Data Node
HDFS uses a master/slave architecture in which

one device (master) termed as NameNode controls
one or more other devices (slaves) termed as
DataNode.
◦ It breaks Data/Files into small blocks
(Typically 64MB or 128MB each block) and
stores on DataNode and each block
replicates on other nodes to accomplish fault
tolerance.
 NameNode keeps the track of blocks written

to the DataNode.
i. NameNode
It is also known as Master node. NameNode
does not store actual data or dataset. NameNode
stores Metadata i.e. number of blocks, their location,
on which Rack, which Datanode the data is stored
and other details. It consists of files and directories.
Tasks of HDFS NameNode

 Manage file system namespace.
 Regulates client’s access to files.
 Executes file system execution such as naming,
closing, opening files and directories.

ii. DataNode
It is also known as Slave.

HDFS Datanode is responsible for storing actual
data in HDFS.
Datanode performs read and write
operation as per the request of the clients.
Replica block of Datanode consists of 2 files on
the file system.
 The first file is for data and second file is
for recording the block’s metadata.
HDFS Metadata includes checksums for data.
At startup, each Datanode connects to its

corresponding Namenode and does
handshaking. Verification of namespace ID and
software version of DataNode take place by
handshaking. At the time of mismatch found,
DataNode goes down automatically.
Tasks of HDFS DataNode

 DataNode performs operations like block replica
creation, deletion, and replication according to the

instruction of NameNode.
 DataNode manages data storage of the system.
YARN(Yet Another Resource Negotiator):

 As the name implies, YARN is the one who helps
to manage the resources across the clusters. In

short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.

1.Resource Manager
2.Nodes Manager
3.Application Manager
 Resource manager has the privilege of allocating

resources for the applications in a system whereas
 Node managers work on the allocation of
resources such as CPU,s memory, bandwidth per
machine and later on acknowledges the resource
manager.
 Application manager works as an interface
between the resource manager and node manager
and performs negotiations as per the requirement
of the two.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem
component which provides data processing.
MapReduce is a software framework for easily

writing applications that process the vast amount of
structured and unstructured data stored in the
Hadoop Distributed File system.
MapReduce programs are parallel in nature, thus are

very useful for performing large-scale data analysis
using multiple machines in the cluster. Thus, it
improves the speed and reliability of cluster this
parallel processing.
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works
by breaking the processing into two phases:
 Map phase
 Reduce phase
Each phase has key-value pairs as input and output.

In addition, programmer also specifies two
functions: map function and reduce function
1.Map() performs sorting and filtering of data and
thereby organizing them in the form of group.
Map generates a key-value pair based result
which is later on processed by the Reduce()
method.
2.Reduce(), as the name suggests does the
summarization by aggregating the mapped data.
In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into
smaller set of tuples.
The Mapper
 Reads data as key/value pairs
◦ The key is often discarded
 Outputs zero or more key/value pairs
Shuffle and Sort
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to go to
the same machine
The Reducer
 Called once for each unique key
 Gets a list of all values associated with a key as input
 The reducer outputs zero or more final key/value
pairs
◦ Usually just one output per input key
MapReduce: Word Count
Features of MapReduce
 Simplicity – MapReduce jobs are easy to run.
Applications can be written in any language such

as java, C++, and python.
 Scalability – MapReduce can process petabytes
of data.
 Speed – By means of parallel processing
problems that take days to solve, it is solved in
hours and minutes by MapReduce.
 Fault Tolerance – MapReduce takes care of
failures. If one copy of data is unavailable,
another machine has a copy of the same key pair
which can be used for solving the same subtask.
Other Tools
 Hive
◦ Hadoop processing with SQL
 Pig
◦ Hadoop processing with scripting
 Cascading
◦ Pipe and Filter processing model
 HBase
◦ Database model built on top of Hadoop
 Flume
◦ Designed for large scale data movement
Matrix Multiplication
https://www.youtube.com/watch?v=RIMA4rvNpI8

HADOOP

Uploaded by

Copyright:

Available Formats

HADOOP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HADOOP

Uploaded by

Copyright:

Available Formats

Apache HADOOP

Open source software framework designed

Hadoop works better when

Traditional RDBMS is Hadoop can process and

The Hadoop Ecosystem

HDFS Hadoop Distributed File System

Hadoop YARN Yet Another Resource Negotiator

Hadoop ecosystem is continuously growing to meet

HDFS (Hadoop Distributed File System):

HDFS is the primary or major component of

 Hadoop distributed file system (HDFS) is a

HDFS is a distributed file system that runs on

Hadoop interact directly with HDFS by shell-

HDFS consists of two core components i.e.

HDFS uses a master/slave architecture in which

 NameNode keeps the track of blocks written

Tasks of HDFS NameNode

 Regulates client’s access to files.

 Executes file system execution such as naming,

closing, opening files and directories.

It is also known as Slave.

At startup, each Datanode connects to its

Tasks of HDFS DataNode

creation, deletion, and replication according to the

YARN(Yet Another Resource Negotiator):

to manage the resources across the clusters. In

 Consists of three major components i.e.

 Resource manager has the privilege of allocating

MapReduce is a software framework for easily

MapReduce programs are parallel in nature, thus are

Each phase has key-value pairs as input and output.

Applications can be written in any language such

You might also like