0% found this document useful (0 votes)
19 views66 pages

Wa0002.

Fbd

Uploaded by

Riki Borah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views66 pages

Wa0002.

Fbd

Uploaded by

Riki Borah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Hadoop overview

• Hadoop is an open source framework, from the Apache foundation.


• It is capable of processing large amounts of heterogeneous data sets
in a distributed fashion across clusters of commodity computers and
hardware using a simplified programming model.
• Hadoop provides a reliable shared storage and analysis system.
• Hadoop core is hosted by Apache Software Foundation.
• Origins in Google now used by Yahoo, Facebook.
• Once we store data in Hadoop, the Companies across the globe can
use this data to solve big problems, improve revenue, and more.
Hadoop Distributions

• Apache Hadoop is a method for distributed computing and storage.


• The big data created increased rapidly and there was a need to handle
and use this data optimally.
• The introduction of Apache Hadoop helped handle this problem to a
great extent.
• Although Hadoop is free, various distributions offer an easier to use
bundle.
• There are a number of Hadoop Distributions.
Top 5 Hadoop Distribution
HDFS, Map Reduce - Two core components in Hadoop
HDFS ( Hadoop Distributed File System)
- HDFS is the java based file system for scalable and reliable storage of large
datasets.
- Data in HDFS is stored in the form of blocks
- It operates on the Master Slave Architecture.

Hadoop MapReduce-
- This is a java based programming paradigm of Hadoop framework.
- provides scalability across various Hadoop clusters.
- MapReduce distributes the workload into various tasks that can run in
parallel.
Main components of a Hadoop Application
• Hadoop applications have wide range of technologies that provide
great advantage in solving complex business problems.
Core components of a Hadoop application -
• Hadoop Common
• HDFS
• Hadoop MapReduce
• YARN
The components used in Hadoop Ecosystem

• Data Access Components are - Pig and Hive


• Data Storage Component is - HBase
• Data Integration Components are - Apache Flume, Sqoop
• Data Management and Monitoring Components are - Ambari, Oozie
and Zookeeper.
• Data Serialization Components are - Thrift and Avro
• Data Intelligence Components are - Apache Mahout and Drill.
Hadoop streaming
• Hadoop distribution has a generic application programming
interface(API).
• This API is used for writing Map and Reduce jobs in any desired
programming language like Python, Perl, Ruby, etc.
• This is referred to as Hadoop Streaming.
• Users can create and run jobs with any kind of shell scripts or
executable as the Mapper or Reducers.
What is a commodity hardware?
• Hadoop process large amounts of heterogeneous data sets in a
distributed fashion.
• Processing is done across clusters of commodity computers and
hardware using a simplified programming model.
• Commodity Hardware refers to inexpensive systems.
• They do not have high availability or high quality.
• Commodity Hardware consists of RAM (because there are specific
services that need to be executed on RAM. )
• Hadoop can be run on any commodity hardware.
• It does not require any super computers or high end hardware
configuration to execute jobs.
• The best configuration for executing Hadoop jobs is dual core
machines or dual processors with 4GB or 8GB RAM
What is cluster? How a Server Cluster Works
?
Cluster –
• In a computer system, a cluster is a group of servers and other resources.
• A Cluster acts like a single system
• enables high availability, load balancing and parallel processing.
Server cluster -
• A server cluster is a collection of servers, called nodes.
• The Nodes communicate with each other to make a set of services highly available to
clients.
Hadoop cluster -
• A Hadoop cluster is a special type of computational cluster
• designed specifically for storing and analyzing huge amounts of unstructured data in a
distributed computing environment.
Required Software for Hadoop installation

• Required software for Linux include:


• Java™ must be installed.
• ssh must be installed
• sshd must be running to use the Hadoop scripts that manage remote
Hadoop daemons (if the optional start and stop scripts are to be
used.)
Topology of Hadoop Cluster
Name Node

◼ Hadoop employs a master/slave architecture for both distributed


storage and distributed computation.
◼ The NameNode is the master of HDFS
◼ NameNode is the bookkeeper of HDFS
◼ Single namenode store all metadata.
◼ Metadata is maintained entirely in RAM for fast lookup
Data node

◼ Each slave machine in the cluster will host a Data Node daemon.
◼ It performs the grunt work of the distributed filesystem
◼ When you want to read or write a HDFS file, the file is broken into blocks.
◼ NameNode will tell the client which Data Node each block resides in.
◼ Then the client communicates directly with the Data Node daemons
◼ after it is advised by the name node to process the local files
corresponding to the blocks .
◼ Data Nodes are constantly reporting to the Name Node.
Secondary Name node

◼ The Secondary Name Node (SNN) is an assistant daemon for


monitoring the state of the cluster HDFS.

◼ The SNN differs from the Name Node in that this process doesn’t
receive or record any real-time changes to HDFS.

◼ Instead, it communicates with the Name Node to take snapshots of


the HDFS metadata at intervals defined by the cluster configuration.
Job tracker

◼ The JobTracker daemon is the liaison between your application and Hadoop.
◼ JobTracker takes care of resource allocation of the hadoop job to ensure timely
completion.
◼ JobTracker determines the execution plan by determining which files to process,
assigns nodes to different tasks, and monitors all tasks as they’re running.
◼ Should a task fail, the JobTracker will automatically relaunch
the task, possibly on a different node, up to a predefined limit of retries.
◼ JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
◼ There is only one JobTracker daemon per Hadoop cluster.
Task tracker

◼ Runs the task assigned by Job tracker.


◼ Send progress report (Heartbeat) to the Job Tracker.
Q&A
What happens to a NameNode that has no data?
• There does not exist any NameNode without data.
• If it is a NameNode then it should have some sort of data in it.

What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
• The Hadoop job fails when the NameNode is down.

What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
• The Hadoop job fails when the Job Tracker is down.

Whenever a client submits a hadoop job, who receives it?


• NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block
information.
• JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
Hadoop Video
HDFS: Architecture
HDFS Key features
◼ Self healing
◼ Fault tolerant
◼ Abstraction for parallel processing
◼ Manage complex structured or unstructured data
HDFS weaknesses
• Block information stored in memory
• File system metadata size is limited to the amount of available RAM on the
NameNode (the small files problem)
• Single Point of Failure
• File system is offline if NameNode goes down
Where HDFS is not a fit
• Low latency data access
• Low latency describes a computer network that is optimized to process a
very high volume of data messages with minimal delay (latency).
• These networks are designed to support operations that require near
real-time access to rapidly changing data.
• HDFS always deals with large size of data set, which requires high latency to
access data.
• Lots of small files
• Name node holds the filesystem metadata.
Modes of Hadoop installation

• Standalone (or local) mode


• everything runs in a single JVM
• Pseudo-distributed mode’
• daemons run on the local machine
• Fully distributed mode
What is MapReduce
◼ A parallel programming framework
◼ MapReduce provides
◼ Automatic parallelization
◼ Fault tolerance
◼ Monitoring and status updates
Map Reduce Overview
◼ Tasks are distributed to multiple nodes
◼ Each node processes the data stored in that node
◼ Consists of two phase:
◼ Map – Reads input data and produces intermediate keys and values as
output. Map output (intermediate key, values) is stored in local system.
◼ Reduce – Values with the same key are sent to the same reducer for further
processing
Map Reduce Process
Anatomy of a MapReduce Job Run

• At the highest level, there are four independent entities:


• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run.
• The tasktrackers, which run the tasks that the job has been split into.
• The distributed files system (normally HDFS), which is used for sharing
job files between the other entities.
MapReduce (MRv1): Hadoop Architecture
Map Reduce: Job Tracker

◼ Determines the execution plan by determining which files to process,


assigns nodes to different tasks, and monitors all tasks as they are
running.
◼ When a task fails, the JobTracker will automatically relaunch
(ATTEMPT) the task, possibly on a different node, up to a predefined
limit of retries.
◼ There is only one JobTracker daemon per Hadoop cluster.
MapReduce: Task Tracker

◼ Task Tracker is responsible for instantiating and monitoring individual


map reduce tasks.
◼ Maintains a heartbeat with Job Tracker
◼ There is one Task Tracker daemon per Hadoop data node.
Basic parameters of Mapper and Reducer
The four basic parameters of a mapper are -
<LongWritable, text, text and IntWritable>

• LongWritable, text - represent input parameters.


• Text, IntWritable - represent intermediate output parameters.

The four basic parameters of a reducer are –


<text, IntWritable, text, IntWritable>

• text, IntWritable - represent intermediate output parameters


• text, IntWritable - represent final output parameters.
MapReduce: Programming Model

◼ Borrows from Functional Programming


◼ Users implement two functions: Map and Reduce
MapReduce Data Flow in Hadoop
Map

◼ Records are fed to the map function as key value pairs


◼ Produces one or more intermediate values with an output key from
the input
Shuffle and Sort

◼ Map task performs an in memory sort of the keys and divides them
into partitions corresponding to the reducers that they will be sent to.
◼ Output of every Map task is copied to the reducer.
◼ Every reduce task get a set of keys to execute the task on.
◼ Reduce task performs a merge on the output it receives from
different map tasks.
Reduce

◼ After map(), intermediate values are combined into a list.

◼ Reduce() combines intermediate values into one or more final values


for that same output key
DATA BLOCK AND INPUT SPLITS IN HADOOP’S MAPREDUCE

• HDFS breaks down very large files into large blocks (for example,
measuring 128MB)
• Then HDFS stores three copies of these blocks on different nodes in
the cluster.
• HDFS has no awareness of the content of these files.
• The key to efficient MapReduce processing is that, wherever possible,
data is processed locally — on the slave node where it’s stored.
• In Hadoop, files are composed of individual records
• which are ultimately processed one-by-one by mapper tasks.
• For example, the sample data set contains information about
completed flights within the United States between 1987 and 2008.
• Suppose, you have one large file for each year.
• within every file, each individual line represents a single flight.
• In other words, one line represents one record.
• Default block size for the Hadoop cluster is 64MB or 128 MB.( so that
the data files are broken into chunks of exactly 64MB/ 128 MB.)
If each map task processes all records in a specific data
block, what happens to those records that span block
boundaries?

• File blocks are exactly 64MB (or whatever you set the block size to be)
• and because HDFS has no conception of what’s inside the file blocks,
it can’t gauge when a record might spill over into another block.
• To solve this problem, Hadoop uses a logical representation of the
data stored in file blocks, known as input splits.
• When a MapReduce job client calculates the input splits, it figures out
where the first whole record in a block begins and where the last
record in the block ends.
• In cases where the last record in a block is incomplete, the input split
includes location information for the next block
• and the byte offset of the data needed to complete the record.
• The number of input splits that are calculated for a specific
application determines the number of mapper tasks.
• Each of these mapper tasks is assigned, where possible, to a slave
node where the input split is stored.
• The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its
best to ensure that input splits are processed locally.
Data blocks and input splits in HDFS
Output Format used in Reducer
What is OutputFormat

◼ The OutputFormat determines where and how the results of your job
are persisted.
◼ Hadoop comes with a collection of classes and interfaces for
different types of format.

◼ TextOutputFormat is a line separated, tab delimited text files of


key-value pairs.

◼ Hadoop provides the SequenceFileOutputFormat which can write


the binary representation of objects instead of converting it to text,
and compress the result.
RecordWriter

◼ OutputFormat specify how to serialize data by providing a implementation


of RecordWriter.

◼ RecordWriter classes handle the job of taking an individual key-value pair


and writing it to the location prepared by the OutputFormat.

◼ There are two functions to a RecordWriter implements : write and close.

◼ The write function takes key-values from MapReduce job and writes the
bytes to disk.

◼ The default RecordWriter is LineRecordWriter .

◼ Close function closes the Hadoop data stream to the output file.
Word Count Example (Pseudo code)
What the word count problem signifies ?
• The basic goal of this program is to count the unique words in a text
file.
Word Count Example
WordCountMapper.java
• Map class starts with import statements.
• It imports Hadoop specific data types for key and values.
• In Hadoop, key and value data types can be only Hadoop specific
types.
• LongWritable is similar to Long data type in Java which is used to take
care of a long number.
• Text is similar to String data type which is sequence of characters.
• IntWritable is similar to Integer in Java.
• Every Map class extends the abstract class Mapper and override the
map() function.
• LongWritable, Text - Data types for Input key and input value, which
Hadoop supplies to map().
• Text, IntWritable – Data types for output key and output value.
• Two fields have been declared – one, word (which is required in the
processing logic.)
• Map() has the parameter as – input key, input value, context.
• Role of context of Context data type is to catch the output of key-value pair
• In the processing logic of map(), we tokenize the string into words and write
it into context using the method context.write(word, one).
• where word is used as key and one is used as value.
WordCountReducer.java
• Every Reduce class needs to extend the Reducer class (an abstract class.)
• <Text, IntWritable, IntWritable, Text> Hadoop specific type parameters.
• Text, IntWritable – data types used for the input key and value
• IntWritable, Text - data types used for the output key and value
• we need to override the reduce(Text key, Iterable<IntWritable> values,
Context context)
• The input to reduce() is the key and list of values.
• So values is specified as an iterable field.
• Here context collects the output key and values pair.
Logic used in reduce()
• The logic used in reduce function uses a for loop which iterates over
the values.
• Then we add the values into the sum field.
• After all the values of a particular key are processed, we output the
key and value pair through context.write(key, new IntWritable<sum>)
• The data types used for input key and value of the reducer() should
match the data types of output key and value of map()
The driver class
• Job object controls the execution of the job which is used to set the
job parameters so that Hadoop can take it from that point and
executes the job as specified by the programmer.
• We need to set the file input path and file output path in the driver
class. These paths will be passed as command line arguments.

• The job.waitForCompletion() method is used to trigger the submission


of the job to Hadoop.
Execution of a Map reduce job

• The following instructions are to run a MapReduce job locally.


• We can also execute a job on YARN
• Steps –
1. Format the filesystem
2. Start NameNode daemon and DataNode daemon
3. Browse the web interface for the NameNode
4. Make the HDFS directories required to execute MapReduce jobs
5. Copy the input files into the distributed filesystem
6. Run some of the examples provided
7. Examine the output files View the output files on the distributed filesystem
8. When you’re done, stop the daemons
YARN on a Single Node

• You can run a MapReduce job on YARN in a pseudo-distributed mode by


setting a few parameters
• and running ResourceManager daemon and NodeManager daemon in
addition.
1. Configure parameters
2. Start ResourceManager daemon and NodeManager daemon
3. Browse the web interface for the ResourceManager; by default it is
available at:
4. ResourceManager - http://localhost:8088/
5. Run a MapReduce job.
6. When you’re done, stop the daemons
Hadoop Rack Awareness

• Many Hadoop components are rack-aware and take advantage of the


network topology for performance and safety.
• Hadoop daemons obtain the rack information of the workers in the
cluster by invoking an administrator configured module.
• It is highly recommended configuring rack awareness prior to starting
HDFS.
What means - Running a Hadoop job

• A Hadoop job run as a set of daemons on different servers in the


network.
• These daemons have specific roles.
• Some exist only on one server, some exist across multiple servers.
▪ Namenode
▪ Datanode
▪ Secondary Namenode
▪ Tasktracker
▪ Jobtracker

Compile and Run the Hadoop job
Map and reduce job is running
Hadoop job completed
What is in the part file
Output from a MapReduce job
The output from running the job provides some useful information.
For example,
1) We can see that the job was given an ID of job_local_0001
2) Hadoop job ran one map task and one reduce task with the IDs.
3) Knowing the job and task IDs can be very useful when debugging MapReduce jobs.
For example –
attempt_local_0001_m_000000_0
attempt_local_0001_r_000000_0

3) “Counters,” shows the statistics that Hadoop generates for each job it runs.
4) These are very useful for checking whether the amount of data processed is what you expected.
• For example, we can follow the number of records that went through the system:
• five map inputs produced five map outputs, then five reduce inputs in two groups produced two reduce
outputs.

You might also like