0% found this document useful (0 votes)

19 views66 pages

Wa0002.

Fbd

Uploaded by

Riki Borah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views66 pages

Wa0002.

Fbd

Uploaded by

Riki Borah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Hadoop overview

• Hadoop is an open source framework, from the Apache foundation.

• It is capable of processing large amounts of heterogeneous data sets
in a distributed fashion across clusters of commodity computers and
hardware using a simplified programming model.
• Hadoop provides a reliable shared storage and analysis system.
• Hadoop core is hosted by Apache Software Foundation.
• Origins in Google now used by Yahoo, Facebook.
• Once we store data in Hadoop, the Companies across the globe can
use this data to solve big problems, improve revenue, and more.
Hadoop Distributions

• Apache Hadoop is a method for distributed computing and storage.

• The big data created increased rapidly and there was a need to handle
and use this data optimally.
• The introduction of Apache Hadoop helped handle this problem to a
great extent.
• Although Hadoop is free, various distributions offer an easier to use
bundle.
• There are a number of Hadoop Distributions.
Top 5 Hadoop Distribution
HDFS, Map Reduce - Two core components in Hadoop
HDFS ( Hadoop Distributed File System)
- HDFS is the java based file system for scalable and reliable storage of large
datasets.
- Data in HDFS is stored in the form of blocks
- It operates on the Master Slave Architecture.

Hadoop MapReduce-
- This is a java based programming paradigm of Hadoop framework.
- provides scalability across various Hadoop clusters.
- MapReduce distributes the workload into various tasks that can run in
parallel.
Main components of a Hadoop Application
• Hadoop applications have wide range of technologies that provide
great advantage in solving complex business problems.
Core components of a Hadoop application -
• Hadoop Common
• HDFS
• Hadoop MapReduce
• YARN
The components used in Hadoop Ecosystem

• Data Access Components are - Pig and Hive

• Data Storage Component is - HBase
• Data Integration Components are - Apache Flume, Sqoop
• Data Management and Monitoring Components are - Ambari, Oozie
and Zookeeper.
• Data Serialization Components are - Thrift and Avro
• Data Intelligence Components are - Apache Mahout and Drill.
Hadoop streaming
• Hadoop distribution has a generic application programming
interface(API).
• This API is used for writing Map and Reduce jobs in any desired
programming language like Python, Perl, Ruby, etc.
• This is referred to as Hadoop Streaming.
• Users can create and run jobs with any kind of shell scripts or
executable as the Mapper or Reducers.
What is a commodity hardware?
• Hadoop process large amounts of heterogeneous data sets in a
distributed fashion.
• Processing is done across clusters of commodity computers and
hardware using a simplified programming model.
• Commodity Hardware refers to inexpensive systems.
• They do not have high availability or high quality.
• Commodity Hardware consists of RAM (because there are specific
services that need to be executed on RAM. )
• Hadoop can be run on any commodity hardware.
• It does not require any super computers or high end hardware
configuration to execute jobs.
• The best configuration for executing Hadoop jobs is dual core
machines or dual processors with 4GB or 8GB RAM
What is cluster? How a Server Cluster Works
?
Cluster –
• In a computer system, a cluster is a group of servers and other resources.
• A Cluster acts like a single system
• enables high availability, load balancing and parallel processing.
Server cluster -
• A server cluster is a collection of servers, called nodes.
• The Nodes communicate with each other to make a set of services highly available to
clients.
Hadoop cluster -
• A Hadoop cluster is a special type of computational cluster
• designed specifically for storing and analyzing huge amounts of unstructured data in a
distributed computing environment.
Required Software for Hadoop installation

• Required software for Linux include:

• Java™ must be installed.
• ssh must be installed
• sshd must be running to use the Hadoop scripts that manage remote
Hadoop daemons (if the optional start and stop scripts are to be
used.)
Topology of Hadoop Cluster
Name Node

◼ Hadoop employs a master/slave architecture for both distributed

storage and distributed computation.
◼ The NameNode is the master of HDFS
◼ NameNode is the bookkeeper of HDFS
◼ Single namenode store all metadata.
◼ Metadata is maintained entirely in RAM for fast lookup
Data node

◼ Each slave machine in the cluster will host a Data Node daemon.
◼ It performs the grunt work of the distributed filesystem
◼ When you want to read or write a HDFS file, the file is broken into blocks.
◼ NameNode will tell the client which Data Node each block resides in.
◼ Then the client communicates directly with the Data Node daemons
◼ after it is advised by the name node to process the local files
corresponding to the blocks .
◼ Data Nodes are constantly reporting to the Name Node.
Secondary Name node

◼ The Secondary Name Node (SNN) is an assistant daemon for

monitoring the state of the cluster HDFS.

◼ The SNN differs from the Name Node in that this process doesn’t
receive or record any real-time changes to HDFS.

◼ Instead, it communicates with the Name Node to take snapshots of

the HDFS metadata at intervals defined by the cluster configuration.
Job tracker

◼ The JobTracker daemon is the liaison between your application and Hadoop.
◼ JobTracker takes care of resource allocation of the hadoop job to ensure timely
completion.
◼ JobTracker determines the execution plan by determining which files to process,
assigns nodes to different tasks, and monitors all tasks as they’re running.
◼ Should a task fail, the JobTracker will automatically relaunch
the task, possibly on a different node, up to a predefined limit of retries.
◼ JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
◼ There is only one JobTracker daemon per Hadoop cluster.
Task tracker

◼ Runs the task assigned by Job tracker.

◼ Send progress report (Heartbeat) to the Job Tracker.
Q&A
What happens to a NameNode that has no data?
• There does not exist any NameNode without data.
• If it is a NameNode then it should have some sort of data in it.

What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
• The Hadoop job fails when the NameNode is down.

What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
• The Hadoop job fails when the Job Tracker is down.

Whenever a client submits a hadoop job, who receives it?

• NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block
information.
• JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
Hadoop Video
HDFS: Architecture
HDFS Key features
◼ Self healing
◼ Fault tolerant
◼ Abstraction for parallel processing
◼ Manage complex structured or unstructured data
HDFS weaknesses
• Block information stored in memory
• File system metadata size is limited to the amount of available RAM on the
NameNode (the small files problem)
• Single Point of Failure
• File system is offline if NameNode goes down
Where HDFS is not a fit
• Low latency data access
• Low latency describes a computer network that is optimized to process a
very high volume of data messages with minimal delay (latency).
• These networks are designed to support operations that require near
real-time access to rapidly changing data.
• HDFS always deals with large size of data set, which requires high latency to
access data.
• Lots of small files
• Name node holds the filesystem metadata.
Modes of Hadoop installation

• Standalone (or local) mode

• everything runs in a single JVM
• Pseudo-distributed mode’
• daemons run on the local machine
• Fully distributed mode
What is MapReduce
◼ A parallel programming framework
◼ MapReduce provides
◼ Automatic parallelization
◼ Fault tolerance
◼ Monitoring and status updates
Map Reduce Overview
◼ Tasks are distributed to multiple nodes
◼ Each node processes the data stored in that node
◼ Consists of two phase:
◼ Map – Reads input data and produces intermediate keys and values as
output. Map output (intermediate key, values) is stored in local system.
◼ Reduce – Values with the same key are sent to the same reducer for further
processing
Map Reduce Process
Anatomy of a MapReduce Job Run

• At the highest level, there are four independent entities:

• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run.
• The tasktrackers, which run the tasks that the job has been split into.
• The distributed files system (normally HDFS), which is used for sharing
job files between the other entities.
MapReduce (MRv1): Hadoop Architecture
Map Reduce: Job Tracker

◼ Determines the execution plan by determining which files to process,

assigns nodes to different tasks, and monitors all tasks as they are
running.
◼ When a task fails, the JobTracker will automatically relaunch
(ATTEMPT) the task, possibly on a different node, up to a predefined
limit of retries.
◼ There is only one JobTracker daemon per Hadoop cluster.
MapReduce: Task Tracker

◼ Task Tracker is responsible for instantiating and monitoring individual

map reduce tasks.
◼ Maintains a heartbeat with Job Tracker
◼ There is one Task Tracker daemon per Hadoop data node.
Basic parameters of Mapper and Reducer
The four basic parameters of a mapper are -
<LongWritable, text, text and IntWritable>

• LongWritable, text - represent input parameters.

• Text, IntWritable - represent intermediate output parameters.

The four basic parameters of a reducer are –

<text, IntWritable, text, IntWritable>

• text, IntWritable - represent intermediate output parameters

• text, IntWritable - represent final output parameters.
MapReduce: Programming Model

◼ Borrows from Functional Programming

◼ Users implement two functions: Map and Reduce
MapReduce Data Flow in Hadoop
Map

◼ Records are fed to the map function as key value pairs

◼ Produces one or more intermediate values with an output key from
the input
Shuffle and Sort

◼ Map task performs an in memory sort of the keys and divides them
into partitions corresponding to the reducers that they will be sent to.
◼ Output of every Map task is copied to the reducer.
◼ Every reduce task get a set of keys to execute the task on.
◼ Reduce task performs a merge on the output it receives from
different map tasks.
Reduce

◼ After map(), intermediate values are combined into a list.

◼ Reduce() combines intermediate values into one or more final values

for that same output key
DATA BLOCK AND INPUT SPLITS IN HADOOP’S MAPREDUCE

• HDFS breaks down very large files into large blocks (for example,
measuring 128MB)
• Then HDFS stores three copies of these blocks on different nodes in
the cluster.
• HDFS has no awareness of the content of these files.
• The key to efficient MapReduce processing is that, wherever possible,
data is processed locally — on the slave node where it’s stored.
• In Hadoop, files are composed of individual records
• which are ultimately processed one-by-one by mapper tasks.
• For example, the sample data set contains information about
completed flights within the United States between 1987 and 2008.
• Suppose, you have one large file for each year.
• within every file, each individual line represents a single flight.
• In other words, one line represents one record.
• Default block size for the Hadoop cluster is 64MB or 128 MB.( so that
the data files are broken into chunks of exactly 64MB/ 128 MB.)
If each map task processes all records in a specific data
block, what happens to those records that span block
boundaries?

• File blocks are exactly 64MB (or whatever you set the block size to be)
• and because HDFS has no conception of what’s inside the file blocks,
it can’t gauge when a record might spill over into another block.
• To solve this problem, Hadoop uses a logical representation of the
data stored in file blocks, known as input splits.
• When a MapReduce job client calculates the input splits, it figures out
where the first whole record in a block begins and where the last
record in the block ends.
• In cases where the last record in a block is incomplete, the input split
includes location information for the next block
• and the byte offset of the data needed to complete the record.
• The number of input splits that are calculated for a specific
application determines the number of mapper tasks.
• Each of these mapper tasks is assigned, where possible, to a slave
node where the input split is stored.
• The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its
best to ensure that input splits are processed locally.
Data blocks and input splits in HDFS
Output Format used in Reducer
What is OutputFormat

◼ The OutputFormat determines where and how the results of your job
are persisted.
◼ Hadoop comes with a collection of classes and interfaces for
different types of format.

◼ TextOutputFormat is a line separated, tab delimited text files of

key-value pairs.

◼ Hadoop provides the SequenceFileOutputFormat which can write

the binary representation of objects instead of converting it to text,
and compress the result.
RecordWriter

◼ OutputFormat specify how to serialize data by providing a implementation

of RecordWriter.

◼ RecordWriter classes handle the job of taking an individual key-value pair

and writing it to the location prepared by the OutputFormat.

◼ There are two functions to a RecordWriter implements : write and close.

◼ The write function takes key-values from MapReduce job and writes the
bytes to disk.

◼ The default RecordWriter is LineRecordWriter .

◼ Close function closes the Hadoop data stream to the output file.
Word Count Example (Pseudo code)
What the word count problem signifies ?
• The basic goal of this program is to count the unique words in a text
file.
Word Count Example
WordCountMapper.java
• Map class starts with import statements.
• It imports Hadoop specific data types for key and values.
• In Hadoop, key and value data types can be only Hadoop specific
types.
• LongWritable is similar to Long data type in Java which is used to take
care of a long number.
• Text is similar to String data type which is sequence of characters.
• IntWritable is similar to Integer in Java.
• Every Map class extends the abstract class Mapper and override the
map() function.
• LongWritable, Text - Data types for Input key and input value, which
Hadoop supplies to map().
• Text, IntWritable – Data types for output key and output value.
• Two fields have been declared – one, word (which is required in the
processing logic.)
• Map() has the parameter as – input key, input value, context.
• Role of context of Context data type is to catch the output of key-value pair
• In the processing logic of map(), we tokenize the string into words and write
it into context using the method context.write(word, one).
• where word is used as key and one is used as value.
WordCountReducer.java
• Every Reduce class needs to extend the Reducer class (an abstract class.)
• <Text, IntWritable, IntWritable, Text> Hadoop specific type parameters.
• Text, IntWritable – data types used for the input key and value
• IntWritable, Text - data types used for the output key and value
• we need to override the reduce(Text key, Iterable<IntWritable> values,
Context context)
• The input to reduce() is the key and list of values.
• So values is specified as an iterable field.
• Here context collects the output key and values pair.
Logic used in reduce()
• The logic used in reduce function uses a for loop which iterates over
the values.
• Then we add the values into the sum field.
• After all the values of a particular key are processed, we output the
key and value pair through context.write(key, new IntWritable<sum>)
• The data types used for input key and value of the reducer() should
match the data types of output key and value of map()
The driver class
• Job object controls the execution of the job which is used to set the
job parameters so that Hadoop can take it from that point and
executes the job as specified by the programmer.
• We need to set the file input path and file output path in the driver
class. These paths will be passed as command line arguments.

• The job.waitForCompletion() method is used to trigger the submission

of the job to Hadoop.
Execution of a Map reduce job

• The following instructions are to run a MapReduce job locally.

• We can also execute a job on YARN
• Steps –
1. Format the filesystem
2. Start NameNode daemon and DataNode daemon
3. Browse the web interface for the NameNode
4. Make the HDFS directories required to execute MapReduce jobs
5. Copy the input files into the distributed filesystem
6. Run some of the examples provided
7. Examine the output files View the output files on the distributed filesystem
8. When you’re done, stop the daemons
YARN on a Single Node

• You can run a MapReduce job on YARN in a pseudo-distributed mode by

setting a few parameters
• and running ResourceManager daemon and NodeManager daemon in
addition.
1. Configure parameters
2. Start ResourceManager daemon and NodeManager daemon
3. Browse the web interface for the ResourceManager; by default it is
available at:
4. ResourceManager - http://localhost:8088/
5. Run a MapReduce job.
6. When you’re done, stop the daemons
Hadoop Rack Awareness

• Many Hadoop components are rack-aware and take advantage of the

network topology for performance and safety.
• Hadoop daemons obtain the rack information of the workers in the
cluster by invoking an administrator configured module.
• It is highly recommended configuring rack awareness prior to starting
HDFS.
What means - Running a Hadoop job

• A Hadoop job run as a set of daemons on different servers in the

network.
• These daemons have specific roles.
• Some exist only on one server, some exist across multiple servers.
▪ Namenode
▪ Datanode
▪ Secondary Namenode
▪ Tasktracker
▪ Jobtracker
•
Compile and Run the Hadoop job
Map and reduce job is running
Hadoop job completed
What is in the part file
Output from a MapReduce job
The output from running the job provides some useful information.
For example,
1) We can see that the job was given an ID of job_local_0001
2) Hadoop job ran one map task and one reduce task with the IDs.
3) Knowing the job and task IDs can be very useful when debugging MapReduce jobs.
For example –
attempt_local_0001_m_000000_0
attempt_local_0001_r_000000_0

3) “Counters,” shows the statistics that Hadoop generates for each job it runs.
4) These are very useful for checking whether the amount of data processed is what you expected.
• For example, we can follow the number of records that went through the system:
• five map inputs produced five map outputs, then five reduce inputs in two groups produced two reduce
outputs.

Hadoop Classroom Notes
100% (2)
Hadoop Classroom Notes
76 pages
Profile Summary: Pramood Nandigama 530-593-1418
No ratings yet
Profile Summary: Pramood Nandigama 530-593-1418
7 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Hadoop
No ratings yet
Hadoop
31 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Unit 2
No ratings yet
Unit 2
18 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop Physical Organization
No ratings yet
Hadoop Physical Organization
7 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Unit 2
No ratings yet
Unit 2
22 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Introduction To
No ratings yet
Introduction To
7 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BD Sec B
No ratings yet
BD Sec B
19 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
27 pages
IMTC634 - Data Science - Chapter 13
No ratings yet
IMTC634 - Data Science - Chapter 13
16 pages
Unit 2
No ratings yet
Unit 2
21 pages
Hadoop
No ratings yet
Hadoop
7 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
23 pages
Haoop Architecture
No ratings yet
Haoop Architecture
24 pages
Hadoop
No ratings yet
Hadoop
5 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
UNIT 4 Notes by ARUN JHAPATE
No ratings yet
UNIT 4 Notes by ARUN JHAPATE
20 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop Hdfs Sqoop Notes
No ratings yet
Hadoop Hdfs Sqoop Notes
28 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Module - 2
No ratings yet
Module - 2
84 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
PPT04-Hadoop Infrastructure Layer
No ratings yet
PPT04-Hadoop Infrastructure Layer
40 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Ch 2. HADOOP
No ratings yet
Ch 2. HADOOP
25 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Unit-7 PPS
No ratings yet
Unit-7 PPS
17 pages
CISA - Domain 3
No ratings yet
CISA - Domain 3
95 pages
Experiment 3 Python
No ratings yet
Experiment 3 Python
4 pages
Lab 1.1
No ratings yet
Lab 1.1
3 pages
Projok
No ratings yet
Projok
33 pages
Joption Method
No ratings yet
Joption Method
10 pages
Latestlog
No ratings yet
Latestlog
24 pages
RDBMS Codings
No ratings yet
RDBMS Codings
14 pages
Final Chapter - 6 Spos-2018
No ratings yet
Final Chapter - 6 Spos-2018
122 pages
Agile SDLC
No ratings yet
Agile SDLC
28 pages
BDA Unit-1
No ratings yet
BDA Unit-1
56 pages
Lab 6
No ratings yet
Lab 6
5 pages
SIPROTEC 4 DNP Mapping
100% (1)
SIPROTEC 4 DNP Mapping
4 pages
EDB Database Compatibility For Oracle Developers Guide v10
No ratings yet
EDB Database Compatibility For Oracle Developers Guide v10
418 pages
Khalid Resume
No ratings yet
Khalid Resume
3 pages
Java Unit 2
No ratings yet
Java Unit 2
7 pages
Arangodb
No ratings yet
Arangodb
7 pages
PL - SQL Transaction Commit, Rollback, Savepoint, Autocommit, Set Transaction
No ratings yet
PL - SQL Transaction Commit, Rollback, Savepoint, Autocommit, Set Transaction
3 pages
Futo Digital Bootcamp 2024 Timetable
No ratings yet
Futo Digital Bootcamp 2024 Timetable
3 pages
PROGRAM - 1 . Write A C Program To Display "This Is My First C Program"
No ratings yet
PROGRAM - 1 . Write A C Program To Display "This Is My First C Program"
29 pages
CGI Programming: Python3 Advanced #7
No ratings yet
CGI Programming: Python3 Advanced #7
12 pages
Learning Objectives: Switch Case Statement: Describe The Syntax
No ratings yet
Learning Objectives: Switch Case Statement: Describe The Syntax
8 pages
Cover Page: Template For The Unisys Innovation Program Student Technical Contest-Preliminary Project Plan and Abstract
No ratings yet
Cover Page: Template For The Unisys Innovation Program Student Technical Contest-Preliminary Project Plan and Abstract
3 pages
Implementation of Three Address Code
No ratings yet
Implementation of Three Address Code
9 pages
SEO Complete Audit Checklist by Iq SEO - Please Feel Free To Make A
No ratings yet
SEO Complete Audit Checklist by Iq SEO - Please Feel Free To Make A
24 pages
Viden Io SRM University Notes Fourth Semester Vacation Tracking System
No ratings yet
Viden Io SRM University Notes Fourth Semester Vacation Tracking System
20 pages
Data Structure Assignment
No ratings yet
Data Structure Assignment
33 pages
Flex Notes
No ratings yet
Flex Notes
1 page
PML Basics PDF
No ratings yet
PML Basics PDF
90 pages

Wa0002.

Uploaded by

Wa0002.

Uploaded by

Hadoop overview

• Hadoop is an open source framework, from the Apache foundation.

• Apache Hadoop is a method for distributed computing and storage.

• Data Access Components are - Pig and Hive

• Required software for Linux include:

◼ Hadoop employs a master/slave architecture for both distributed

◼ The Secondary Name Node (SNN) is an assistant daemon for

◼ Instead, it communicates with the Name Node to take snapshots of

◼ Runs the task assigned by Job tracker.

Whenever a client submits a hadoop job, who receives it?

• Standalone (or local) mode

• At the highest level, there are four independent entities:

◼ Determines the execution plan by determining which files to process,

◼ Task Tracker is responsible for instantiating and monitoring individual

• LongWritable, text - represent input parameters.

The four basic parameters of a reducer are –

• text, IntWritable - represent intermediate output parameters

◼ Borrows from Functional Programming

◼ Records are fed to the map function as key value pairs

◼ After map(), intermediate values are combined into a list.

◼ Reduce() combines intermediate values into one or more final values

◼ TextOutputFormat is a line separated, tab delimited text files of

◼ Hadoop provides the SequenceFileOutputFormat which can write

◼ OutputFormat specify how to serialize data by providing a implementation

◼ RecordWriter classes handle the job of taking an individual key-value pair

◼ There are two functions to a RecordWriter implements : write and close.

◼ The default RecordWriter is LineRecordWriter .

• The job.waitForCompletion() method is used to trigger the submission

• The following instructions are to run a MapReduce job locally.

• You can run a MapReduce job on YARN in a pseudo-distributed mode by

• Many Hadoop components are rack-aware and take advantage of the

• A Hadoop job run as a set of daemons on different servers in the

You might also like