Lecture 5 - Hadoop and Mapreduce
Lecture 5 - Hadoop and Mapreduce
launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for
storage and processing of large scale dataset on large
clusters of commodity hardware
• Large datasets → Terabytes or petabytes of data
• Large clusters → hundreds or thousands of nodes
6
Hadoop Master/Slave Architecture
7
Design Principles of Hadoop
• Commodity hardware
• Large number of low-end cheap machines working in parallel
to solve a computing problem
8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
9
Hadoop: How it Works
10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)
11
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files
File F
Blocks (64 MB)
12
Hadoop Distributed File
System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)
• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4
Datanodes
Data Retrieval
• JobTracker
• Determines the
execution plan for the
job
• Assigns individual tasks
• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
Map Parse-hash
Reduce
Map Parse-hash
Reduce
In this example, 1 map-reduce job
consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce
Map Parse-hash
22
MapReduce Phases
Deciding on what will be the key and what will be the value ➔ developer’s
responsibility
23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Reduce
Map Parse-hash
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Map Reduce
Tasks Tasks
26
Example 2: Color Count
Job: Count the number of each color in a data set
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001
Map Parse-hash
Reduce Part0002
Map Parse-hash
Reduce Part0003
Map Parse-hash
That’s the output file, it
has 3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)
Write to HDFS
Map Part0002
That’s the output file, it
has 4 parts on probably 4
Write to HDFS
Map Part0003 different machines
Write to HDFS
Map Part0004
28
Other Tools
• Hive
• Hadoop processing with SQL
• Pig
• Hadoop processing with scripting
• HBase
• Database model built on top of Hadoop
29
Who Uses Hadoop?