0% found this document useful (0 votes)
21 views

Lecture 5 - Hadoop and Mapreduce

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture 5 - Hadoop and Mapreduce

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

What is Hadoop?

An open source framework Commodity Hardware


that allows distributed
processing of large data-sets ❖ Economic / affordable
across the cluster of machines
Commodity Hardware ❖ Typically low
performance hardware
• Open source framework written in Java

• Inspired by Google's Map-Reduce programming model as well


as its file system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances

2002 2003 2004 2005 2006 2007 2008 2009

published GFS & Hadoop became


MapReduce papers Development of top-level project
started as Lucene sub-project

launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for
storage and processing of large scale dataset on large
clusters of commodity hardware
• Large datasets → Terabytes or petabytes of data
• Large clusters → hundreds or thousands of nodes

• . Uses for Hadoop


• Data-intensive text processing
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Hadoop Distributed file system (HDFS)
• Execution engine (MapReduce)

6
Hadoop Master/Slave Architecture

• Hadoop is designed as a master-slave architecture

Master node (single node)

Many slave nodes

7
Design Principles of Hadoop

• Need to process big data

• Need to parallelize computation across thousands of nodes

• Commodity hardware
• Large number of low-end cheap machines working in parallel
to solve a computing problem

8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data

• Replication: Each data block is replicated many times


(default is 3)

• Failure: Failure is the norm rather than exception

• Fault Tolerance: Detection of faults and quick,


automatic recovery from them is a core architectural goal
of HDFS

9
Hadoop: How it Works

10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)

Master node (single node)

Many slave nodes

11
Hadoop Distributed File System
(HDFS)

Centralized namenode
- Maintains metadata info about files

File F
Blocks (64 MB)

Many datanode (1000s)


- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)

12
Hadoop Distributed File
System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)

• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4

Datanodes
Data Retrieval

• When a client wants to retrieve data it


communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored

• Then communicated directly with the data nodes to


read the data
MapReduce
Distributing computation across nodes
MapReduce Overview

• A method for distributing computation across


multiple nodes

• Each node processes the data that is stored at that


node

• Consists of two main phases


• Map
• Reduce
The Mapper

• Reads data as key/value pairs


• The key is often discarded

• Outputs zero or more key/value pairs


Shuffle and Sort

• Output from the mapper is sorted by key

• All values with the same key are guaranteed to go to


the same machine
The Reducer

• Called once for each unique key

• Gets a list of all values associated with a key as


input

• The reducer outputs zero or more final key/value


pairs
• Usually just one output per input key
JobTracker and TaskTracker

• JobTracker
• Determines the
execution plan for the
job
• Assigns individual tasks

• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3

• This file has 5 Blocks → run 5 map tasks

• Where to run the task reading block “1”


• Try to run it on Node 1 or Node 3

21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

Map Parse-hash
Reduce

Map Parse-hash
Reduce
In this example, 1 map-reduce job
consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce

Map Parse-hash

22
MapReduce Phases

Deciding on what will be the key and what will be the value ➔ developer’s
responsibility

23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce

Map Parse-hash
Reduce

Map Parse-hash
Reduce

Map Parse-hash

Users only provide the “Map” and “Reduce” functions


24
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)

• Just need to obey the Key-Value pairs interface

• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs

• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>

• Shuffling and Sorting:


• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>

25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set

Map Reduce
Tasks Tasks

26
Example 2: Color Count
Job: Count the number of each color in a data set

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])


on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001

Map Parse-hash
Reduce Part0002

Map Parse-hash
Reduce Part0003

Map Parse-hash
That’s the output file, it
has 3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)

• No need for reduce phase


Write to HDFS
Map Part0001

Write to HDFS
Map Part0002
That’s the output file, it
has 4 parts on probably 4
Write to HDFS
Map Part0003 different machines

Write to HDFS
Map Part0004

28
Other Tools

• Hive
• Hadoop processing with SQL

• Pig
• Hadoop processing with scripting

• HBase
• Database model built on top of Hadoop

29
Who Uses Hadoop?

You might also like