0% found this document useful (0 votes)

21 views

Lecture 5 - Hadoop and Mapreduce

Uploaded by

reham2sultan5alalimi3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Lecture 5 - Hadoop and Mapreduce

Uploaded by

reham2sultan5alalimi3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

What is Hadoop?

An open source framework Commodity Hardware

that allows distributed
processing of large data-sets ❖ Economic / affordable
across the cluster of machines
Commodity Hardware ❖ Typically low
performance hardware
• Open source framework written in Java

• Inspired by Google's Map-Reduce programming model as well

as its file system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances

2002 2003 2004 2005 2006 2007 2008 2009

published GFS & Hadoop became

MapReduce papers Development of top-level project
started as Lucene sub-project

launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for
storage and processing of large scale dataset on large
clusters of commodity hardware
• Large datasets → Terabytes or petabytes of data
• Large clusters → hundreds or thousands of nodes

• . Uses for Hadoop

• Data-intensive text processing
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Hadoop Distributed file system (HDFS)
• Execution engine (MapReduce)

6
Hadoop Master/Slave Architecture

• Hadoop is designed as a master-slave architecture

Master node (single node)

Many slave nodes

7
Design Principles of Hadoop

• Need to process big data

• Need to parallelize computation across thousands of nodes

• Commodity hardware
• Large number of low-end cheap machines working in parallel
to solve a computing problem

8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data

• Replication: Each data block is replicated many times

(default is 3)

• Failure: Failure is the norm rather than exception

• Fault Tolerance: Detection of faults and quick,

automatic recovery from them is a core architectural goal
of HDFS

9
Hadoop: How it Works

10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)

Master node (single node)

Many slave nodes

11
Hadoop Distributed File System
(HDFS)

Centralized namenode
- Maintains metadata info about files

File F
Blocks (64 MB)

Many datanode (1000s)

- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)

12
Hadoop Distributed File
System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)

• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4

Datanodes
Data Retrieval

• When a client wants to retrieve data it

communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored

• Then communicated directly with the data nodes to

read the data
MapReduce
Distributing computation across nodes
MapReduce Overview

• A method for distributing computation across

multiple nodes

• Each node processes the data that is stored at that

node

• Consists of two main phases

• Map
• Reduce
The Mapper

• Reads data as key/value pairs

• The key is often discarded

• Outputs zero or more key/value pairs

Shuffle and Sort

• Output from the mapper is sorted by key

• All values with the same key are guaranteed to go to

the same machine
The Reducer

• Called once for each unique key

• Gets a list of all values associated with a key as

input

• The reducer outputs zero or more final key/value

pairs
• Usually just one output per input key
JobTracker and TaskTracker

• JobTracker
• Determines the
execution plan for the
job
• Assigns individual tasks

• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3

• This file has 5 Blocks → run 5 map tasks

• Where to run the task reading block “1”

• Try to run it on Node 1 or Node 3

21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

Map Parse-hash
Reduce

Map Parse-hash
Reduce
In this example, 1 map-reduce job
consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce

Map Parse-hash

22
MapReduce Phases

Deciding on what will be the key and what will be the value ➔ developer’s
responsibility

23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce

Map Parse-hash
Reduce

Map Parse-hash

Users only provide the “Map” and “Reduce” functions

24
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)

• Just need to obey the Key-Value pairs interface

• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs

• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>

• Shuffling and Sorting:

• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>

25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set

Map Reduce
Tasks Tasks

26
Example 2: Color Count
Job: Count the number of each color in a data set

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])

on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001

Map Parse-hash
Reduce Part0002

Map Parse-hash
Reduce Part0003

Map Parse-hash
That’s the output file, it
has 3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)

• No need for reduce phase

Write to HDFS
Map Part0001

Write to HDFS
Map Part0002
That’s the output file, it
has 4 parts on probably 4
Write to HDFS
Map Part0003 different machines

Write to HDFS
Map Part0004

28
Other Tools

• Hive
• Hadoop processing with SQL

• Pig
• Hadoop processing with scripting

• HBase
• Database model built on top of Hadoop

29
Who Uses Hadoop?

AZ-900 Exam - Free Actual Q&as, Page 1 - ExamTopics
0% (1)
AZ-900 Exam - Free Actual Q&as, Page 1 - ExamTopics
562 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
Big Data
No ratings yet
Big Data
67 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop
No ratings yet
Hadoop
34 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit 5
No ratings yet
Unit 5
7 pages
Hadoop
No ratings yet
Hadoop
5 pages
CC unit5
No ratings yet
CC unit5
27 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Unit 2
No ratings yet
Unit 2
21 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Unit 5
No ratings yet
Unit 5
35 pages
T05 MapReduce
No ratings yet
T05 MapReduce
20 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Introduction
No ratings yet
Introduction
2 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
HADOOP
No ratings yet
HADOOP
19 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Quality of Software Architectures. Models and Architectures- 4th International Conference on the Quality of Software-Architectures, QoSA 2008, Karlsruhe, Germany, October 14-17, 2008. Proceedings ( PDFDrive )
No ratings yet
Quality of Software Architectures. Models and Architectures- 4th International Conference on the Quality of Software-Architectures, QoSA 2008, Karlsruhe, Germany, October 14-17, 2008. Proceedings ( PDFDrive )
207 pages
Web Technology MCQ
No ratings yet
Web Technology MCQ
9 pages
Udemy Microservices PDF
No ratings yet
Udemy Microservices PDF
3 pages
Nithya Narasimhan
No ratings yet
Nithya Narasimhan
14 pages
Cloud Platforms PDF
No ratings yet
Cloud Platforms PDF
12 pages
WSRR Handbook PDF
No ratings yet
WSRR Handbook PDF
940 pages
Nagaraju Bachu
No ratings yet
Nagaraju Bachu
6 pages
Angrybirds Cloud
No ratings yet
Angrybirds Cloud
50 pages
Unit-4 (Service Oriented Architecture)
No ratings yet
Unit-4 (Service Oriented Architecture)
9 pages
CXVX
No ratings yet
CXVX
12 pages
Rajlich, Bennett, 2000, Software Maintenance and Evolution. A Roadmap PDF
No ratings yet
Rajlich, Bennett, 2000, Software Maintenance and Evolution. A Roadmap PDF
14 pages
Business Model Infographic 06
100% (1)
Business Model Infographic 06
8 pages
Cloud Engineer Study Plan
No ratings yet
Cloud Engineer Study Plan
2 pages
Arun Ramanathan: Skills Work Experience
No ratings yet
Arun Ramanathan: Skills Work Experience
1 page
Functional Module
No ratings yet
Functional Module
17 pages
Software Architecture in Practice: Part One: Envisioning Architecture
No ratings yet
Software Architecture in Practice: Part One: Envisioning Architecture
49 pages
CC Unit-2 (New)
No ratings yet
CC Unit-2 (New)
21 pages
chap1
No ratings yet
chap1
28 pages
Azure Devops Engineer: Profile Summary
No ratings yet
Azure Devops Engineer: Profile Summary
4 pages
(Ebook) Java Web Services: Up and Running, 2nd Edition by Martin Kalin ISBN 9781449365110, 9781449373870, 1449365116, 1449373879 - The ebook in PDF format is ready for download
No ratings yet
(Ebook) Java Web Services: Up and Running, 2nd Edition by Martin Kalin ISBN 9781449365110, 9781449373870, 1449365116, 1449373879 - The ebook in PDF format is ready for download
61 pages
Software Design and Architecture
No ratings yet
Software Design and Architecture
31 pages
Syllabus RTM CSE 7 8 Potrait Finale
No ratings yet
Syllabus RTM CSE 7 8 Potrait Finale
44 pages
SDA Quiz 2
No ratings yet
SDA Quiz 2
5 pages
CSE446 Question Bank Flashcards & Practice Test Quizlet
No ratings yet
CSE446 Question Bank Flashcards & Practice Test Quizlet
17 pages
rgpv-syllabus-btech-cs-7-sem-cs701-software-architectures (1)
No ratings yet
rgpv-syllabus-btech-cs-7-sem-cs701-software-architectures (1)
1 page
SDA_Unit 1 (1)
No ratings yet
SDA_Unit 1 (1)
75 pages
Overall Industry Certification Registration v-2.0
0% (1)
Overall Industry Certification Registration v-2.0
356 pages
GoogleCloud Slides
No ratings yet
GoogleCloud Slides
30 pages
PPLWork Student Data Sheet 2023
No ratings yet
PPLWork Student Data Sheet 2023
3 pages