Big Data Analytics - Lecture 4

Uploaded by

Samira El Margae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views23 pages

Big Data Analytics - Lecture 4

Uploaded by

Samira El Margae

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Big Data Analytics

Isara Anantavrasilp

Lecture 4: Hadoop Distributed File System

© Isara Anantavrasilp 1
Client-Server Architecture
server • Client-Server is one of the
oldest architecture in
computer network field
• One powerful machine
serves many (weak)
client client
clients
• It works!
client client
(WWW, Mail, Games, etc.)

© Isara Anantavrasilp 2
Client-Server Limitations
client
server • This architecture does not
scale well
client
client
– Server could be
overwhelmed by clients
– Processing power could be
limited
client client
– Storage could be limited
• Extending processors or
client client expanding drives will not
client
get very far
client
client
© Isara Anantavrasilp 3
Supersized Server
• Some tasks require
High-Performance
Computers (HPC)
– Complex task:
Weather forecasts,
simulations
– Large data: Search
engines, LHC

© Isara Anantavrasilp 5
Commodity Cluster
• Commodity cluster: Computing cluster that is
composed of readily available machines
• In other words, commodity cluster employs
ordinary, generic components
– In contrast to specialized machines designed for
specific tasks
• Cheaper component and maintenance costs
• Less powerful than specific machines

Problem: Does not scale to customers

Problem: Fridge is the bottleneck

Problem: Too much redundancy

© Isara Anantavrasilp 13
Software library for distributed processing of large
data sets using a network of many computers
• Software library for distributed processing of
large data sets using a network of many
computers
– Composed of many software and components
– Each component is responsible for different task in
data processing and storage
• Designed to store and process large data sets
in parallel and distributed fashion
• Intended to work on commodity components

© Isara Anantavrasilp 15
Hadoop History
• In 2002, Doug Cutting and Mike Cafarella were
developing an open source web-crawler called Apache
Nutch
• They estimated that to index 1 billion pages, they
would need around $500k for hardware
• However, their architecture would not scale to such
volume
• In 2003, Google released a white paper on Google File
System (GFS)
– Technology to store large files
– Distributed file system
• Nutch team implemented their own version called
Nutch Distributed File System (NDFS)
© Isara Anantavrasilp 16
MapReduce
• In 2004 Google introduced MapReduce in
another paper
• It works closely with GFS
• Nutch team incorporated MapReduce with NDFS
• Finally, they extend the system beyond web
crawling and called it Hadoop
• In 2008, Hadoop became the fastest system to
sort a terabyte of data
• Now, Hadoop has grown very mature

© Isara Anantavrasilp 17
Hadoop Distribution
• HDFS takes care of distributed file storage
• MapReduce processes the files in distribution
fasion
• However there are other operations
– Query
– Analysis and processing
– Resource negotiation
– Export/import data
• Thus, there are much more components in the
Hadoop Ecosystem

Pig Mahout Hive

Flume

Spark

Hbase
MapReduce
Zookeeper

Yet Another Resource Negotiator (YARN)

Sqoop

Hadoop Distributed File System (HDFS)

Pig Mahout Hive

Flume

Spark

Hbase
MapReduce
Zookeeper

Yet Another Resource Negotiator (YARN)

Sqoop

Hadoop Distributed File System (HDFS)

© Isara Anantavrasilp 21
Hadoop vs RDBMS
• Hadoop is fast, but it is not for everything
• Hadoop is good for analyzing large files, but not
good for small changes
• MapReduce is good for data that are write once
read many times.
– RDBMS is better when the data must be updated
often
• MapReduce interpret data while reading (schema
on read)
– RDBMS checks schema at write time
© Isara Anantavrasilp 23
DataNode HDFS Architecture
1 128

NameNode
380 MB

128

2 • Hadoop separates a file into

blocks
• Each block is stored in different
124

DataNode
3 • NameNodes keeps track of the
location of each block
• Block size can be configured
4 © Isara Anantavrasilp 24
DataNode HDFS Fault Tolerance

128 128
1
NameNode
380 MB

124 128 2 • Hadoop provides fault

tolerance by replicated
the same blocks in
different DataNodes
124 128 3 • Number of replicates is
dictated by replication
factor
– Default is 3
124 128 128 4 © Isara Anantavrasilp 25