Big Data Analytics - Lecture 4
Big Data Analytics - Lecture 4
Isara Anantavrasilp
© Isara Anantavrasilp 1
Client-Server Architecture
server • Client-Server is one of the
oldest architecture in
computer network field
• One powerful machine
serves many (weak)
client client
clients
• It works!
client client
(WWW, Mail, Games, etc.)
© Isara Anantavrasilp 2
Client-Server Limitations
client
server • This architecture does not
scale well
client
client
– Server could be
overwhelmed by clients
– Processing power could be
limited
client client
– Storage could be limited
• Extending processors or
client client expanding drives will not
client
get very far
client
client
© Isara Anantavrasilp 3
Supersized Server
• Some tasks require
High-Performance
Computers (HPC)
– Complex task:
Weather forecasts,
simulations
– Large data: Search
engines, LHC
© Isara Anantavrasilp 4
Supersized Server Limitations
• Expensive
• Specialized hardware
• EXPENSIVE!
© Isara Anantavrasilp 5
Commodity Cluster
• Commodity cluster: Computing cluster that is
composed of readily available machines
• In other words, commodity cluster employs
ordinary, generic components
– In contrast to specialized machines designed for
specific tasks
• Cheaper component and maintenance costs
• Less powerful than specific machines
© Isara Anantavrasilp 6
Rapid Cooking Scenario
© Isara Anantavrasilp 8
Rapid Cooking Scenario
© Isara Anantavrasilp 12
And Scale!
DISTRIBUTED
MAP REDUCE
FILE SYSTEM
© Isara Anantavrasilp 13
Software library for distributed processing of large
data sets using a network of many computers
• Software library for distributed processing of
large data sets using a network of many
computers
– Composed of many software and components
– Each component is responsible for different task in
data processing and storage
• Designed to store and process large data sets
in parallel and distributed fashion
• Intended to work on commodity components
© Isara Anantavrasilp 15
Hadoop History
• In 2002, Doug Cutting and Mike Cafarella were
developing an open source web-crawler called Apache
Nutch
• They estimated that to index 1 billion pages, they
would need around $500k for hardware
• However, their architecture would not scale to such
volume
• In 2003, Google released a white paper on Google File
System (GFS)
– Technology to store large files
– Distributed file system
• Nutch team implemented their own version called
Nutch Distributed File System (NDFS)
© Isara Anantavrasilp 16
MapReduce
• In 2004 Google introduced MapReduce in
another paper
• It works closely with GFS
• Nutch team incorporated MapReduce with NDFS
• Finally, they extend the system beyond web
crawling and called it Hadoop
• In 2008, Hadoop became the fastest system to
sort a terabyte of data
• Now, Hadoop has grown very mature
© Isara Anantavrasilp 17
Hadoop Distribution
• HDFS takes care of distributed file storage
• MapReduce processes the files in distribution
fasion
• However there are other operations
– Query
– Analysis and processing
– Resource negotiation
– Export/import data
• Thus, there are much more components in the
Hadoop Ecosystem
© Isara Anantavrasilp 18
Hadoop Ecosystem
Ambari
Spark
Hbase
MapReduce
Zookeeper
© Isara Anantavrasilp 19
Hadoop Ecosystem
Ambari
Spark
Hbase
MapReduce
Zookeeper
© Isara Anantavrasilp 20
Hadoop Distribution
© Isara Anantavrasilp 21
Hadoop vs RDBMS
• Hadoop is fast, but it is not for everything
• Hadoop is good for analyzing large files, but not
good for small changes
• MapReduce is good for data that are write once
read many times.
– RDBMS is better when the data must be updated
often
• MapReduce interpret data while reading (schema
on read)
– RDBMS checks schema at write time
© Isara Anantavrasilp 23
DataNode HDFS Architecture
1 128
NameNode
380 MB
128
DataNode
3 • NameNodes keeps track of the
location of each block
• Block size can be configured
4 © Isara Anantavrasilp 24
DataNode HDFS Fault Tolerance
128 128
1
NameNode
380 MB