DATA228 Lecture Notes Week 3
DATA228 Lecture Notes Week 3
Sangjin Lee
Hadoop: history & architecture
Ch pter 1 & p rts of 10, “H doop: the De initive Guide” 4th Edition, Tom White
a
a
a
f
Hadoop: Big Data refresher
• “The Google File System”, S nj y Ghem w t, How rd Gobio , Shun-T k Leung, 2003
• These were b sed on l rge-sc le systems th t were in wide use t Google t the time
a
a
a
a
f
a
a
a
a
a
a
a
a
a
a
a
a
a
ff
ff
a
a
a
a
a
a
Hadoop: history
• 2005 - 2006: Doug Cutting t Y hoo cre tes M pReduce implement tion nd forms n
open-source project c lled H doop
• MR v.2 APIs
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop: history
AWS GCP
• M pReduce APIs
• HDFS APIs
• YARN APIs
• H doop is n ecosystem
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
Hadoop as a distributed system
Two distributed systems for storage and compute
M pReduce API
YARN
Stor ge
HDFS
a
a
a
f
a
Hadoop as a distributed system
• H doop is compos ble: you c n use some (do not h ve to use ll)
• Ex mples
Client API
Tools
M pReduce
YARN
HDFS
H doop Common
a
a
Hadoop architecture
• High v il bility
• Higher-level fr meworks th t cre te complex M pReduce work lows: Pig, Oozie, C sc ding,
Sc lding, …
• Sp rk
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
Running Hadoop
Running Hadoop
• Single-node setup
• Cluster setup
• Cloud setup
# of processes # of m chines
loc l 1 1
pseudo-distributed sever l 1
cluster m ny m ny
a
a
a
a
a
Running Hadoop
Demo
• Inst ll H doop